LeMat-Rho: High-Fidelity Charge Density Dataset for Atomistic Materials Modeling
¹Department of Materials, Imperial College, London, UK
²Entalpic, Paris, FR
³Hugging Face
⁴Department of Materials, EPFL, Lausanne CH
Ab initio datasets have significantly advanced atomistic modeling by enabling the development of accurate machine learning interatomic potentials. Atom-centered models predict atomic energies and forces with high precision, as these values are derived from charge densities calculated using the Kohn-Sham equations. Access to extensive, high-accuracy charge density datasets provides additional physical insights for atomistic machine-learning approaches.
As part of the LeMaterial Open Science initiative, this is the initial release of LeMat-Rho, an r²SCAN charge-density dataset encompassing 70,000 materials. This dataset provides the research community with high-quality quantum-mechanical data describing electron distributions in materials. The r²SCAN meta-GGA exchange-correlation functional provides a more accurate representation of bonding and energetics than generalized-gradient approximations such as PBE, as illustrated in the Figure below (Fig. 1).
The workflow automates high-throughput density functional theory (DFT) calculations, beginning with input crystal structures from LeMat-Bulk and proceeding through a configurable workflow builder to the VASP ab initio code. The pipeline achieves robust convergence and high-fidelity electronic structure data using a staged PBE-to-r²SCAN workflow. This process starts with a PBE static pre-calculation that aims to produce reasonable initial wavefunctions at a low computational cost. This is followed by two consecutive r²SCAN relaxations and a final static evaluation. Indeed, as the cell volume equilibrates, the basis set must be reset to ensure optimal convergence.
The approach produces converged electronic-structure outputs, including the charge density, wavefunction, energies, and forces. All simulation results are systematically curated and stored in AWS Open Data. The dataset is now publicly available on Hugging Face, serving as an open resource for materials discovery, electronic structure analysis, and machine learning research.
The charge densities are fed into a Bader analysis code to provide information on the charge localization and chemical bonding. Subsequently, we have formatted the dataset for ease of use and published it publicly on HuggingFace.
To maximize computational resources, we used the high-throughput tools Atomate and JobFlow. The HuggingFace dataset is populated using an array of tools from the pymatgen Python library.
Figure 2: Illustration of the LeMat-Rho implementation workflow
The resulting dataset is enriched in lithium- and cobalt-containing compounds because the initial crystal structures were sourced primarily from repositories developed for battery materials exploration, as shown in Figure 3. To maximize computational throughput and maintain consistency across calculations, the workflow preferentially selected structures with small primitive cells, thereby reducing the representation of larger, more compositionally complex oxide systems.
Figure 3: Periodic table heatmap showing the percentage occurrence of each chemical element across the dataset.
The workflow employed strict convergence criteria to ensure high-quality DFT results. Electronic self-consistent field (SCF) calculations were converged to an energy tolerance below 1 × 10⁻⁶ eV, while ionic relaxations were considered converged when the maximum residual force magnitude fell below 0.02 eV/Å. Overall workflow robustness was high. Across the completed dataset, 98% of structures achieved residual forces below 0.0182 eV/Å and residual stresses below 4.19 kbar, indicating strong structural and mechanical convergence as shown in Figure 4.
To facilitate machine-learning applications and large-scale data distribution, volumetric charge densities were transformed into a standardized 15×15×15 representation using the mp-rho framework. This compression preserves the global spatial distribution of electron density while reducing storage requirements sufficiently for Parquet-based datasets and efficient access through Hugging Face.
To enable broader analysis of the compounds' chemical properties, we provide Bader charges and Bader atomic volumes for each structure. The goal is to quantify the redistribution of electron density among atoms, serving as a measure of charge transfer and bonding characteristics. The accompanying figure (Fig.5) shows the distribution of Bader charges for three representative elements, highlighting the diversity of chemical bonding in the dataset. These distributions indicate how atomic charge states vary between compounds and may serve as valuable descriptors for understanding structure–property relationships.
Figure 5: Histograms showing the distribution of Bader charges for oxygen (O), lithium (Li), and iron (Fe) atoms in the dataset. Oxygen exhibits predominantly negative charges, while lithium tends to be positively charged, consistent with their typical ionic behavior. Iron displays a broader distribution, reflecting its diverse oxidation states and bonding environments across different compounds.
This dataset of charge densities provides a compact and scalable representation of electronic structure information across a wide range of materials. We hope it will support the development of new machine learning models capable of learning complex material properties directly from electron density data, while also enabling faster initialization of DFT calculations.
By releasing this dataset, we hope to contribute to current accelerations in computational materials science.
Download and use within Python
from datasets import load_dataset
dataset = load_dataset('LeMaterial/LeMat-Rho')
# convert to Pandas, if you prefer working with this type of object:
df = dataset['train'].to_pandas()
Data fields
| Feature name | Data type | Description | OPTIMADE required field |
|---|---|---|---|
| elements | Sequence[String] | A list of elements in the structure. For example a structure with composition Li2O7 will have [”Li”,”O”] in its elements. | ✅ |
| nsites | Integer | The total number of sites in the structure. For example a structure with an un-reduced composition of Li4O2 will have a total of 6 sites. | ✅ |
| chemical_formula_anonymous | String | Anonymous formula for a chemical structure, sorted by largest contributing species, and reduced by greatest common divisor. For example a structure with a O2Li4 un-reduced composition will have a anonymous formula of A2B. “1”’s at the end of an element composition are dropped (ie not A2B1) | ✅ |
| chemical_formula_reduced | String | Reduced by the greatest common divisor chemical composition. For example a structure with a un-reduced composition of O2Li4 will have a reduced composition of Li2O. Elements with a reduced composition of 1 have the “1” dropped. Elements are sorted by alphabetic ordering. Notes: Not using the same method of Pymatgen’s composition reduction method which takes into account certain elements existing in diatomic states. | ✅ |
| chemical_formula_descriptive | String | A more descriptive chemical formula for the structure, for example a fictive structure of a 6-fold hydrated Na ion might have a descriptive chemical formula of Na(H2O)6, or a Titanium chloride organic dimer might have a descriptive formula of [(C5H5)2TiCl]2. Note: this field is absolutely not standardized across the database. Where possible if available we scrapped as is from the respective databases. Where not possible this may be the same as the chemical formula reduced. | ✅ |
| space_group_it_number | Integer | The international space group of the bulk structure as computed by Moyopy | ✅ |
| nelements | Integer | Total number of different elements in a structure. For example Li4O2 has only 2 separate elements. | ✅ |
| dimension_types | Sequence[Integer], shape = 3 | Periodic boundary conditions for a given structure. Because all of our materials are bulk materials for this database it is [1, 1, 1], meaning it is periodic in x, y, and z dimensions. | ✅ |
| nperiodic_dimensions | Integer | Number of periodic dimensions. Bulk materials have value 3. |
✅ |
| lattice_vectors | Sequence[Sequence[Float]], shape = 3×3 | The matrix of the structures. For example a cubic system with a lattice a=4.5 will have a [[4.5,0,0],[0,4.5,0],[0,0,4.5]] lattice vector entry. | ✅ |
| immutable_id | String | The material ID associated with the structure from the respective database. Note: OQMD IDs are simply integers, thus we converted them to be “oqmd-YYY” | ✅ |
| cartesian_site_positions | Sequence[Sequence[Float]], shape = N×3 | In cartesian units (not fractional units) the coordinates of the species. These match the ordering of all site based properties such as species_at_sites, magnetic_moments and forces. For example a material with a single element placed at a fractional coordinate of [0.5, 0.5, 0.5] with a cubic lattice with a=2, will have a cartesian_site_positions of [1, 1, 1]. | ✅ |
| species | JSON | An Optimade field that includes information about the species themselves, such as their mass, their name, their labels, etc. | ✅ |
| species_at_sites | Sequence[String] | An array of the chemical elements belonging to each site, for example a structure with an un-reduced composition of Li2O2 may have an entry of [”Li”, “Li”, “O”, “O”] for this field, where each species should match the other site based properties such as cartesian_site_positions. | ✅ |
| last_modified | DateTime | The date that the entry was last modified from the respective database it was pulled from. | ✅ |
| elements_ratios | Sequence[Float] or Dictionary | The fractional composition for a given structure in dictionary format. For example a structure with an unreduced composition of Li2O4 would have an entry of {’Li’:0.3333, ‘O’:0.6667} | ✅ |
| stress_tensor | Sequence[Sequence[Float]], shape = 3×3 | The full 3x3 vector for stress tensor in units of kB. Note: for OQMD stress tensor were given in Voigt notation, and were converted to the full tensor. | |
| energy | Float | The uncorrected energy from VASP in eV. | |
| energy_corrected | Float | Energy after any applied post-processing corrections in eV. If no correction is applied, this equals energy. |
|
| magnetic_moments | Sequence[Float] | The magnetic moment per site given in µB. | |
| forces | Sequence[Sequence[Float]], shape = N×3 | The force per site, in the proper order of the sites based on other site specific fields for each site in the x, y and z directions, given in eV/A. | |
| total_magnetization | Float | The total magnetization of the structure in µB. | |
| charges | Sequence[Float] | Site-resolved atomic charges computed using the default charge partitioning scheme, if available. | |
| dos_ef | Float | Density of states evaluated at the Fermi level. Units depend on the underlying electronic structure calculation. | |
| functional | String | What functional was used to calculate the data point in the row. Here all entries should be r2scan. |
|
| cross_compatibility | Boolean | Whether or not this data can be mixed with other rows from a DFT calculation parameter perspective. | |
| bawl_fingerprint | String | Unique materials fingerprint generated using the BAWL fingerprint methodology. More details in the LeMaterial blogpost | |
| compressed_charge_density | Compressed Array / Nested Sequence[Float] | Compressed representation of the valence charge density grid. Intended for reconstruction of the full charge density. | |
| compressed_aeccar0 | Compressed Array / Nested Sequence[Float] | Compressed representation of the VASP AECCAR0 charge-density file containing core-electron contributions. | |
| compressed_aeccar1 | Compressed Array / Nested Sequence[Float] | Compressed representation of the VASP AECCAR1 charge-density file containing valence pseudo-charge contributions. | |
| compressed_aeccar2 | Compressed Array / Nested Sequence[Float] | Compressed representation of the VASP AECCAR2 charge-density file containing all-electron valence charge contributions. | |
| charge_density_grid_shape | Sequence[Integer], shape = 3 | Dimensions of the underlying charge-density grid. Example: [15, 15, 15]. |
|
| bader_charges | Sequence[Float] | Atomic charges obtained from Bader charge analysis. Values correspond to the ordering in species_at_sites. |
|
| bader_atomic_volume | Sequence[Float] | Atomic volumes from Bader partitioning, typically reported in ų. Ordering matches species_at_sites. |
|
| ddec6_charges | Sequence[Float] | Atomic charges obtained from the DDEC6 charge partitioning method. May be null if unavailable. |
Notes
elements_ratios
Unlike the original schema, this dataset stores elemental fractions as a sequence rather than a dictionary. The values correspond to the ordering in the elements field.
Example:
{
"elements": ["Be", "N", "Tb"],
"elements_ratios": [0.25, 0.50, 0.25]
}
Charge Density Fields
The fields compressed_charge_density, compressed_aeccar0, compressed_aeccar1, and compressed_aeccar2 store compressed volumetric electronic density data. These fields can be used to reconstruct charge density distributions and related electronic structure properties.
To support different machine-learning and data-analysis workflows, LeMat-Rho is distributed in two compressed charge-density formats on Hugging Face.
Fixed Grid Dataset
The default LeMat-Rho dataset stores charge densities on a standardized 15×15×15 grid generated with pyrho for each material. This representation provides a uniform tensor shape across the entire dataset, simplifying batching, storage, and training of machine-learning models.
Adaptive Grid Dataset
For applications where preserving a consistent spatial resolution is more important than maintaining a fixed tensor shape, we additionally provide an adaptive-grid version of the dataset. This representation better preserves the spatial features of the original charge density and is often preferable for physical analysis
In this representation, the number of grid points along each lattice direction is determined from the corresponding lattice-vector length:
grid_points[i] = max(5, ceil(lattice_length[i] / 0.2))
This produces approximately one grid point every 0.2 Å along each crystallographic direction.
Examples:
5 Å lattice vector → 25 grid points 10 Å lattice vector → 50 grid points
Atomic Charge Analyses
The bader_charges, bader_atomic_volume, and ddec6_charges fields contain atom-resolved properties. The ordering of values matches the ordering used in:
species_at_sitescartesian_site_positionsforcesmagnetic_moments
S3 Bucket Structure and Access
The complete LeMat-Rho workflow outputs are publicly available through the AWS S3 bucket:
s3://lemat-rho
The bucket contains one directory per material, identified by its source database ID (e.g., Alexandria, Materials Project, or OQMD):
lemat-rho/
├── agm001987721/
│ ├── LeMatRhoPreStaticMaker/
│ │ ├── OUTCAR.gz
│ │ └── vasprun.xml.gz
│ ├── LeMatRhoRelaxMaker_1/
│ │ ├── OUTCAR.gz
│ │ └── vasprun.xml.gz
│ ├── LeMatRhoRelaxMaker_2/
│ │ ├── OUTCAR.gz
│ │ └── vasprun.xml.gz
│ └── LeMatRhoStaticMaker/
│ ├── CHG.gz
│ ├── AECCAR0.gz
│ ├── AECCAR1.gz
│ ├── AECCAR2.gz
│ ├── OUTCAR.gz
│ └── vasprun.xml.gz
├── mp-1234/
└── oqmd-5678/
The final charge-density outputs are located in the LeMatRhoStaticMaker directory:
| File | Description |
|---|---|
CHG.gz |
Total valence charge density |
AECCAR0.gz |
Core-electron charge density |
AECCAR1.gz |
Valence pseudo-charge density |
AECCAR2.gz |
All-electron valence charge density |
vasprun.xml.gz |
Complete VASP calculation output |
OUTCAR.gz |
Detailed VASP log and electronic-structure information |
Accessing Files
Individual files can be downloaded directly using the AWS CLI:
aws s3 cp s3://lemat-rho/agm001987721/LeMatRhoStaticMaker/CHG.gz .
or via HTTPS:
https://lemat-rho.s3.amazonaws.com/agm001987721/LeMatRhoStaticMaker/CHG.gz
The Hugging Face dataset contains compressed and processed representations of these charge densities, while the S3 bucket provides access to the original VASP outputs and volumetric charge-density files used to generate the dataset.
Software and Infrastructure
LeMat-Rho was generated using an automated high-throughput computational workflow built on the open-source LeMaterial ecosystem. Workflow execution and job management were orchestrated using FireWorks, Atomate, and JobFlow, enabling robust scheduling, monitoring, and recovery of large-scale density functional theory calculations.
Materials representations, structure analysis, and post-processing operations were performed using pymatgen. To facilitate machine-learning applications and large-scale data distribution, volumetric charge densities were compressed into a standardized 15×15×15 representation using the pyrho package.
All calculation outputs were curated and stored through the AWS Open Data infrastructure and subsequently distributed through the Hugging Face Datasets platform, providing efficient public access to the resulting dataset.
Citation Information
We are currently in the process of creating a pre-print to describe our methods, the materials fingerprint method and the dataset. For now however the following can be cited:
@misc {LeMat-Rho_2026,
author = {{Mathilde L. D. Franckel}, {Richard Tran}, {Martin Siron}, {Daniel Speckhard}, {Georgia Channing}, {Guilherme Penedo}, {Ali Ramlaoui}, {Alexandre Duval}, {Jonathan Schmidt}},
title = {LeMat-Rho: High-Fidelity Charge Density Dataset for Machine Learning and Atomistic Materials Modelling},
year = 2026,
url = { https://huggingface.co/datasets/LeMaterial/LeMat-Rho},
doi = {},
publisher = { Hugging Face }
}
Acknowledgements:
- LeMaterial is the Open Science initiative hosted by Entalpic
- This project was supported by HuggingFace
- The dataset was made possible by the AWS Open Data program
- Original LeMat-Bulk structures were selected from Alexandria, MaterialsProject, and OQMD
License
This database is licensed by Creative Commons Attribution 4.0 License.
![[agm001990318: PBE Charge density from a static PBE calculation on the r²SCAN final structure (isosurface level: 0.056(e/ų)]](https://cdn-uploads.huggingface.co/production/uploads/682dba9647537ba4f1f03397/gmTssRMgdKJWuFv2lcs4M.png)



