LeMat-Rho: High-Fidelity Charge Density Dataset for Atomistic Materials Modeling

Published June 18, 2026

Mathilde L. D. Franckel¹², Richard Tran², Martin Siron², Daniel Speckhard², Georgia Channing³, Guilherme Penedo³, Ali Ramlaoui², Alexandre Duval², Jonathan Schmidt⁴

¹Department of Materials, Imperial College, London, UK

²Entalpic, Paris, FR

³Hugging Face

⁴Department of Materials, EPFL, Lausanne CH

Ab initio datasets have significantly advanced atomistic modeling by enabling the development of accurate machine learning interatomic potentials. Atom-centered models predict atomic energies and forces with high precision, as these values are derived from charge densities calculated using the Kohn-Sham equations. Access to extensive, high-accuracy charge density datasets provides additional physical insights for atomistic machine-learning approaches.

As part of the LeMaterial Open Science initiative, this is the initial release of LeMat-Rho, an r²SCAN charge-density dataset encompassing 70,000 materials. This dataset provides the research community with high-quality quantum-mechanical data describing electron distributions in materials. The r²SCAN meta-GGA exchange-correlation functional provides a more accurate representation of bonding and energetics than generalized-gradient approximations such as PBE, as illustrated in the Figure below (Fig. 1).

PBE Charge density	r²SCAN Charge density

Figure 1.a: PBE Charge density from a static PBE calculation on the r²SCAN final structure of agm001990318 (isosurface level: 0.056(e/Å³)	Figure 1.b: r²SCAN Charge density of the agm001990318 compound taken from the Alexandria dataset (isosurface level: 0.056(e/Å³)

The workflow automates high-throughput density functional theory (DFT) calculations, beginning with input crystal structures from LeMat-Bulk and proceeding through a configurable workflow builder to the VASP ab initio code. The pipeline achieves robust convergence and high-fidelity electronic structure data using a staged PBE-to-r²SCAN workflow. This process starts with a PBE static pre-calculation that aims to produce reasonable initial wavefunctions at a low computational cost. This is followed by two consecutive r²SCAN relaxations and a final static evaluation. Indeed, as the cell volume equilibrates, the basis set must be reset to ensure optimal convergence.

The approach produces converged electronic-structure outputs, including the charge density, wavefunction, energies, and forces. All simulation results are systematically curated and stored in AWS Open Data. The dataset is now publicly available on Hugging Face, serving as an open resource for materials discovery, electronic structure analysis, and machine learning research.

The charge densities are fed into a Bader analysis code to provide information on the charge localization and chemical bonding. Subsequently, we have formatted the dataset for ease of use and published it publicly on HuggingFace.

To maximize computational resources, we used the high-throughput tools Atomate and JobFlow. The HuggingFace dataset is populated using an array of tools from the pymatgen Python library.

Figure 2: Illustration of the LeMat-Rho implementation workflow

The resulting dataset is enriched in lithium- and cobalt-containing compounds because the initial crystal structures were sourced primarily from repositories developed for battery materials exploration, as shown in Figure 3. To maximize computational throughput and maintain consistency across calculations, the workflow preferentially selected structures with small primitive cells, thereby reducing the representation of larger, more compositionally complex oxide systems.

Figure 3: Periodic table heatmap showing the percentage occurrence of each chemical element across the dataset.

The workflow employed strict convergence criteria to ensure high-quality DFT results. Electronic self-consistent field (SCF) calculations were converged to an energy tolerance below 1 × 10⁻⁶ eV, while ionic relaxations were considered converged when the maximum residual force magnitude fell below 0.02 eV/Å. Overall workflow robustness was high. Across the completed dataset, 98% of structures achieved residual forces below 0.0182 eV/Å and residual stresses below 4.19 kbar, indicating strong structural and mechanical convergence as shown in Figure 4.

Force distribution	Stress distribution

Figure 4.a: Distribution of atomic force magnitudes across the dataset. Forces were computed as the Euclidean norm of the atomic Cartesian force components. Dashed vertical lines indicate the 80th, 98th, and 99th percentiles of the force magnitude distribution.	Figure 4.b:Distribution of stress magnitudes across the dataset, separated into normal and shear stress contributions derived from the stress tensor components. Dashed vertical lines indicate the 80th, 98th, and 99th percentiles of the total stress magnitude distribution.

To facilitate machine-learning applications and large-scale data distribution, volumetric charge densities were transformed into a standardized 15×15×15 representation using the mp-rho framework. This compression preserves the global spatial distribution of electron density while reducing storage requirements sufficiently for Parquet-based datasets and efficient access through Hugging Face.

To enable broader analysis of the compounds' chemical properties, we provide Bader charges and Bader atomic volumes for each structure. The goal is to quantify the redistribution of electron density among atoms, serving as a measure of charge transfer and bonding characteristics. The accompanying figure (Fig.5) shows the distribution of Bader charges for three representative elements, highlighting the diversity of chemical bonding in the dataset. These distributions indicate how atomic charge states vary between compounds and may serve as valuable descriptors for understanding structure–property relationships.

Figure 5: Histograms showing the distribution of Bader charges for oxygen (O), lithium (Li), and iron (Fe) atoms in the dataset. Oxygen exhibits predominantly negative charges, while lithium tends to be positively charged, consistent with their typical ionic behavior. Iron displays a broader distribution, reflecting its diverse oxidation states and bonding environments across different compounds.

This dataset of charge densities provides a compact and scalable representation of electronic structure information across a wide range of materials. We hope it will support the development of new machine learning models capable of learning complex material properties directly from electron density data, while also enabling faster initialization of DFT calculations.

By releasing this dataset, we hope to contribute to current accelerations in computational materials science.

Download and use within Python

from datasets import load_dataset

dataset = load_dataset('LeMaterial/LeMat-Rho')

# convert to Pandas, if you prefer working with this type of object:
df = dataset['train'].to_pandas()

Data fields

Feature name	Data type	Description	OPTIMADE required field
elements	Sequence[String]	A list of elements in the structure. For example a structure with composition Li2O7 will have [”Li”,”O”] in its elements.	✅
nsites	Integer	The total number of sites in the structure. For example a structure with an un-reduced composition of Li4O2 will have a total of 6 sites.	✅
chemical_formula_anonymous	String	Anonymous formula for a chemical structure, sorted by largest contributing species, and reduced by greatest common divisor. For example a structure with a O2Li4 un-reduced composition will have a anonymous formula of A2B. “1”’s at the end of an element composition are dropped (ie not A2B1)	✅
chemical_formula_reduced	String	Reduced by the greatest common divisor chemical composition. For example a structure with a un-reduced composition of O2Li4 will have a reduced composition of Li2O. Elements with a reduced composition of 1 have the “1” dropped. Elements are sorted by alphabetic ordering. Notes: Not using the same method of Pymatgen’s composition reduction method which takes into account certain elements existing in diatomic states.	✅
chemical_formula_descriptive	String	A more descriptive chemical formula for the structure, for example a fictive structure of a 6-fold hydrated Na ion might have a descriptive chemical formula of Na(H2O)6, or a Titanium chloride organic dimer might have a descriptive formula of [(C5H5)2TiCl]2. Note: this field is absolutely not standardized across the database. Where possible if available we scrapped as is from the respective databases. Where not possible this may be the same as the chemical formula reduced.	✅
space_group_it_number	Integer	The international space group of the bulk structure as computed by Moyopy	✅
nelements	Integer	Total number of different elements in a structure. For example Li4O2 has only 2 separate elements.	✅
dimension_types	Sequence[Integer], shape = 3	Periodic boundary conditions for a given structure. Because all of our materials are bulk materials for this database it is [1, 1, 1], meaning it is periodic in x, y, and z dimensions.	✅
nperiodic_dimensions	Integer	Number of periodic dimensions. Bulk materials have value `3`.	✅
lattice_vectors	Sequence[Sequence[Float]], shape = 3×3	The matrix of the structures. For example a cubic system with a lattice a=4.5 will have a [[4.5,0,0],[0,4.5,0],[0,0,4.5]] lattice vector entry.	✅
immutable_id	String	The material ID associated with the structure from the respective database. Note: OQMD IDs are simply integers, thus we converted them to be “oqmd-YYY”	✅
cartesian_site_positions	Sequence[Sequence[Float]], shape = N×3	In cartesian units (not fractional units) the coordinates of the species. These match the ordering of all site based properties such as species_at_sites, magnetic_moments and forces. For example a material with a single element placed at a fractional coordinate of [0.5, 0.5, 0.5] with a cubic lattice with a=2, will have a cartesian_site_positions of [1, 1, 1].	✅
species	JSON	An Optimade field that includes information about the species themselves, such as their mass, their name, their labels, etc.	✅
species_at_sites	Sequence[String]	An array of the chemical elements belonging to each site, for example a structure with an un-reduced composition of Li2O2 may have an entry of [”Li”, “Li”, “O”, “O”] for this field, where each species should match the other site based properties such as cartesian_site_positions.	✅
last_modified	DateTime	The date that the entry was last modified from the respective database it was pulled from.	✅
elements_ratios	Sequence[Float] or Dictionary	The fractional composition for a given structure in dictionary format. For example a structure with an unreduced composition of Li2O4 would have an entry of {’Li’:0.3333, ‘O’:0.6667}	✅
stress_tensor	Sequence[Sequence[Float]], shape = 3×3	The full 3x3 vector for stress tensor in units of kB. Note: for OQMD stress tensor were given in Voigt notation, and were converted to the full tensor.
energy	Float	The uncorrected energy from VASP in eV.
energy_corrected	Float	Energy after any applied post-processing corrections in eV. If no correction is applied, this equals `energy`.
magnetic_moments	Sequence[Float]	The magnetic moment per site given in µB.
forces	Sequence[Sequence[Float]], shape = N×3	The force per site, in the proper order of the sites based on other site specific fields for each site in the x, y and z directions, given in eV/A.
total_magnetization	Float	The total magnetization of the structure in µB.
charges	Sequence[Float]	Site-resolved atomic charges computed using the default charge partitioning scheme, if available.
dos_ef	Float	Density of states evaluated at the Fermi level. Units depend on the underlying electronic structure calculation.
functional	String	What functional was used to calculate the data point in the row. Here all entries should be `r2scan`.
cross_compatibility	Boolean	Whether or not this data can be mixed with other rows from a DFT calculation parameter perspective.
bawl_fingerprint	String	Unique materials fingerprint generated using the BAWL fingerprint methodology. More details in the LeMaterial blogpost
compressed_charge_density	Compressed Array / Nested Sequence[Float]	Compressed representation of the valence charge density grid. Intended for reconstruction of the full charge density.
compressed_aeccar0	Compressed Array / Nested Sequence[Float]	Compressed representation of the VASP AECCAR0 charge-density file containing core-electron contributions.
compressed_aeccar1	Compressed Array / Nested Sequence[Float]	Compressed representation of the VASP AECCAR1 charge-density file containing valence pseudo-charge contributions.
compressed_aeccar2	Compressed Array / Nested Sequence[Float]	Compressed representation of the VASP AECCAR2 charge-density file containing all-electron valence charge contributions.
charge_density_grid_shape	Sequence[Integer], shape = 3	Dimensions of the underlying charge-density grid. Example: `[15, 15, 15]`.
bader_charges	Sequence[Float]	Atomic charges obtained from Bader charge analysis. Values correspond to the ordering in `species_at_sites`.
bader_atomic_volume	Sequence[Float]	Atomic volumes from Bader partitioning, typically reported in Å³. Ordering matches `species_at_sites`.
ddec6_charges	Sequence[Float]	Atomic charges obtained from the DDEC6 charge partitioning method. May be null if unavailable.

Notes

`elements_ratios`

Unlike the original schema, this dataset stores elemental fractions as a sequence rather than a dictionary. The values correspond to the ordering in the elements field.

Example:

{
  "elements": ["Be", "N", "Tb"],
  "elements_ratios": [0.25, 0.50, 0.25]
}

Charge Density Fields

The fields compressed_charge_density, compressed_aeccar0, compressed_aeccar1, and compressed_aeccar2 store compressed volumetric electronic density data. These fields can be used to reconstruct charge density distributions and related electronic structure properties.

To support different machine-learning and data-analysis workflows, LeMat-Rho is distributed in two compressed charge-density formats on Hugging Face.

Fixed Grid Dataset

The default LeMat-Rho dataset stores charge densities on a standardized 15×15×15 grid generated with pyrho for each material. This representation provides a uniform tensor shape across the entire dataset, simplifying batching, storage, and training of machine-learning models.

Adaptive Grid Dataset

For applications where preserving a consistent spatial resolution is more important than maintaining a fixed tensor shape, we additionally provide an adaptive-grid version of the dataset. This representation better preserves the spatial features of the original charge density and is often preferable for physical analysis

In this representation, the number of grid points along each lattice direction is determined from the corresponding lattice-vector length:

grid_points[i] = max(5, ceil(lattice_length[i] / 0.2))

This produces approximately one grid point every 0.2 Å along each crystallographic direction.

Examples:

5 Å lattice vector → 25 grid points 10 Å lattice vector → 50 grid points

Atomic Charge Analyses

The bader_charges, bader_atomic_volume, and ddec6_charges fields contain atom-resolved properties. The ordering of values matches the ordering used in:

species_at_sites
cartesian_site_positions
forces
magnetic_moments

S3 Bucket Structure and Access

The complete LeMat-Rho workflow outputs are publicly available through the AWS S3 bucket:

s3://lemat-rho

The bucket contains one directory per material, identified by its source database ID (e.g., Alexandria, Materials Project, or OQMD):

lemat-rho/
├── agm001987721/
│   ├── LeMatRhoPreStaticMaker/
│   │   ├── OUTCAR.gz
│   │   └── vasprun.xml.gz
│   ├── LeMatRhoRelaxMaker_1/
│   │   ├── OUTCAR.gz
│   │   └── vasprun.xml.gz
│   ├── LeMatRhoRelaxMaker_2/
│   │   ├── OUTCAR.gz
│   │   └── vasprun.xml.gz
│   └── LeMatRhoStaticMaker/
│       ├── CHG.gz
│       ├── AECCAR0.gz
│       ├── AECCAR1.gz
│       ├── AECCAR2.gz
│       ├── OUTCAR.gz
│       └── vasprun.xml.gz
├── mp-1234/
└── oqmd-5678/

The final charge-density outputs are located in the LeMatRhoStaticMaker directory:

File	Description
`CHG.gz`	Total valence charge density
`AECCAR0.gz`	Core-electron charge density
`AECCAR1.gz`	Valence pseudo-charge density
`AECCAR2.gz`	All-electron valence charge density
`vasprun.xml.gz`	Complete VASP calculation output
`OUTCAR.gz`	Detailed VASP log and electronic-structure information

Accessing Files

Individual files can be downloaded directly using the AWS CLI:

aws s3 cp s3://lemat-rho/agm001987721/LeMatRhoStaticMaker/CHG.gz .

or via HTTPS:

https://lemat-rho.s3.amazonaws.com/agm001987721/LeMatRhoStaticMaker/CHG.gz

The Hugging Face dataset contains compressed and processed representations of these charge densities, while the S3 bucket provides access to the original VASP outputs and volumetric charge-density files used to generate the dataset.

Software and Infrastructure

LeMat-Rho was generated using an automated high-throughput computational workflow built on the open-source LeMaterial ecosystem. Workflow execution and job management were orchestrated using FireWorks, Atomate, and JobFlow, enabling robust scheduling, monitoring, and recovery of large-scale density functional theory calculations.

Materials representations, structure analysis, and post-processing operations were performed using pymatgen. To facilitate machine-learning applications and large-scale data distribution, volumetric charge densities were compressed into a standardized 15×15×15 representation using the pyrho package.

All calculation outputs were curated and stored through the AWS Open Data infrastructure and subsequently distributed through the Hugging Face Datasets platform, providing efficient public access to the resulting dataset.

Citation Information

We are currently in the process of creating a pre-print to describe our methods, the materials fingerprint method and the dataset. For now however the following can be cited:

@misc {LeMat-Rho_2026,
    author = {{Mathilde L. D. Franckel}, {Richard Tran}, {Martin Siron}, {Daniel Speckhard}, {Georgia Channing}, {Guilherme  Penedo}, {Ali Ramlaoui}, {Alexandre Duval}, {Jonathan Schmidt}},
    title = {LeMat-Rho: High-Fidelity Charge Density Dataset for Machine Learning and Atomistic Materials Modelling},
    year = 2026,
    url = { https://huggingface.co/datasets/LeMaterial/LeMat-Rho},
    doi = {},
    publisher = { Hugging Face }
}

Acknowledgements:

LeMaterial is the Open Science initiative hosted by Entalpic
This project was supported by HuggingFace
The dataset was made possible by the AWS Open Data program
Original LeMat-Bulk structures were selected from Alexandria, MaterialsProject, and OQMD

License

This database is licensed by Creative Commons Attribution 4.0 License.

Datasets mentioned in this article 1

Atompack: A Fast Storage Layer for Atomistic ML Training

June 11, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote