6.06 TB
62,492 files
Updated 28 days ago
Name
Size
2017
2018
2019
2020
2021
2022
2023
2024
.gitattributes10.1 MB
xet
README.md7.31 kB
xet
metadata.parquet8.88 GB
xet
README.md

image/gif

Major TOM Core AlphaEarth Embeddings Subset

This is a prototype dataset. It only includes some of the AlphaEarth embeddings stored in Major TOM grid cells.

This dataset is mostly aimed at experimentation and prototyping. It is particularly useful to use it along other datasets published within the Major TOM project.

Content

Field Type Description
grid_cell string Major TOM cell
year int year of the sample
thumbnail image 3-dimensional PCA (jpeg-compressed)
centre_lat float Centre of the fragment latitude
centre_lon float Centre of the fragment longitude
subdir string subdirectory where the .tif file is stored
embedding array average embedding for the grid cell [1,64]
utm_crs string CRS of the original product
utm_footprint string Polygon footprint (image UTM) of the fragment
geometry geometry Polygon footprint (WGS84) of the fragment
geotransform array geotransform of the footprint
grid_row_u int Major TOM cell row
grid_col_r int Major TOM cell col

Example Access

from fsspec.parquet import open_parquet_file
import pyarrow.parquet as pq
import rasterio as rio

# 1. Read the metadata
metadata_url = "https://huggingface.co/datasets/Major-TOM/Core-AlphaEarth-Embeddings/resolve/main/metadata.parquet"
columns = ["grid_cell","subdir"]
row_idx = 0

with open_parquet_file(metadata_url,columns = columns) as f:
    with pq.ParquetFile(f) as pf:
        row_group = pf.read_row_group(row_idx, columns=columns)

subdir = row_group['subdir'][0].as_py()
grid_cell = row_group['grid_cell'][0].as_py()

# 2. Read the data
file_url = f"https://huggingface.co/datasets/Major-TOM/Core-AlphaEarth-Embeddings/resolve/main/{subdir}/{grid_cell}.tif"

with rio.open(f"/vsicurl/{file_url}") as src:
    embedding_array = src.read()

Coverage (zoom in)

This is a sample dataset with 62,489 grid cells covered, each containing 1,068 by 1,068 embeddings, each of dimensionality 64. In total, that covers 71,276,453,136 (71 billion) individual embeddings, with over 7 milion square kilometers covered.

image/png

Examples

image/png

image/png

image/png

image/png

image/png

image/png

image/png

image/png

Data Source

The source data is produced by Google and Google DeepMind.

🗄 Source Dataset: https://developers.google.com/earth-engine/datasets/catalog/GOOGLE_SATELLITE_EMBEDDING_V1_ANNUAL

📝 Blog post: https://deepmind.google/discover/blog/alphaearth-foundations-helps-map-our-planet-in-unprecedented-detail/

📜 Paper: https://arxiv.org/abs/2507.22291

AlphaEarth Citation:

@misc{brown2025alphaearthfoundationsembeddingfield,
      title={AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data}, 
      author={Christopher F. Brown and Michal R. Kazmierski and Valerie J. Pasquarella and William J. Rucklidge and Masha Samsikova and Chenhui Zhang and Evan Shelhamer and Estefania Lahera and Olivia Wiles and Simon Ilyushchenko and Noel Gorelick and Lihui Lydia Zhang and Sophia Alj and Emily Schechter and Sean Askay and Oliver Guinan and Rebecca Moore and Alexis Boukouvalas and Pushmeet Kohli},
      year={2025},
      eprint={2507.22291},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Major TOM Citation:

@inproceedings{Francis2024MajorTOM,
  author={Francis, Alistair and Czerkawski, Mikolaj},
  booktitle={IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium}, 
  title={Major TOM: Expandable Datasets for Earth Observation}, 
  year={2024},
  pages={2935-2940},
  doi={10.1109/IGARSS53475.2024.10640760}
}

@misc{Czerkawski2024EmbeddedMajorTOM,
      title={Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space}, 
      author={Mikolaj Czerkawski and Marcin Kluczek and Jędrzej S. Bojanowski},
      year={2024},
      eprint={2412.05600},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.05600}, 
}

Credits

This dataset was curated by Mikolaj Czerkawski (@mikonvergence) from Asterisk Labs.

Thank you to Cesar Aybar (https://github.com/csaybar) and Julio Contreras (https://github.com/JulioContrerasH) for supporting me with the core code used to acquire this dataset at scale.

Total size
6.06 TB
Files
62,492
Last updated
May 21
Pre-warmed CDN
US EU US EU

Contributors