Buckets:

Cap-alfaMike
/

Core-AlphaEarth-Embeddings-bucket

6.06 TB

62,492 files

Updated 28 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
2017		28 days ago	695 items
2018		28 days ago	2,814 items
2019		28 days ago	12,072 items
2020		28 days ago	13,729 items
2021		28 days ago	15,506 items
2022		28 days ago	10,759 items
2023		28 days ago	6,903 items
2024		28 days ago	11 items
.gitattributes	10.1 MB xet	28 days ago	7d913d1e
README.md	7.31 kB xet	28 days ago	7d3a6df3
metadata.parquet	8.88 GB xet	28 days ago	65db1a2e

README.md

Major TOM Core AlphaEarth Embeddings Subset

This is a prototype dataset. It only includes some of the AlphaEarth embeddings stored in Major TOM grid cells.

This dataset is mostly aimed at experimentation and prototyping. It is particularly useful to use it along other datasets published within the Major TOM project.

Content

Field	Type	Description
grid_cell	string	Major TOM cell
year	int	year of the sample
thumbnail	image	3-dimensional PCA (jpeg-compressed)
centre_lat	float	Centre of the fragment latitude
centre_lon	float	Centre of the fragment longitude
subdir	string	subdirectory where the .tif file is stored
embedding	array	average embedding for the grid cell [1,64]
utm_crs	string	CRS of the original product
utm_footprint	string	Polygon footprint (image UTM) of the fragment
geometry	geometry	Polygon footprint (WGS84) of the fragment
geotransform	array	geotransform of the footprint
grid_row_u	int	Major TOM cell row
grid_col_r	int	Major TOM cell col

Example Access

from fsspec.parquet import open_parquet_file
import pyarrow.parquet as pq
import rasterio as rio

# 1. Read the metadata
metadata_url = "https://huggingface.co/datasets/Major-TOM/Core-AlphaEarth-Embeddings/resolve/main/metadata.parquet"
columns = ["grid_cell","subdir"]
row_idx = 0

with open_parquet_file(metadata_url,columns = columns) as f:
    with pq.ParquetFile(f) as pf:
        row_group = pf.read_row_group(row_idx, columns=columns)

subdir = row_group['subdir'][0].as_py()
grid_cell = row_group['grid_cell'][0].as_py()

# 2. Read the data
file_url = f"https://huggingface.co/datasets/Major-TOM/Core-AlphaEarth-Embeddings/resolve/main/{subdir}/{grid_cell}.tif"

with rio.open(f"/vsicurl/{file_url}") as src:
    embedding_array = src.read()

Coverage (zoom in)

This is a sample dataset with 62,489 grid cells covered, each containing 1,068 by 1,068 embeddings, each of dimensionality 64. In total, that covers 71,276,453,136 (71 billion) individual embeddings, with over 7 milion square kilometers covered.

Examples

Data Source

The source data is produced by Google and Google DeepMind.

🗄 Source Dataset: https://developers.google.com/earth-engine/datasets/catalog/GOOGLE_SATELLITE_EMBEDDING_V1_ANNUAL

📝 Blog post: https://deepmind.google/discover/blog/alphaearth-foundations-helps-map-our-planet-in-unprecedented-detail/

📜 Paper: https://arxiv.org/abs/2507.22291

AlphaEarth Citation:

@misc{brown2025alphaearthfoundationsembeddingfield,
      title={AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data}, 
      author={Christopher F. Brown and Michal R. Kazmierski and Valerie J. Pasquarella and William J. Rucklidge and Masha Samsikova and Chenhui Zhang and Evan Shelhamer and Estefania Lahera and Olivia Wiles and Simon Ilyushchenko and Noel Gorelick and Lihui Lydia Zhang and Sophia Alj and Emily Schechter and Sean Askay and Oliver Guinan and Rebecca Moore and Alexis Boukouvalas and Pushmeet Kohli},
      year={2025},
      eprint={2507.22291},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Major TOM Citation:

@inproceedings{Francis2024MajorTOM,
  author={Francis, Alistair and Czerkawski, Mikolaj},
  booktitle={IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium}, 
  title={Major TOM: Expandable Datasets for Earth Observation}, 
  year={2024},
  pages={2935-2940},
  doi={10.1109/IGARSS53475.2024.10640760}
}

@misc{Czerkawski2024EmbeddedMajorTOM,
      title={Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space}, 
      author={Mikolaj Czerkawski and Marcin Kluczek and Jędrzej S. Bojanowski},
      year={2024},
      eprint={2412.05600},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.05600}, 
}

Credits

This dataset was curated by Mikolaj Czerkawski (@mikonvergence) from Asterisk Labs.

Thank you to Cesar Aybar (https://github.com/csaybar) and Julio Contreras (https://github.com/JulioContrerasH) for supporting me with the core code used to acquire this dataset at scale.

Total size: 6.06 TB

Files: 62,492

Last updated: May 21

Pre-warmed CDN: US EU US EU