Audio Datasets nyu-dice-lab/wavepulse-radio-raw-transcripts Viewer • Updated Feb 18, 2025 • 565M • 1.16k • 8 laion/LAION-DISCO-12M Viewer • Updated Nov 14, 2024 • 12.3M • 680 • 43 laion/LAION-Audio-300M Viewer • Updated Jan 10, 2025 • 229M • 13.9k • 61
nyu-dice-lab/wavepulse-radio-raw-transcripts Viewer • Updated Feb 18, 2025 • 565M • 1.16k • 8
Video Datasets nkp37/OpenVid-1M Viewer • Updated 26 days ago • 1.45M • 42.2k • 262 Koala-36M/Koala-36M-v1 Viewer • Updated Oct 12, 2024 • 36M • 381 • 56 OpenGVLab/InternVid-Full Viewer • Updated Jun 5, 2024 • 47.6M • 152 • 16 1x-technologies/world_model_raw_data Updated Apr 20, 2025 • 195 • 6
Text Datasets Running 133 TxT360: Trillion Extracted Text 📖 133 Explore the TxT360 LLM pre‑training dataset CASIA-LM/ChineseWebText2.0 Viewer • Updated Dec 2, 2024 • 2k • 2.6k • 29 HPLT/HPLT2.0_cleaned Viewer • Updated Nov 13, 2025 • 9.03B • 26.1k • 42 TrevorDohm/Pile_Tokenized Viewer • Updated Feb 20, 2024 • 134M • 7
Robotic Datasets agibot-world/AgiBotWorld-Alpha Viewer • Updated Sep 29, 2025 • 49.8M • 10.3k • 219
Image Datasets kakaobrain/coyo-700m Viewer • Updated Aug 30, 2022 • 747M • 2.75k • 159 mlfoundations/datacomp_1b Viewer • Updated Aug 21, 2023 • 1.39B • 7.82k • 42
Audio Datasets nyu-dice-lab/wavepulse-radio-raw-transcripts Viewer • Updated Feb 18, 2025 • 565M • 1.16k • 8 laion/LAION-DISCO-12M Viewer • Updated Nov 14, 2024 • 12.3M • 680 • 43 laion/LAION-Audio-300M Viewer • Updated Jan 10, 2025 • 229M • 13.9k • 61
nyu-dice-lab/wavepulse-radio-raw-transcripts Viewer • Updated Feb 18, 2025 • 565M • 1.16k • 8
Robotic Datasets agibot-world/AgiBotWorld-Alpha Viewer • Updated Sep 29, 2025 • 49.8M • 10.3k • 219
Video Datasets nkp37/OpenVid-1M Viewer • Updated 26 days ago • 1.45M • 42.2k • 262 Koala-36M/Koala-36M-v1 Viewer • Updated Oct 12, 2024 • 36M • 381 • 56 OpenGVLab/InternVid-Full Viewer • Updated Jun 5, 2024 • 47.6M • 152 • 16 1x-technologies/world_model_raw_data Updated Apr 20, 2025 • 195 • 6
Image Datasets kakaobrain/coyo-700m Viewer • Updated Aug 30, 2022 • 747M • 2.75k • 159 mlfoundations/datacomp_1b Viewer • Updated Aug 21, 2023 • 1.39B • 7.82k • 42
Text Datasets Running 133 TxT360: Trillion Extracted Text 📖 133 Explore the TxT360 LLM pre‑training dataset CASIA-LM/ChineseWebText2.0 Viewer • Updated Dec 2, 2024 • 2k • 2.6k • 29 HPLT/HPLT2.0_cleaned Viewer • Updated Nov 13, 2025 • 9.03B • 26.1k • 42 TrevorDohm/Pile_Tokenized Viewer • Updated Feb 20, 2024 • 134M • 7