KletterMix: Climbing Toward High-Quality German Pretraining Data
Abstract
A high-quality German-language corpus for language model pretraining is introduced through careful translation of an English corpus while preserving document structure and metadata, demonstrating improved downstream performance in German-language tasks.
High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source. We document the dataset through a broad set of corpus-level analyses, including translation quality, document length distributions, topic coverage, source composition, and geographic metadata. Using COMETKiwi, we show that the translated documents achieve strong quality across diverse domains, suggesting that careful translation can preserve much of the semantic and stylistic richness of the original corpus. Beyond dataset construction, we evaluate KletterMix as training data. Through controlled pretraining and annealing ablations against established German corpora, we show that models trained on KletterMix achieve measurable improvements on German-language downstream evaluations. These results demonstrate that carefully curated translated data can substantially strengthen the German pretraining data ecosystem.
Community
Hi @Maurice Kraus and team,
the paper looks really interesting, I have to check the translated dataset.
After reading the paper once, I was wondering if the used translation prompt is not too short and lacks of potential filtering and instructions - compared to the used translation prompt of the FineTranslations project, which can be found here. In general I am missing a reference to the FineTranslations project (see here which should be definitely added :)
Hey @stefan-it , thanks for the interest and the heads-up.
We were not aware of this dataset, as there is no accompanying paper. Yes, our prompt is indeed short because, during our initial probing and manual inspection, a longer prompt did not seem necessary. Also, compared to FineTranslations, our dataset is based on ClimbMix, a high-quality dataset that has already been filtered (using Nemotron CC and other measures).
Furthermore, we provide proxy-score-based measures to further improve data quality.
That said, we will take a closer look at this dataset and will definitely include it in a future revised version of the paper.
Bests,
Maurice + Authors
Many thanks Maurice!
I also really like these cluster labels in the dataset. Maybe these cluster labels could be used as a kind of subset if users are only interested in specific clusters. We have done a similar approach with our German Commons dataset.
And I would highly like to see a kind dataset subsets/samples based on tokens, e.g. FineWeb offers these sample-350BT or sample-10BT splits. But enough feature requests for now 😅
Get this paper in your agent:
hf papers read 2606.03773 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper