arxiv:2606.03773

KletterMix: Climbing Toward High-Quality German Pretraining Data

Published on Jun 2

Upvote

Authors:

Maurice Kraus ,

Ruben Härle ,

Abstract

A high-quality German-language corpus for language model pretraining is introduced through careful translation of an English corpus while preserving document structure and metadata, demonstrating improved downstream performance in German-language tasks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source. We document the dataset through a broad set of corpus-level analyses, including translation quality, document length distributions, topic coverage, source composition, and geographic metadata. Using COMETKiwi, we show that the translated documents achieve strong quality across diverse domains, suggesting that careful translation can preserve much of the semantic and stylistic richness of the original corpus. Beyond dataset construction, we evaluate KletterMix as training data. Through controlled pretraining and annealing ablations against established German corpora, we show that models trained on KletterMix achieve measurable improvements on German-language downstream evaluations. These results demonstrate that carefully curated translated data can substantially strengthen the German pretraining data ecosystem.

View arXiv page View PDF Add to collection

Community

stefan-it

about 21 hours ago

Hi @Maurice Kraus and team,

the paper looks really interesting, I have to check the translated dataset.

After reading the paper once, I was wondering if the used translation prompt is not too short and lacks of potential filtering and instructions - compared to the used translation prompt of the FineTranslations project, which can be found here. In general I am missing a reference to the FineTranslations project (see here which should be definitely added :)

mkrausio

Paper author about 21 hours ago

•

edited about 21 hours ago

Hey @stefan-it , thanks for the interest and the heads-up.

We were not aware of this dataset, as there is no accompanying paper. Yes, our prompt is indeed short because, during our initial probing and manual inspection, a longer prompt did not seem necessary. Also, compared to FineTranslations, our dataset is based on ClimbMix, a high-quality dataset that has already been filtered (using Nemotron CC and other measures).

Furthermore, we provide proxy-score-based measures to further improve data quality.

That said, we will take a closer look at this dataset and will definitely include it in a future revised version of the paper.

Bests,
Maurice + Authors

stefan-it

about 20 hours ago

Many thanks Maurice!

I also really like these cluster labels in the dataset. Maybe these cluster labels could be used as a kind of subset if users are only interested in specific clusters. We have done a similar approach with our German Commons dataset.

And I would highly like to see a kind dataset subsets/samples based on tokens, e.g. FineWeb offers these sample-350BT or sample-10BT splits. But enough feature requests for now 😅

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.03773

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.03773 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.03773 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.03773 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.