Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining
Abstract
Training-time data augmentation techniques help mitigate overfitting in autoregressive language model pretraining by delaying performance deterioration and improving final model quality when training on fixed datasets for many epochs.
As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate training-time data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction (x_{t+i} for i > 1). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime~\footnote{All code and data are available at https://github.com/ michaelchen-lab/ data-augmentations-for-pretraining.
Community
As high-quality text grows scarce, language model pretraining is entering a data-constrained, compute-abundant regime that requires many epochs over a fixed corpus, a setting in which standard autoregressive (AR) training overfits and eventually degrades. This work aims to demystify training-time data augmentation as a remedy, systematically separating which augmented training views regularize many-epoch AR training from which fail or interfere.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws (2026)
- Forecasting Downstream Performance of LLMs With Proxy Metrics (2026)
- When Data Is Scarce: Scaling Sparse Language Models with Repeated Training (2026)
- q0: Primitives for Hyper-Epoch Pretraining (2026)
- Scaling Laws for Mixture Pretraining Under Data Constraints (2026)
- Generating Pretraining Tokens from Organic Data for Data-Bound Scaling (2026)
- Annotations Mitigate Post-Training Mode Collapse (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.16246 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper