Readout recipe-control models (31M, 4 matched arms)
Four matched 31M Pythia-style (GPTNeoX architecture) pretraining runs used as
the optimizer-recipe sensitivity control in Learning to Read Out:
Unembedding Dynamics in Language Model Pretraining (appendix,
fig:app-recipe-control-geometry). All arms share one tokenized Pile slice
(pile_10B_seed1234.bin, GPT-NeoX-20B tokenizer, data_seed=1234), the same
parameter seed (0), data order, fp16 precision, global batch 1024
(2,097,152 tokens/step), weight decay 0.1, and a 10B-token budget; each arm
perturbs exactly one recipe axis:
| Arm | Perturbation |
|---|---|
baseline/ |
none (peak_lr 1e-3, warmup 1430 steps, W_U lr multiplier 1.0) |
long_warmup/ |
extended LR warmup |
wu_lr_0p25/ |
output-readout (W_U) learning-rate multiplier 0.25× |
wu_lr_4x/ |
output-readout (W_U) learning-rate multiplier 4× |
Each arm ships config.json (full training config), metrics.csv
(train/val curves), and ckpts/step<N>/model_fp16.pt checkpoint trajectories
at token-milestone steps.
The trainer, tokenizer pipeline, and slice-building scripts are in the code
release under experiments/ablations/pretraining_recipe_control/
(https://github.com/hematteo/learning-to-read-out): rebuild the exact slice
with trainer/tokenize_slice.py or scripts/fetch_pythia_preshuffled.py
(sources and licences in docs/DATA.md). These are research artifacts for
analyzing W_U readout geometry across training, not general-purpose language
models.
Citation
@misc{he2026learningtoreadout,
title = {Learning to Read Out: Unembedding Dynamics in Language Model Pretraining},
author = {He, Matteo and Shen, William F. and Iacob, Alex and Jovanovic, Andrej
and Qiu, Xinchi and Lane, Nicholas D.},
year = {2026},
note = {Under review. Code: https://github.com/hematteo/learning-to-read-out},
}
MIT. Trained on a slice of the Pile (monology/pile-uncopyrighted /
EleutherAI/pile-standard-pythia-preshuffled); see the Pile's data statement
for upstream text provenance.