Abstract
Separating state prediction from token prediction in Transformers improves language modeling performance and efficiency across different scales.
Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the state-prediction separation hypothesis: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Transformers by 2--3 percentage points on average on downstream tasks. We also conduct extensive empirical analysis that rules out potential confounders and demonstrates the fundamental difference in the gradients our design entails.
Community
Code coming soon!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior (2026)
- Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility (2026)
- Depth-Attention: Cross-Layer Value Mixing for Language Models (2026)
- Bifocal Diffusion Language Models: Asymmetric Bidirectional Context for Parallel Generation (2026)
- Test-Time Training with Next-Token Prediction (2026)
- Dual Dimensionality for Local and Global Attention (2026)
- Parallel Causal Associative Fields: Gated Sparse Memory for Long-Context Language Modeling (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2607.01218 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper