arxiv:2607.01218

The State-Prediction Separation Hypothesis

Published on Jul 1

· Submitted by

Nathan Godey on Jul 2

Cornell LIL Lab

Upvote

Authors:

Abstract

Separating state prediction from token prediction in Transformers improves language modeling performance and efficiency across different scales.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the state-prediction separation hypothesis: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Transformers by 2--3 percentage points on average on downstream tasks. We also conduct extensive empirical analysis that rules out potential confounders and demonstrates the fundamental difference in the gradients our design entails.