Papers
arxiv:2510.14751

Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

Published on Mar 25
Authors:
,
,
,
,
,
,

Abstract

Future summary prediction trains an auxiliary head to predict compact long-term representations, improving language model performance on reasoning and code tasks compared to traditional next-token prediction.

AI-generated summary

Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right-to-left order. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2510.14751
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.14751 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.