From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space
Abstract
PreRL applies reward-driven online updates to the marginal distribution in pre-train space, while DSRL uses NSR-PreRL to expand reasoning horizons before standard RL fine-tuning.
While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.
Community
We’re excited to share our new paper: From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space. https://arxiv.org/abs/2604.14142
Most existing RL for LLM reasoning optimizes the policy under a given question context, i.e. P(y|x).
In this work, we ask a different question:
Can we directly optimize reasoning trajectories themselves in pre-train space, instead of only optimizing them conditioned on a specific problem?
Our motivation is simple: if reasoning knowledge is already internalized in the model, then optimizing trajectories at the level of P(y) may provide a way to shape that internalized reasoning space more directly.
We introduce PreRL, which applies reward-driven online updates in pre-train space, and find a surprising result:
Negative Sample Reinforcement (NSR) is especially effective.
Instead of reinforcing only correct trajectories, pruning incorrect ones in pre-train space can strongly stimulate reasoning behaviors and provide a better foundation for subsequent RL.
Building on this, we propose DSRL, which first performs NSR-PreRL warmup and then switches to standard RL.
Across benchmarks, this leads to better reasoning performance, stronger exploration, and improved efficiency.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reinforcement-aware Knowledge Distillation for LLM Reasoning (2026)
- Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance (2026)
- FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization (2026)
- Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models (2026)
- Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning (2026)
- Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning (2026)
- Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.14142 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper