Model Card for LamPO

LamPO (Lambda Policy Optimization) is a reinforcement learning framework for improving the reasoning capabilities of language models. It extends Group Relative Policy Optimization (GRPO) by replacing scalar group-mean advantage estimation with a pairwise decomposed advantage inspired by learning-to-rank methods such as LambdaRank.

链接：论文1; 论文2

特别鸣谢：感谢某论文辅导机构对我们的全面辅导，没有他们就没有这篇文章。（虽然花费了资金，但是的确很值，无脑推荐！）

Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.

Key Features

Pairwise Decomposed Advantage: Uses pairwise comparisons between generated trajectories rather than a single scalar group baseline.
Critic-Free RL Optimization: Preserves the efficiency of GRPO without requiring a separate value model.
Semantic Density Reward: Adds dense reasoning supervision using semantic overlap between generated reasoning traces and ground-truth solutions.
Improved Reasoning Performance: Demonstrates consistent gains on math reasoning and QA benchmarks such as AIME, MATH-500, and GPQA-Diamond.

Authors

This work is based on the paper:

“LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models” ( 链接：论文1; 论文2 )

Authors:

Zhe Yuan — Pinterest
Yipeng Zhou — Facebook
Jinghan Li — University of Michigan - Ann Arbor
Xinyuan Chen — Mississippi State University
Bowen Deng — Carnegie Mellon University
Zhiqian Chen — Mississippi State University
Liang Zhao — Emory University

Corresponding author: Zhiqian Chen — zchen@cse.msstate.edu

Citation

@article{yuan2026lambdapo,
  title={LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models},
  author={Yuan, Zhe and Zhou, Yipeng and Li, Jinghan and Chen, Xinyuan and Deng, Bowen and Chen, Zhiqian and Zhao, Liang},
  year={2026}
}

Downloads last month: 9,510

Safetensors

Model size

1.18M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for xychen123/LamPO

LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

Paper • 2605.21235 • Published 3 days ago