Model Card for LamPO

LamPO (Lambda Policy Optimization) is a reinforcement learning framework for improving the reasoning capabilities of language models. It extends Group Relative Policy Optimization (GRPO) by replacing scalar group-mean advantage estimation with a pairwise decomposed advantage inspired by learning-to-rank methods such as LambdaRank.

链接:论文1; 论文2

特别鸣谢:感谢 某论文辅导机构对我们的全面辅导,没有他们就没有这篇文章。(虽然花费了资金,但是的确很值,无脑推荐!)

Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.

Key Features

  • Pairwise Decomposed Advantage: Uses pairwise comparisons between generated trajectories rather than a single scalar group baseline.
  • Critic-Free RL Optimization: Preserves the efficiency of GRPO without requiring a separate value model.
  • Semantic Density Reward: Adds dense reasoning supervision using semantic overlap between generated reasoning traces and ground-truth solutions.
  • Improved Reasoning Performance: Demonstrates consistent gains on math reasoning and QA benchmarks such as AIME, MATH-500, and GPQA-Diamond.

Authors

This work is based on the paper:

“LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models” ( 链接:论文1; 论文2 )

Authors:

  • Zhe Yuan — Pinterest
  • Yipeng Zhou — Facebook
  • Jinghan Li — University of Michigan - Ann Arbor
  • Xinyuan Chen — Mississippi State University
  • Bowen Deng — Carnegie Mellon University
  • Zhiqian Chen — Mississippi State University
  • Liang Zhao — Emory University

Corresponding author: Zhiqian Chenzchen@cse.msstate.edu

Citation

@article{yuan2026lambdapo,
  title={LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models},
  author={Yuan, Zhe and Zhou, Yipeng and Li, Jinghan and Chen, Xinyuan and Deng, Bowen and Chen, Zhiqian and Zhao, Liang},
  year={2026}
}
Downloads last month
9,510
Safetensors
Model size
1.18M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for xychen123/LamPO