Papers
arxiv:2604.13902

DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

Published on Apr 15
ยท Submitted by
Xiaofan Li
on Apr 20
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

A novel reinforcement learning approach for large language models that addresses the exploration-exploitation trade-off through perplexity-based sample partitioning and bidirectional reward allocation mechanisms.

AI-generated summary

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.

Community

Paper submitter

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.

Mathematical Reasoning Results

Qwen3-4B-Base

Method AIME24 AIME25 MATH AMC OLY MIN AVG
Base 9.58 3.75 48.48 31.48 25.52 24.82 23.94
GRPO 26.67 23.33 85.83 60.24 53.06 44.39 48.92
DAPO 26.25 23.75 86.43 61.90 53.88 44.34 49.43
DAPO w/ EL 26.67 24.58 86.78 62.95 54.53 44.53 50.01
CDE 26.67 24.17 85.93 62.35 52.25 43.11 49.08
DiPO (ours) 29.17 24.58 87.00 63.70 54.09 44.76 50.55

Qwen3-8B-Base

Method AIME24 AIME25 MATH AMC OLY MIN AVG
Base 8.33 9.17 66.78 39.46 33.30 66.78 37.30
GRPO 31.67 24.58 89.08 69.28 56.20 48.62 53.24
DAPO 30.08 25.83 89.43 69.12 56.90 48.02 53.23
DAPO w/ EL 33.75 25.42 89.58 69.87 57.21 47.56 53.90
CDE 31.67 26.25 89.35 68.07 57.11 47.75 53.37
DiPO (ours) 35.00 27.50 89.55 71.23 57.73 47.75 54.79

Qwen2.5-7B

Method AIME24 AIME25 MATH AMC OLY MIN AVG
Base 7.08 2.08 41.53 22.74 19.28 17.19 18.32
GRPO 20.42 15.42 79.15 58.43 42.42 36.95 42.13
DAPO 20.42 16.67 79.08 59.94 42.70 37.55 42.73
DAPO w/ EL 20.00 14.58 79.85 58.73 43.05 39.65 42.64
CDE 20.00 15.00 79.00 55.87 42.94 35.94 41.46
DiPO (ours) 22.92 16.67 80.35 60.09 43.72 37.59 43.56

BFCLv3 Function Calling Results

Qwen2.5-3B-Instruct

Method Non-Live Acc Live Acc Multi Turn Acc Relevance Detection Irrelevance Detection Overall
Instruct 42.52 53.96 1.00 44.44 82.49 33.04
SFT400 69.29 41.40 0.00 94.44 60.14 34.08
SFT400+PPO 78.29 58.76 5.12 100.00 48.40 45.80
SFT400+GRPO 76.21 64.15 1.75 94.44 58.63 46.42
PPO, Cold Start 82.42 67.78 4.88 100.00 18.09 51.15
ToolRL+GRPO 81.58 73.78 3.75 100.00 56.44 52.98
ToolRL+DAPO 82.19 69.43 8.00 81.25 57.60 53.21
ToolRL+DiPO 83.42 73.06 8.62 100.00 54.16 55.03

Qwen2.5-7B-Instruct

Method Non-Live Acc Live Acc Multi Turn Acc Relevance Detection Irrelevance Detection Overall
Instruct 66.02 53.51 4.25 76.47 62.66 41.97
SFT400 69.29 41.40 0.00 94.44 8.11 34.08
SFT400+PPO 83.90 51.84 0.25 100.00 29.66 42.02
SFT400+GRPO 80.69 46.51 0.25 100.00 14.19 39.25
PPO, Cold Start 79.33 63.17 0.38 88.89 52.92 46.68
ToolRL+GRPO 86.17 74.90 18.12 83.33 76.68 58.38
ToolRL+DAPO 87.10 76.31 19.75 87.50 67.25 61.06
ToolRL+DiPO 86.21 76.83 24.50 87.50 69.57 62.51

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.13902
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.13902 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.13902 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.13902 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.