TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
Abstract
Trajectory Matching Policy Optimization (TMPO) addresses reward hacking in diffusion model alignment by replacing scalar reward maximization with trajectory-level reward distribution matching, improving generative diversity and maintaining reward performance through Softmax Trajectory Balance and dynamic stochastic tree sampling.
Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively constraining probability distribution over acceptable trajectories, causing concentration on a few high-reward paths. In contrast, we propose Trajectory Matching Policy Optimization (TMPO), which replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this objective inherits the mode-covering property of forward KL divergence, preserving coverage over all acceptable trajectories while optimizing reward. To further reduce multi-trajectory training time on large-scale flow-matching models, TMPO incorporates Dynamic Stochastic Tree Sampling, where trajectories share denoising prefixes and branch at dynamically scheduled steps, reducing redundant computation while improving training effectiveness. Extensive results across diverse alignment tasks such as human preference, compositional generation and text rendering show that TMPO improves generative diversity over state-of-the-art methods by 9.1%, and achieves competitive performance in all downstream and efficiency metrics, attaining the optimal trade-off between reward and diversity.
Community
🎉 Official resources for our paper are now available!
📦 Code: https://github.com/MAIR-Lab-HUST/TMPO
🌍 Project Page: https://tmpo-diffusion.github.io/
TMPO proposes a fundamental shift in Diffusion RL alignment goals and achieves an improved reward–diversity trade-off for generative model alignment, delivering higher diversity with reduced training cost.
If you find our work useful, please consider starring the repository and citing our paper. Thanks for your support! ⭐🚀
Get this paper in your agent:
hf papers read 2605.10983 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper