arxiv:2605.10983

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

Published on May 13

Authors:

Abstract

Trajectory Matching Policy Optimization (TMPO) addresses reward hacking in diffusion model alignment by replacing scalar reward maximization with trajectory-level reward distribution matching, improving generative diversity and maintaining reward performance through Softmax Trajectory Balance and dynamic stochastic tree sampling.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively constraining probability distribution over acceptable trajectories, causing concentration on a few high-reward paths. In contrast, we propose Trajectory Matching Policy Optimization (TMPO), which replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this objective inherits the mode-covering property of forward KL divergence, preserving coverage over all acceptable trajectories while optimizing reward. To further reduce multi-trajectory training time on large-scale flow-matching models, TMPO incorporates Dynamic Stochastic Tree Sampling, where trajectories share denoising prefixes and branch at dynamically scheduled steps, reducing redundant computation while improving training effectiveness. Extensive results across diverse alignment tasks such as human preference, compositional generation and text rendering show that TMPO improves generative diversity over state-of-the-art methods by 9.1%, and achieves competitive performance in all downstream and efficiency metrics, attaining the optimal trade-off between reward and diversity.

View arXiv page View PDF Add to collection

Community

ChaelChael

about 14 hours ago

•

edited about 14 hours ago

🎉 Official resources for our paper are now available!

📦 Code: https://github.com/MAIR-Lab-HUST/TMPO

🌍 Project Page: https://tmpo-diffusion.github.io/

TMPO proposes a fundamental shift in Diffusion RL alignment goals and achieves an improved reward–diversity trade-off for generative model alignment, delivering higher diversity with reduced training cost.

If you find our work useful, please consider starring the repository and citing our paper. Thanks for your support! ⭐🚀

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.10983

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.10983 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.10983 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.10983 in a Space README.md to link it from this page.