arxiv:2607.02291

Optimizing Visual Generative Models via Distribution-wise Rewards

Published on Jul 2

· Submitted by

Ruihang Li on Jul 3

Tencent Hunyuan

Upvote

Authors:

Abstract

A novel reinforcement learning framework for visual generation uses distribution-wise rewards to improve image diversity and quality while addressing mode collapse and computational efficiency issues.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Conventional reinforcement learning strategies for visual generation typically employ sample-wise reward functions, yet this practice frequently results in reward hacking that degrades image diversity and introduces visual anomalies. To address these limitations, we present a novel framework that finetunes generative models using distribution-wise rewards, ensuring better alignment with real-world data distributions. Unlike rewards that evaluate samples individually, distribution-wise reward accounts for the data distribution of the samples, mitigating the mode collapse problem that occurs when all samples optimize towards the same direction independently. To overcome the prohibitive computational cost of estimating these rewards, we introduce a subset-replace strategy that efficiently provides reward signals by updating only a small subset of a generated reference set. Additionally, we apply RL to optimize post-hoc model merging coefficients, potentially mitigating the train-inference inconsistency caused by introducing stochastic differential equation (SDE) in regular RL practices. Extensive experiments show our approach significantly improves FID-50K across various base models, from 8.30 to 5.77 for SiT and from 3.74 to 3.52 for EDM2. Qualitative evaluation also confirms that our method enhances perceptual quality while preserving sample diversity.

View arXiv page View PDF Add to collection

Community

rhli

Paper submitter about 22 hours ago

•

edited about 22 hours ago

Proposes a distribution-wise RL framework for visual generation to mitigate reward hacking and mode collapse. By employing an efficient subset-replace strategy, this approach significantly improves the visual quality and diversity on the SiT model.