Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability
Abstract
Large Audio Language Models for role-play text-to-speech are improved through a new evaluation metric and reinforcement learning approach that enhances stylistic consistency and alignment with role-play instructions.
Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle to maintain stylistic consistency with character profiles and scene descriptions across multi-turn dialogues. A critical bottleneck is the lack of objective metrics for quantifying speaking style. To bridge this gap, we propose Mean Continuation Log-Probability (MCLP) as both an evaluation metric and a reward signal, validated on LALM-based Role-Play TTS (RP-TTS) tasks. Critically, we leverage the In-Context Learning capability of pre-trained LALMs to formulate MCLP via a continuation log-probability prediction. This metric quantifies stylistic consistency by measuring the likelihood of the ground-truth speech conditioned on the generated speech. Furthermore, we employ MCLP as a reinforcement learning reward to enhance the style alignment between generated speech and Role-Play instructions. To facilitate evaluation, we construct an RP-TTS dataset with rich scene and character annotations. Experimental results demonstrate that our method significantly outperforms strong LALM baselines on both objective and subjective metrics.
Get this paper in your agent:
hf papers read 2601.22661 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
y-ren16/MCLP-RPTTS
Datasets citing this paper 1
y-ren16/WenetSpeech-RP
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper