You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This model is licensed under CC BY-NC 4.0. By clicking "Agree", you accept the terms of this license.

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

Paper: arxiv.org
Blog post & Audio samples: kyutai.org

Overview

This model is a full-duplex spoken dialogue model post-trained with reinforcement learning (RL) to improve interactivity. Starting from Moshi (Défossez et al., 2024) or PersonaPlex (Roy et al., 2026), our post-training targets the four canonical axes of full-duplex interactivity: pause handling, turn-taking, backchanneling, and user interruption, using axis-specific rewards with GRPO and an LLM Judge reward to preserve response quality.

Compared to the base models, the post-trained models reduce cases where the model inappropriately barges in on the user, substantially improve turn-taking response latency, and promote well-timed backchanneling, as evaluated on both Full-Duplex-Bench v1 (static, using pre-recorded audio input) and Full-Duplex-Bench v2 (dynamic, using real-time multi-turn dialogue).

Training Data

We construct RL training data from Seamless Interaction (Agrawal et al., 2025), a 4,000-hour two-party human conversation corpus in which each speaker is recorded on a separate channel. For each of the four interactivity axes, we use voice activity detection to automatically extract up to 2,000 relevant segments from this corpus.

Models

We release two RL-trained models, one for each base model.

🤗 kyutai/moshika-rl-seamless: based on kyutai/moshika-pytorch-bf16
🤗 kyutai/personaplex-rl-seamless: based on nvidia/personaplex-7b-v1

Usage

This model uses the Moshi architecture and is compatible with the official Moshi inference code.

pip install moshi
python -m moshi.server --hf-repo kyutai/moshika-rl-seamless

Then open the local web client to start a real-time, full-duplex conversation. See the official Moshi README for further details on installation and usage.

Throughout our experiments, we prepend 3 seconds of silence to the input audio to allow time for Moshi to produce its conversation-initiating phrase before the user begins speaking.

Bias, Risks, and Limitations

This model is intended for research use only and is not recommended for providing advice or performing any professional duty. It should not be used to impersonate other people or for any malicious purpose.

The rule-based reward design for each axis requires manual engineering and becomes increasingly difficult to scale as the number of axes grows. We have also observed that the conversational style of the training data can affect the model's safety behavior, making the incorporation of safety-aware rewards or constraints into the RL process an important direction for future work.

License

This model is licensed under CC BY-NC 4.0.

Citation

@article{ohashi2026multifaceted,
  title={Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models},
  author={Ohashi, Atsumoto and Zeghidour, Neil and D{\'e}fossez, Alexandre and Kharitonov, Eugene},
  journal={arXiv preprint arXiv:2606.11167},
  year={2026}
}