Audio-to-Audio
Moshi
Safetensors
English

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This model is licensed under CC BY-NC 4.0. By clicking "Agree", you accept the terms of this license.

Log in or Sign Up to review the conditions and access this model content.

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

Overview

This model is a full-duplex spoken dialogue model post-trained with reinforcement learning (RL) to improve interactivity. Starting from Moshi (Défossez et al., 2024) or PersonaPlex (Roy et al., 2026), our post-training targets the four canonical axes of full-duplex interactivity: pause handling, turn-taking, backchanneling, and user interruption, using axis-specific rewards with GRPO and an LLM Judge reward to preserve response quality.

Compared to the base models, the post-trained models reduce cases where the model inappropriately barges in on the user, substantially improve turn-taking response latency, and promote well-timed backchanneling, as evaluated on both Full-Duplex-Bench v1 (static, using pre-recorded audio input) and Full-Duplex-Bench v2 (dynamic, using real-time multi-turn dialogue).

Training Data

We construct RL training data from Seamless Interaction (Agrawal et al., 2025), a 4,000-hour two-party human conversation corpus in which each speaker is recorded on a separate channel. For each of the four interactivity axes, we use voice activity detection to automatically extract up to 2,000 relevant segments from this corpus.

Models

We release two RL-trained models, one for each base model.

Usage

This model uses the Moshi architecture and is compatible with the official Moshi inference code.

pip install moshi
python -m moshi.server --hf-repo kyutai/moshika-rl-seamless

Then open the local web client to start a real-time, full-duplex conversation. See the official Moshi README for further details on installation and usage.

Throughout our experiments, we prepend 3 seconds of silence to the input audio to allow time for Moshi to produce its conversation-initiating phrase before the user begins speaking.

Bias, Risks, and Limitations

This model is intended for research use only and is not recommended for providing advice or performing any professional duty. It should not be used to impersonate other people or for any malicious purpose.

The rule-based reward design for each axis requires manual engineering and becomes increasingly difficult to scale as the number of axes grows. We have also observed that the conversational style of the training data can affect the model's safety behavior, making the incorporation of safety-aware rewards or constraints into the RL process an important direction for future work.

License

This model is licensed under CC BY-NC 4.0.

Citation

@article{ohashi2026multifaceted,
  title={Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models},
  author={Ohashi, Atsumoto and Zeghidour, Neil and D{\'e}fossez, Alexandre and Kharitonov, Eugene},
  journal={arXiv preprint arXiv:2606.11167},
  year={2026}
}
Downloads last month
38
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kyutai/moshika-rl-seamless

Finetuned
(3)
this model

Dataset used to train kyutai/moshika-rl-seamless

Collection including kyutai/moshika-rl-seamless

Paper for kyutai/moshika-rl-seamless