ZONOS2

ZONOS2 is Zyphra's autoregressive text-to-speech model with multi-codebook audio generation, 44.1 kHz DAC decode, and speaker conditioning from reference audio.

Original model: Zyphra/ZONOS2

Supported Repositories

Repository	Format	Notes
`mlx-community/Zyphra-ZONOS2`	BF16	Official MLX conversion for `mlx-audio`

Installation

pip install mlx-audio

Usage

Python API:

from mlx_audio.audio_io import write as audio_write
from mlx_audio.tts import load

model = load("mlx-community/Zyphra-ZONOS2", lazy=True)

result = next(model.generate(
    text="Hello, this is ZONOS two running locally with MLX audio.",
    max_tokens=220,
))

audio_write("zonos2.wav", result.audio, result.sample_rate)

Voice Cloning

Pass a short reference clip with ref_audio. Clean speech-only clips work best.

result = next(model.generate(
    text="This text will be spoken with the reference speaker.",
    ref_audio="speaker.wav",
    max_tokens=220,
))

You can also compute the speaker embedding once and reuse it from the Python API:

speaker = model.extract_speaker_embedding("speaker.wav")

result = next(model.generate(
    text="This reuses a precomputed speaker embedding.",
    speaker_embedding=speaker,
    max_tokens=220,
))

CLI

python -m mlx_audio.tts.generate \
  --model mlx-community/Zyphra-ZONOS2 \
  --text "Hello, this is ZONOS two running with MLX audio." \
  --output_path outputs \
  --file_prefix zonos2

Voice cloning:

python -m mlx_audio.tts.generate \
  --model mlx-community/Zyphra-ZONOS2 \
  --text "This text will use the voice from the reference clip." \
  --ref_audio speaker.wav \
  --output_path outputs \
  --file_prefix zonos2_clone

Generation Parameters

Parameter	Default	Description
`ref_audio`	`None`	Reference audio path or array for voice cloning
`speaker_embedding`	`None`	Precomputed 2048-D speaker embedding, Python API only
`max_tokens`	1024	Maximum number of audio token frames
`temperature`	1.15	Sampling temperature
`top_k`	106	Top-k sampling filter
`top_p`	0.0	Nucleus sampling filter, disabled at 0
`min_p`	0.18	Minimum-probability sampling filter
`repetition_penalty`	1.2	Repetition penalty applied to recent audio tokens
`seed`	`None`	Seed for deterministic sampling
`text_normalization`	`True`	English text normalization toggle, Python API only

Notes

Output audio is mono 44.1 kHz.
DAC dependency: mlx-community/descript-audio-codec-44khz
Speaker encoder: marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B
Reference-audio speaker extraction uses the bundled speaker encoder.
English text normalization handles common written forms; unsupported languages fall back to raw UTF-8 byte prompting.
Streaming is not implemented yet. Use non-streaming generation.

License

See the Zyphra/ZONOS2 model card for upstream license and usage details.

Downloads last month: 122

Safetensors

Model size

8B params

Tensor type

BF16

MLX

Hardware compatibility

Quantized

Model tree for mlx-community/Zyphra-ZONOS2

Base model

Zyphra/ZONOS2

Finetuned

(1)

this model