Instructions to use mlx-community/Zyphra-ZONOS2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Zyphra-ZONOS2 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Zyphra-ZONOS2 mlx-community/Zyphra-ZONOS2
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
ZONOS2
ZONOS2 is Zyphra's autoregressive text-to-speech model with multi-codebook audio generation, 44.1 kHz DAC decode, and speaker conditioning from reference audio.
Original model: Zyphra/ZONOS2
Supported Repositories
| Repository | Format | Notes |
|---|---|---|
mlx-community/Zyphra-ZONOS2 |
BF16 | Official MLX conversion for mlx-audio |
Installation
pip install mlx-audio
Usage
Python API:
from mlx_audio.audio_io import write as audio_write
from mlx_audio.tts import load
model = load("mlx-community/Zyphra-ZONOS2", lazy=True)
result = next(model.generate(
text="Hello, this is ZONOS two running locally with MLX audio.",
max_tokens=220,
))
audio_write("zonos2.wav", result.audio, result.sample_rate)
Voice Cloning
Pass a short reference clip with ref_audio. Clean speech-only clips work best.
result = next(model.generate(
text="This text will be spoken with the reference speaker.",
ref_audio="speaker.wav",
max_tokens=220,
))
You can also compute the speaker embedding once and reuse it from the Python API:
speaker = model.extract_speaker_embedding("speaker.wav")
result = next(model.generate(
text="This reuses a precomputed speaker embedding.",
speaker_embedding=speaker,
max_tokens=220,
))
CLI
python -m mlx_audio.tts.generate \
--model mlx-community/Zyphra-ZONOS2 \
--text "Hello, this is ZONOS two running with MLX audio." \
--output_path outputs \
--file_prefix zonos2
Voice cloning:
python -m mlx_audio.tts.generate \
--model mlx-community/Zyphra-ZONOS2 \
--text "This text will use the voice from the reference clip." \
--ref_audio speaker.wav \
--output_path outputs \
--file_prefix zonos2_clone
Generation Parameters
| Parameter | Default | Description |
|---|---|---|
ref_audio |
None |
Reference audio path or array for voice cloning |
speaker_embedding |
None |
Precomputed 2048-D speaker embedding, Python API only |
max_tokens |
1024 | Maximum number of audio token frames |
temperature |
1.15 | Sampling temperature |
top_k |
106 | Top-k sampling filter |
top_p |
0.0 | Nucleus sampling filter, disabled at 0 |
min_p |
0.18 | Minimum-probability sampling filter |
repetition_penalty |
1.2 | Repetition penalty applied to recent audio tokens |
seed |
None |
Seed for deterministic sampling |
text_normalization |
True |
English text normalization toggle, Python API only |
Notes
- Output audio is mono 44.1 kHz.
- DAC dependency: mlx-community/descript-audio-codec-44khz
- Speaker encoder: marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B
- Reference-audio speaker extraction uses the bundled speaker encoder.
- English text normalization handles common written forms; unsupported languages fall back to raw UTF-8 byte prompting.
- Streaming is not implemented yet. Use non-streaming generation.
License
See the Zyphra/ZONOS2 model card for upstream license and usage details.
- Downloads last month
- 122
Model size
8B params
Tensor type
BF16
·
Hardware compatibility
Log In to add your hardware
Quantized
Model tree for mlx-community/Zyphra-ZONOS2
Base model
Zyphra/ZONOS2