Chatterbox TTS β GGUF (ggml-quantised)
GGUF / ggml conversion of ResembleAI/chatterbox for use with CrispStrobe/CrispASR.
Chatterbox is a full TTS pipeline: character tokenizer β T3 (30-layer Llama AR, 520M) β speech tokens β S3Gen (Conformer encoder + UNet1D CFM denoiser, 10 Euler steps) β HiFTGenerator vocoder (conv chains + Snake activations + iSTFT) β 24 kHz WAV. Distributed under MIT license.
Two GGUF files are needed: the T3 model (text β speech tokens) and the S3Gen model (speech tokens β audio).
Files
| File | Quant | Size | Notes |
|---|---|---|---|
chatterbox-t3-f16.gguf |
F16 | 1.1 GB | T3 AR model β reference quality |
chatterbox-t3-q8_0.gguf |
Q8_0 | 542 MB | T3 AR model β recommended |
chatterbox-t3-q4_k.gguf |
Q4_K | 287 MB | T3 AR model β smallest |
chatterbox-s3gen-f16.gguf |
F16 | 548 MB | S3Gen + vocoder β reference quality |
chatterbox-s3gen-q8_0.gguf |
Q8_0 | 342 MB | S3Gen + vocoder β recommended |
chatterbox-s3gen-q4_k.gguf |
Q4_K | 237 MB | S3Gen + vocoder β smallest |
Note: vocoder weights (conv_pre, resblocks, conv_post, source fusion) are kept at F32 in all quant levels for audio quality. Quantization applies to the Conformer encoder, UNet decoder, and T3 Llama layers.
Quick start
# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
cmake --build build -j --target chatterbox
# 2. Pull both model files
huggingface-cli download cstr/chatterbox-GGUF chatterbox-t3-q8_0.gguf --local-dir .
huggingface-cli download cstr/chatterbox-GGUF chatterbox-s3gen-q8_0.gguf --local-dir .
# 3. Synthesise (C API / test binary β CLI adapter in progress)
# See tests/test_voc_wav.cpp for vocoder-only usage
Architecture
Text β Character tokenizer (704 tokens)
β T3 Llama AR (30 layers, 1024D, 16 heads, RoPE, SwiGLU, CFG)
β 25 Hz speech tokens (6561 codebook)
β Conformer encoder (6 pre + 4 post upsample, 512D, 8 heads)
β 80-channel mel spectrogram
β UNet1D CFM denoiser (1 down + 12 mid + 1 up, 256 ch, 10 Euler steps)
β HiFTGenerator vocoder (3Γ ConvTranspose1d + 9 ResBlocks + Snake + iSTFT)
β 24 kHz mono WAV
Quality verification
ASR roundtrip on Python reference mel (no source fusion, deterministic):
| Metric | Value |
|---|---|
| ASR output (moonshine-base) | "Hello world" (correct) |
| Per-stage cosine vs Python ref | 1.000 (conv_pre through rb_2) |
| Waveform cosine vs torch.istft | 0.93 |
| STFT range | [-0.82, 2.0] (ref [-1.1, 1.7]) |
All quantization levels (F16/Q8_0/Q4_K) produce ASR-identical output on the reference mel.
Conversion
python models/convert-chatterbox-to-gguf.py \
--input ResembleAI/chatterbox \
--output-dir .
Requires pip install gguf safetensors torch huggingface_hub.
Related models
cstr/lahgtna-chatterbox-v1-GGUFβ Arabic T3 variant (MIT, shares S3Gen)cstr/orpheus-3b-base-GGUFβ Llama-3.2 + SNAC TTScstr/qwen3-tts-0.6b-customvoice-GGUFβ Qwen3-TTS with fixed speakers
License
MIT β same as the upstream ResembleAI/chatterbox.
- Downloads last month
- 100
8-bit
16-bit
Model tree for cstr/chatterbox-GGUF
Base model
ResembleAI/chatterbox