Chatterbox TTS β€” GGUF (ggml-quantised)

GGUF / ggml conversion of ResembleAI/chatterbox for use with CrispStrobe/CrispASR.

Chatterbox is a full TTS pipeline: character tokenizer β†’ T3 (30-layer Llama AR, 520M) β†’ speech tokens β†’ S3Gen (Conformer encoder + UNet1D CFM denoiser, 10 Euler steps) β†’ HiFTGenerator vocoder (conv chains + Snake activations + iSTFT) β†’ 24 kHz WAV. Distributed under MIT license.

Two GGUF files are needed: the T3 model (text β†’ speech tokens) and the S3Gen model (speech tokens β†’ audio).

Files

File Quant Size Notes
chatterbox-t3-f16.gguf F16 1.1 GB T3 AR model β€” reference quality
chatterbox-t3-q8_0.gguf Q8_0 542 MB T3 AR model β€” recommended
chatterbox-t3-q4_k.gguf Q4_K 287 MB T3 AR model β€” smallest
chatterbox-s3gen-f16.gguf F16 548 MB S3Gen + vocoder β€” reference quality
chatterbox-s3gen-q8_0.gguf Q8_0 342 MB S3Gen + vocoder β€” recommended
chatterbox-s3gen-q4_k.gguf Q4_K 237 MB S3Gen + vocoder β€” smallest

Note: vocoder weights (conv_pre, resblocks, conv_post, source fusion) are kept at F32 in all quant levels for audio quality. Quantization applies to the Conformer encoder, UNet decoder, and T3 Llama layers.

Quick start

# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
cmake --build build -j --target chatterbox

# 2. Pull both model files
huggingface-cli download cstr/chatterbox-GGUF chatterbox-t3-q8_0.gguf --local-dir .
huggingface-cli download cstr/chatterbox-GGUF chatterbox-s3gen-q8_0.gguf --local-dir .

# 3. Synthesise (C API / test binary β€” CLI adapter in progress)
# See tests/test_voc_wav.cpp for vocoder-only usage

Architecture

Text β†’ Character tokenizer (704 tokens)
     β†’ T3 Llama AR (30 layers, 1024D, 16 heads, RoPE, SwiGLU, CFG)
     β†’ 25 Hz speech tokens (6561 codebook)
     β†’ Conformer encoder (6 pre + 4 post upsample, 512D, 8 heads)
     β†’ 80-channel mel spectrogram
     β†’ UNet1D CFM denoiser (1 down + 12 mid + 1 up, 256 ch, 10 Euler steps)
     β†’ HiFTGenerator vocoder (3Γ— ConvTranspose1d + 9 ResBlocks + Snake + iSTFT)
     β†’ 24 kHz mono WAV

Quality verification

ASR roundtrip on Python reference mel (no source fusion, deterministic):

Metric Value
ASR output (moonshine-base) "Hello world" (correct)
Per-stage cosine vs Python ref 1.000 (conv_pre through rb_2)
Waveform cosine vs torch.istft 0.93
STFT range [-0.82, 2.0] (ref [-1.1, 1.7])

All quantization levels (F16/Q8_0/Q4_K) produce ASR-identical output on the reference mel.

Conversion

python models/convert-chatterbox-to-gguf.py \
    --input ResembleAI/chatterbox \
    --output-dir .

Requires pip install gguf safetensors torch huggingface_hub.

Related models

License

MIT β€” same as the upstream ResembleAI/chatterbox.

Downloads last month
100
GGUF
Model size
0.3B params
Architecture
chatterbox-s3gen
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/chatterbox-GGUF

Quantized
(18)
this model