vibevoice.cpp — quantized model bundle

Brought to you by the LocalAI team — the creators of LocalAI, the open-source AI engine that runs any model — LLMs, vision, voice, image, video — on any hardware. No GPU required.

Quantized GGUF weights for vibevoice.cpp, a C/C++ port of Microsoft VibeVoice (TTS + ASR) on top of ggml.

File	Source	Quant	Size
`vibevoice-realtime-0.5B-q8_0.gguf`	`microsoft/VibeVoice-Realtime-0.5B`	Q8_0 (matmul) + F16	~1.6 GB
`vibevoice-asr-q8_0.gguf`	`microsoft/VibeVoice-ASR`	Q8_0 (matmul) + F16	~13 GB
`voice-en-Carter_man.gguf`	upstream voice prompt cache	F16	8 MB
`voice-en-Emma.gguf`	upstream voice prompt cache	F16	6 MB
`tokenizer.gguf`	Qwen2.5 BPE + VibeVoice specials	—	6 MB

Quantization scheme

scripts/quantize_gguf.py in the source repo selectively quantizes only the LM matmul weights — attention q/k/v/o, ffn gate/up/down, and lm_head — to Q8_0. Everything else (1-D conv kernels, RMSNorm scales, biases, layer-scale gammas, token embeddings, small scalars) passes through unchanged. The conv1d implementation in vibevoice.cpp casts kernels to F16 inline rather than dequantizing on the fly, so quantizing those would corrupt the convolution outputs.

Q8_0 was chosen because it's pure-Python implementable in gguf-py and gives a ~60% size reduction on the 7B ASR model with no measurable quality regression in the closed-loop TTS → ASR roundtrip test.

Quickstart

git clone --recursive https://github.com/mudler/vibevoice.cpp
cd vibevoice.cpp && cmake -B build -DVIBEVOICE_BUILD_TESTS=ON && cmake --build build -j

# Pull this bundle
mkdir -p models && cd models
hf download mudler/vibevoice.cpp-models --local-dir .
cd ..

# TTS
build/bin/vibevoice-cli tts \
    --model models/vibevoice-realtime-0.5B-q8_0.gguf \
    --voice models/voice-en-Carter_man.gguf \
    --tokenizer models/tokenizer.gguf \
    --text "Hello world this is a test of the synthesis system." \
    --out hello.wav

# ASR
build/bin/vibevoice-cli asr \
    --model models/vibevoice-asr-q8_0.gguf \
    --tokenizer models/tokenizer.gguf \
    --audio hello.wav
# -> [{"Start":0,"End":2.8,"Speaker":0,"Content":"Hello world, this is a test of the synthesis system."}]

Closed-loop verification

The test_closed_loop ctest in vibevoice.cpp runs TTS → ASR end-to-end and asserts ≥80% source-word recall in the recovered transcript. With this bundle (both Q8_0 models) it passes at 10/10 (100 %).

License

Weights are derived from Microsoft VibeVoice (VibeVoice-Realtime-0.5B and VibeVoice-ASR); follow the upstream model licenses for use. The conversion + quantization tooling is released under MIT as part of vibevoice.cpp.

Downloads last month: -

GGUF

Model size

0 params

Architecture

vibevoice-tokenizer

Hardware compatibility

8-bit

View +3 variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mudler/vibevoice.cpp-models

Base model

microsoft/VibeVoice-ASR

Quantized

(8)

this model