matcha-zh-en-8k β Matcha zh-en with a distilled 8 kHz vocoder
8 kHz telephony variant of csukuangfj/matcha-icefall-zh-en,
a Chinese-English code-mixed Matcha-TTS. The
acoustic model is unchanged (that's where the zh-en quality lives); only the vocoder is
replaced β the stock 16 kHz vocos-16khz-univ.onnx is distilled into an 8 kHz vocos-8khz-univ.onnx.
Built for a phone-attendant product on Jetson Nano gen1 CPU (ONNX Runtime CPU via sherpa-onnx):
the audio leaves through an 8 kHz G.711 channel, so native 8 kHz output avoids the on-device 16kβ8k
resample and ships a smaller vocoder. Drop-in for the stock sherpa-onnx-offline-tts β only the
model dir changes.
Listen β zh/en code-mixed (rendered by stock sherpa-onnx, 8 kHz)
| text | 8 kHz | through G.711 |
|---|---|---|
| ζ¨ε₯½,ι裑ζ―ε°η£η²Ύε―,θ«εζ¨θ¦ζΎ David ιζ― Michael? | ||
| Kevin θͺͺδΈεη meeting ζΉε° Zoom η·δΈι²θ‘γ | ||
| Please hold,ζεΉ«ζ¨ζ₯θ©’ VIP ζε‘θ³ζ,θ«η¨εγ | ||
| The quick brown fox εζζ·ηζ£θ²ηηΈ,δΈθ΅·ζΈ¬θ©¦δΈθ±ζγ |
Full A/B vs the 16 kHz original and vs MeloTTS-8k: Luigi/zh-en-tts-8k-comparison.
Results
| metric | value |
|---|---|
| PESQ-NB (8k vocoder vs stock 16k vocoder β 8 kHz) | 3.80 (ceiling β 4.5) |
| Full-pipeline RTF, x86 ORT CPU, 4 threads | 0.014 |
| Predicted Jetson Nano A57 RTF (Γ13 host factor) | ~0.18 |
| Vocoder size | 30 MB (vs 54 MB at 16 kHz) |
The vocoder is only ~25 % of Matcha's compute (the flow-matching acoustic model dominates), so 8 kHz is a telephony-native + integration win rather than a large speed cut. fp32, ONNX opset 17, no custom ops. Do not int8-quantize the vocoder convs (measured anti-remedy on the A57).
Usage β stock sherpa-onnx, only the model dir changes
sherpa-onnx-offline-tts \
--matcha-acoustic-model=matcha-zh-en-8k/model-steps-3.onnx \
--matcha-vocoder=matcha-zh-en-8k/vocos-8khz-univ.onnx \
--matcha-lexicon=matcha-zh-en-8k/lexicon.txt \
--matcha-tokens=matcha-zh-en-8k/tokens.txt \
--matcha-data-dir=matcha-zh-en-8k/espeak-ng-data \
--tts-rule-fsts=matcha-zh-en-8k/date-zh.fst,matcha-zh-en-8k/number-zh.fst,matcha-zh-en-8k/phone-zh.fst \
--num-threads=4 --output-filename=out.wav \
"δΈθ±ζεζ測試γIt supports both English εδΈζεζγ"
Output is 8 kHz mono (sample_rate=8000 in the acoustic model metadata). See DEVICE_ACCEPTANCE.md.
How it was made
Tokens captured from sherpa-onnx's own frontend (debug logs β Chinese via pinyin lexicon, English via
espeak IPA), fed to the unchanged acoustic model to dump real mels; 8 kHz targets = the stock 16 kHz
vocoder's output downsampled. A ConvNeXt-based vocoder (mel[80,T] β mag/x/y for an iSTFT with
n_fft=512/hop=128/win=512, set in the ONNX metadata so sherpa's iSTFT yields exactly 8 kHz) was trained
with multi-resolution STFT + telephony-band mel + MPD/MSD GAN losses. Training/eval code:
https://github.com/vieenrose/jetson-tts (matcha8k/).
Contents
model-steps-3.onnx Matcha acoustic model (weights identical to stock; metadata sample_rate=8000)
vocos-8khz-univ.onnx distilled 8 kHz vocoder
tokens.txt lexicon.txt identical to the stock matcha-icefall-zh-en
date-zh.fst number-zh.fst phone-zh.fst text-normalization FSTs (unchanged)
espeak-ng-data/ English IPA phonemization data (unchanged)
samples/ code-mixed 8 kHz + G.711 demo clips
DEVICE_ACCEPTANCE.md acceptance checklist (CLI, RTF, metrics)
Attribution
- Base model: csukuangfj/matcha-icefall-zh-en (acoustic model + tokens/lexicon/FSTs/espeak unchanged), originally from modelscope dengcunqin/matcha_tts_zh_en_20251010.
- Runtime contract: k2-fsa/sherpa-onnx (Apache-2.0).
- Vocoder architecture inspired by Vocos (MIT).
Model tree for Luigi/matcha-zh-en-8k
Base model
csukuangfj/matcha-icefall-zh-en