matcha-zh-en-8k — Matcha zh-en with a distilled 8 kHz vocoder

8 kHz telephony variant of csukuangfj/matcha-icefall-zh-en, a Chinese-English code-mixed Matcha-TTS. The acoustic model is unchanged (that's where the zh-en quality lives); only the vocoder is replaced — the stock 16 kHz vocos-16khz-univ.onnx is distilled into an 8 kHz vocos-8khz-univ.onnx.

Built for a phone-attendant product on Jetson Nano gen1 CPU (ONNX Runtime CPU via sherpa-onnx): the audio leaves through an 8 kHz G.711 channel, so native 8 kHz output avoids the on-device 16k→8k resample and ships a smaller vocoder. Drop-in for the stock sherpa-onnx-offline-tts — only the model dir changes.

Listen — zh/en code-mixed (rendered by stock sherpa-onnx, 8 kHz)

text	8 kHz	through G.711
您好,這裡是台灣精密,請問您要找 David 還是 Michael?
Kevin 說下午的 meeting 改到 Zoom 線上進行。
Please hold,我幫您查詢 VIP 會員資料,請稍候。
The quick brown fox 和敏捷的棕色狐狸,一起測試中英文。

Full A/B vs the 16 kHz original and vs MeloTTS-8k: Luigi/zh-en-tts-8k-comparison.

Results

metric	value
PESQ-NB (8k vocoder vs stock 16k vocoder → 8 kHz)	3.80 (ceiling ≈ 4.5)
Full-pipeline RTF, x86 ORT CPU, 4 threads	0.014
Predicted Jetson Nano A57 RTF (×13 host factor)	~0.18
Vocoder size	30 MB (vs 54 MB at 16 kHz)

The vocoder is only ~25 % of Matcha's compute (the flow-matching acoustic model dominates), so 8 kHz is a telephony-native + integration win rather than a large speed cut. fp32, ONNX opset 17, no custom ops. Do not int8-quantize the vocoder convs (measured anti-remedy on the A57).

Usage — stock sherpa-onnx, only the model dir changes

sherpa-onnx-offline-tts \
  --matcha-acoustic-model=matcha-zh-en-8k/model-steps-3.onnx \
  --matcha-vocoder=matcha-zh-en-8k/vocos-8khz-univ.onnx \
  --matcha-lexicon=matcha-zh-en-8k/lexicon.txt \
  --matcha-tokens=matcha-zh-en-8k/tokens.txt \
  --matcha-data-dir=matcha-zh-en-8k/espeak-ng-data \
  --tts-rule-fsts=matcha-zh-en-8k/date-zh.fst,matcha-zh-en-8k/number-zh.fst,matcha-zh-en-8k/phone-zh.fst \
  --num-threads=4 --output-filename=out.wav \
  "中英文合成測試。It supports both English 和中文合成。"

Output is 8 kHz mono (sample_rate=8000 in the acoustic model metadata). See DEVICE_ACCEPTANCE.md.

How it was made

Tokens captured from sherpa-onnx's own frontend (debug logs — Chinese via pinyin lexicon, English via espeak IPA), fed to the unchanged acoustic model to dump real mels; 8 kHz targets = the stock 16 kHz vocoder's output downsampled. A ConvNeXt-based vocoder (mel[80,T] → mag/x/y for an iSTFT with n_fft=512/hop=128/win=512, set in the ONNX metadata so sherpa's iSTFT yields exactly 8 kHz) was trained with multi-resolution STFT + telephony-band mel + MPD/MSD GAN losses. Training/eval code: https://github.com/vieenrose/jetson-tts (matcha8k/).

model-steps-3.onnx     Matcha acoustic model (weights identical to stock; metadata sample_rate=8000)
vocos-8khz-univ.onnx   distilled 8 kHz vocoder
tokens.txt lexicon.txt identical to the stock matcha-icefall-zh-en
date-zh.fst number-zh.fst phone-zh.fst  text-normalization FSTs (unchanged)
espeak-ng-data/        English IPA phonemization data (unchanged)
samples/               code-mixed 8 kHz + G.711 demo clips
DEVICE_ACCEPTANCE.md   acceptance checklist (CLI, RTF, metrics)

Attribution

Base model: csukuangfj/matcha-icefall-zh-en (acoustic model + tokens/lexicon/FSTs/espeak unchanged), originally from modelscope dengcunqin/matcha_tts_zh_en_20251010.
Runtime contract: k2-fsa/sherpa-onnx (Apache-2.0).
Vocoder architecture inspired by Vocos (MIT).

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Luigi/matcha-zh-en-8k

Base model

csukuangfj/matcha-icefall-zh-en

Quantized

(2)

this model

Luigi
/

matcha-zh-en-8k