matcha-zh-en-8k β€” Matcha zh-en with a distilled 8 kHz vocoder

8 kHz telephony variant of csukuangfj/matcha-icefall-zh-en, a Chinese-English code-mixed Matcha-TTS. The acoustic model is unchanged (that's where the zh-en quality lives); only the vocoder is replaced β€” the stock 16 kHz vocos-16khz-univ.onnx is distilled into an 8 kHz vocos-8khz-univ.onnx.

Built for a phone-attendant product on Jetson Nano gen1 CPU (ONNX Runtime CPU via sherpa-onnx): the audio leaves through an 8 kHz G.711 channel, so native 8 kHz output avoids the on-device 16k→8k resample and ships a smaller vocoder. Drop-in for the stock sherpa-onnx-offline-tts — only the model dir changes.

Listen β€” zh/en code-mixed (rendered by stock sherpa-onnx, 8 kHz)

text 8 kHz through G.711
您ε₯½,ι€™θ£‘ζ˜―ε°η£η²Ύε―†,請問您要找 David ι‚„ζ˜― Michael?
Kevin θͺͺδΈ‹εˆηš„ meeting ζ”Ήεˆ° Zoom η·šδΈŠι€²θ‘Œγ€‚
Please hold,ζˆ‘εΉ«ζ‚¨ζŸ₯θ©’ VIP ζœƒε“‘θ³‡ζ–™,請稍候。
The quick brown fox ε’Œζ•ζ·ηš„ζ£•θ‰²η‹η‹Έ,一衷測試中英文。

Full A/B vs the 16 kHz original and vs MeloTTS-8k: Luigi/zh-en-tts-8k-comparison.

Results

metric value
PESQ-NB (8k vocoder vs stock 16k vocoder β†’ 8 kHz) 3.80 (ceiling β‰ˆ 4.5)
Full-pipeline RTF, x86 ORT CPU, 4 threads 0.014
Predicted Jetson Nano A57 RTF (Γ—13 host factor) ~0.18
Vocoder size 30 MB (vs 54 MB at 16 kHz)

The vocoder is only ~25 % of Matcha's compute (the flow-matching acoustic model dominates), so 8 kHz is a telephony-native + integration win rather than a large speed cut. fp32, ONNX opset 17, no custom ops. Do not int8-quantize the vocoder convs (measured anti-remedy on the A57).

Usage β€” stock sherpa-onnx, only the model dir changes

sherpa-onnx-offline-tts \
  --matcha-acoustic-model=matcha-zh-en-8k/model-steps-3.onnx \
  --matcha-vocoder=matcha-zh-en-8k/vocos-8khz-univ.onnx \
  --matcha-lexicon=matcha-zh-en-8k/lexicon.txt \
  --matcha-tokens=matcha-zh-en-8k/tokens.txt \
  --matcha-data-dir=matcha-zh-en-8k/espeak-ng-data \
  --tts-rule-fsts=matcha-zh-en-8k/date-zh.fst,matcha-zh-en-8k/number-zh.fst,matcha-zh-en-8k/phone-zh.fst \
  --num-threads=4 --output-filename=out.wav \
  "δΈ­θ‹±ζ–‡εˆζˆζΈ¬θ©¦γ€‚It supports both English ε’ŒδΈ­ζ–‡εˆζˆγ€‚"

Output is 8 kHz mono (sample_rate=8000 in the acoustic model metadata). See DEVICE_ACCEPTANCE.md.

How it was made

Tokens captured from sherpa-onnx's own frontend (debug logs β€” Chinese via pinyin lexicon, English via espeak IPA), fed to the unchanged acoustic model to dump real mels; 8 kHz targets = the stock 16 kHz vocoder's output downsampled. A ConvNeXt-based vocoder (mel[80,T] β†’ mag/x/y for an iSTFT with n_fft=512/hop=128/win=512, set in the ONNX metadata so sherpa's iSTFT yields exactly 8 kHz) was trained with multi-resolution STFT + telephony-band mel + MPD/MSD GAN losses. Training/eval code: https://github.com/vieenrose/jetson-tts (matcha8k/).

Contents

model-steps-3.onnx     Matcha acoustic model (weights identical to stock; metadata sample_rate=8000)
vocos-8khz-univ.onnx   distilled 8 kHz vocoder
tokens.txt lexicon.txt identical to the stock matcha-icefall-zh-en
date-zh.fst number-zh.fst phone-zh.fst  text-normalization FSTs (unchanged)
espeak-ng-data/        English IPA phonemization data (unchanged)
samples/               code-mixed 8 kHz + G.711 demo clips
DEVICE_ACCEPTANCE.md   acceptance checklist (CLI, RTF, metrics)

Attribution

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Luigi/matcha-zh-en-8k

Quantized
(2)
this model

Space using Luigi/matcha-zh-en-8k 1