Kokoro 82M β€” Core ML ANE chain + all 54 voices

A self-contained, fixed-revision Core ML build of the Kokoro-82M 7-stage Apple-Neural-Engine pipeline packaged with all 54 standard Kokoro voices, so a sandboxed Swift / Core ML app can offer every voice on the ANE path from a single hash-pinnable download. It mirrors the English ANE chain from FluidInference/kokoro-82m-coreml and adds the voice packs that repo did not ship in ANE form.

This was built for LocalMacTalk β€” a native-Swift, fully sandboxed macOS app for evaluating on-device multimodal models β€” whose speech-synthesis layer lets the user choose where Kokoro runs: GPU (MLX), the Neural Engine, or the CPU. Moving TTS onto the ANE keeps it off the same Metal device as the language model, which matters when a 12B LLM is decoding and synthesizing speech at the same time. (You don't need LocalMacTalk to use these files β€” they're plain Core ML models and Kokoro voice packs.)

Why this repo exists

The excellent upstream FluidInference/kokoro-82m-coreml ships the 7-stage ANE acoustic chain and the Core ML G2P models, but under ANE/ it includes only a single English voice (af_heart.bin). The full standard voice set (af_*, am_*, bf_*, bm_*, and the multilingual voices) lives in hexgrad/Kokoro-82M only as PyTorch .pt tensors β€” not in the flat [510, 256] float32 layout the ANE chain consumes. So an app that wants user-selectable voices on the Core ML / ANE path had nothing to download.

This repo closes that gap and pins everything to one fixed commit (the upstream tracks a moving main), so a downstream can hash-verify every artifact.

What's inside

  • 7-stage ANE acoustic chain (ANE/Kokoro{Albert,PostAlbert,Alignment,Prosody,Noise,Vocoder,Tail}.mlmodelc) β€” mirrored from the upstream ANE/ (English variant of the laishere/kokoro chain).
  • Core ML G2P (G2PEncoder.mlmodelc, G2PDecoder.mlmodelc, g2p_vocab.json) and the phoneme table ANE/vocab.json β€” mirrored from the upstream repo root.
  • All 54 voices (voices/*.bin) β€” 11 af / 9 am (American English), 4 bf / 4 bm (British English), plus Spanish, French, Hindi, Italian, Japanese, Portuguese, and Chinese (ef/em/ff/hf/hm/if/im/jf/jm/pf/pm/zf/zm).

The voice packs were converted byte-exact from hexgrad/Kokoro-82M/voices/*.pt by squeezing the [510, 1, 256] tensor to [510, 256] and writing little-endian float32. As a correctness check, the converted voices/af_heart.bin is byte-identical to the upstream ANE/af_heart.bin.

Voice-pack format

Each voices/<name>.bin is 522,240 bytes = 510 Γ— 256 Γ— 4 (little-endian float32), row-major:

  • row = utterance-length bucket, clamp(phoneme_count βˆ’ 1, 0, 509) (BOS/EOS not counted)
  • cols [0:128] = style_timbre (fed to the Noise and Vocoder stages)
  • cols [128:256] = style_s (fed to the PostAlbert and Prosody stages)

Using it with the FluidInference chain

The acoustic stages and G2P here are the same models the upstream repo provides, so the inference orchestration is identical to FluidInference's KokoroAne pipeline (see FluidInference/FluidAudio): text β†’ G2P ([BOS]+graphemes+[EOS] β†’ encoder β†’ greedy decode with position_ids offset 2 and a causal mask) β†’ IPA phonemes β†’ input_ids via ANE/vocab.json β†’ the 7 stages (with the fp16/fp32 boundary conversions at Noise / Vocoder / Tail) β†’ 24 kHz audio. The only addition here is that you can load any of the 54 voice packs for the style_s / style_timbre slices instead of just af_heart.

Credits & license

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for txgsync/kokoro-82m-coreml-ane

Finetuned
(1)
this model