Voxi-TTS — Expressive Kokoro (British English, 8 emotions)

An expressive fine-tune of Kokoro-82M that speaks British English in 8 selectable emotional styles, each exposed as its own voicepack.

Emotion voicepacks
bf_neutral · bf_happy · bf_sad · bf_angry · bf_excited · bf_disgust · bf_sarcastic · bf_surprised

(bf_ = British female, matching the source speaker.)

How it was trained

  • Base: Kokoro-82M (StyleTTS2 + ISTFTNet decoder, 82M params)
  • Recipe: two-stage StyleTTS2 fine-tune via semidark/kikiri-tts (patched StyleTTS2)
  • Data: EMNS — Emotive Narrative Storytelling Corpus (OpenSLR 136, Apache-2.0), single British-English speaker, ~1.9h, 8 balanced emotions
  • Approach: each emotion treated as a distinct speaker (multispeaker), so the style space separates per emotion; one voicepack extracted per emotion
  • Hardware: NVIDIA A100-40GB, fp32, batch 4. Stage 1: 12 epochs (Mel 0.47→0.35). Stage 2: 24 epochs, adversarial SLM from epoch 3 (F0 10.0→3.6, Dur 1.5→0.83).

Files

Path What
kokoro_voxi_v1.pth Converted Kokoro-format inference weights (use this)
voices/bf_*.pt The 8 emotion voicepacks
config.json Kokoro model config
speaker_map.json emotion ↔ training speaker-id map
checkpoints/epoch_2nd_00022.pth Full Stage-2 training checkpoint (to resume)
checkpoints/first_stage.pth Stage-1 checkpoint (also used as style-encoder source for voicepack extraction)

Usage

import torch, numpy as np, soundfile as sf
from huggingface_hub import hf_hub_download
from kokoro import KModel, KPipeline

repo = "Joshuant/voxi-tts"
model_path = hf_hub_download(repo, "kokoro_voxi_v1.pth")
config_path = hf_hub_download(repo, "config.json")
voice_path = hf_hub_download(repo, "voices/bf_angry.pt")   # pick an emotion

km = KModel(repo_id="hexgrad/Kokoro-82M", config=config_path, model=model_path).eval()
pipe = KPipeline(lang_code="b", repo_id="hexgrad/Kokoro-82M", model=km)  # b = British English
voice = torch.load(voice_path, map_location="cpu", weights_only=True)

audio = np.concatenate([a for _, _, a in pipe("I can't believe you actually did that.", voice=voice)])
sf.write("out.wav", audio, 24000)

Switch emotion by loading a different voices/bf_<emotion>.pt.

Expressive multi-emotion synthesis — inline [emotion] tags

The model has no SSML/tag parsing on its own (emotion = which voicepack you load). The included voxi_speak.py wrapper adds inline [emotion] tags by switching voicepacks per span and stitching the audio:

from voxi_speak import VoxiSpeaker
voxi  = VoxiSpeaker()                 # auto-downloads this repo from the Hub
audio = voxi.speak("[happy] I'm so glad you came! [sad] But now you must go. [angry] And you didn't even tell me!")
voxi.save("out.wav", audio)
python voxi_speak.py "[angry] Get out! [neutral] ...please." -o out.wav
python voxi_speak.py --list           # emotions + aliases
  • Tags are case-insensitive and support aliases: [joy]→happy, [anger]→angry, [surprise]→surprised, [disgusted]→disgust, [sarcasm]→sarcastic, etc.
  • Text before the first tag uses default (neutral). Unknown tags are ignored with a warning (the current emotion continues).
  • A small configurable silence (gap_ms) is inserted between emotion spans.

Available emotions: neutral · happy · sad · angry · excited · disgust · sarcastic · surprised.

Notes & limitations

  • Trained on ~1.9h from a single speaker — it's one expressive British voice, not multi-speaker.
  • Emotion intensity varies; lower-resource emotions (e.g. sarcastic, disgust) may be subtler.
  • Inference needs misaki with phonemizer-fork + espeakng_loader, and lang_code="b".

Credits & licenses

  • Base model: Kokoro-82M (Apache-2.0)
  • Training recipe: kikiri-tts
  • Dataset: EMNS (OpenSLR 136, Apache-2.0) — Kari Noriy, Xiaosong Yang, Jian Zhang (2023)

Released under Apache-2.0.

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for IOTEverythin/voxi-tts

Finetuned
(28)
this model