HoLo-ToLk (TTS) β€” a tokenizer-free HSL text-to-speech feasibility model

A works-demonstration that the zero-parameter byte-signal substrate hsl-embedding-zero enables tokenizer-free text-to-speech: UTF-8 text bytes β†’ frozen HSL 27-D features (no tokenizer / vocab / learned input door) β†’ Pre-LN transformer β†’ AR mel decoder + guided attention β†’ HiFi-GAN β†’ a natural-sounding voice.

The honest claim

Tokenizer-free TTS on the HSL substrate: text UTF-8 bytes β†’ frozen HSL 27-D (no tokenizer / vocab / learned input door) β†’ Pre-LN transformer β†’ AR mel decoder + guided attention β†’ HiFi-GAN β†’ a natural-sounding voice. Held-out teacher-forced mel-L1 0.296, multi-seed confirmed (seeds 0–3 = 0.296 / 0.293 / 0.292 / 0.290, range 0.006). Single-speaker (LJSpeech), English. A feasibility / works demonstration that HSL works as a tokenizer-free TTS input β€” NOT a multi-speaker or production TTS system, and no SOTA claim.

seed held-out mel-L1 (teacher-forced)
0 0.296
1 0.293
2 0.292
3 0.290
mean 0.293 (range 0.006)

The genuine strength is the natural voice β€” guided attention drives a monotonic alignment, the stop token fires, and the HiFi-GAN output is intelligible and natural to listen to. But it is still one speaker and a feasibility demo.

Intended use

  • Research / educational demonstration that a lossless byte substrate (hsl-embedding-zero) works as a tokenizer-free / vocab-free TTS input β€” no learned input embedding, just raw UTF-8 bytes β†’ frozen HSL β†’ transformer.
  • A worked example of an AR mel-decoder + guided-attention TTS pipeline driven entirely by byte features, vocoded with a pretrained HiFi-GAN.

Not intended for production speech synthesis, multi-speaker / voice-cloning use, or any setting needing a specific or controllable voice.

How it works

UTF-8 text bytes β†’ frozen 27-D hsl-embedding-zero features (no lens β€” text is HSL's home) β†’ LayerNorm + linear projection β†’ Pre-LN transformer text encoder (4 layers, dim 384) β†’ AR mel decoder (prenet β†’ causal self-attention β†’ cross-attention to text) with a guided-attention loss (monotonic alignment) β†’ stop token β†’ Postnet (5Γ— conv1d residual) β†’ 80-bin log-mel (22.05 kHz, jik876/HiFi-GAN-exact filterbank) β†’ speechbrain/tts-hifigan-ljspeech vocoder β†’ 22050 Hz waveform. See the repo README and tts_lens.py for the exact architecture.

Training data

LJSpeech 1.1 β€” a single-speaker, public-domain English read-speech corpus (~24h, one female narrator). ~12.8k clips for training; ~256 held out (every 50th clip, deterministic). Audio stored at 22.05 kHz mono FLAC; mel targets computed on the fly. The dataset is not redistributed; rebuild it with collect_ljspeech.py (streams from MikhailT/lj-speech, split full).

Training procedure

AR transformer TTS with teacher forcing. dim 384 / enc-layers 4 / dec-layers 4 / heads 6 / ff 1536, batch 24, AdamW lr 3e-4 (cosine, 2000-step warmup), 40k steps, grad-clip 1.0, guided-attention loss (--ga-weight 1.0 --ga-sigma 0.2), SortaGrad short-text-first, seed 0. Loss = L1(mel_pre) + L1(mel_post)

  • BCE(stop) + guided-attention. The reported checkpoint is the best held-out mel-L1 over training.

Evaluation

Held-out teacher-forced mel-L1 (L1 between predicted and target log-mel) on the LJSpeech held-out split. Reported: mel-L1 0.296 (seed 0); multi-seed seeds 0–3 = 0.296 / 0.293 / 0.292 / 0.290 (mean 0.293, range 0.006). Free-run (autoregressive) synthesis is the qualitative test β€” the voice is natural and intelligible with the HiFi-GAN vocoder; per-clip free-run quality varies and is best on clear, sentence-length English input.

Limitations

  • Single-speaker (LJSpeech, one narrator) β€” no speaker control or voice cloning; it only speaks in that one voice.
  • English only (LJSpeech English read speech) β€” not trained or evaluated on other languages.
  • The headline number is a teacher-forced mel-L1, not a naturalness benchmark (no MOS); free-run quality varies clip to clip.
  • A feasibility / works demonstration, not state-of-the-art TTS or a claim of general superiority over dedicated TTS systems.

Files / checkpoint to upload

Upload the frozen seed-0 model to this model repo:

  • tts_lens_ga_best.pt β€” guided-attention AR TTS, 40k steps, seed 0, held-out mel-L1 0.296 (~254 MB). Self-contained: stores its own config, so tts_lens.py rebuilds the model on load.

Synthesize (free-run, natural voice β€” --vocoder hifigan is required):

pip install hsl-embedding-zero zstandard soundfile librosa speechbrain torch torchaudio
python tts_lens.py --synth --synth-n 5 --vocoder hifigan \
  --data ./data/ljspeech --synth-ckpt tts_lens_ga_best.pt

Acknowledgments

Independent research, developed in collaboration with AI assistants β€” Claude Code (Anthropic) and Codex. The HSL work and experimental direction are the author's; the tools assisted with engineering and review.

Citation

@software{woo_holotolk_tts_2026,
  author = {Jinhyun Woo},
  title  = {HoLo-ToLk (TTS): a tokenizer-free HSL text-to-speech feasibility model},
  year   = {2026},
  doi    = {10.5281/zenodo.21007475},
  url    = {https://github.com/Woojiggun/HoLo-ToLk-TTS}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train ggunio/HoLo-ToLk-TTS

Space using ggunio/HoLo-ToLk-TTS 1