OmniVoice Thai Fine-tuned · โมเดลสังเคราะห์เสียงภาษาไทย

Created by UKA — AI Agent, Hacker & Cyber Security Expert

Fine-tuned OmniVoice สำหรับการสังเคราะห์เสียงภาษาไทย (Thai TTS)

support me BTC : bc1qf27cyk3vmugcdyv9xdtuv5jwz37863crpj5c9v

ภาษาไทย 🇹🇭

เกี่ยวกับโมเดลนี้

OmniVoice Thai เป็นโมเดล Text-to-Speech ภาษาไทยที่ fine-tune ต่อจาก k2-fsa/OmniVoice (Qwen3-0.6B) โดยใช้เทคนิค Masked Token Prediction (MaskGIT-style) แบบ Diffusion

วิธีใช้

!pip install omnivoice

from omnivoice import OmniVoice
import soundfile as sf

model = OmniVoice.from_pretrained("hotdogs/omnivoice-thai")

# สร้างเสียงจากข้อความ
audio = model.generate(
    text="สวัสดีครับ วันนี้อากาศดีมากเลย",
    instruct="male, low pitch",
)
sf.write("output.wav", audio[0], 24000)

วิธีการเทรน (Methodology)

Base Model: k2-fsa/OmniVoice (Qwen3-0.6B, MaskGIT diffusion)
Dataset: Thanarit/Thai-Voice-Test7 + custom voice data — รวม ~~20,000 utterances (~~12.6 ชั่วโมง, 2 speakers)
Audio Preprocessing: Resample 16kHz → 24kHz ด้วย torchaudio, tokenize ด้วย eustlb/higgs-audio-v2-tokenizer
Training Config:
- batch_tokens: 2,048 (ต่อ GPU forward pass)
- gradient_accumulation_steps: 8 (effective batch ≈ 16,384 tokens)
- learning_rate: 1e-5, cosine schedule, warmup 2%
- max_steps: 30,000, early stop เมื่อ per-step loss < 3.0
- mixed_precision: fp16, attn_implementation: sdpa
Hardware: NVIDIA RTX 3090 24GB (Vast.ai cloud)
Training Time: ~1 ชั่วโมง 30 นาที (1,747 steps)
Monitoring: Python watchdog script อ่าน training log ทุก 2 วินาที → auto-kill เมื่อ loss < 3.0

เครื่องมือที่ใช้ (Tools)

เครื่องมือ	วัตถุประสงค์
OmniVoice	TTS framework (MaskGIT)
PyTorch 2.8 + CUDA 13.0	Training backend
HuggingFace Accelerate	Distributed training
higgs-audio-v2-tokenizer	Audio tokenization
torchaudio	Audio preprocessing
NVIDIA RTX 3090 24GB	GPU compute (Vast.ai)
Hermes Agent	Autonomous AI agent for orchestration

ข้อจำกัด

เทรนด้วยข้อมูลจำกัด (~12.6 ชม., 2 speakers) — อาจไม่ generalize ได้ดี
Transcript มาจาก ASR อาจมีคำผิด
Base model เทรนมาด้วย EN/ZH — ภาษาไทยเป็นเรื่องใหม่
Per-step loss แกว่ง 3.5-5.5; smoothed ~4.4

English 🇬🇧

About

OmniVoice Thai is a Thai TTS model fine-tuned from k2-fsa/OmniVoice (Qwen3-0.6B) using MaskGIT-style masked token prediction.

Quick Start

!pip install omnivoice

from omnivoice import OmniVoice
model = OmniVoice.from_pretrained("hotdogs/omnivoice-thai")

# Voice cloning
audio = model.generate(
    text="Hello, this is a test.",
    ref_audio="reference.wav",
)

# Voice design
audio = model.generate(
    text="The weather is nice today.",
    instruct="female, high pitch, british accent",
)

Model Details

Creator: UKA
Base model: k2-fsa/OmniVoice (Qwen3-0.6B)
Training steps: 1,747 (early stop at per-step loss < 3.0)
Best per-step loss: 2.8775 (step 1,747)
Smoothed loss: ~4.4 (masked prediction loss is naturally higher than AR)
Dataset: ~~20,000 utterances (~~12.6 hrs, 2 speakers)

Training Log

Step 1747/30000 | loss=2.8775 (per-step) | lr=9.96e-06
Step 1700 | train/loss: 4.4281 (smoothed) | epoch: 1

Limitations

Limited data (12.6 hrs, 2 speakers)
ASR transcript quality may contain errors
Base model trained on EN/ZH — Thai is new territory

Downloads last month: 241

Safetensors

Model size

0.6B params

Tensor type

I64

F32

Model tree for hotdogs/omnivoice-thai

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

k2-fsa/OmniVoice

Finetuned

(31)

this model

hotdogs
/

omnivoice-thai

OmniVoice Thai Fine-tuned · โมเดลสังเคราะห์เสียงภาษาไทย

support me BTC : bc1qf27cyk3vmugcdyv9xdtuv5jwz37863crpj5c9v

ภาษาไทย 🇹🇭

เกี่ยวกับโมเดลนี้

วิธีใช้

วิธีการเทรน (Methodology)

เครื่องมือที่ใช้ (Tools)

ข้อจำกัด

English 🇬🇧

About

Quick Start

Model Details

Training Log

Limitations

Model tree for hotdogs/omnivoice-thai

Dataset used to train hotdogs/omnivoice-thai