Moonshine-MN: Mongolian ASR

Fine-tuned UsefulSensors/moonshine-base on Mongolian (Cyrillic) speech from Mozilla Common Voice.

Performance

Checkpoint WER
final (step 15000) 11.88%

Quickstart

import torch, librosa
from transformers import MoonshineForConditionalGeneration, AutoFeatureExtractor
from mn_tokenizer import MnBPETokenizer
from huggingface_hub import hf_hub_download

model = MoonshineForConditionalGeneration.from_pretrained("orgilj/moonshine-mn").eval()
fe    = AutoFeatureExtractor.from_pretrained("orgilj/moonshine-mn")
tok   = MnBPETokenizer(vocab_file=hf_hub_download("orgilj/moonshine-mn", "mn_bpe.model"))

def transcribe(path, num_beams=5):
    audio, _ = librosa.load(path, sr=16000)
    inp = fe(audio, sampling_rate=16000, return_tensors="pt")
    with torch.no_grad():
        ids = model.generate(
            inp.input_values,
            num_beams=num_beams,
            max_new_tokens=180,        # under the model's max_length=194
        )
    return tok.decode_ids(ids[0].tolist())

if __name__ == "__main__":
    print(transcribe("/workspace/data/cv-corpus-24.0-2025-12-05/mn/clips/common_voice_mn_44590402.mp3"))

Live streaming

# From finetune-moonshine-asr repo:
python scripts/stream_mn.py --model orgilj/moonshine-mn --live

Training details

  • Base: moonshine-base (hidden_size=416, 8-layer enc+dec)
  • Tokenizer: custom Mongolian SentencePiece BPE (vocab 2003)
  • Data: Mozilla Common Voice Mongolian
  • LR: 1e-4 cosine, warmup 500 steps, 15000 total steps, fp16, batch 64
Downloads last month
85
Safetensors
Model size
48.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for orgilj/moonshine-mn

Finetuned
(6)
this model