RsGoksel_ITU_Mainframe — Turkish ASR (Whisper-large-v3-turbo, LoRA fine-tune)

A Turkish automatic-speech-recognition model: a LoRA fine-tune of openai/whisper-large-v3-turbo (809 M, 128-mel, 32 encoder / 4 decoder layers), merged to full weights. Internal codename VH5.0d.

A CTranslate2 / faster-whisper build is at RsGoksel/RsGoksel_ITU_Mainframe-ct2.

Results (author-measured)

Evaluated with faster-whisper (CTranslate2) + VAD + a reimplemented Turkish normalizer (trnorm), references and hypotheses passed through the same normalizer. These are author-measured numbers; they are not (yet) the official ysdede/turkish_asr_leaderboard figures (the board uses WhisperX + its own trnorm). Our pipeline was validated to be ~1.3 pt harsher than the board's (zero-shot turbo: our 13.50 % vs board 12.32 % on MediaSpeech), so the board's own ruler would, if anything, score this model better.

Dataset (set) This model Zero-shot turbo (ours) Best public open model (board)
ymoslem/MediaSpeech (tr, 2513) 10.49 % 13.50 % large-v3-turbo 12.32 %
erenfazlioglu/turkishvoicedataset (1000-utt sample) 8.53 % large-v3-turbo 10.40 %
ysdede/commonvoice_17_tr_fixed (test, 9650) 9.77 % (CER 2.63 %) 13.09 % small-turkish-0 10.84 %

CV17 note: the test split is held out for this model, but CommonVoice 17 has known train/test sentence overlap (kaldi #2141), and most Turkish CV models — including the board's — train on CV17; treat the CV17 number as a fair-among-CV-trained comparison, not a clean OOD number.

Usage

transformers

import torch, soundfile as sf
from transformers import WhisperForConditionalGeneration, WhisperProcessor

m = WhisperForConditionalGeneration.from_pretrained(
    "RsGoksel/RsGoksel_ITU_Mainframe", torch_dtype=torch.float16).to("cuda").eval()
p = WhisperProcessor.from_pretrained("RsGoksel/RsGoksel_ITU_Mainframe")

audio, sr = sf.read("audio.wav")  # 16 kHz mono
feats = p.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", torch.float16)
ids = m.generate(feats, language="tr", task="transcribe")
print(p.batch_decode(ids, skip_special_tokens=True)[0])

faster-whisper (use the -ct2 repo)

from faster_whisper import WhisperModel
model = WhisperModel("RsGoksel/RsGoksel_ITU_Mainframe-ct2", device="cuda", compute_type="float16")
segments, _ = model.transcribe("audio.wav", language="tr", beam_size=5, vad_filter=True)
print(" ".join(s.text for s in segments))

Training

  • Base: openai/whisper-large-v3-turbo (MIT). Method: LoRA (r=32, α=64, dropout 0.05, targets q_proj,v_proj), merged to full weights.
  • Recipe: learning rate 1e-5, bf16, cosine schedule + 500 warmup, max-grad-norm 1.0, early-stop on a broadcast (MSP) dev slice with generation_config.language="tr" forced (HF Trainer eval otherwise auto-detects language and yields misleadingly high WER on broadcast audio).
  • Data: ~997 h Turkish (broadcast/podcast-heavy + CommonVoice 17 + FLEURS-train + a math set). A controlled ablation on the byte-identical 886 h subset matches the 997 h result, indicating the gain comes from Whisper pretraining, not fine-tuning-data volume.

⚠️ Data provenance & license

This model is released for research / non-commercial use (CC-BY-NC-4.0). Reasons, disclosed openly:

  • The base model is MIT, but the fine-tuning corpus includes sources whose license/provenance is unclear (~60 % of the from-scratch corpus lineage) and at least one non-commercial (NC) source (a Khan-Academy-derived math set). We therefore release the weights under a non-commercial license.
  • No third-party audio is redistributed here — only the trained weights.
  • If you need a commercially-clean model, retrain on a license-clean corpus.

Citation / context

Part of the TurkSpeak project (two-track Turkish ASR study: from-scratch RNN-T vs. Whisper transfer, with a byte-identical controlled pretraining ablation and a training-free LODR decoding result). Base model: Whisper [Radford et al., arXiv:2212.04356]. LoRA [Hu et al., arXiv:2106.09685].

Downloads last month
22
Safetensors
Model size
0.8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RsGoksel/RsGoksel_ITU_Mainframe

Adapter
(130)
this model

Papers for RsGoksel/RsGoksel_ITU_Mainframe

Evaluation results

  • WER (author harness, faster-whisper+VAD+trnorm) on MediaSpeech (Turkish)
    self-reported
    10.490
  • WER (author harness) on turkishvoicedataset
    self-reported
    8.530
  • WER (author harness; see contamination note) on commonvoice_17_tr_fixed (test)
    test set self-reported
    9.770