RsGoksel_ITU_Mainframe — Turkish ASR (Whisper-large-v3-turbo, LoRA fine-tune)

A Turkish automatic-speech-recognition model: a LoRA fine-tune of openai/whisper-large-v3-turbo (809 M, 128-mel, 32 encoder / 4 decoder layers), merged to full weights. Internal codename VH5.0d.

A CTranslate2 / faster-whisper build is at RsGoksel/RsGoksel_ITU_Mainframe-ct2.

Results (author-measured)

Evaluated with faster-whisper (CTranslate2) + VAD + a reimplemented Turkish normalizer (trnorm), references and hypotheses passed through the same normalizer. These are author-measured numbers; they are not (yet) the official ysdede/turkish_asr_leaderboard figures (the board uses WhisperX + its own trnorm). Our pipeline was validated to be ~1.3 pt harsher than the board's (zero-shot turbo: our 13.50 % vs board 12.32 % on MediaSpeech), so the board's own ruler would, if anything, score this model better.

Dataset (set)	This model	Zero-shot turbo (ours)	Best public open model (board)
ymoslem/MediaSpeech (tr, 2513)	10.49 %	13.50 %	large-v3-turbo 12.32 %
erenfazlioglu/turkishvoicedataset (1000-utt sample)	8.53 %	—	large-v3-turbo 10.40 %
ysdede/commonvoice_17_tr_fixed (test, 9650)	9.77 % (CER 2.63 %)	13.09 %	small-turkish-0 10.84 %

CV17 note: the test split is held out for this model, but CommonVoice 17 has known train/test sentence overlap (kaldi #2141), and most Turkish CV models — including the board's — train on CV17; treat the CV17 number as a fair-among-CV-trained comparison, not a clean OOD number.

Usage

transformers

import torch, soundfile as sf
from transformers import WhisperForConditionalGeneration, WhisperProcessor

m = WhisperForConditionalGeneration.from_pretrained(
    "RsGoksel/RsGoksel_ITU_Mainframe", torch_dtype=torch.float16).to("cuda").eval()
p = WhisperProcessor.from_pretrained("RsGoksel/RsGoksel_ITU_Mainframe")

audio, sr = sf.read("audio.wav")  # 16 kHz mono
feats = p.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", torch.float16)
ids = m.generate(feats, language="tr", task="transcribe")
print(p.batch_decode(ids, skip_special_tokens=True)[0])

faster-whisper (use the -ct2 repo)

from faster_whisper import WhisperModel
model = WhisperModel("RsGoksel/RsGoksel_ITU_Mainframe-ct2", device="cuda", compute_type="float16")
segments, _ = model.transcribe("audio.wav", language="tr", beam_size=5, vad_filter=True)
print(" ".join(s.text for s in segments))

Training

Base: openai/whisper-large-v3-turbo (MIT). Method: LoRA (r=32, α=64, dropout 0.05, targets q_proj,v_proj), merged to full weights.
Recipe: learning rate 1e-5, bf16, cosine schedule + 500 warmup, max-grad-norm 1.0, early-stop on a broadcast (MSP) dev slice with generation_config.language="tr" forced (HF Trainer eval otherwise auto-detects language and yields misleadingly high WER on broadcast audio).
Data: ~997 h Turkish (broadcast/podcast-heavy + CommonVoice 17 + FLEURS-train + a math set). A controlled ablation on the byte-identical 886 h subset matches the 997 h result, indicating the gain comes from Whisper pretraining, not fine-tuning-data volume.

⚠️ Data provenance & license

This model is released for research / non-commercial use (CC-BY-NC-4.0). Reasons, disclosed openly:

The base model is MIT, but the fine-tuning corpus includes sources whose license/provenance is unclear (~60 % of the from-scratch corpus lineage) and at least one non-commercial (NC) source (a Khan-Academy-derived math set). We therefore release the weights under a non-commercial license.
No third-party audio is redistributed here — only the trained weights.
If you need a commercially-clean model, retrain on a license-clean corpus.

Citation / context

Part of the TurkSpeak project (two-track Turkish ASR study: from-scratch RNN-T vs. Whisper transfer, with a byte-identical controlled pretraining ablation and a training-free LODR decoding result). Base model: Whisper [Radford et al., arXiv:2212.04356]. LoRA [Hu et al., arXiv:2106.09685].

Downloads last month: 22

Safetensors

Model size

0.8B params

Tensor type

F16

Model tree for RsGoksel/RsGoksel_ITU_Mainframe

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Adapter

(130)

this model

Papers for RsGoksel/RsGoksel_ITU_Mainframe

Robust Speech Recognition via Large-Scale Weak Supervision

Paper • 2212.04356 • Published Dec 6, 2022 • 54

LoRA: Low-Rank Adaptation of Large Language Models

Paper • 2106.09685 • Published Jun 17, 2021 • 63

Evaluation results

WER (author harness, faster-whisper+VAD+trnorm) on MediaSpeech (Turkish)
self-reported

10.490
WER (author harness) on turkishvoicedataset
self-reported

8.530
WER (author harness; see contamination note) on commonvoice_17_tr_fixed (test)
test set self-reported

9.770