Instructions to use RsGoksel/RsGoksel_ITU_Mainframe with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RsGoksel/RsGoksel_ITU_Mainframe with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="RsGoksel/RsGoksel_ITU_Mainframe")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("RsGoksel/RsGoksel_ITU_Mainframe") model = AutoModelForSpeechSeq2Seq.from_pretrained("RsGoksel/RsGoksel_ITU_Mainframe") - Notebooks
- Google Colab
- Kaggle
RsGoksel_ITU_Mainframe — Turkish ASR (Whisper-large-v3-turbo, LoRA fine-tune)
A Turkish automatic-speech-recognition model: a LoRA fine-tune of
openai/whisper-large-v3-turbo
(809 M, 128-mel, 32 encoder / 4 decoder layers), merged to full weights.
Internal codename VH5.0d.
A CTranslate2 / faster-whisper build is at
RsGoksel/RsGoksel_ITU_Mainframe-ct2.
Results (author-measured)
Evaluated with faster-whisper (CTranslate2) + VAD + a reimplemented Turkish normalizer
(trnorm), references and hypotheses passed through the same normalizer. These are
author-measured numbers; they are not (yet) the official
ysdede/turkish_asr_leaderboard
figures (the board uses WhisperX + its own trnorm). Our pipeline was validated to be
~1.3 pt harsher than the board's (zero-shot turbo: our 13.50 % vs board 12.32 % on MediaSpeech),
so the board's own ruler would, if anything, score this model better.
| Dataset (set) | This model | Zero-shot turbo (ours) | Best public open model (board) |
|---|---|---|---|
| ymoslem/MediaSpeech (tr, 2513) | 10.49 % | 13.50 % | large-v3-turbo 12.32 % |
| erenfazlioglu/turkishvoicedataset (1000-utt sample) | 8.53 % | — | large-v3-turbo 10.40 % |
| ysdede/commonvoice_17_tr_fixed (test, 9650) | 9.77 % (CER 2.63 %) | 13.09 % | small-turkish-0 10.84 % |
CV17 note: the test split is held out for this model, but CommonVoice 17 has known train/test sentence overlap (kaldi #2141), and most Turkish CV models — including the board's — train on CV17; treat the CV17 number as a fair-among-CV-trained comparison, not a clean OOD number.
Usage
transformers
import torch, soundfile as sf
from transformers import WhisperForConditionalGeneration, WhisperProcessor
m = WhisperForConditionalGeneration.from_pretrained(
"RsGoksel/RsGoksel_ITU_Mainframe", torch_dtype=torch.float16).to("cuda").eval()
p = WhisperProcessor.from_pretrained("RsGoksel/RsGoksel_ITU_Mainframe")
audio, sr = sf.read("audio.wav") # 16 kHz mono
feats = p.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", torch.float16)
ids = m.generate(feats, language="tr", task="transcribe")
print(p.batch_decode(ids, skip_special_tokens=True)[0])
faster-whisper (use the -ct2 repo)
from faster_whisper import WhisperModel
model = WhisperModel("RsGoksel/RsGoksel_ITU_Mainframe-ct2", device="cuda", compute_type="float16")
segments, _ = model.transcribe("audio.wav", language="tr", beam_size=5, vad_filter=True)
print(" ".join(s.text for s in segments))
Training
- Base:
openai/whisper-large-v3-turbo(MIT). Method: LoRA (r=32, α=64, dropout 0.05, targetsq_proj,v_proj), merged to full weights. - Recipe: learning rate 1e-5, bf16, cosine schedule + 500 warmup, max-grad-norm 1.0,
early-stop on a broadcast (MSP) dev slice with
generation_config.language="tr"forced (HF Trainer eval otherwise auto-detects language and yields misleadingly high WER on broadcast audio). - Data: ~997 h Turkish (broadcast/podcast-heavy + CommonVoice 17 + FLEURS-train + a math set). A controlled ablation on the byte-identical 886 h subset matches the 997 h result, indicating the gain comes from Whisper pretraining, not fine-tuning-data volume.
⚠️ Data provenance & license
This model is released for research / non-commercial use (CC-BY-NC-4.0). Reasons, disclosed openly:
- The base model is MIT, but the fine-tuning corpus includes sources whose license/provenance is unclear (~60 % of the from-scratch corpus lineage) and at least one non-commercial (NC) source (a Khan-Academy-derived math set). We therefore release the weights under a non-commercial license.
- No third-party audio is redistributed here — only the trained weights.
- If you need a commercially-clean model, retrain on a license-clean corpus.
Citation / context
Part of the TurkSpeak project (two-track Turkish ASR study: from-scratch RNN-T vs. Whisper transfer, with a byte-identical controlled pretraining ablation and a training-free LODR decoding result). Base model: Whisper [Radford et al., arXiv:2212.04356]. LoRA [Hu et al., arXiv:2106.09685].
- Downloads last month
- 22
Model tree for RsGoksel/RsGoksel_ITU_Mainframe
Papers for RsGoksel/RsGoksel_ITU_Mainframe
LoRA: Low-Rank Adaptation of Large Language Models
Evaluation results
- WER (author harness, faster-whisper+VAD+trnorm) on MediaSpeech (Turkish)self-reported10.490
- WER (author harness) on turkishvoicedatasetself-reported8.530
- WER (author harness; see contamination note) on commonvoice_17_tr_fixed (test)test set self-reported9.770