You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

English–Twi Code-Switching ASR Model — Kasanoma (wav2vec2)

Model Overview

This repository contains a fine-tuned checkpoint of facebook/wav2vec2-large-xlsr-53 for English–Twi code-switching speech transcription. It is further fine-tuned on a realistic bilingual dataset containing English & Twi mixed-language utterances.

The model supports natural bilingual speech, including intra-sentential and inter-sentential code-switching.

How to Use

import torch
from datasets import load_dataset, Audio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

model = Wav2Vec2ForCTC.from_pretrained("Kennethdot/kasanoma_wav2vec2")
processor = Wav2Vec2Processor.from_pretrained("Kennethdot/kasanoma_wav2vec2")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

# Load a sample from the test set
dataset = load_dataset(
    "Kennethdot/Ghana_English-Twi_Code-switching_ASR",
    split="test"
).cast_column("audio", Audio(sampling_rate=16000))

sample = dataset[0]["audio"]

inputs = processor(
    sample["array"],
    sampling_rate=sample["sampling_rate"],
    return_tensors="pt",
    padding=True
).to(device)

with torch.no_grad():
    logits = model(**inputs).logits

predicted_ids = torch.argmax(logits, dim=-1)

transcription = processor.batch_decode(
        predicted_ids,
        group_tokens=True,
        skip_special_tokens=False
        )[0].strip()

print(transcription)

Model Details

Property	Value
Task	English–Twi code-switching speech transcription
Base model	`facebook/wav2vec2-large-xlsr-53`
Fine-tuning dataset	`Kennethdot/Ghana_English-Twi_Code-switching_ASR`
Dataset size	~100 hours of English–Twi code-switched speech
Sampling rate	16,000 Hz

The dataset was normalised to remove punctuations (, ? . ! ; : " % ") that may destabilize training.

Evaluation Results

Model	CS WER	Twi WER	English WER
Zero-shot XLSR-53	90.39	85.08	110.26
Fine-tuned Model (Kasanoma)	6.58	99.44	100.43

Note: The high monolingual WER scores reflect that this model is optimised for code-switched input. For purely Twi or purely English audio, a monolingual model is likely more appropriate.

Examples

The model produces fluent bilingual outputs with natural speech patterns:

Example 1 — Ma yɛnkɔgye yɛn ani, it has been a long week.
Example 2 — Adwuma no yɛ den dodo, I need a vacation.
Example 3 — Nsuomnam yɛ dɛ paa, w'atry-i grilled tilapia?

Limitations

The model performs well on English–Twi mixed speech. Keep the following in mind:

Input length: wav2vec2 processes raw waveforms directly but memory usage scales with audio length. For long recordings, apply sliding-window chunking.
Out-of-distribution input: Performance may degrade on slang, idioms, informal Twi, spelling variation, proper names, or utterances far outside the training distribution.
Monolingual speech: The model is not optimised for purely English or purely Twi utterances.

Human review is recommended for high-stakes use cases.

Ethical Considerations

Intended for research and educational use only
Should not be used for surveillance or unauthorized speech monitoring
Bias may exist due to dataset imbalance between languages

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for Kennethdot/kasanoma_wav2vec2

Base model

facebook/wav2vec2-xls-r-300m

Finetuned

(870)

this model

Kennethdot
/

kasanoma_wav2vec2