You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

English–Twi Code-Switching ASR Model — Kasanoma (wav2vec2)

Model Overview

This repository contains a fine-tuned checkpoint of facebook/wav2vec2-large-xlsr-53 for English–Twi code-switching speech transcription. It is further fine-tuned on a realistic bilingual dataset containing English & Twi mixed-language utterances.

The model supports natural bilingual speech, including intra-sentential and inter-sentential code-switching.

How to Use

import torch
from datasets import load_dataset, Audio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

model = Wav2Vec2ForCTC.from_pretrained("Kennethdot/kasanoma_wav2vec2")
processor = Wav2Vec2Processor.from_pretrained("Kennethdot/kasanoma_wav2vec2")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

# Load a sample from the test set
dataset = load_dataset(
    "Kennethdot/Ghana_English-Twi_Code-switching_ASR",
    split="test"
).cast_column("audio", Audio(sampling_rate=16000))

sample = dataset[0]["audio"]

inputs = processor(
    sample["array"],
    sampling_rate=sample["sampling_rate"],
    return_tensors="pt",
    padding=True
).to(device)

with torch.no_grad():
    logits = model(**inputs).logits

predicted_ids = torch.argmax(logits, dim=-1)

transcription = processor.batch_decode(
        predicted_ids,
        group_tokens=True,
        skip_special_tokens=False
        )[0].strip()

print(transcription)

Model Details

Property Value
Task English–Twi code-switching speech transcription
Base model facebook/wav2vec2-large-xlsr-53
Fine-tuning dataset Kennethdot/Ghana_English-Twi_Code-switching_ASR
Dataset size ~100 hours of English–Twi code-switched speech
Sampling rate 16,000 Hz

The dataset was normalised to remove punctuations (, ? . ! ; : " % ") that may destabilize training.

Evaluation Results

Model CS WER Twi WER English WER
Zero-shot XLSR-53 90.39 85.08 110.26
Fine-tuned Model (Kasanoma) 6.58 99.44 100.43

Note: The high monolingual WER scores reflect that this model is optimised for code-switched input. For purely Twi or purely English audio, a monolingual model is likely more appropriate.

Examples

The model produces fluent bilingual outputs with natural speech patterns:

  • Example 1Ma yɛnkɔgye yɛn ani, it has been a long week.
  • Example 2Adwuma no yɛ den dodo, I need a vacation.
  • Example 3Nsuomnam yɛ dɛ paa, w'atry-i grilled tilapia?

Limitations

The model performs well on English–Twi mixed speech. Keep the following in mind:

  • Input length: wav2vec2 processes raw waveforms directly but memory usage scales with audio length. For long recordings, apply sliding-window chunking.
  • Out-of-distribution input: Performance may degrade on slang, idioms, informal Twi, spelling variation, proper names, or utterances far outside the training distribution.
  • Monolingual speech: The model is not optimised for purely English or purely Twi utterances.

Human review is recommended for high-stakes use cases.

Ethical Considerations

  • Intended for research and educational use only
  • Should not be used for surveillance or unauthorized speech monitoring
  • Bias may exist due to dataset imbalance between languages
Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kennethdot/kasanoma_wav2vec2

Finetuned
(870)
this model

Dataset used to train Kennethdot/kasanoma_wav2vec2