Instructions to use Kennethdot/kasanoma_wav2vec2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Kennethdot/kasanoma_wav2vec2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="Kennethdot/kasanoma_wav2vec2")# Load model directly from transformers import AutoProcessor, AutoModelForCTC processor = AutoProcessor.from_pretrained("Kennethdot/kasanoma_wav2vec2") model = AutoModelForCTC.from_pretrained("Kennethdot/kasanoma_wav2vec2") - Notebooks
- Google Colab
- Kaggle
English–Twi Code-Switching ASR Model — Kasanoma (wav2vec2)
Model Overview
This repository contains a fine-tuned checkpoint of
facebook/wav2vec2-large-xlsr-53
for English–Twi code-switching speech transcription. It is further fine-tuned
on a realistic bilingual dataset containing English & Twi mixed-language
utterances.
The model supports natural bilingual speech, including intra-sentential and inter-sentential code-switching.
How to Use
import torch
from datasets import load_dataset, Audio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
model = Wav2Vec2ForCTC.from_pretrained("Kennethdot/kasanoma_wav2vec2")
processor = Wav2Vec2Processor.from_pretrained("Kennethdot/kasanoma_wav2vec2")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()
# Load a sample from the test set
dataset = load_dataset(
"Kennethdot/Ghana_English-Twi_Code-switching_ASR",
split="test"
).cast_column("audio", Audio(sampling_rate=16000))
sample = dataset[0]["audio"]
inputs = processor(
sample["array"],
sampling_rate=sample["sampling_rate"],
return_tensors="pt",
padding=True
).to(device)
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(
predicted_ids,
group_tokens=True,
skip_special_tokens=False
)[0].strip()
print(transcription)
Model Details
| Property | Value |
|---|---|
| Task | English–Twi code-switching speech transcription |
| Base model | facebook/wav2vec2-large-xlsr-53 |
| Fine-tuning dataset | Kennethdot/Ghana_English-Twi_Code-switching_ASR |
| Dataset size | ~100 hours of English–Twi code-switched speech |
| Sampling rate | 16,000 Hz |
The dataset was normalised to remove punctuations (, ? . ! ; : " % ") that may destabilize training.
Evaluation Results
| Model | CS WER | Twi WER | English WER |
|---|---|---|---|
| Zero-shot XLSR-53 | 90.39 | 85.08 | 110.26 |
| Fine-tuned Model (Kasanoma) | 6.58 | 99.44 | 100.43 |
Note: The high monolingual WER scores reflect that this model is optimised for code-switched input. For purely Twi or purely English audio, a monolingual model is likely more appropriate.
Examples
The model produces fluent bilingual outputs with natural speech patterns:
- Example 1 — Ma yɛnkɔgye yɛn ani, it has been a long week.
- Example 2 — Adwuma no yɛ den dodo, I need a vacation.
- Example 3 — Nsuomnam yɛ dɛ paa, w'atry-i grilled tilapia?
Limitations
The model performs well on English–Twi mixed speech. Keep the following in mind:
- Input length: wav2vec2 processes raw waveforms directly but memory usage scales with audio length. For long recordings, apply sliding-window chunking.
- Out-of-distribution input: Performance may degrade on slang, idioms, informal Twi, spelling variation, proper names, or utterances far outside the training distribution.
- Monolingual speech: The model is not optimised for purely English or purely Twi utterances.
Human review is recommended for high-stakes use cases.
Ethical Considerations
- Intended for research and educational use only
- Should not be used for surveillance or unauthorized speech monitoring
- Bias may exist due to dataset imbalance between languages
- Downloads last month
- -
Model tree for Kennethdot/kasanoma_wav2vec2
Base model
facebook/wav2vec2-xls-r-300m