AfriHuBERT: A self-supervised speech representation model for African languages

Model description

This is a compact multilingual self-supervised speech encoder (HuBERT-style) trained for one iteration. It was trained on over 10,000 hours of African language data aggregated from various sources. According to the paper, this is the AfriHuBERT-s model. For the stronger variants with mHuBERT-147 as backbone, you can click here for the AfriHuBERT-o model, and here for the AfriHuBERT-n model.

Pretraining data

Dataset: AfriHuBERT was trained on data from 11 major sources, including BibleTTS, Kallaama, MMS Ulab v2, NaijaVoices, and NCHLT. All sources and their licenses are shown in the table below. Please refer to the paper for more information.

Language Coverage

AfriHuBERT covers 1,230 languages in total including 1,226 indigenous African languages

BibTeX entry and citation info.

@misc{alabi2024afrihubertselfsupervisedspeechrepresentation,
      title={AfriHuBERT: A self-supervised speech representation model for African languages}, 
      author={Jesujoba O. Alabi and Xuechen Liu and Dietrich Klakow and Junichi Yamagishi},
      year={2024},
      eprint={2409.20201},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.20201}, 
}

Downloads last month: 30

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train ajesujoba/AfriHuBERTs

Paper for ajesujoba/AfriHuBERTs

AfriHuBERT: A self-supervised speech representation model for African languages

Paper • 2409.20201 • Published Sep 30, 2024