-
The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages
Paper • 2603.29244 • Published • 1 -
Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
Paper • 2604.22723 • Published -
Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
Paper • 2604.22730 • Published -
Attention Sinks in Massively Multilingual Neural Machine Translation:Discovery, Analysis, and Mitigation
Paper • 2605.01229 • Published
Thiomi NLP
community
AI & ML interests
None defined yet.
Recent Activity
View all activity
Papers
Attention Sinks in Massively Multilingual Neural Machine Translation:Discovery, Analysis, and Mitigation
Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
Organization Card
Thiomi is a research and engineering effort focused on closing the gap between major African languages and the NLP infrastructure that exists for English. We build:
- Datasets — community-collected text and speech corpora for languages that aren't well represented in public scraped data
- Models — morphological analyzers, ASR systems, and translation models trained on those corpora, with the architectures and recipes that work best for each language family
- Methods — open recipes for cross-lingual transfer, zero-shot morphological discovery, and other techniques that let small target datasets do useful work
-
The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages
Paper • 2603.29244 • Published • 1 -
Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
Paper • 2604.22723 • Published -
Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
Paper • 2604.22730 • Published -
Attention Sinks in Massively Multilingual Neural Machine Translation:Discovery, Analysis, and Mitigation
Paper • 2605.01229 • Published