Thiomi NLP

community

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

mutisya authored a paper about 2 months ago

The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

mutisya authored a paper about 2 months ago

Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data

mutisya authored a paper about 2 months ago

Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

View all activity

Papers

Attention Sinks in Massively Multilingual Neural Machine Translation:Discovery, Analysis, and Mitigation

Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

View all Papers

Organization Card

Community About org cards

Thiomi is a research and engineering effort focused on closing the gap between major African languages and the NLP infrastructure that exists for English. We build:

Datasets — community-collected text and speech corpora for languages that aren't well represented in public scraped data
Models — morphological analyzers, ASR systems, and translation models trained on those corpora, with the architectures and recipes that work best for each language family
Methods — open recipes for cross-lingual transfer, zero-shot morphological discovery, and other techniques that let small target datasets do useful work

Thiomi NLP

AI & ML interests

Recent Activity

Papers

Collections 1

The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data

Attention Sinks in Massively Multilingual Neural Machine Translation:Discovery, Analysis, and Mitigation

The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data

Attention Sinks in Massively Multilingual Neural Machine Translation:Discovery, Analysis, and Mitigation

models 1

thiomi/bantumorph-v7

datasets 1

thiomi/thiomi-5k

AI & ML interests

Recent Activity

Papers

Team members 6

Collections 1

models 1

datasets 1