Instructions to use olaverse/lid-lite-25 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use olaverse/lid-lite-25 with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("olaverse/lid-lite-25", "model.bin")) - Notebooks
- Google Colab
- Kaggle
lid-lite-25
lid-lite-25 is a lightweight, CPU-only language identification model covering 25 languages, trained with fastText β a linear classifier over character n-grams, no pretrained base, no GPU required.
- Sub-millisecond inference, ~5-10MB per checkpoint β built for high-volume pre-routing, not as a final accuracy-critical classifier.
- Two checkpoints for two input lengths: one trained on long-form passages, one on short questions/queries, since a single model calibrates poorly across both.
- 99.9% accuracy on long text, 97.3% on short text across all 25 languages (see Benchmarks).
For higher accuracy at the cost of needing a GPU to train (inference is CPU-fine),
see the sibling model lid-neural-25.1
(XLM-RoBERTa based).
ποΈ Model Details
| Checkpoint | Trained on | Use for |
|---|---|---|
passages.bin |
Long-form text (paragraph-length) | Documents, articles, passages |
questions.bin |
Short text (sentence/question-length) | Search queries, chat messages, short user input |
| Architecture | fastText linear classifier, character n-grams (wordNgrams=2, dim=100) |
| Parameters | none in the neural-network sense β a linear model over n-gram hash buckets |
| File size | ~5-10MB per checkpoint |
| Languages | 25 (see below) |
| Training data | olaverse/qg-passages-multi |
Languages: af am de en fr ha hi ig id it ja ko nl pl pt ru sn so es sw tr vi xh yo zu (ISO 639-1; ISO 639-3: afr amh deu eng fra hau hin ibo ind ita jpn kor nld pol por rus sna som spa swh tur vie xho yor zul)
π Usage
import fasttext
from huggingface_hub import hf_hub_download
model_path = hf_hub_download("olaverse/lid-lite-25", "questions.bin") # or "passages.bin"
model = fasttext.load_model(model_path)
labels, probs = model.predict("What causes ocean tides?")
print(labels, probs) # (('__label__eng',), array([0.999...]))
Use passages.bin for long-form text, questions.bin for short queries β see
Model Details above. Both checkpoints share the same label set and interface.
π Benchmarks
Held-out validation split (5% of training data, not seen during training).
passages.bin β overall accuracy: 99.9% (n=2,498)
All 25 languages score F1 β₯ 0.994. Lowest: hau 0.995, xho 0.994, zul 0.994.
questions.bin β overall accuracy: 97.3% (n=7,419)
| Language | Precision | Recall | F1 |
|---|---|---|---|
| jpn | 0.836 | 1.000 | 0.911 |
| xho | 0.814 | 0.748 | 0.780 |
| zul | 0.763 | 0.774 | 0.769 |
| amh | 1.000 | 0.946 | 0.972 |
| (remaining 21 languages) | β₯0.986 | β₯0.980 | β₯0.983 |
Full per-language table available in the linked training notebook.
β οΈ Known limitations
- Zulu/Xhosa confusion on short text β both score noticeably lower on
questions.bin(F1 ~0.77-0.78) than every other language (all β₯0.91). Zulu and Xhosa are closely related Nguni languages with substantial shared vocabulary; short text provides limited disambiguating signal. Not observed on long-form passages (both β₯0.994 F1 there). The siblinglid-neural-25.2shows the same pattern (not immune to it either), so this looks like a genuine linguistic difficulty rather than a model-specific weakness. - Japanese precision on short text (0.836) β some other-language short queries are misclassified as Japanese (high recall, lower precision).
- Trained on machine/teacher-generated text (Aya-Collection-derived passages and LLM-generated questions), not human-authored social media or transliterated/code-switched text β untested on those input styles.
Training data & licensing
Trained from scratch (no pretrained base) on
olaverse/qg-passages-multi
(Apache-2.0). Released under Apache-2.0.
Citation
@misc{lid-lite-25,
title = {lid-lite-25},
author = {Olaverse},
year = {2026},
url = {https://huggingface.co/olaverse/lid-lite-25}
}
- Downloads last month
- -
