lid-lite-25

lid-lite-25 is a lightweight, CPU-only language identification model covering 25 languages, trained with fastText — a linear classifier over character n-grams, no pretrained base, no GPU required.

Sub-millisecond inference, ~5-10MB per checkpoint — built for high-volume pre-routing, not as a final accuracy-critical classifier.
Two checkpoints for two input lengths: one trained on long-form passages, one on short questions/queries, since a single model calibrates poorly across both.
99.9% accuracy on long text, 97.3% on short text across all 25 languages (see Benchmarks).

For higher accuracy at the cost of needing a GPU to train (inference is CPU-fine), see the sibling model lid-neural-25.1 (XLM-RoBERTa based).

🗒️ Model Details

Checkpoint	Trained on	Use for
`passages.bin`	Long-form text (paragraph-length)	Documents, articles, passages
`questions.bin`	Short text (sentence/question-length)	Search queries, chat messages, short user input


Architecture	fastText linear classifier, character n-grams (`wordNgrams=2`, `dim=100`)
Parameters	none in the neural-network sense — a linear model over n-gram hash buckets
File size	~5-10MB per checkpoint
Languages	25 (see below)
Training data	`olaverse/qg-passages-multi`

Languages: af am de en fr ha hi ig id it ja ko nl pl pt ru sn so es sw tr vi xh yo zu (ISO 639-1; ISO 639-3: afr amh deu eng fra hau hin ibo ind ita jpn kor nld pol por rus sna som spa swh tur vie xho yor zul)

🏃 Usage

import fasttext
from huggingface_hub import hf_hub_download

model_path = hf_hub_download("olaverse/lid-lite-25", "questions.bin")  # or "passages.bin"
model = fasttext.load_model(model_path)

labels, probs = model.predict("What causes ocean tides?")
print(labels, probs)  # (('__label__eng',), array([0.999...]))

Use passages.bin for long-form text, questions.bin for short queries — see Model Details above. Both checkpoints share the same label set and interface.

📊 Benchmarks

Held-out validation split (5% of training data, not seen during training).

`passages.bin` — overall accuracy: 99.9% (n=2,498)

All 25 languages score F1 ≥ 0.994. Lowest: hau 0.995, xho 0.994, zul 0.994.

`questions.bin` — overall accuracy: 97.3% (n=7,419)

Language	Precision	Recall	F1
jpn	0.836	1.000	0.911
xho	0.814	0.748	0.780
zul	0.763	0.774	0.769
amh	1.000	0.946	0.972
(remaining 21 languages)	≥0.986	≥0.980	≥0.983

Full per-language table available in the linked training notebook.

⚠️ Known limitations

Zulu/Xhosa confusion on short text — both score noticeably lower on questions.bin (F1 ~0.77-0.78) than every other language (all ≥0.91). Zulu and Xhosa are closely related Nguni languages with substantial shared vocabulary; short text provides limited disambiguating signal. Not observed on long-form passages (both ≥0.994 F1 there). The sibling lid-neural-25.2 shows the same pattern (not immune to it either), so this looks like a genuine linguistic difficulty rather than a model-specific weakness.
Japanese precision on short text (0.836) — some other-language short queries are misclassified as Japanese (high recall, lower precision).
Trained on machine/teacher-generated text (Aya-Collection-derived passages and LLM-generated questions), not human-authored social media or transliterated/code-switched text — untested on those input styles.

Training data & licensing

Trained from scratch (no pretrained base) on olaverse/qg-passages-multi (Apache-2.0). Released under Apache-2.0.

Citation

@misc{lid-lite-25,
  title  = {lid-lite-25},
  author = {Olaverse},
  year   = {2026},
  url    = {https://huggingface.co/olaverse/lid-lite-25}
}

Downloads last month: -

Collection including olaverse/lid-lite-25

LID 5

Collection

7 items • Updated about 3 hours ago