lid-lite-25

lid

lid-lite-25 is a lightweight, CPU-only language identification model covering 25 languages, trained with fastText β€” a linear classifier over character n-grams, no pretrained base, no GPU required.

  • Sub-millisecond inference, ~5-10MB per checkpoint β€” built for high-volume pre-routing, not as a final accuracy-critical classifier.
  • Two checkpoints for two input lengths: one trained on long-form passages, one on short questions/queries, since a single model calibrates poorly across both.
  • 99.9% accuracy on long text, 97.3% on short text across all 25 languages (see Benchmarks).

For higher accuracy at the cost of needing a GPU to train (inference is CPU-fine), see the sibling model lid-neural-25.1 (XLM-RoBERTa based).

πŸ—’οΈ Model Details

Checkpoint Trained on Use for
passages.bin Long-form text (paragraph-length) Documents, articles, passages
questions.bin Short text (sentence/question-length) Search queries, chat messages, short user input
Architecture fastText linear classifier, character n-grams (wordNgrams=2, dim=100)
Parameters none in the neural-network sense β€” a linear model over n-gram hash buckets
File size ~5-10MB per checkpoint
Languages 25 (see below)
Training data olaverse/qg-passages-multi

Languages: af am de en fr ha hi ig id it ja ko nl pl pt ru sn so es sw tr vi xh yo zu (ISO 639-1; ISO 639-3: afr amh deu eng fra hau hin ibo ind ita jpn kor nld pol por rus sna som spa swh tur vie xho yor zul)

πŸƒ Usage

import fasttext
from huggingface_hub import hf_hub_download

model_path = hf_hub_download("olaverse/lid-lite-25", "questions.bin")  # or "passages.bin"
model = fasttext.load_model(model_path)

labels, probs = model.predict("What causes ocean tides?")
print(labels, probs)  # (('__label__eng',), array([0.999...]))

Use passages.bin for long-form text, questions.bin for short queries β€” see Model Details above. Both checkpoints share the same label set and interface.

πŸ“Š Benchmarks

Held-out validation split (5% of training data, not seen during training).

passages.bin β€” overall accuracy: 99.9% (n=2,498)

All 25 languages score F1 β‰₯ 0.994. Lowest: hau 0.995, xho 0.994, zul 0.994.

questions.bin β€” overall accuracy: 97.3% (n=7,419)

Language Precision Recall F1
jpn 0.836 1.000 0.911
xho 0.814 0.748 0.780
zul 0.763 0.774 0.769
amh 1.000 0.946 0.972
(remaining 21 languages) β‰₯0.986 β‰₯0.980 β‰₯0.983

Full per-language table available in the linked training notebook.

⚠️ Known limitations

  • Zulu/Xhosa confusion on short text β€” both score noticeably lower on questions.bin (F1 ~0.77-0.78) than every other language (all β‰₯0.91). Zulu and Xhosa are closely related Nguni languages with substantial shared vocabulary; short text provides limited disambiguating signal. Not observed on long-form passages (both β‰₯0.994 F1 there). The sibling lid-neural-25.2 shows the same pattern (not immune to it either), so this looks like a genuine linguistic difficulty rather than a model-specific weakness.
  • Japanese precision on short text (0.836) β€” some other-language short queries are misclassified as Japanese (high recall, lower precision).
  • Trained on machine/teacher-generated text (Aya-Collection-derived passages and LLM-generated questions), not human-authored social media or transliterated/code-switched text β€” untested on those input styles.

Training data & licensing

Trained from scratch (no pretrained base) on olaverse/qg-passages-multi (Apache-2.0). Released under Apache-2.0.

Citation

@misc{lid-lite-25,
  title  = {lid-lite-25},
  author = {Olaverse},
  year   = {2026},
  url    = {https://huggingface.co/olaverse/lid-lite-25}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including olaverse/lid-lite-25