facebook/xnli
Viewer β’ Updated β’ 6.4M β’ 21.3k β’ 72
A Unigram Language Model tokenizer trained from scratch on the French (fr) subset
of the facebook/xnli dataset.
| Parameter | Value |
|---|---|
| Algorithm | Unigram LM (SentencePiece-style) |
| Vocabulary size | 8,000 |
| Special tokens | <s>, <pad>, </s>, <unk>, <mask> |
| Corpus | facebook/xnli / fr β all splits |
| Corpus size | 800,404 sentences |
| Normalizer | Nmt + NFC Unicode |
| Pre-tokenizer | Metaspace (β prefix) |
| Shrinking factor | 0.75 |
| Max piece length | 16 |
| Metric | Value |
|---|---|
| Tokens / char | 0.3037 |
| Fertility (tokens / word) | 1.8279 |
| Avg sequence length | 27.91 tokens |
| Vocabulary coverage | 1.0000 |
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("NIRVLab/xnli-unigram-fr")
tokens = tokenizer("Bonjour le monde!", return_tensors="pt")