NIRVLab β€” Unigram Tokenizer for French XNLI

A Unigram Language Model tokenizer trained from scratch on the French (fr) subset of the facebook/xnli dataset.

Training Details

Parameter Value
Algorithm Unigram LM (SentencePiece-style)
Vocabulary size 8,000
Special tokens <s>, <pad>, </s>, <unk>, <mask>
Corpus facebook/xnli / fr β€” all splits
Corpus size 800,404 sentences
Normalizer Nmt + NFC Unicode
Pre-tokenizer Metaspace (▁ prefix)
Shrinking factor 0.75
Max piece length 16

Evaluation Metrics

Metric Value
Tokens / char 0.3037
Fertility (tokens / word) 1.8279
Avg sequence length 27.91 tokens
Vocabulary coverage 1.0000

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("NIRVLab/xnli-unigram-fr")
tokens = tokenizer("Bonjour le monde!", return_tensors="pt")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train NIRVLab/xnli-unigram-fr