NIRVLab โ MorpheL Tokenizer for Turkish XNLI
MorpheL: MI-Guided Stochastic Segmentation tokenizer for morphologically rich low-resource languages (Proposal: MorpheL: Morphology-Aware Tokenizer Adaptation for Pretrained Models in Low-Resource Languages).
Trained on the Turkish (tr) subset of
facebook/xnli โ all splits.
Algorithm
MorpheL scores candidate intra-word boundaries via pointwise mutual information (MI) between prefix and suffix substrings, then stochastically selects the number of cuts via Gumbel perturbation (Eq. 6โ8). Key distinctions from BPE-Dropout:
- Randomness is MI-informed โ only high-MI boundaries enter the candidate pool
- Vowel-consonant transition heuristic pre-filters linguistically implausible positions
- Global MI table aggregated over full corpus (not per-sentence) for stability
Training Config
| Parameter | Value |
|---|---|
| Algorithm | MorpheL (MI + Gumbel) |
| Vocabulary size | 8,084 |
| top_k | 4 |
| temperature (T) | 1.0 (vocab induction: T=0) |
| mi_threshold | 0.0 (keep MI > 0) |
| min_frequency | 2 |
| Special tokens | <s>, <pad>, </s>, <unk>, <mask> |
| Corpus | facebook/xnli/tr โ all splits (800,404 sentences) |
| Vowel set | Turkish (ฤฑ ฤฐ รถ รผ รข รฎ รป + ASCII) |
Evaluation Metrics (vs Baselines, vocab_size=8000, same corpus)
| Metric | BPE | WordPiece | Unigram | MorpheL |
|---|---|---|---|---|
| Fertility โ | โ | โ | โ | 3.4047 |
| Tokens/char โ | โ | โ | โ | 0.4546 |
| Avg seq len โ | โ | โ | โ | 37.49 |
| Vocab coverage โ | โ | โ | โ | 1.0000 |
| OOV rate โ | โ | โ | โ | 0.0000 |
Fill baseline columns after running baseline notebooks.
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("NIRVLab/xnli-morphel-tr-beta")
Note: MorpheL segments words before passing to the tokenizer. At downstream training time, use
temperature=1.0for stochastic segmentation. For inference, usetemperature=0(deterministic).