NIRVLab — MorpheL Tokenizer for Turkish XNLI

MorpheL: MI-Guided Stochastic Segmentation tokenizer for morphologically rich low-resource languages (Proposal: MorpheL: Morphology-Aware Tokenizer Adaptation for Pretrained Models in Low-Resource Languages).

Trained on the Turkish (tr) subset of facebook/xnli — all splits.

Algorithm

MorpheL scores candidate intra-word boundaries via pointwise mutual information (MI) between prefix and suffix substrings, then stochastically selects the number of cuts via Gumbel perturbation (Eq. 6–8). Key distinctions from BPE-Dropout:

Randomness is MI-informed — only high-MI boundaries enter the candidate pool
Vowel-consonant transition heuristic pre-filters linguistically implausible positions
Global MI table aggregated over full corpus (not per-sentence) for stability

Training Config

Parameter	Value
Algorithm	MorpheL (MI + Gumbel)
Vocabulary size	8,084
top_k	4
temperature (T)	1.0 (vocab induction: T=0)
mi_threshold	0.0 (keep MI > 0)
min_frequency	2
Special tokens	`<s>, <pad>, </s>, <unk>, <mask>`
Corpus	`facebook/xnli/tr` — all splits (800,404 sentences)
Vowel set	Turkish (ı İ ö ü â î û + ASCII)

Evaluation Metrics (vs Baselines, vocab_size=8000, same corpus)

Metric	BPE	WordPiece	Unigram	MorpheL
Fertility ↓	—	—	—	3.4047
Tokens/char ↓	—	—	—	0.4546
Avg seq len ↓	—	—	—	37.49
Vocab coverage ↑	—	—	—	1.0000
OOV rate ↓	—	—	—	0.0000

Fill baseline columns after running baseline notebooks.

Usage

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("NIRVLab/xnli-morphel-tr-beta")

Note: MorpheL segments words before passing to the tokenizer. At downstream training time, use temperature=1.0 for stochastic segmentation. For inference, use temperature=0 (deterministic).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

NIRVLab
/

xnli-morphel-tr-beta

NIRVLab — MorpheL Tokenizer for Turkish XNLI

Algorithm

Training Config

Evaluation Metrics (vs Baselines, vocab_size=8000, same corpus)

Usage

Dataset used to train NIRVLab/xnli-morphel-tr-beta