NIRVLab โ€” MorpheL Tokenizer for Turkish XNLI

MorpheL: MI-Guided Stochastic Segmentation tokenizer for morphologically rich low-resource languages (Proposal: MorpheL: Morphology-Aware Tokenizer Adaptation for Pretrained Models in Low-Resource Languages).

Trained on the Turkish (tr) subset of facebook/xnli โ€” all splits.

Algorithm

MorpheL scores candidate intra-word boundaries via pointwise mutual information (MI) between prefix and suffix substrings, then stochastically selects the number of cuts via Gumbel perturbation (Eq. 6โ€“8). Key distinctions from BPE-Dropout:

  • Randomness is MI-informed โ€” only high-MI boundaries enter the candidate pool
  • Vowel-consonant transition heuristic pre-filters linguistically implausible positions
  • Global MI table aggregated over full corpus (not per-sentence) for stability

Training Config

Parameter Value
Algorithm MorpheL (MI + Gumbel)
Vocabulary size 8,084
top_k 4
temperature (T) 1.0 (vocab induction: T=0)
mi_threshold 0.0 (keep MI > 0)
min_frequency 2
Special tokens <s>, <pad>, </s>, <unk>, <mask>
Corpus facebook/xnli/tr โ€” all splits (800,404 sentences)
Vowel set Turkish (ฤฑ ฤฐ รถ รผ รข รฎ รป + ASCII)

Evaluation Metrics (vs Baselines, vocab_size=8000, same corpus)

Metric BPE WordPiece Unigram MorpheL
Fertility โ†“ โ€” โ€” โ€” 3.4047
Tokens/char โ†“ โ€” โ€” โ€” 0.4546
Avg seq len โ†“ โ€” โ€” โ€” 37.49
Vocab coverage โ†‘ โ€” โ€” โ€” 1.0000
OOV rate โ†“ โ€” โ€” โ€” 0.0000

Fill baseline columns after running baseline notebooks.

Usage

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("NIRVLab/xnli-morphel-tr-beta")

Note: MorpheL segments words before passing to the tokenizer. At downstream training time, use temperature=1.0 for stochastic segmentation. For inference, use temperature=0 (deterministic).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train NIRVLab/xnli-morphel-tr-beta