alakxender/dhivehi-news-corpus
Viewer • Updated • 87.2k • 209
How to use alakxender/mt5-dhivehi-tokenizer-extended with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("alakxender/mt5-dhivehi-tokenizer-extended", dtype="auto")This tokenizer extends google/mt5-base by incorporating full Dhivehi language support, while preserving the original multilingual capabilities of mT5.
Based on a 300k vocab SentencePiece model trained on a combined corpus of multilingual samples + Dhivehi text.
google/mt5-base300,000NFKC| Detail | Value |
|---|---|
| Tokenizer | SentencePieceTrainer |
| Model type | unigram |
| Vocab size | 300000 |
| Character coverage | 0.9995 |
| Input sentence size | 5,000,000 |
| Special tokens | <pad>, <unk>, <s>, </s> |
| Byte fallback | Enabled |
| Normalization | NFKC |
from transformers import MT5Tokenizer
tokenizer = MT5Tokenizer.from_pretrained("your-username/mt5-tokenizer-truly-extended")
text = "ރިޔާސީ އިންތިހާބުގައި ވާދަކުރަށްވަން ނަޝީދު ހިޔާލު ހޯއްދަވަނީ"
tokens = tokenizer.tokenize(text)
print(tokens)
| Language | Tokenization | Result |
|---|---|---|
| Dhivehi | Fragmented in base | Improved |
| Multilingual | Supported as before | Retained |
| Mixed sentences | Fragile in base | Improved |
The tokenizer supports round-trip decoding for Dhivehi:
ids = tokenizer.encode("ރިޔާސީ", add_special_tokens=False)
decoded = tokenizer.decode(ids)
assert decoded == "ރިޔާސީ"
spiece.model: The new SentencePiece model (300k vocab)tokenizer_config.json: Updated for extended vocabspecial_tokens_map.json: Preserved from mt5-basegoogle/mt5-base model.from transformers import MT5ForConditionalGeneration
model = MT5ForConditionalGeneration.from_pretrained("google/mt5-base")
model.resize_token_embeddings(len(tokenizer))
Base model
google/mt5-base