Nexuss0781/synaxarium
Viewer • Updated • 366 • 31
A production-ready Byte-level BPE tokenizer specifically trained on Amharic biblical and religious texts, achieving accurate reconstruction of complex Ge'ez script, ancient punctuation, and liturgical content.
| Dataset | Source | Texts | Description |
|---|---|---|---|
| Synaxarium | Nexuss0781/synaxarium | 366 | Daily synaxarium readings in Amharic |
| Canon Biblical | Nexuss0781/conon-biblical-am-en | 61,403 | Amharic-English biblical texts |
| Total | - | 61,769 | 15.43 MB combined corpus |
{
"vocab_size": 16000,
"min_frequency": 2,
"special_tokens": ["<pad>", "<unk>", "<s>", "</s>", "<mask>"],
"lowercase": false,
"compression": "gzip (level 9)",
"checkpointing": true
}
| Metric | Result |
|---|---|
| Accurate Reconstruction | ✅ High accuracy |
| Ge'ez Punctuation | ✅ Accurate (1 token for ፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠) |
| Synaxarium Text | ✅ Accurate (66 tokens) |
| Biblical Text | ✅ Accurate (82 tokens) |
| Compression Ratio | 89.8% (1.3MB → 136KB) |
| Training Time | ~17 seconds |
pip install tokenizers huggingface_hub
from tokenizers import Tokenizer
from huggingface_hub import hf_hub_download
# Download and load tokenizer
tokenizer_path = hf_hub_download("Nexuss0781/Ethio-BBPE", "tokenizer.json")
tokenizer = Tokenizer.from_file(tokenizer_path)
# Encode Amharic text
text = "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"
encoded = tokenizer.encode(text)
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")
print(f"Decoded: {tokenizer.decode(encoded.ids)}")
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("models/EthioBBPE/tokenizer.json")
# Test with ancient Ge'ez punctuation
text = "፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"
encoded = tokenizer.encode(text)
print(f"Encoded {len(text)} chars into {len(encoded.ids)} token(s)")
# Output: Encoded 16 chars into 1 token(s)
import gzip
import json
from tokenizers import Tokenizer, AddedToken
# Load compressed vocabulary
with gzip.open('models/EthioBBPE/vocab.json.gz', 'rt', encoding='utf-8') as f:
vocab = json.load(f)
print(f"Vocabulary size: {len(vocab)}")
print(f"Storage saved: ~89.8%")
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("models/EthioBBPE/tokenizer.json")
# Synaxarium text
synaxarium = """ሰላም ለኢዮብ ዘኢነበበ ከንቶ ። አመ አኀዞ አበቅ ወአመ አህጎለ ጥሪቶ ።"""
encoded = tokenizer.encode(synaxarium)
print(f"Original: {synaxarium}")
print(f"Tokens: {encoded.tokens}")
print(f"Token count: {len(encoded.ids)}")
print(f"Reconstructed: {tokenizer.decode(encoded.ids)}")
print(f"Perfect match: {synaxarium == tokenizer.decode(encoded.ids)}")
texts = [
"በመዠመሪያ፡እግዚአብሔር፡ሰማይንና፡ምድርን፡ፈጠረ።",
"ወደ ቍስጥንጥንያ አገርም በደረሰች ጊዜ",
"፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"
]
encodings = tokenizer.encode_batch(texts)
for i, enc in enumerate(encodings):
print(f"Text {i+1}: {len(enc.ids)} tokens")
| File | Size | Description |
|---|---|---|
tokenizer.json |
1.3 MB | Standard tokenizer format |
vocab.json.gz |
136 KB | Compressed vocabulary (89.8% smaller) |
config.json |
431 B | Training configuration |
training_metrics.json |
1.2 KB | Comprehensive training metrics |
README.md |
- | This documentation |
<pad>, <unk>, <s>, </s>, <mask>All test cases achieve accurate reconstruction:
test_cases = [
("Ge'ez Punctuation", "፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"),
("Synaxarium", "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"),
("Biblical", "ወደ ቍስጥንጥንያ አገርም በደረሰች ጊዜ")
]
for name, text in test_cases:
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded.ids)
assert text == decoded, f"{name} failed!"
print(f"✅ {name}: Accurate ({len(encoded.ids)} tokens)")
This tokenizer was trained on two specialized Amharic biblical datasets:
Both datasets are available on Hugging Face under the Nexuss0781 organization.
Automatic checkpointing during training allows resumption from interruptions:
python scripts/train_tokenizer.py --data_dir ./data --use_checkpoint
python scripts/train_tokenizer.py --data_dir ./data --vocab_size 32000
python scripts/train_tokenizer.py --data_dir ./data --save_compressed
# Supports: gzip, bz2, lzma
Apache License 2.0 - See LICENSE for details.
Made with ❤️ for the Amharic NLP Community
Last Updated: May 2026