diacnet-1.0
diacnet-1.0 restores diacritics/accents to text that's been typed or scraped
without them, across 10 languages. Fine-tuned from google/byt5-small โ
character/byte-level rather than word-level, so it handles Yoruba tone marks,
Vietnamese combining diacritics, and Polish/Turkish special characters through
the same mechanism, no per-language vocabulary needed.
- Single joint model, all 10 languages โ a language tag prefix (
<yor>,<vie>, etc.) tells the model which diacritic inventory to apply, no separate models or an upstream language-ID step required. - Median CER of ~0.02 across most languages (see Benchmarks) โ near-perfect restoration on well-formed input.
- Fully self-supervised training โ no manual annotation. Clean, already- diacritized text is the target; diacritics are deterministically stripped to create the training input.
๐๏ธ Model Details
| Base model | google/byt5-small |
| Architecture | Byte-level seq2seq (T5) |
| Max sequence length | 256 bytes (trained on sentence-level examples) |
| Languages | 10 (see below) |
| Training data | olaverse/qg-passages-multi, split into sentences |
| Training | 3 epochs, batch size 16 ร grad-accum 2, lr 1e-4 |
Languages: yo vi ig ha pl tr pt es fr it (ISO 639-1; ISO 639-3: yor vie ibo hau pol tur por spa fra ita)
Scoped deliberately to languages where diacritics are lexically meaningful โ not applied to the other 15 languages in the source corpus (e.g. Swahili, Zulu, Amharic, Japanese), where diacritic restoration either doesn't apply or isn't the right frame for the script.
๐ Usage
from transformers import AutoTokenizer, T5ForConditionalGeneration
tok = AutoTokenizer.from_pretrained("olaverse/diacnet-1.0")
model = T5ForConditionalGeneration.from_pretrained("olaverse/diacnet-1.0")
text = "<yor> se eranko naa si gbo o?"
inputs = tok(text, return_tensors="pt")
output_ids = model.generate(**inputs, max_new_tokens=256)
print(tok.decode(output_ids[0], skip_special_tokens=True))
# แนฃรฉ แบนranko nรกร sรฌ gbแปฬ แป?
Prefix input text with the target language tag (<yor>, <vie>, <ibo>,
<hau>, <pol>, <tur>, <por>, <spa>, <fra>, <ita>). Works best on
single sentences or short passages โ see Known Limitations for longer text.
๐ Benchmarks
Character error rate (CER, lower is better) on a held-out validation split (5% of training data, not seen during training). Sentence-level examples.
| Language | CER | n |
|---|---|---|
| por | 0.013 | 750 |
| spa | 0.013 | 845 |
| fra | 0.016 | 587 |
| pol | 0.016 | 1,137 |
| ita | 0.022 | 269 |
| ibo | 0.030 | 1,672 |
| tur | 0.033 | 1,206 |
| hau | 0.038 | 15 |
| vie | 0.063 | 1,832 |
| yor | 0.110 | 1,583 |
โ ๏ธ Known limitations
- Yoruba CER is notably higher than the other 9 languages โ nearly 3x the
next-highest score. Qualitative inspection shows this is driven almost
entirely by genuine tonal ambiguity, not model weakness: Yoruba diacritics
mark tone (low/mid/high pitch), and the same base letter sequence can
correspond to multiple valid tone patterns depending on word sense or
context, sometimes unrecoverable from text alone. Example from validation:
target
แปฬfรญรฌsรฌ(high tone) vs. predictedแปฬfรญรฌsรฌ(low tone) on an English loanword ("office") โ everything else in the same sentence, including several other tone-marked words, was restored correctly. Most Yoruba errors are single-diacritic misses like this on an otherwise correctly restored sentence, not systematic failure. - Hausa's 0.038 CER is based on only 15 validation examples โ too small a sample to treat as a reliable estimate. Hausa was underrepresented in the source corpus relative to the other 9 languages; treat this number as indicative at best until evaluated on more data.
- Trained on machine/teacher-generated text (Aya-Collection-derived passages), not human-authored or casually-typed text โ accuracy on real-world messy input (mixed scripts, typos, non-standard spelling) is untested.
- Trained and evaluated on sentence-length input (median 58 bytes, p90 162 bytes). Longer multi-sentence passages should be split into sentences before inference for best results, rather than passed in as one long input.
Training data & licensing
Fine-tuned from google/byt5-small (Apache-2.0) on
olaverse/qg-passages-multi
(Apache-2.0), split into sentences and paired with a diacritic-stripped copy
of each sentence as the self-supervised training input. Released under
Apache-2.0.
Citation
@misc{diacnet-1.0,
title = {diacnet-1.0},
author = {Olaverse},
year = {2026},
url = {https://huggingface.co/olaverse/diacnet-1.0}
}
- Downloads last month
- -
Model tree for olaverse/diacnet-1.0
Base model
google/byt5-small