GaloNMT — English → Galo Neural Machine Translation
GaloNMT is a vanilla Transformer-based neural machine translation model that translates English text into Galo, a Tibeto-Burman language spoken by the Galo community in Arunachal Pradesh, India. Galo is classified as a low-resource language with very limited digital representation, making this one of the first dedicated NMT systems for the language.
Model Details
| Property | Value |
|---|---|
| Architecture | Vanilla Transformer (from scratch) |
| Translation Direction | English → Galo |
| Framework | PyTorch |
| Model Size | ~34.7 MB (model.pt) |
| Tokenizer | Byte-Pair Encoding (BPE) via HuggingFace tokenizers |
| Source Vocab Size | 5,000 |
| Target Vocab Size | 5,000 |
Architecture Hyperparameters
| Hyperparameter | Value |
|---|---|
d_model |
128 |
n_heads |
4 |
n_layers |
2 |
d_ff |
256 |
dropout |
0.3 |
max_seq_length |
64 |
Training Configuration
| Parameter | Value |
|---|---|
| Optimizer | Adam |
| Learning Rate | 1e-4 |
| Batch Size | 16 |
| Epochs | 30 |
| Loss Function | CrossEntropyLoss (ignoring PAD) |
| Hardware | Apple M4 Silicon (MPS) |
Training Data
The model was trained on the Galo Bible Parallel Corpus, a sentence-aligned English–Galo parallel corpus derived from Bible translations.
| Split | Sentences |
|---|---|
| Train | 6,144 |
| Validation | 768 |
| Test | 768 |
| Total | 7,680 |
The dataset was split using an 80 : 10 : 10 ratio (train / validation / test) with a fixed random seed of 42 for reproducibility.
Evaluation Results
Evaluation was performed on 100 randomly sampled sentences from the held-out test set using SacreBLEU.
| Metric | Score |
|---|---|
| BLEU | 16.61 |
| chrF | 15.26 |
| TER | 150.04 |
Sample Translations
| English Input | Galo Output |
|---|---|
| The elder to Gaius the beloved, | Yo lëga ëmrëm nyi gaddë nyi gaddë , yo go mendudü ëgum nyi gaddë yo go mendudü dü ? |
| Beloved, I personally am praying for you, | Ngo nonnuëm mendu , ngo nonnuëm mendu , ngo nonnuëm mendu , |
| Do not love the world, nor the things that are in the world. | Ëmbë rünamë , tani mooko sokë tani mooko sokë tani mooko sokë nyi ë , okkë tani mooko sokë nyi ë tani mooko sokë aken ë . |
Note: The model shows signs of repetition in some outputs, a common phenomenon in low-resource NMT settings. See Limitations for details.
How to Use
Requirements
pip install torch tokenizers
Inference
import torch
import json
from tokenizers import Tokenizer
with open("GaloNMT/config.json", "r") as f:
config = json.load(f)
en_tokenizer = Tokenizer.from_file("GaloNMT/en_tokenizer.json")
galo_tokenizer = Tokenizer.from_file("GaloNMT/galo_tokenizer.json")
PAD_IDX = en_tokenizer.token_to_id("[PAD]")
SOS_IDX = en_tokenizer.token_to_id("[SOS]")
EOS_IDX = en_tokenizer.token_to_id("[EOS]")
def translate(sentence, model, max_len=64):
model.eval()
tokens = [SOS_IDX] + en_tokenizer.encode(sentence).ids + [EOS_IDX]
src = torch.tensor(tokens).unsqueeze(0).to(device)
src_mask = (src != PAD_IDX).unsqueeze(1).unsqueeze(2)
trg_indexes = [SOS_IDX]
for _ in range(max_len):
trg_tensor = torch.tensor(trg_indexes).unsqueeze(0).to(device)
trg_mask = torch.tril(
torch.ones((1, 1, len(trg_indexes), len(trg_indexes)), device=device)
).bool()
with torch.no_grad():
output = model(src, trg_tensor, src_mask, trg_mask)
pred_token = output.argmax(2)[:, -1].item()
trg_indexes.append(pred_token)
if pred_token == EOS_IDX:
break
return galo_tokenizer.decode(trg_indexes)
Intended Use
- Primary use: Research and experimentation in low-resource neural machine translation for the Galo language.
- Secondary use: Supporting language documentation and digital preservation efforts for the Galo community.
- Not intended for: Production-grade translation systems, legal or medical translation, or any high-stakes application where translation accuracy is critical.
Limitations
- Small training corpus: The model is trained on only ~7,700 sentence pairs from a single domain (Bible text), which limits its vocabulary coverage and generalization to other domains.
- Repetitive outputs: Due to the low-resource setting and small model size, the decoder occasionally produces repetitive n-grams — a well-known issue in autoregressive NMT.
- Single domain: Performance on out-of-domain text (news, conversational, technical) is expected to be significantly lower than the reported metrics.
- No beam search: The current inference uses greedy decoding. Beam search or sampling strategies may improve output quality.
- No back-translation or data augmentation: The model was trained on parallel data only, without synthetic data augmentation techniques.
Ethical Considerations
- The training data is derived from publicly available Bible translations. Care should be taken when using the model in culturally sensitive contexts.
- Galo is a language spoken by an indigenous community. Any deployment or public-facing use of this model should involve community consultation and respect for indigenous language rights.
- This model should not be used to generate content that misrepresents the Galo language or culture.
Training Loss Curve
The model trained for 30 epochs with the following loss progression:
| Epoch | Loss | Epoch | Loss | Epoch | Loss |
|---|---|---|---|---|---|
| 1 | 7.0211 | 11 | 5.3566 | 21 | 4.8699 |
| 2 | 6.3616 | 12 | 5.2930 | 22 | 4.8339 |
| 3 | 6.1726 | 13 | 5.2337 | 23 | 4.7986 |
| 4 | 6.0124 | 14 | 5.1815 | 24 | 4.7632 |
| 5 | 5.8844 | 15 | 5.1299 | 25 | 4.7345 |
| 6 | 5.7708 | 16 | 5.0777 | 26 | 4.7034 |
| 7 | 5.6739 | 17 | 5.0343 | 27 | 4.6699 |
| 8 | 5.5823 | 18 | 4.9872 | 28 | 4.6412 |
| 9 | 5.5018 | 19 | 4.9482 | 29 | 4.6122 |
| 10 | 5.4271 | 20 | 4.9081 | 30 | 4.5867 |
Model Files
GaloNMT/
├── config.json # Model architecture configuration
├── model.pt # Trained model weights (~34.7 MB)
├── en_tokenizer.json # English BPE tokenizer
├── galo_tokenizer.json # Galo BPE tokenizer
└── README.md # This model card
Citation
If you use this model in your research, please cite:
@misc{galonmt2026,
title = {GaloNMT: Neural Machine Translation for Galo to English},
author = {Jurist Dupit},
year = {2026},
howpublished = {\url{https://huggingface.co/GaloNMT}},
note = {Vanilla Transformer trained on the Galo Bible Parallel Corpus},
institute = {Rajiv Gandhi University Rono Hills Doimukh}
}
Acknowledgements
This work contributes to the digital preservation and computational linguistic support for the Galo language. We thank the Galo-speaking community for the linguistic resources that made this project possible.
- Downloads last month
- 32
Evaluation results
- bleu on Galo Bible Parallel Corpusself-reported16.610
- chrf on Galo Bible Parallel Corpusself-reported15.260
- ter on Galo Bible Parallel Corpusself-reported150.040