You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

GaloNMT — English → Galo Neural Machine Translation

GaloNMT is a vanilla Transformer-based neural machine translation model that translates English text into Galo, a Tibeto-Burman language spoken by the Galo community in Arunachal Pradesh, India. Galo is classified as a low-resource language with very limited digital representation, making this one of the first dedicated NMT systems for the language.

Model Details

Property	Value
Architecture	Vanilla Transformer (from scratch)
Translation Direction	English → Galo
Framework	PyTorch
Model Size	~34.7 MB (`model.pt`)
Tokenizer	Byte-Pair Encoding (BPE) via HuggingFace `tokenizers`
Source Vocab Size	5,000
Target Vocab Size	5,000

Architecture Hyperparameters

Hyperparameter	Value
`d_model`	128
`n_heads`	4
`n_layers`	2
`d_ff`	256
`dropout`	0.3
`max_seq_length`	64

Training Configuration

Parameter	Value
Optimizer	Adam
Learning Rate	1e-4
Batch Size	16
Epochs	30
Loss Function	CrossEntropyLoss (ignoring PAD)
Hardware	Apple M4 Silicon (MPS)

Training Data

The model was trained on the Galo Bible Parallel Corpus, a sentence-aligned English–Galo parallel corpus derived from Bible translations.

Split	Sentences
Train	6,144
Validation	768
Test	768
Total	7,680

The dataset was split using an 80 : 10 : 10 ratio (train / validation / test) with a fixed random seed of 42 for reproducibility.

Evaluation Results

Evaluation was performed on 100 randomly sampled sentences from the held-out test set using SacreBLEU.

Metric	Score
BLEU	16.61
chrF	15.26
TER	150.04

Sample Translations

English Input	Galo Output
The elder to Gaius the beloved,	Yo lëga ëmrëm nyi gaddë nyi gaddë , yo go mendudü ëgum nyi gaddë yo go mendudü dü ?
Beloved, I personally am praying for you,	Ngo nonnuëm mendu , ngo nonnuëm mendu , ngo nonnuëm mendu ,
Do not love the world, nor the things that are in the world.	Ëmbë rünamë , tani mooko sokë tani mooko sokë tani mooko sokë nyi ë , okkë tani mooko sokë nyi ë tani mooko sokë aken ë .

Note: The model shows signs of repetition in some outputs, a common phenomenon in low-resource NMT settings. See Limitations for details.

How to Use

Requirements

pip install torch tokenizers

Inference

import torch
import json
from tokenizers import Tokenizer

with open("GaloNMT/config.json", "r") as f:
    config = json.load(f)

en_tokenizer = Tokenizer.from_file("GaloNMT/en_tokenizer.json")
galo_tokenizer = Tokenizer.from_file("GaloNMT/galo_tokenizer.json")

PAD_IDX = en_tokenizer.token_to_id("[PAD]")
SOS_IDX = en_tokenizer.token_to_id("[SOS]")
EOS_IDX = en_tokenizer.token_to_id("[EOS]")

def translate(sentence, model, max_len=64):
    model.eval()
    tokens = [SOS_IDX] + en_tokenizer.encode(sentence).ids + [EOS_IDX]
    src = torch.tensor(tokens).unsqueeze(0).to(device)
    src_mask = (src != PAD_IDX).unsqueeze(1).unsqueeze(2)

    trg_indexes = [SOS_IDX]
    for _ in range(max_len):
        trg_tensor = torch.tensor(trg_indexes).unsqueeze(0).to(device)
        trg_mask = torch.tril(
            torch.ones((1, 1, len(trg_indexes), len(trg_indexes)), device=device)
        ).bool()
        with torch.no_grad():
            output = model(src, trg_tensor, src_mask, trg_mask)
        pred_token = output.argmax(2)[:, -1].item()
        trg_indexes.append(pred_token)
        if pred_token == EOS_IDX:
            break

    return galo_tokenizer.decode(trg_indexes)

Intended Use

Primary use: Research and experimentation in low-resource neural machine translation for the Galo language.
Secondary use: Supporting language documentation and digital preservation efforts for the Galo community.
Not intended for: Production-grade translation systems, legal or medical translation, or any high-stakes application where translation accuracy is critical.

Limitations

Small training corpus: The model is trained on only ~7,700 sentence pairs from a single domain (Bible text), which limits its vocabulary coverage and generalization to other domains.
Repetitive outputs: Due to the low-resource setting and small model size, the decoder occasionally produces repetitive n-grams — a well-known issue in autoregressive NMT.
Single domain: Performance on out-of-domain text (news, conversational, technical) is expected to be significantly lower than the reported metrics.
No beam search: The current inference uses greedy decoding. Beam search or sampling strategies may improve output quality.
No back-translation or data augmentation: The model was trained on parallel data only, without synthetic data augmentation techniques.

Ethical Considerations

The training data is derived from publicly available Bible translations. Care should be taken when using the model in culturally sensitive contexts.
Galo is a language spoken by an indigenous community. Any deployment or public-facing use of this model should involve community consultation and respect for indigenous language rights.
This model should not be used to generate content that misrepresents the Galo language or culture.

Training Loss Curve

The model trained for 30 epochs with the following loss progression:

Epoch	Loss	Epoch	Loss	Epoch	Loss
1	7.0211	11	5.3566	21	4.8699
2	6.3616	12	5.2930	22	4.8339
3	6.1726	13	5.2337	23	4.7986
4	6.0124	14	5.1815	24	4.7632
5	5.8844	15	5.1299	25	4.7345
6	5.7708	16	5.0777	26	4.7034
7	5.6739	17	5.0343	27	4.6699
8	5.5823	18	4.9872	28	4.6412
9	5.5018	19	4.9482	29	4.6122
10	5.4271	20	4.9081	30	4.5867

Model Files

GaloNMT/
├── config.json            # Model architecture configuration
├── model.pt               # Trained model weights (~34.7 MB)
├── en_tokenizer.json      # English BPE tokenizer
├── galo_tokenizer.json    # Galo BPE tokenizer
└── README.md              # This model card

Citation

If you use this model in your research, please cite:

@misc{galonmt2026,
  title        = {GaloNMT: Neural Machine Translation for Galo to English},
  author       = {Jurist Dupit},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/GaloNMT}},
  note         = {Vanilla Transformer trained on the Galo Bible Parallel Corpus},
  institute = {Rajiv Gandhi University Rono Hills Doimukh}
}

Acknowledgements

This work contributes to the digital preservation and computational linguistic support for the Galo language. We thank the Galo-speaking community for the linguistic resources that made this project possible.

Downloads last month: 32

Evaluation results

bleu on Galo Bible Parallel Corpus
self-reported

16.610
chrf on Galo Bible Parallel Corpus
self-reported

15.260
ter on Galo Bible Parallel Corpus
self-reported

150.040