You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

GaloNMT — English → Galo Neural Machine Translation

GaloNMT is a vanilla Transformer-based neural machine translation model that translates English text into Galo, a Tibeto-Burman language spoken by the Galo community in Arunachal Pradesh, India. Galo is classified as a low-resource language with very limited digital representation, making this one of the first dedicated NMT systems for the language.

Model Details

Property Value
Architecture Vanilla Transformer (from scratch)
Translation Direction English → Galo
Framework PyTorch
Model Size ~34.7 MB (model.pt)
Tokenizer Byte-Pair Encoding (BPE) via HuggingFace tokenizers
Source Vocab Size 5,000
Target Vocab Size 5,000

Architecture Hyperparameters

Hyperparameter Value
d_model 128
n_heads 4
n_layers 2
d_ff 256
dropout 0.3
max_seq_length 64

Training Configuration

Parameter Value
Optimizer Adam
Learning Rate 1e-4
Batch Size 16
Epochs 30
Loss Function CrossEntropyLoss (ignoring PAD)
Hardware Apple M4 Silicon (MPS)

Training Data

The model was trained on the Galo Bible Parallel Corpus, a sentence-aligned English–Galo parallel corpus derived from Bible translations.

Split Sentences
Train 6,144
Validation 768
Test 768
Total 7,680

The dataset was split using an 80 : 10 : 10 ratio (train / validation / test) with a fixed random seed of 42 for reproducibility.

Evaluation Results

Evaluation was performed on 100 randomly sampled sentences from the held-out test set using SacreBLEU.

Metric Score
BLEU 16.61
chrF 15.26
TER 150.04

Sample Translations

English Input Galo Output
The elder to Gaius the beloved, Yo lëga ëmrëm nyi gaddë nyi gaddë , yo go mendudü ëgum nyi gaddë yo go mendudü dü ?
Beloved, I personally am praying for you, Ngo nonnuëm mendu , ngo nonnuëm mendu , ngo nonnuëm mendu ,
Do not love the world, nor the things that are in the world. Ëmbë rünamë , tani mooko sokë tani mooko sokë tani mooko sokë nyi ë , okkë tani mooko sokë nyi ë tani mooko sokë aken ë .

Note: The model shows signs of repetition in some outputs, a common phenomenon in low-resource NMT settings. See Limitations for details.

How to Use

Requirements

pip install torch tokenizers

Inference

import torch
import json
from tokenizers import Tokenizer

with open("GaloNMT/config.json", "r") as f:
    config = json.load(f)

en_tokenizer = Tokenizer.from_file("GaloNMT/en_tokenizer.json")
galo_tokenizer = Tokenizer.from_file("GaloNMT/galo_tokenizer.json")

PAD_IDX = en_tokenizer.token_to_id("[PAD]")
SOS_IDX = en_tokenizer.token_to_id("[SOS]")
EOS_IDX = en_tokenizer.token_to_id("[EOS]")

def translate(sentence, model, max_len=64):
    model.eval()
    tokens = [SOS_IDX] + en_tokenizer.encode(sentence).ids + [EOS_IDX]
    src = torch.tensor(tokens).unsqueeze(0).to(device)
    src_mask = (src != PAD_IDX).unsqueeze(1).unsqueeze(2)

    trg_indexes = [SOS_IDX]
    for _ in range(max_len):
        trg_tensor = torch.tensor(trg_indexes).unsqueeze(0).to(device)
        trg_mask = torch.tril(
            torch.ones((1, 1, len(trg_indexes), len(trg_indexes)), device=device)
        ).bool()
        with torch.no_grad():
            output = model(src, trg_tensor, src_mask, trg_mask)
        pred_token = output.argmax(2)[:, -1].item()
        trg_indexes.append(pred_token)
        if pred_token == EOS_IDX:
            break

    return galo_tokenizer.decode(trg_indexes)

Intended Use

  • Primary use: Research and experimentation in low-resource neural machine translation for the Galo language.
  • Secondary use: Supporting language documentation and digital preservation efforts for the Galo community.
  • Not intended for: Production-grade translation systems, legal or medical translation, or any high-stakes application where translation accuracy is critical.

Limitations

  • Small training corpus: The model is trained on only ~7,700 sentence pairs from a single domain (Bible text), which limits its vocabulary coverage and generalization to other domains.
  • Repetitive outputs: Due to the low-resource setting and small model size, the decoder occasionally produces repetitive n-grams — a well-known issue in autoregressive NMT.
  • Single domain: Performance on out-of-domain text (news, conversational, technical) is expected to be significantly lower than the reported metrics.
  • No beam search: The current inference uses greedy decoding. Beam search or sampling strategies may improve output quality.
  • No back-translation or data augmentation: The model was trained on parallel data only, without synthetic data augmentation techniques.

Ethical Considerations

  • The training data is derived from publicly available Bible translations. Care should be taken when using the model in culturally sensitive contexts.
  • Galo is a language spoken by an indigenous community. Any deployment or public-facing use of this model should involve community consultation and respect for indigenous language rights.
  • This model should not be used to generate content that misrepresents the Galo language or culture.

Training Loss Curve

The model trained for 30 epochs with the following loss progression:

Epoch Loss Epoch Loss Epoch Loss
1 7.0211 11 5.3566 21 4.8699
2 6.3616 12 5.2930 22 4.8339
3 6.1726 13 5.2337 23 4.7986
4 6.0124 14 5.1815 24 4.7632
5 5.8844 15 5.1299 25 4.7345
6 5.7708 16 5.0777 26 4.7034
7 5.6739 17 5.0343 27 4.6699
8 5.5823 18 4.9872 28 4.6412
9 5.5018 19 4.9482 29 4.6122
10 5.4271 20 4.9081 30 4.5867

Model Files

GaloNMT/
├── config.json            # Model architecture configuration
├── model.pt               # Trained model weights (~34.7 MB)
├── en_tokenizer.json      # English BPE tokenizer
├── galo_tokenizer.json    # Galo BPE tokenizer
└── README.md              # This model card

Citation

If you use this model in your research, please cite:

@misc{galonmt2026,
  title        = {GaloNMT: Neural Machine Translation for Galo to English},
  author       = {Jurist Dupit},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/GaloNMT}},
  note         = {Vanilla Transformer trained on the Galo Bible Parallel Corpus},
  institute = {Rajiv Gandhi University Rono Hills Doimukh}
}

Acknowledgements

This work contributes to the digital preservation and computational linguistic support for the Galo language. We thank the Galo-speaking community for the linguistic resources that made this project possible.

Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results