Tibetan Script Classifier (DINOv3)

This repository contains fine-tuned Tibetan script classification checkpoints for 18 classes, trained from the DINOv3 ViT-S backbone:

  • Backbone: facebook/dinov3-vits16-pretrain-lvd1689m
  • Task: 18-way script classification
  • Training script included: finetune_dinov3.py

Hugging Face access: DINOv3 requires access approval at huggingface.co/facebook/dinov3-vits16-pretrain-lvd1689m before from_pretrained / downloads will work. Anyone cloning this repo will see the same gated-model error until their HF account is granted access and they are logged in (huggingface-cli login or HF_TOKEN).

Label Set

dhumri, difficult, drathung, drudring, druring, druthung, khyuyig, multi_scripts, non_tibetan, peri, petsuk, trinyig, tsegdrig, tsugchung, tsumachug, uchen_sugdring, uchen_sugthung, yigchung.

Preprocessing (per experiment)

Images for training were produced as follows (see preprocess.py in the parent project):

  • whole_page: resize so the short edge is 224 px, then center crop to 224×224 (one crop per source page).
  • patches_color: same short-edge resize to 224, then sliding-window 224×224 patches with 25% overlap between windows (multiple crops per page).
  • patches_clahe: identical patch layout as patches_color; each patch is converted to grayscale and CLAHE contrast normalization is applied (clipLimit=2.0, tileGridSize=(8,8)), then saved as BGR/RGB for training.

Training recipe

  • Progressive unfreezing (defaults in finetune_dinov3.py):
    • Stage A — head only: 20 epochs, backbone frozen, classifier head LR 1e-3 (backbone LR 0).
    • Stage B — last 2 blocks: 10 epochs, backbone LR 1e-5, head LR 1e-3.
    • Stage C — last 4 blocks: 10 epochs, backbone LR 5e-6, head LR 5e-4.
  • Loss: class-weighted cross-entropy with inverse-frequency weights over the training split (nn.CrossEntropyLoss(weight=...)).
  • Sampling: the published runs use a standard DataLoader with shuffle=True. The script also defines get_weighted_samplerWeightedRandomSampler if you want to switch the train loader to explicit class-balanced sampling.
  • Document-aware augmentations (train only): RandomRotation ±5° (fill white), ColorJitter brightness/contrast ±20% (0.2), plus RandomResizedCrop and light RandomErasing as in ScriptDataset; no horizontal flip.

Class distribution (whole_page split totals)

The whole-page split has 5,684 samples in total (train 3,996 / val 844 / test 844). 88 benchmark images (5 per class) were excluded from training and evaluation.

Class Samples
dhumri 98
difficult 170
drathung 129
drudring 132
druring 119
druthung 207
khyuyig 113
multi_scripts 235
non_tibetan 192
peri 614
petsuk 1388
trinyig 42
tsegdrig 749
tsugchung 77
tsumachug 178
uchen_sugdring 835
uchen_sugthung 240
yigchung 166

Experiments Included

1) whole_page

  • Files: whole_page/final_model.pt, results.json, confusion_matrix.png, confusion_matrix.csv, splits.json
  • Test (image-level) macro-F1: 0.5124
  • Test accuracy: 0.5711

2) patches_color

  • Files: patches_color/final_model.pt, results.json, confusion_matrix.png, checkpoint_page_eval.json, splits.json
  • Test (patch-level) macro-F1: 0.4899
  • Re-eval page-level macro-F1 for shipped final_model.pt (checkpoint_page_eval.json): 0.5017
  • Best page-level macro-F1 among stage checkpoints on the same grid: 0.5043 (Stage A)

3) patches_clahe

  • Files: patches_clahe/final_model.pt, results.json, confusion_matrix.png, checkpoint_page_eval.json, splits.json
  • Test (patch-level) macro-F1: 0.4911
  • Re-eval page-level macro-F1 for shipped final_model.pt: 0.5261
  • Best page-level macro-F1 among stage checkpoints: 0.529 (Stage B)

Which stage produced which checkpoint?

  • final_model.pt in each folder is the stage with the highest validation macro-F1 among best_stage_*.pt checkpoints (see best_val_checkpoint in each results.json): Stage B for whole_page, Stage C for both patches_color and patches_clahe.
  • For page-level quality on the patch runs, the best single stage on the re-eval grid differs: Stage A (patches_color) and Stage B (patches_clahe) beat their respective final_model.pt page scores—use checkpoint_page_eval.json if you want to deploy a stage checkpoint instead of the val-selected default.

Which experiment won?

CLAHE patches achieved the highest page-level macro-F1 (0.529 on the best stage checkpoint), while whole page achieved the best image-level macro-F1 (0.512). Whole page is recommended for production due to simpler inference.

How To Load a Checkpoint

import torch
from pathlib import Path
from finetune_dinov3 import DINOv3Classifier, DINOV3_MODEL_ID

ckpt_path = Path("whole_page/final_model.pt")
payload = torch.load(ckpt_path, map_location="cpu")

label_to_idx = payload["label_to_idx"]
idx_to_label = {v: k for k, v in label_to_idx.items()}
num_classes = len(label_to_idx)

model = DINOv3Classifier(DINOV3_MODEL_ID, num_classes)
model.load_state_dict(payload["model_state_dict"])
model.eval()

Inference (Single Image)

import torch
from PIL import Image
from transformers import AutoImageProcessor

processor = AutoImageProcessor.from_pretrained(DINOV3_MODEL_ID)
img = Image.open("example.png").convert("RGB")
inputs = processor(images=img, return_tensors="pt")

with torch.no_grad():
    logits = model(inputs["pixel_values"])
    probs = torch.softmax(logits, dim=1)[0].cpu().numpy()
pred_idx = int(probs.argmax())
pred_label = idx_to_label[pred_idx]
print(pred_label, float(probs[pred_idx]))

Page-Level Inference (Patch Aggregation)

For patch experiments (patches_color, patches_clahe), aggregate by page stem:

  1. group patch probabilities by page ID (strip _pN suffix),
  2. average probabilities per page,
  3. take argmax of averaged probabilities.

This is the same logic used in the re-evaluation script output (checkpoint_page_eval.json).

Known Limitations

  • Class imbalance is high (for example petsuk and uchen_sugdring dominate, while trinyig is small).
  • Results can vary by preprocessing variant and by patch vs page-level evaluation protocol.
  • Patch-level metrics and page-level metrics are not directly interchangeable.
  • The model expects Tibetan manuscript-style inputs; performance can drop on out-of-domain scans or mixed/noisy pages.
  • Checkpoints are tied to the exact label mapping saved in each payload (label_to_idx).

Reproducibility Notes

  • Exclusion manifest support is enabled in training (benchmark_page_ids.json).
  • Full training code used for these artifacts is included at finetune_dinov3.py.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for karma689/tibetan-script-classifier

Evaluation results