Tibetan Script Classifier (DINOv3)

This repository contains fine-tuned Tibetan script classification checkpoints for 18 classes, trained from the DINOv3 ViT-S backbone:

Backbone: facebook/dinov3-vits16-pretrain-lvd1689m
Task: 18-way script classification
Training script included: finetune_dinov3.py

Hugging Face access: DINOv3 requires access approval at huggingface.co/facebook/dinov3-vits16-pretrain-lvd1689m before from_pretrained / downloads will work. Anyone cloning this repo will see the same gated-model error until their HF account is granted access and they are logged in (huggingface-cli login or HF_TOKEN).

Label Set

dhumri, difficult, drathung, drudring, druring, druthung, khyuyig, multi_scripts, non_tibetan, peri, petsuk, trinyig, tsegdrig, tsugchung, tsumachug, uchen_sugdring, uchen_sugthung, yigchung.

Preprocessing (per experiment)

Images for training were produced as follows (see preprocess.py in the parent project):

whole_page: resize so the short edge is 224 px, then center crop to 224×224 (one crop per source page).
patches_color: same short-edge resize to 224, then sliding-window 224×224 patches with 25% overlap between windows (multiple crops per page).
patches_clahe: identical patch layout as patches_color; each patch is converted to grayscale and CLAHE contrast normalization is applied (clipLimit=2.0, tileGridSize=(8,8)), then saved as BGR/RGB for training.

Training recipe

Progressive unfreezing (defaults in finetune_dinov3.py):
- Stage A — head only: 20 epochs, backbone frozen, classifier head LR 1e-3 (backbone LR 0).
- Stage B — last 2 blocks: 10 epochs, backbone LR 1e-5, head LR 1e-3.
- Stage C — last 4 blocks: 10 epochs, backbone LR 5e-6, head LR 5e-4.
Loss: class-weighted cross-entropy with inverse-frequency weights over the training split (nn.CrossEntropyLoss(weight=...)).
Sampling: the published runs use a standard DataLoader with shuffle=True. The script also defines get_weighted_sampler → WeightedRandomSampler if you want to switch the train loader to explicit class-balanced sampling.
Document-aware augmentations (train only): RandomRotation ±5° (fill white), ColorJitter brightness/contrast ±20% (0.2), plus RandomResizedCrop and light RandomErasing as in ScriptDataset; no horizontal flip.

Class distribution (`whole_page` split totals)

The whole-page split has 5,684 samples in total (train 3,996 / val 844 / test 844). 88 benchmark images (5 per class) were excluded from training and evaluation.

Class	Samples
dhumri	98
difficult	170
drathung	129
drudring	132
druring	119
druthung	207
khyuyig	113
multi_scripts	235
non_tibetan	192
peri	614
petsuk	1388
trinyig	42
tsegdrig	749
tsugchung	77
tsumachug	178
uchen_sugdring	835
uchen_sugthung	240
yigchung	166

Experiments Included

1) `whole_page`

Files: whole_page/final_model.pt, results.json, confusion_matrix.png, confusion_matrix.csv, splits.json
Test (image-level) macro-F1: 0.5124
Test accuracy: 0.5711

2) `patches_color`

Files: patches_color/final_model.pt, results.json, confusion_matrix.png, checkpoint_page_eval.json, splits.json
Test (patch-level) macro-F1: 0.4899
Re-eval page-level macro-F1 for shipped final_model.pt (checkpoint_page_eval.json): 0.5017
Best page-level macro-F1 among stage checkpoints on the same grid: 0.5043 (Stage A)

3) `patches_clahe`

Files: patches_clahe/final_model.pt, results.json, confusion_matrix.png, checkpoint_page_eval.json, splits.json
Test (patch-level) macro-F1: 0.4911
Re-eval page-level macro-F1 for shipped final_model.pt: 0.5261
Best page-level macro-F1 among stage checkpoints: 0.529 (Stage B)

Which stage produced which checkpoint?

final_model.pt in each folder is the stage with the highest validation macro-F1 among best_stage_*.pt checkpoints (see best_val_checkpoint in each results.json): Stage B for whole_page, Stage C for both patches_color and patches_clahe.
For page-level quality on the patch runs, the best single stage on the re-eval grid differs: Stage A (patches_color) and Stage B (patches_clahe) beat their respective final_model.pt page scores—use checkpoint_page_eval.json if you want to deploy a stage checkpoint instead of the val-selected default.

Which experiment won?

CLAHE patches achieved the highest page-level macro-F1 (0.529 on the best stage checkpoint), while whole page achieved the best image-level macro-F1 (0.512). Whole page is recommended for production due to simpler inference.

How To Load a Checkpoint

import torch
from pathlib import Path
from finetune_dinov3 import DINOv3Classifier, DINOV3_MODEL_ID

ckpt_path = Path("whole_page/final_model.pt")
payload = torch.load(ckpt_path, map_location="cpu")

label_to_idx = payload["label_to_idx"]
idx_to_label = {v: k for k, v in label_to_idx.items()}
num_classes = len(label_to_idx)

model = DINOv3Classifier(DINOV3_MODEL_ID, num_classes)
model.load_state_dict(payload["model_state_dict"])
model.eval()

Inference (Single Image)

import torch
from PIL import Image
from transformers import AutoImageProcessor

processor = AutoImageProcessor.from_pretrained(DINOV3_MODEL_ID)
img = Image.open("example.png").convert("RGB")
inputs = processor(images=img, return_tensors="pt")

with torch.no_grad():
    logits = model(inputs["pixel_values"])
    probs = torch.softmax(logits, dim=1)[0].cpu().numpy()
pred_idx = int(probs.argmax())
pred_label = idx_to_label[pred_idx]
print(pred_label, float(probs[pred_idx]))

Page-Level Inference (Patch Aggregation)

For patch experiments (patches_color, patches_clahe), aggregate by page stem:

group patch probabilities by page ID (strip _pN suffix),
average probabilities per page,
take argmax of averaged probabilities.

This is the same logic used in the re-evaluation script output (checkpoint_page_eval.json).

Known Limitations

Class imbalance is high (for example petsuk and uchen_sugdring dominate, while trinyig is small).
Results can vary by preprocessing variant and by patch vs page-level evaluation protocol.
Patch-level metrics and page-level metrics are not directly interchangeable.
The model expects Tibetan manuscript-style inputs; performance can drop on out-of-domain scans or mixed/noisy pages.
Checkpoints are tied to the exact label mapping saved in each payload (label_to_idx).

Reproducibility Notes

Exclusion manifest support is enabled in training (benchmark_page_ids.json).
Full training code used for these artifacts is included at finetune_dinov3.py.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for karma689/tibetan-script-classifier

Base model

facebook/dinov3-vit7b16-pretrain-lvd1689m

Finetuned

facebook/dinov3-vits16-pretrain-lvd1689m