Tibetan Script Classifier (DINOv3)
This repository contains fine-tuned Tibetan script classification checkpoints for 18 classes, trained from the DINOv3 ViT-S backbone:
- Backbone:
facebook/dinov3-vits16-pretrain-lvd1689m - Task: 18-way script classification
- Training script included:
finetune_dinov3.py
Hugging Face access: DINOv3 requires access approval at huggingface.co/facebook/dinov3-vits16-pretrain-lvd1689m before from_pretrained / downloads will work. Anyone cloning this repo will see the same gated-model error until their HF account is granted access and they are logged in (huggingface-cli login or HF_TOKEN).
Label Set
dhumri, difficult, drathung, drudring, druring, druthung, khyuyig, multi_scripts, non_tibetan, peri, petsuk, trinyig, tsegdrig, tsugchung, tsumachug, uchen_sugdring, uchen_sugthung, yigchung.
Preprocessing (per experiment)
Images for training were produced as follows (see preprocess.py in the parent project):
whole_page: resize so the short edge is 224 px, then center crop to 224×224 (one crop per source page).patches_color: same short-edge resize to 224, then sliding-window 224×224 patches with 25% overlap between windows (multiple crops per page).patches_clahe: identical patch layout aspatches_color; each patch is converted to grayscale and CLAHE contrast normalization is applied (clipLimit=2.0,tileGridSize=(8,8)), then saved as BGR/RGB for training.
Training recipe
- Progressive unfreezing (defaults in
finetune_dinov3.py):- Stage A — head only: 20 epochs, backbone frozen, classifier head LR 1e-3 (backbone LR 0).
- Stage B — last 2 blocks: 10 epochs, backbone LR 1e-5, head LR 1e-3.
- Stage C — last 4 blocks: 10 epochs, backbone LR 5e-6, head LR 5e-4.
- Loss: class-weighted cross-entropy with inverse-frequency weights over the training split (
nn.CrossEntropyLoss(weight=...)). - Sampling: the published runs use a standard
DataLoaderwithshuffle=True. The script also definesget_weighted_sampler→WeightedRandomSamplerif you want to switch the train loader to explicit class-balanced sampling. - Document-aware augmentations (train only):
RandomRotation±5° (fill white),ColorJitterbrightness/contrast ±20% (0.2), plusRandomResizedCropand lightRandomErasingas inScriptDataset; no horizontal flip.
Class distribution (whole_page split totals)
The whole-page split has 5,684 samples in total (train 3,996 / val 844 / test 844). 88 benchmark images (5 per class) were excluded from training and evaluation.
| Class | Samples |
|---|---|
| dhumri | 98 |
| difficult | 170 |
| drathung | 129 |
| drudring | 132 |
| druring | 119 |
| druthung | 207 |
| khyuyig | 113 |
| multi_scripts | 235 |
| non_tibetan | 192 |
| peri | 614 |
| petsuk | 1388 |
| trinyig | 42 |
| tsegdrig | 749 |
| tsugchung | 77 |
| tsumachug | 178 |
| uchen_sugdring | 835 |
| uchen_sugthung | 240 |
| yigchung | 166 |
Experiments Included
1) whole_page
- Files:
whole_page/final_model.pt,results.json,confusion_matrix.png,confusion_matrix.csv,splits.json - Test (image-level) macro-F1: 0.5124
- Test accuracy: 0.5711
2) patches_color
- Files:
patches_color/final_model.pt,results.json,confusion_matrix.png,checkpoint_page_eval.json,splits.json - Test (patch-level) macro-F1: 0.4899
- Re-eval page-level macro-F1 for shipped
final_model.pt(checkpoint_page_eval.json): 0.5017 - Best page-level macro-F1 among stage checkpoints on the same grid: 0.5043 (Stage A)
3) patches_clahe
- Files:
patches_clahe/final_model.pt,results.json,confusion_matrix.png,checkpoint_page_eval.json,splits.json - Test (patch-level) macro-F1: 0.4911
- Re-eval page-level macro-F1 for shipped
final_model.pt: 0.5261 - Best page-level macro-F1 among stage checkpoints: 0.529 (Stage B)
Which stage produced which checkpoint?
final_model.ptin each folder is the stage with the highest validation macro-F1 amongbest_stage_*.ptcheckpoints (seebest_val_checkpointin eachresults.json): Stage B forwhole_page, Stage C for bothpatches_colorandpatches_clahe.- For page-level quality on the patch runs, the best single stage on the re-eval grid differs: Stage A (
patches_color) and Stage B (patches_clahe) beat their respectivefinal_model.ptpage scores—usecheckpoint_page_eval.jsonif you want to deploy a stage checkpoint instead of the val-selected default.
Which experiment won?
CLAHE patches achieved the highest page-level macro-F1 (0.529 on the best stage checkpoint), while whole page achieved the best image-level macro-F1 (0.512). Whole page is recommended for production due to simpler inference.
How To Load a Checkpoint
import torch
from pathlib import Path
from finetune_dinov3 import DINOv3Classifier, DINOV3_MODEL_ID
ckpt_path = Path("whole_page/final_model.pt")
payload = torch.load(ckpt_path, map_location="cpu")
label_to_idx = payload["label_to_idx"]
idx_to_label = {v: k for k, v in label_to_idx.items()}
num_classes = len(label_to_idx)
model = DINOv3Classifier(DINOV3_MODEL_ID, num_classes)
model.load_state_dict(payload["model_state_dict"])
model.eval()
Inference (Single Image)
import torch
from PIL import Image
from transformers import AutoImageProcessor
processor = AutoImageProcessor.from_pretrained(DINOV3_MODEL_ID)
img = Image.open("example.png").convert("RGB")
inputs = processor(images=img, return_tensors="pt")
with torch.no_grad():
logits = model(inputs["pixel_values"])
probs = torch.softmax(logits, dim=1)[0].cpu().numpy()
pred_idx = int(probs.argmax())
pred_label = idx_to_label[pred_idx]
print(pred_label, float(probs[pred_idx]))
Page-Level Inference (Patch Aggregation)
For patch experiments (patches_color, patches_clahe), aggregate by page stem:
- group patch probabilities by page ID (strip
_pNsuffix), - average probabilities per page,
- take
argmaxof averaged probabilities.
This is the same logic used in the re-evaluation script output (checkpoint_page_eval.json).
Known Limitations
- Class imbalance is high (for example
petsukanduchen_sugdringdominate, whiletrinyigis small). - Results can vary by preprocessing variant and by patch vs page-level evaluation protocol.
- Patch-level metrics and page-level metrics are not directly interchangeable.
- The model expects Tibetan manuscript-style inputs; performance can drop on out-of-domain scans or mixed/noisy pages.
- Checkpoints are tied to the exact label mapping saved in each payload (
label_to_idx).
Reproducibility Notes
- Exclusion manifest support is enabled in training (
benchmark_page_ids.json). - Full training code used for these artifacts is included at
finetune_dinov3.py.
Model tree for karma689/tibetan-script-classifier
Base model
facebook/dinov3-vit7b16-pretrain-lvd1689mEvaluation results
- Macro F1 (whole page)self-reported0.512
- Accuracy (whole page)self-reported0.571
- Macro F1 (CLAHE patches, page-level)self-reported0.529