trocr-kgb-archive

A TrOCR model fine-tuned for line-level Cyrillic handwriting recognition on Soviet-era archival documents.

Model description

A two-stage LoRA fine-tune of kazars24/trocr-base-handwritten-ru for handwritten Russian (and occasional Ukrainian) text in 20th-century archival material.

  • Stage 1 — domain-prep LoRA on UkrainianCatholicUniversity/rukopys (~6 000 Ukrainian-handwritten line-level crops), merged into the base.
  • Stage 2 — target-domain LoRA on ~2 500 hand-annotated archival crops, merged into the Stage 1 base to produce the published checkpoint.

The two-stage recipe is what produces the gain: Stage 1 is roughly a wash on the target domain when measured zero-shot, but it provides a strong handwriting prior that the small target-domain Stage 2 fine-tune builds on.

This repository hosts the fully merged checkpoint — there is no PEFT adapter at inference time. from_pretrained returns an ordinary VisionEncoderDecoderModel.

Intended uses

  • Line-level OCR of handwritten Cyrillic strips — single-line crops similar in shape to the training inputs.
  • Russian primarily, with occasional Ukrainian and mixed-language content tolerated.
  • Reasonable performance on handwritten dates containing mixed Cyrillic / Latin / Arabic / Roman numeral conventions (see Evaluation).

Limitations

  • Small training set (~2 000 unique handwritten lines). The model has limited language-modeling power and will degrade on out-of-distribution writing styles, highly stylized hands, or scripts from earlier than the 20th century.
  • Handwritten only. Typewriter-printed Cyrillic was not part of training; for that text class, use a typewriter/print-oriented Cyrillic OCR model (e.g. PaddleOCR with the East-Slavic recognizer).
  • Date-format dependency. Date crops may use Cyrillic month numerals (Х), ASCII Latin Roman (XI), Arabic (12), or mixed conventions. The model emits its own convention; downstream users with a specific target convention should normalize predictions.
  • Confident errors. Beam search does not buy headroom in failure cases (the second-best beam is typically identical to top-1). Don't use top-1 sequence score as a confidence signal.
  • Greedy ≈ beam-4 on the validation corpus (Δ < 0.001 CER). Greedy decoding is recommended for speed; the saved generation_config defaults to it.
  • Length cap. max_new_tokens=64 was the training cap; longer lines may be truncated.

How to use

from transformers import VisionEncoderDecoderModel, TrOCRProcessor
from PIL import Image
import torch

model = VisionEncoderDecoderModel.from_pretrained("Hukyl/trocr-kgb-archive")
processor = TrOCRProcessor.from_pretrained("Hukyl/trocr-kgb-archive")
model.eval()

# Single-line crop, RGB
img = Image.open("line.jpg").convert("RGB")
pixel_values = processor(images=img, return_tensors="pt").pixel_values

with torch.inference_mode():
    generated_ids = model.generate(pixel_values, max_new_tokens=64, num_beams=1)

text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(text)

For GPU inference: model.to("cuda") and pixel_values = pixel_values.to("cuda") before generate.

Training data

  • Stage 1: UkrainianCatholicUniversity/rukopys, train split — gold-annotated Ukrainian handwritten line crops.
  • Stage 2: ~2 564 hand-annotated line-level crops from an internal Cyrillic handwritten archival corpus, split 80 / 20 train / val with a stratified seed (label × length-bucket).

Per-label distribution after the Stage 2 train / val split:

Label Total crops Share
General handwritten text 2 364 92 %
Handwritten dates 200 8 %

Training procedure

Stage 1 — Рукопис domain prep

Hyperparameter Value
Adapter LoRA, r=32, α=64, dropout=0.10, target=attn_ffn, no DoRA
Optimizer AdamW
Learning rate 2e-4
Warmup ratio 0.1
Batch size 4
Epochs 10
Patience 6
Eval decoding beam=4

The Stage 1 LoRA is merged into the base via merge_and_unload() to produce a self-contained VisionEncoderDecoderModel checkpoint, which is then used as the base for Stage 2.

Stage 2 — Target-domain fine-tune

Hyperparameter Value
Adapter LoRA, r=32, α=64, dropout=0.10, target=attn_ffn, no DoRA
Optimizer AdamW
Learning rate 5e-5
Warmup ratio 0.1
Batch size 8
Epochs 8
Patience 8 (effectively no early stop)
metric_for_best_model CER
load_best_model_at_end True
Eval decoding greedy (beams=1)

After training, the Stage 2 LoRA is merged into the Stage 1 base. The published artifact is the result of that merge — a single self-contained VisionEncoderDecoderModel.

Augmentation (Stage 2)

Each train source crop produced 10 augmented variants (offline, written to disk; no online aug). Per variant: 1 geometric transform + 1-2 photometric transforms, sampled uniformly from per-label registries.

Group Transforms
Geometric MarginManipulation, Rotation (±1–5°), ElasticDistortion, BaselineWarp
Photometric PaperColourShift (LAB), GaussianNoise, JpegCompression, ContrastGamma, MorphologicalOp (erode/dilate)

Evaluation

Held-out 513-row val (greedy decoding):

Metric Value
Overall CER 0.152
Overall WER 0.439
CER on general handwritten lines 0.150
CER on handwritten dates 0.168
Exact-match accuracy ~36 %

Greedy decoding was confirmed equivalent to beam=4 (Δ < 0.001 CER) on this corpus, so the published generation_config defaults to num_beams=1 for inference speed.

Train-set memorization (sanity check)

Reported as a sanity check, not a production-relevant number: across the full corpus (train + val mixed, 2 564 crops), CER drops to ~0.04 because the model has memorized its training portion. The honest, generalization-relevant number is the val-only 0.152 above.

Authors

Acknowledgements

License

Released under Apache-2.0.

Downloads last month
48
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Hukyl/trocr-kgb-archive

Adapter
(1)
this model

Dataset used to train Hukyl/trocr-kgb-archive

Evaluation results

  • Character Error Rate on Cyrillic handwritten archival corpus (held-out val)
    self-reported
    0.152
  • Word Error Rate on Cyrillic handwritten archival corpus (held-out val)
    self-reported
    0.439