trocr-kgb-archive

A TrOCR model fine-tuned for line-level Cyrillic handwriting recognition on Soviet-era archival documents.

Model description

A two-stage LoRA fine-tune of kazars24/trocr-base-handwritten-ru for handwritten Russian (and occasional Ukrainian) text in 20th-century archival material.

Stage 1 — domain-prep LoRA on UkrainianCatholicUniversity/rukopys (~6 000 Ukrainian-handwritten line-level crops), merged into the base.
Stage 2 — target-domain LoRA on ~2 500 hand-annotated archival crops, merged into the Stage 1 base to produce the published checkpoint.

The two-stage recipe is what produces the gain: Stage 1 is roughly a wash on the target domain when measured zero-shot, but it provides a strong handwriting prior that the small target-domain Stage 2 fine-tune builds on.

This repository hosts the fully merged checkpoint — there is no PEFT adapter at inference time. from_pretrained returns an ordinary VisionEncoderDecoderModel.

Intended uses

Line-level OCR of handwritten Cyrillic strips — single-line crops similar in shape to the training inputs.
Russian primarily, with occasional Ukrainian and mixed-language content tolerated.
Reasonable performance on handwritten dates containing mixed Cyrillic / Latin / Arabic / Roman numeral conventions (see Evaluation).

Limitations

Small training set (~2 000 unique handwritten lines). The model has limited language-modeling power and will degrade on out-of-distribution writing styles, highly stylized hands, or scripts from earlier than the 20th century.
Handwritten only. Typewriter-printed Cyrillic was not part of training; for that text class, use a typewriter/print-oriented Cyrillic OCR model (e.g. PaddleOCR with the East-Slavic recognizer).
Date-format dependency. Date crops may use Cyrillic month numerals (Х), ASCII Latin Roman (XI), Arabic (12), or mixed conventions. The model emits its own convention; downstream users with a specific target convention should normalize predictions.
Confident errors. Beam search does not buy headroom in failure cases (the second-best beam is typically identical to top-1). Don't use top-1 sequence score as a confidence signal.
Greedy ≈ beam-4 on the validation corpus (Δ < 0.001 CER). Greedy decoding is recommended for speed; the saved generation_config defaults to it.
Length cap. max_new_tokens=64 was the training cap; longer lines may be truncated.

How to use

from transformers import VisionEncoderDecoderModel, TrOCRProcessor
from PIL import Image
import torch

model = VisionEncoderDecoderModel.from_pretrained("Hukyl/trocr-kgb-archive")
processor = TrOCRProcessor.from_pretrained("Hukyl/trocr-kgb-archive")
model.eval()

# Single-line crop, RGB
img = Image.open("line.jpg").convert("RGB")
pixel_values = processor(images=img, return_tensors="pt").pixel_values

with torch.inference_mode():
    generated_ids = model.generate(pixel_values, max_new_tokens=64, num_beams=1)

text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(text)

For GPU inference: model.to("cuda") and pixel_values = pixel_values.to("cuda") before generate.

Training data

Stage 1: UkrainianCatholicUniversity/rukopys, train split — gold-annotated Ukrainian handwritten line crops.
Stage 2: ~2 564 hand-annotated line-level crops from an internal Cyrillic handwritten archival corpus, split 80 / 20 train / val with a stratified seed (label × length-bucket).

Per-label distribution after the Stage 2 train / val split:

Label	Total crops	Share
General handwritten text	2 364	92 %
Handwritten dates	200	8 %

Training procedure

Stage 1 — Рукопис domain prep

Hyperparameter	Value
Adapter	LoRA, r=32, α=64, dropout=0.10, target=`attn_ffn`, no DoRA
Optimizer	AdamW
Learning rate	2e-4
Warmup ratio	0.1
Batch size	4
Epochs	10
Patience	6
Eval decoding	beam=4

The Stage 1 LoRA is merged into the base via merge_and_unload() to produce a self-contained VisionEncoderDecoderModel checkpoint, which is then used as the base for Stage 2.

Stage 2 — Target-domain fine-tune

Hyperparameter	Value
Adapter	LoRA, r=32, α=64, dropout=0.10, target=`attn_ffn`, no DoRA
Optimizer	AdamW
Learning rate	5e-5
Warmup ratio	0.1
Batch size	8
Epochs	8
Patience	8 (effectively no early stop)
`metric_for_best_model`	CER
`load_best_model_at_end`	True
Eval decoding	greedy (beams=1)

After training, the Stage 2 LoRA is merged into the Stage 1 base. The published artifact is the result of that merge — a single self-contained VisionEncoderDecoderModel.

Augmentation (Stage 2)

Each train source crop produced 10 augmented variants (offline, written to disk; no online aug). Per variant: 1 geometric transform + 1-2 photometric transforms, sampled uniformly from per-label registries.

Group	Transforms
Geometric	MarginManipulation, Rotation (±1–5°), ElasticDistortion, BaselineWarp
Photometric	PaperColourShift (LAB), GaussianNoise, JpegCompression, ContrastGamma, MorphologicalOp (erode/dilate)

Evaluation

Held-out 513-row val (greedy decoding):

Metric	Value
Overall CER	0.152
Overall WER	0.439
CER on general handwritten lines	0.150
CER on handwritten dates	0.168
Exact-match accuracy	~36 %

Greedy decoding was confirmed equivalent to beam=4 (Δ < 0.001 CER) on this corpus, so the published generation_config defaults to num_beams=1 for inference speed.

Train-set memorization (sanity check)

Reported as a sanity check, not a production-relevant number: across the full corpus (train + val mixed, 2 564 crops), CER drops to ~0.04 because the model has memorized its training portion. The honest, generalization-relevant number is the val-only 0.152 above.

Authors

Andrii Shalaiev (@Hukyl)

Acknowledgements

kazars24/trocr-base-handwritten-ru — the off-the-shelf TrOCR base.
UkrainianCatholicUniversity/rukopys — the gold-annotated Ukrainian handwriting corpus used for Stage 1 domain prep.

License

Released under Apache-2.0.

Downloads last month: 48

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for Hukyl/trocr-kgb-archive

Base model

microsoft/trocr-base-handwritten

Finetuned

kazars24/trocr-base-handwritten-ru

Adapter

(1)

this model

Dataset used to train Hukyl/trocr-kgb-archive

Evaluation results

Character Error Rate on Cyrillic handwritten archival corpus (held-out val)
self-reported

0.152
Word Error Rate on Cyrillic handwritten archival corpus (held-out val)
self-reported

0.439