Instructions to use Hukyl/trocr-kgb-archive with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Hukyl/trocr-kgb-archive with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="Hukyl/trocr-kgb-archive")# Load model directly from transformers import AutoTokenizer, AutoModelForImageTextToText tokenizer = AutoTokenizer.from_pretrained("Hukyl/trocr-kgb-archive") model = AutoModelForImageTextToText.from_pretrained("Hukyl/trocr-kgb-archive") - PEFT
How to use Hukyl/trocr-kgb-archive with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
trocr-kgb-archive
A TrOCR model fine-tuned for line-level Cyrillic handwriting recognition on Soviet-era archival documents.
Model description
A two-stage LoRA fine-tune of kazars24/trocr-base-handwritten-ru for handwritten Russian (and occasional Ukrainian) text in 20th-century archival material.
- Stage 1 — domain-prep LoRA on
UkrainianCatholicUniversity/rukopys(~6 000 Ukrainian-handwritten line-level crops), merged into the base. - Stage 2 — target-domain LoRA on ~2 500 hand-annotated archival crops, merged into the Stage 1 base to produce the published checkpoint.
The two-stage recipe is what produces the gain: Stage 1 is roughly a wash on the target domain when measured zero-shot, but it provides a strong handwriting prior that the small target-domain Stage 2 fine-tune builds on.
This repository hosts the fully merged checkpoint — there is no PEFT adapter at inference time. from_pretrained returns an ordinary VisionEncoderDecoderModel.
Intended uses
- Line-level OCR of handwritten Cyrillic strips — single-line crops similar in shape to the training inputs.
- Russian primarily, with occasional Ukrainian and mixed-language content tolerated.
- Reasonable performance on handwritten dates containing mixed Cyrillic / Latin / Arabic / Roman numeral conventions (see Evaluation).
Limitations
- Small training set (~2 000 unique handwritten lines). The model has limited language-modeling power and will degrade on out-of-distribution writing styles, highly stylized hands, or scripts from earlier than the 20th century.
- Handwritten only. Typewriter-printed Cyrillic was not part of training; for that text class, use a typewriter/print-oriented Cyrillic OCR model (e.g. PaddleOCR with the East-Slavic recognizer).
- Date-format dependency. Date crops may use Cyrillic month numerals (
Х), ASCII Latin Roman (XI), Arabic (12), or mixed conventions. The model emits its own convention; downstream users with a specific target convention should normalize predictions. - Confident errors. Beam search does not buy headroom in failure cases (the second-best beam is typically identical to top-1). Don't use top-1 sequence score as a confidence signal.
- Greedy ≈ beam-4 on the validation corpus (Δ < 0.001 CER). Greedy decoding is recommended for speed; the saved
generation_configdefaults to it. - Length cap.
max_new_tokens=64was the training cap; longer lines may be truncated.
How to use
from transformers import VisionEncoderDecoderModel, TrOCRProcessor
from PIL import Image
import torch
model = VisionEncoderDecoderModel.from_pretrained("Hukyl/trocr-kgb-archive")
processor = TrOCRProcessor.from_pretrained("Hukyl/trocr-kgb-archive")
model.eval()
# Single-line crop, RGB
img = Image.open("line.jpg").convert("RGB")
pixel_values = processor(images=img, return_tensors="pt").pixel_values
with torch.inference_mode():
generated_ids = model.generate(pixel_values, max_new_tokens=64, num_beams=1)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(text)
For GPU inference: model.to("cuda") and pixel_values = pixel_values.to("cuda") before generate.
Training data
- Stage 1:
UkrainianCatholicUniversity/rukopys,trainsplit — gold-annotated Ukrainian handwritten line crops. - Stage 2: ~2 564 hand-annotated line-level crops from an internal Cyrillic handwritten archival corpus, split 80 / 20 train / val with a stratified seed (
label × length-bucket).
Per-label distribution after the Stage 2 train / val split:
| Label | Total crops | Share |
|---|---|---|
| General handwritten text | 2 364 | 92 % |
| Handwritten dates | 200 | 8 % |
Training procedure
Stage 1 — Рукопис domain prep
| Hyperparameter | Value |
|---|---|
| Adapter | LoRA, r=32, α=64, dropout=0.10, target=attn_ffn, no DoRA |
| Optimizer | AdamW |
| Learning rate | 2e-4 |
| Warmup ratio | 0.1 |
| Batch size | 4 |
| Epochs | 10 |
| Patience | 6 |
| Eval decoding | beam=4 |
The Stage 1 LoRA is merged into the base via merge_and_unload() to produce a self-contained VisionEncoderDecoderModel checkpoint, which is then used as the base for Stage 2.
Stage 2 — Target-domain fine-tune
| Hyperparameter | Value |
|---|---|
| Adapter | LoRA, r=32, α=64, dropout=0.10, target=attn_ffn, no DoRA |
| Optimizer | AdamW |
| Learning rate | 5e-5 |
| Warmup ratio | 0.1 |
| Batch size | 8 |
| Epochs | 8 |
| Patience | 8 (effectively no early stop) |
metric_for_best_model |
CER |
load_best_model_at_end |
True |
| Eval decoding | greedy (beams=1) |
After training, the Stage 2 LoRA is merged into the Stage 1 base. The published artifact is the result of that merge — a single self-contained VisionEncoderDecoderModel.
Augmentation (Stage 2)
Each train source crop produced 10 augmented variants (offline, written to disk; no online aug). Per variant: 1 geometric transform + 1-2 photometric transforms, sampled uniformly from per-label registries.
| Group | Transforms |
|---|---|
| Geometric | MarginManipulation, Rotation (±1–5°), ElasticDistortion, BaselineWarp |
| Photometric | PaperColourShift (LAB), GaussianNoise, JpegCompression, ContrastGamma, MorphologicalOp (erode/dilate) |
Evaluation
Held-out 513-row val (greedy decoding):
| Metric | Value |
|---|---|
| Overall CER | 0.152 |
| Overall WER | 0.439 |
| CER on general handwritten lines | 0.150 |
| CER on handwritten dates | 0.168 |
| Exact-match accuracy | ~36 % |
Greedy decoding was confirmed equivalent to beam=4 (Δ < 0.001 CER) on this corpus, so the published generation_config defaults to num_beams=1 for inference speed.
Train-set memorization (sanity check)
Reported as a sanity check, not a production-relevant number: across the full corpus (train + val mixed, 2 564 crops), CER drops to ~0.04 because the model has memorized its training portion. The honest, generalization-relevant number is the val-only 0.152 above.
Authors
- Andrii Shalaiev (
@Hukyl)
Acknowledgements
kazars24/trocr-base-handwritten-ru— the off-the-shelf TrOCR base.UkrainianCatholicUniversity/rukopys— the gold-annotated Ukrainian handwriting corpus used for Stage 1 domain prep.
License
Released under Apache-2.0.
- Downloads last month
- 48
Model tree for Hukyl/trocr-kgb-archive
Base model
microsoft/trocr-base-handwrittenDataset used to train Hukyl/trocr-kgb-archive
Evaluation results
- Character Error Rate on Cyrillic handwritten archival corpus (held-out val)self-reported0.152
- Word Error Rate on Cyrillic handwritten archival corpus (held-out val)self-reported0.439