mini-kh-OCR — Khmer & English Document OCR Pipeline

An end-to-end OCR pipeline that combines two models to detect, classify, and recognise Khmer and English text from document images.

Input Image
    │
    ▼
┌─────────────────────────────┐
│  Text Detection             │  phonsobon/mini-text-detection (YOLO11n)
│  → subject / reference /   │
│    content bounding boxes   │
└─────────────┬───────────────┘
              │  crop each region
              ▼
┌─────────────────────────────┐
│  Text Recognition           │  phonsobon/mini-ocr (CRNN + CTC)
│  → Khmer & English text     │
└─────────────┬───────────────┘
              │
              ▼
     Structured output
     grouped by class

Detection Classes

ID	Class	Khmer	Description
`0`	`subject`	កម្មវត្ថុ	Title or subject heading
`1`	`reference`	យោង	Reference or citation
`2`	`content`	អត្ថបទ	Main body / paragraph text

Models Used

Role	Repository
Text Detection	phonsobon/mini-text-detection
Text Recognition	phonsobon/mini-ocr

Files

File	Description
`mini_kh_ocr.py`	Pipeline class — load and import this

Installation

pip install torch torchvision ultralytics huggingface_hub pillow numpy

Quick Start

from huggingface_hub import hf_hub_download

# Download pipeline script
pipeline_path = hf_hub_download(
    repo_id="phonsobon/mini-kh-OCR",
    filename="mini_kh_ocr.py",
)

import importlib.util, sys
spec = importlib.util.spec_from_file_location("mini_kh_ocr", pipeline_path)
mod  = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)

MiniKhOCR = mod.MiniKhOCR

# ── Load pipeline ─────────────────────────────────────────────────────────────
ocr = MiniKhOCR()

# ── Run on an image ───────────────────────────────────────────────────────────
result = ocr("your_document.jpg")

Output Format

result is a dictionary with the following structure:

{
    "subject":   ["កម្មវត្ថុ: សំណើរសុំច្បាប់"],   # កម្មវត្ថុ — subject/heading texts
    "reference": ["យោង: លេខ ០០១/២៤"],           # យោង — reference texts
    "content":   ["អត្ថបទ...", "..."],            # អត្ថបទ — body paragraph texts

    "regions": [                                  # all detections sorted top → bottom
        {
            "class": "subject",
            "conf":  0.91,
            "box":   {"x1": 10, "y1": 5, "x2": 320, "y2": 40},
            "text":  "កម្មវត្ថុ: សំណើរសុំច្បាប់",
        },
        {
            "class": "reference",
            "conf":  0.87,
            "box":   {"x1": 10, "y1": 50, "x2": 200, "y2": 75},
            "text":  "យោង: លេខ ០០១/២៤",
        },
        ...
    ]
}

Usage Examples

Access text by class

result = ocr("document.jpg")

print("=== SUBJECT ===")
for text in result["subject"]:
    print(text)

print("=== REFERENCE ===")
for text in result["reference"]:
    print(text)

print("=== CONTENT ===")
for text in result["content"]:
    print(text)

Format as a structured document

document = ocr.to_document(result)
print(document)

# Output:
# [SUBJECT]
# កម្មវត្ថុ: សំណើរសុំច្បាប់
#
# [REFERENCE]
# យោង: លេខ ០០១/២៤
#
# [CONTENT]
# អត្ថបទដំបូង
# អត្ថបទទីពីរ

Verbose mode — print each region as it is processed

result = ocr("document.jpg", verbose=True)

# [subject]   (10,5)→(320,40)   conf=0.91  →  'កម្មវត្ថុ: សំណើរសុំច្បាប់'
# [reference] (10,50)→(200,75)  conf=0.87  →  'យោង: លេខ ០០១/២៤'
# [content]   (10,90)→(600,120) conf=0.93  →  'អត្ថបទដំបូង'

Get cropped images alongside text

result = ocr("document.jpg", return_crops=True)

for region in result["regions"]:
    print(region["class"], "→", region["text"])
    region["crop"].show()   # PIL Image of the cropped region

Batch processing

import os

folder = "path/to/documents/"
all_results = {}

for fname in os.listdir(folder):
    if fname.lower().endswith((".jpg", ".jpeg", ".png")):
        path = os.path.join(folder, fname)
        result = ocr(path)
        all_results[fname] = {
            "subject":   result["subject"],
            "reference": result["reference"],
            "content":   result["content"],
        }
        print(f"✅ {fname} — {len(result['regions'])} regions detected")

Export to JSON

import json

result = ocr("document.jpg")

# Remove PIL crops before serialising (not JSON-serialisable)
exportable = {
    "subject":   result["subject"],
    "reference": result["reference"],
    "content":   result["content"],
    "regions": [
        {k: v for k, v in r.items() if k != "crop"}
        for r in result["regions"]
    ],
}

with open("output.json", "w", encoding="utf-8") as f:
    json.dump(exportable, f, ensure_ascii=False, indent=2)

Configuration

ocr = MiniKhOCR(
    det_conf  = 0.25,   # lower → more detections, higher → fewer but more confident
    det_iou   = 0.45,   # NMS IoU threshold
    det_imgsz = 640,    # detection image size
    device    = "auto", # "auto" | "cuda" | "cpu"
)

Limitations

Designed for document-style images (printed text, clear layout).
Text recognition works best on single-line crops — very tall content regions spanning multiple lines may merge lines together.
Handwritten text is not supported.

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support