mini-kh-OCR β€” Khmer & English Document OCR Pipeline

An end-to-end OCR pipeline that combines two models to detect, classify, and recognise Khmer and English text from document images.

Input Image
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Text Detection             β”‚  phonsobon/mini-text-detection (YOLO11n)
β”‚  β†’ subject / reference /   β”‚
β”‚    content bounding boxes   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚  crop each region
              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Text Recognition           β”‚  phonsobon/mini-ocr (CRNN + CTC)
β”‚  β†’ Khmer & English text     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό
     Structured output
     grouped by class

Detection Classes

ID Class Khmer Description
0 subject αž€αž˜αŸ’αž˜αžœαžαŸ’αžαž» Title or subject heading
1 reference αž™αŸ„αž„ Reference or citation
2 content αž’αžαŸ’αžαž”αž‘ Main body / paragraph text

Models Used

Role Repository
Text Detection phonsobon/mini-text-detection
Text Recognition phonsobon/mini-ocr

Files

File Description
mini_kh_ocr.py Pipeline class β€” load and import this

Installation

pip install torch torchvision ultralytics huggingface_hub pillow numpy

Quick Start

from huggingface_hub import hf_hub_download

# Download pipeline script
pipeline_path = hf_hub_download(
    repo_id="phonsobon/mini-kh-OCR",
    filename="mini_kh_ocr.py",
)

import importlib.util, sys
spec = importlib.util.spec_from_file_location("mini_kh_ocr", pipeline_path)
mod  = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)

MiniKhOCR = mod.MiniKhOCR

# ── Load pipeline ─────────────────────────────────────────────────────────────
ocr = MiniKhOCR()

# ── Run on an image ───────────────────────────────────────────────────────────
result = ocr("your_document.jpg")

Output Format

result is a dictionary with the following structure:

{
    "subject":   ["αž€αž˜αŸ’αž˜αžœαžαŸ’αžαž»: αžŸαŸ†αžŽαžΎαžšαžŸαž»αŸ†αž…αŸ’αž”αžΆαž”αŸ‹"],   # αž€αž˜αŸ’αž˜αžœαžαŸ’αžαž» β€” subject/heading texts
    "reference": ["αž™αŸ„αž„: αž›αŸαž ០០៑/្ៀ"],           # αž™αŸ„αž„ β€” reference texts
    "content":   ["αž’αžαŸ’αžαž”αž‘...", "..."],            # αž’αžαŸ’αžαž”αž‘ β€” body paragraph texts

    "regions": [                                  # all detections sorted top β†’ bottom
        {
            "class": "subject",
            "conf":  0.91,
            "box":   {"x1": 10, "y1": 5, "x2": 320, "y2": 40},
            "text":  "αž€αž˜αŸ’αž˜αžœαžαŸ’αžαž»: αžŸαŸ†αžŽαžΎαžšαžŸαž»αŸ†αž…αŸ’αž”αžΆαž”αŸ‹",
        },
        {
            "class": "reference",
            "conf":  0.87,
            "box":   {"x1": 10, "y1": 50, "x2": 200, "y2": 75},
            "text":  "αž™αŸ„αž„: αž›αŸαž ០០៑/្ៀ",
        },
        ...
    ]
}

Usage Examples

Access text by class

result = ocr("document.jpg")

print("=== SUBJECT ===")
for text in result["subject"]:
    print(text)

print("=== REFERENCE ===")
for text in result["reference"]:
    print(text)

print("=== CONTENT ===")
for text in result["content"]:
    print(text)

Format as a structured document

document = ocr.to_document(result)
print(document)

# Output:
# [SUBJECT]
# αž€αž˜αŸ’αž˜αžœαžαŸ’αžαž»: αžŸαŸ†αžŽαžΎαžšαžŸαž»αŸ†αž…αŸ’αž”αžΆαž”αŸ‹
#
# [REFERENCE]
# αž™αŸ„αž„: αž›αŸαž ០០៑/្ៀ
#
# [CONTENT]
# αž’αžαŸ’αžαž”αž‘αžŠαŸ†αž”αžΌαž„
# αž’αžαŸ’αžαž”αž‘αž‘αžΈαž–αžΈαžš

Verbose mode β€” print each region as it is processed

result = ocr("document.jpg", verbose=True)

# [subject]   (10,5)β†’(320,40)   conf=0.91  β†’  'αž€αž˜αŸ’αž˜αžœαžαŸ’αžαž»: αžŸαŸ†αžŽαžΎαžšαžŸαž»αŸ†αž…αŸ’αž”αžΆαž”αŸ‹'
# [reference] (10,50)β†’(200,75)  conf=0.87  β†’  'αž™αŸ„αž„: αž›αŸαž ០០៑/្ៀ'
# [content]   (10,90)β†’(600,120) conf=0.93  β†’  'αž’αžαŸ’αžαž”αž‘αžŠαŸ†αž”αžΌαž„'

Get cropped images alongside text

result = ocr("document.jpg", return_crops=True)

for region in result["regions"]:
    print(region["class"], "β†’", region["text"])
    region["crop"].show()   # PIL Image of the cropped region

Batch processing

import os

folder = "path/to/documents/"
all_results = {}

for fname in os.listdir(folder):
    if fname.lower().endswith((".jpg", ".jpeg", ".png")):
        path = os.path.join(folder, fname)
        result = ocr(path)
        all_results[fname] = {
            "subject":   result["subject"],
            "reference": result["reference"],
            "content":   result["content"],
        }
        print(f"βœ… {fname} β€” {len(result['regions'])} regions detected")

Export to JSON

import json

result = ocr("document.jpg")

# Remove PIL crops before serialising (not JSON-serialisable)
exportable = {
    "subject":   result["subject"],
    "reference": result["reference"],
    "content":   result["content"],
    "regions": [
        {k: v for k, v in r.items() if k != "crop"}
        for r in result["regions"]
    ],
}

with open("output.json", "w", encoding="utf-8") as f:
    json.dump(exportable, f, ensure_ascii=False, indent=2)

Configuration

ocr = MiniKhOCR(
    det_conf  = 0.25,   # lower β†’ more detections, higher β†’ fewer but more confident
    det_iou   = 0.45,   # NMS IoU threshold
    det_imgsz = 640,    # detection image size
    device    = "auto", # "auto" | "cuda" | "cpu"
)

Limitations

  • Designed for document-style images (printed text, clear layout).
  • Text recognition works best on single-line crops β€” very tall content regions spanning multiple lines may merge lines together.
  • Handwritten text is not supported.

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support