RT-DETRv2 Document Layout Detection (GGUF)

Document layout analysis model for CrispEmbed. Detects 17 region types in document images.

Architecture

  • Backbone: ResNet-50-D (BN-folded Conv2d)
  • Encoder: HybridEncoder (AIFI self-attention + FPN/PAN with CSP-RepVGG)
  • Decoder: 6-layer transformer with deformable multi-scale cross-attention (300 queries)
  • Classes: 17 (text, title, table, figure, formula, caption, section_header, list_item, footnote, page_header, page_footer, code, document_index, checkbox_selected, checkbox_unselected, form, key_value_region)
  • Parameters: 42M
  • Source: docling-project/docling-layout-heron (Apache-2.0)

Variants

File Size Format Notes
layout-heron-f32.gguf 161 MB F32 Full precision, development
layout-heron-q8_0.gguf 43 MB Q8_0 Recommended for inference

Usage

# CLI
./build/crispembed -m layout-heron --layout document.png --json

# Server
./build/crispembed-server --layout layout-heron-q8_0.gguf
curl -X POST http://localhost:8080/layout/detect -d '{"image": "page.png"}'
from crispembed import CrispLayout
layout = CrispLayout("layout-heron-q8_0.gguf")
regions = layout.detect("document.png")

Parity

  • Encoder: all stages cos=1.0 vs HF reference (with exact input)
  • Detection score: 0.934 (HF reference: 0.955)
  • 14 parity bugs found and fixed via systematic layer-by-layer diff

License

Apache-2.0 (same as upstream docling-layout-heron).

Downloads last month
-
GGUF
Model size
42.1M params
Architecture
layout
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support