RT-DETRv2 Document Layout Detection (GGUF)

Document layout analysis model for CrispEmbed. Detects 17 region types in document images.

Architecture

Backbone: ResNet-50-D (BN-folded Conv2d)
Encoder: HybridEncoder (AIFI self-attention + FPN/PAN with CSP-RepVGG)
Decoder: 6-layer transformer with deformable multi-scale cross-attention (300 queries)
Classes: 17 (text, title, table, figure, formula, caption, section_header, list_item, footnote, page_header, page_footer, code, document_index, checkbox_selected, checkbox_unselected, form, key_value_region)
Parameters: 42M
Source: docling-project/docling-layout-heron (Apache-2.0)

Variants

File	Size	Format	Notes
layout-heron-f32.gguf	161 MB	F32	Full precision, development
layout-heron-q8_0.gguf	43 MB	Q8_0	Recommended for inference

Usage

# CLI
./build/crispembed -m layout-heron --layout document.png --json

# Server
./build/crispembed-server --layout layout-heron-q8_0.gguf
curl -X POST http://localhost:8080/layout/detect -d '{"image": "page.png"}'

from crispembed import CrispLayout
layout = CrispLayout("layout-heron-q8_0.gguf")
regions = layout.detect("document.png")

Parity

Encoder: all stages cos=1.0 vs HF reference (with exact input)
Detection score: 0.934 (HF reference: 0.955)
14 parity bugs found and fixed via systematic layer-by-layer diff

License

Apache-2.0 (same as upstream docling-layout-heron).

Downloads last month: -

GGUF

Model size

42.1M params

Architecture

layout

Hardware compatibility

8-bit

16-bit

32-bit