RT-DETRv2 Document Layout Detection (GGUF)
Document layout analysis model for CrispEmbed. Detects 17 region types in document images.
Architecture
- Backbone: ResNet-50-D (BN-folded Conv2d)
- Encoder: HybridEncoder (AIFI self-attention + FPN/PAN with CSP-RepVGG)
- Decoder: 6-layer transformer with deformable multi-scale cross-attention (300 queries)
- Classes: 17 (text, title, table, figure, formula, caption, section_header, list_item, footnote, page_header, page_footer, code, document_index, checkbox_selected, checkbox_unselected, form, key_value_region)
- Parameters: 42M
- Source: docling-project/docling-layout-heron (Apache-2.0)
Variants
| File | Size | Format | Notes |
|---|---|---|---|
| layout-heron-f32.gguf | 161 MB | F32 | Full precision, development |
| layout-heron-q8_0.gguf | 43 MB | Q8_0 | Recommended for inference |
Usage
# CLI
./build/crispembed -m layout-heron --layout document.png --json
# Server
./build/crispembed-server --layout layout-heron-q8_0.gguf
curl -X POST http://localhost:8080/layout/detect -d '{"image": "page.png"}'
from crispembed import CrispLayout
layout = CrispLayout("layout-heron-q8_0.gguf")
regions = layout.detect("document.png")
Parity
- Encoder: all stages cos=1.0 vs HF reference (with exact input)
- Detection score: 0.934 (HF reference: 0.955)
- 14 parity bugs found and fixed via systematic layer-by-layer diff
License
Apache-2.0 (same as upstream docling-layout-heron).
- Downloads last month
- -
Hardware compatibility
Log In to add your hardware
8-bit
16-bit
32-bit