PP-DocLayoutV3 — ONNX export

ONNX export of PaddlePaddle/PP-DocLayoutV3_safetensors, the layout-detection model used in the PaddleOCR-VL-1.5 pipeline.

This export preserves all four model heads: classification logits, bounding boxes, instance-segmentation masks, and reading-order logits. The original PaddlePaddle release outputs polygon points and reading order via a postprocessor that consumes these four tensors.

Files

file	size	purpose
`PP-DocLayoutV3.onnx`	~5 MB	model graph (references external weights)
`PP-DocLayoutV3.onnx.data`	~137 MB	weight tensors (must sit alongside `.onnx`)
`config.json`	—	original model config (HuggingFace-style)
`preprocessor_config.json`	—	image preprocessing parameters (800×800 resize, normalize)
`inference.yml`	—	original PaddlePaddle inference config (reference only)

Inputs / outputs

Input (single tensor):

name	shape	dtype	notes
`pixel_values`	`(B, 3, 800, 800)`	`float32`	Resize image to 800×800, rescale by `1/255`, mean=`[0,0,0]`, std=`[1,1,1]` (matches `preprocessor_config.json`).

Outputs (four tensors):

name	shape	notes
`logits`	`(B, 300, 25)`	per-query class logits over 25 layout classes
`pred_boxes`	`(B, 300, 4)`	normalized `(cx, cy, w, h)` — convert via standard DETR decoding
`out_masks`	`(B, 300, 200, 200)`	per-query instance-segmentation masks; cv2 contour extraction yields polygon points
`order_logits`	`(B, 300, 300)`	per-query permutation logits for reading order; argmax / Sinkhorn for ordering

Postprocessing

The official postprocessor lives in transformers.models.pp_doclayout_v3.image_processing_pp_doclayout_v3.PPDocLayoutV3ImageProcessor.post_process_object_detection. It takes the four output tensors plus a target_sizes tensor and returns:

{
  "scores":         (N,)      float32
  "labels":         (N,)      int64
  "boxes":          (N, 4)    float32 — axis-aligned (x1, y1, x2, y2) in target coords
  "polygon_points": list[N]   each (P, 2) int polygon vertices in target coords
  "order_seq":      (N,)      int64   — reading-order index
}

You can use that postprocessor directly (transformers >= 5.4, requires torch and cv2) or port it to numpy + cv2 for a torch-free runtime.

Loading

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("PP-DocLayoutV3.onnx", providers=["CPUExecutionProvider"])
# preprocess to 800x800 RGB float32, normalize per preprocessor_config.json
pixel_values = ...  # shape (1, 3, 800, 800), float32
logits, pred_boxes, out_masks, order_logits = sess.run(
    ["logits", "pred_boxes", "out_masks", "order_logits"],
    {"pixel_values": pixel_values},
)

The .onnx.data sidecar is loaded automatically by onnxruntime via the relative location reference embedded in the graph. Both files must sit in the same directory.

How this was exported

pip install transformers==5.6.2 torch==2.11 onnx==1.21 onnxscript
model = AutoModelForObjectDetection.from_pretrained("PaddlePaddle/PP-DocLayoutV3_safetensors").eval()
Wrap the model so forward(pixel_values) returns (logits, pred_boxes, out_masks, order_logits).
torch.onnx.export(wrapped, (pixel_values,), "PP-DocLayoutV3.onnx", opset_version=18, dynamo=True, dynamic_axes={"pixel_values": {0: "batch"}})
Re-save with onnx.save(..., save_as_external_data=True, location="PP-DocLayoutV3.onnx.data") to standardize the sidecar filename.

Numerical parity vs torch (random (1, 3, 800, 800) input):

output	max absolute diff
`logits`	1.32e-4
`pred_boxes`	1.57e-5
`out_masks`	1.62e-3
`order_logits`	3.96e-2

The order_logits deviation reflects accumulated floating-point drift in the decoder's attention; argmax-based reading order is unaffected on the test images we checked.