PP-DocLayoutV3 β€” ONNX export

ONNX export of PaddlePaddle/PP-DocLayoutV3_safetensors, the layout-detection model used in the PaddleOCR-VL-1.5 pipeline.

This export preserves all four model heads: classification logits, bounding boxes, instance-segmentation masks, and reading-order logits. The original PaddlePaddle release outputs polygon points and reading order via a postprocessor that consumes these four tensors.

Files

file size purpose
PP-DocLayoutV3.onnx ~5 MB model graph (references external weights)
PP-DocLayoutV3.onnx.data ~137 MB weight tensors (must sit alongside .onnx)
config.json β€” original model config (HuggingFace-style)
preprocessor_config.json β€” image preprocessing parameters (800Γ—800 resize, normalize)
inference.yml β€” original PaddlePaddle inference config (reference only)

Inputs / outputs

Input (single tensor):

name shape dtype notes
pixel_values (B, 3, 800, 800) float32 Resize image to 800Γ—800, rescale by 1/255, mean=[0,0,0], std=[1,1,1] (matches preprocessor_config.json).

Outputs (four tensors):

name shape notes
logits (B, 300, 25) per-query class logits over 25 layout classes
pred_boxes (B, 300, 4) normalized (cx, cy, w, h) β€” convert via standard DETR decoding
out_masks (B, 300, 200, 200) per-query instance-segmentation masks; cv2 contour extraction yields polygon points
order_logits (B, 300, 300) per-query permutation logits for reading order; argmax / Sinkhorn for ordering

Postprocessing

The official postprocessor lives in transformers.models.pp_doclayout_v3.image_processing_pp_doclayout_v3.PPDocLayoutV3ImageProcessor.post_process_object_detection. It takes the four output tensors plus a target_sizes tensor and returns:

{
  "scores":         (N,)      float32
  "labels":         (N,)      int64
  "boxes":          (N, 4)    float32 β€” axis-aligned (x1, y1, x2, y2) in target coords
  "polygon_points": list[N]   each (P, 2) int polygon vertices in target coords
  "order_seq":      (N,)      int64   β€” reading-order index
}

You can use that postprocessor directly (transformers >= 5.4, requires torch and cv2) or port it to numpy + cv2 for a torch-free runtime.

Loading

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("PP-DocLayoutV3.onnx", providers=["CPUExecutionProvider"])
# preprocess to 800x800 RGB float32, normalize per preprocessor_config.json
pixel_values = ...  # shape (1, 3, 800, 800), float32
logits, pred_boxes, out_masks, order_logits = sess.run(
    ["logits", "pred_boxes", "out_masks", "order_logits"],
    {"pixel_values": pixel_values},
)

The .onnx.data sidecar is loaded automatically by onnxruntime via the relative location reference embedded in the graph. Both files must sit in the same directory.

How this was exported

  1. pip install transformers==5.6.2 torch==2.11 onnx==1.21 onnxscript
  2. model = AutoModelForObjectDetection.from_pretrained("PaddlePaddle/PP-DocLayoutV3_safetensors").eval()
  3. Wrap the model so forward(pixel_values) returns (logits, pred_boxes, out_masks, order_logits).
  4. torch.onnx.export(wrapped, (pixel_values,), "PP-DocLayoutV3.onnx", opset_version=18, dynamo=True, dynamic_axes={"pixel_values": {0: "batch"}})
  5. Re-save with onnx.save(..., save_as_external_data=True, location="PP-DocLayoutV3.onnx.data") to standardize the sidecar filename.

Numerical parity vs torch (random (1, 3, 800, 800) input):

output max absolute diff
logits 1.32e-4
pred_boxes 1.57e-5
out_masks 1.62e-3
order_logits 3.96e-2

The order_logits deviation reflects accumulated floating-point drift in the decoder's attention; argmax-based reading order is unaffected on the test images we checked.

Inference speed

CPU (Apple M-series, single page, 800Γ—800 input): ~480 ms/page with CPUExecutionProvider.

Source

License

Apache-2.0 (inherited from PaddlePaddle/PP-DocLayoutV3_safetensors).

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Bei0001/PP-DocLayoutV3-ONNX

Quantized
(1)
this model

Paper for Bei0001/PP-DocLayoutV3-ONNX