File size: 4,553 Bytes
5b63d11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7453022
5b63d11
7453022
5b63d11
 
7453022
5b63d11
7453022
 
 
 
5b63d11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ed6139
 
 
 
 
 
 
 
 
 
7e09902
 
 
 
 
 
 
 
 
 
 
 
5b63d11
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
license: apache-2.0
base_model: roboflow/rf-detr
tags:
- coreai
- aimodel
- object-detection
- rf-detr
- detr
- apple
- ios
- macos
pipeline_tag: object-detection
---

# RF-DETR β€” Core AI (`.aimodel`)

[RF-DETR](https://github.com/roboflow/rf-detr) (Roboflow's real-time detection
transformer, COCO-pretrained) converted to Apple **Core AI** for iOS 27 / macOS 27 β€”
the answer to [apple/coreai-models#14](https://github.com/apple/coreai-models/issues/14).
**DETR family = no NMS**: post-processing is one sigmoid + top-k.

<p align="center"><img src="demo_coco_cats.jpg" width="440" alt="RF-DETR medium on Core AI"></p>

## Files

| file | input | params | M4 Max GPU | iPhone 17 Pro GPU |
|---|---|---|---|---|
| `rfdetr-nano_float32.aimodel` | 384Γ—384 | 30.5M | **8.6 ms** (~116 FPS) | **~25 ms (33–39 FPS live)** |
| `rfdetr-small_float32.aimodel` | 512Γ—512 | 32.1M | **12.0 ms** (~83 FPS) | β€” |
| `rfdetr-medium_float32.aimodel` | 576Γ—576 | 33.7M | **14.8 ms** (~68 FPS) | **56–63 ms (15–17 FPS live)** |
| `rfdetr-large_float32.aimodel` | 704Γ—704 | 33.9M | **19.1 ms** (~52 FPS) | β€” |

iPhone numbers are end-to-end live-camera measurements from the
[CoreAIKit DetectCamera example](https://github.com/john-rocky/coreai-kit)
(Release; zero-copy capture pipeline β€” AVCaptureVideoPreviewLayer display,
hardware-scaled 32BGRA buffers, vImage preprocessing overlapped with GPU
inference). Peak measured 39.6 FPS β‰ˆ the nano model ceiling; sustained
max-load throughput drops on a hot chassis (thermal).

fp32 is the ship dtype: it gates **detection-set exact** vs the PyTorch fp32 reference on
CPU and GPU (per confident detection: same class, IoU β‰₯ 0.999 measured, score within 2e-3),
and fp16 only bought ~7% latency on M4 Max while adding near-tie ranking noise.

## Graph contract

```
input  "image"  [1, 3, R, R]  float32, RGB in [0, 1]  (ImageNet mean/std folded in-graph)
output "dets"   [1, 300, 4]   boxes, cxcywh normalized to [0, 1]
output "labels" [1, 300, 91]  raw class logits; column index = ORIGINAL COCO id (0 unused, 1=person … 17=cat … 90)
```

Python decode sketch (Swift is the same three steps):

```python
import numpy as np, coreai.runtime as rt

model = await rt.AIModel.load(path, rt.SpecializationOptions.default())
fn = model.load_function("main")
out = await fn({"image": rt.NDArray(rgb01)})          # rgb01: [1,3,R,R] in [0,1]
prob = 1 / (1 + np.exp(-out["labels"].numpy()[0]))    # [300, 91]
scores, classes = prob.max(-1), prob.argmax(-1)       # column index IS the COCO id
boxes = out["dets"].numpy()[0]                        # cxcywh, multiply by image W/H
keep = scores > 0.5                                   # done β€” no NMS
```

## RF-DETR-Seg (instance segmentation)

`rfdetr-seg-{nano,small,medium,large,xlarge,2xlarge}_float32.aimodel` β€” same
contract plus `masks [1, Q, R/4, R/4]`: per-query FULL-FRAME logit planes at
stride 4 (host: sigmoid > 0.5; no ROI plumbing, no NMS). All six gate on CPU
and GPU with binary-mask IoU 1.000 on stable scenes. M4 Max GPU:
seg-nano 312Β² **10.7 ms** β†’ seg-2xlarge 768Β² **59.1 ms**.

<p align="center"><img src="demo_seg_coco_cats.jpg" width="440" alt="RF-DETR-Seg nano on Core AI"></p>

## Split deployment (`split/`)

`split/rfdetr-{nano,medium}_{backbone,head}.aimodel` separate the pure-ViT
backbone (image β†’ features) from the deformable head (features β†’ dets/labels;
position encodings baked in). The chain is bit-exact vs the monolith. Purpose:
per-stage compute-unit preferences β€” e.g. backbone on the Neural Engine.
Measured honestly: on iOS 27 beta the runtime still executes the backbone on
the GPU delegate even under `.neuralEngine` preference (identical detection
fingerprint, no ANE-compile pause), so today the monolith on GPU is the
fastest config; the split exists so ANE placement can be adopted the moment
the runtime honors it. Regenerate with `export_rf_detr.py --variant <v> --split`.

## Conversion

Exported with
[`conversion/export_rf_detr.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_rf_detr.py)
from `rfdetr==1.7.1` weights. The port surfaced four Core AI converter/runtime bugs
(float-arg `arange` abort, int64-comparison buffer clobber, GPU-delegate
floor/trunc/ceil = identity, cast-pair cancellation) β€” each worked around numerically
identically; details and minimal repros in
[zoo/rf-detr.md](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/rf-detr.md).

License: Apache-2.0 (upstream RF-DETR code and COCO-pretrained weights are Apache-2.0).