YuNet — LiteRT (on-device face detection, fully-GPU)

YuNet (ShiqiYu/libfacedetection), a tiny fast face detector (faces + 5 landmarks), converted to LiteRT and running fully on the CompiledModel GPU (ML Drift) on Android. 0.076 M params / 0.3 MB fp16.

On-device (Pixel 8a, Tensor G3 — verified)


nodes on GPU	146 / 146 LITERT_CL (full residency)
inference	~4 ms (640×640)
size	0.3 MB (fp16)
accuracy	device-vs-PyTorch corr 0.9999 (all 12 outputs)

image[1,3,640,640] (BGR, 0-255) →[GPU: YuNet]→ 12 outputs: cls/obj/bbox/kps × strides {8,16,32}

How it converts (litert-torch) — clean, no re-authoring

Pure CNN (depthwise-separable ConvDPUnit) + a nearest-upsample neck (F.interpolate(mode="nearest") → RESIZE_NEAREST_NEIGHBOR, no transposed conv) + non-padded MaxPool2d (no PADV2). The head's per-stride permute/reshape/sigmoid is baked in → 12 decode-ready outputs. Banned ops NONE, ≤4D, tflite-vs-torch corr 1.0, device-vs-torch corr 0.9999.

Decode (host-side) & preprocessing

Preprocessing: letterbox to 640×640, BGR, 0-255, no normalization. Anchor-free priors (px=col·s, py=row·s, offset 0): score=cls·obj, box=center+exp(wh)·s, 5 landmarks kps·s+prior, then NMS.

Minimal usage

Android (Kotlin, CompiledModel GPU)

val model = CompiledModel.create(context.assets, "yunet_fp16.tflite",
    CompiledModel.Options(Accelerator.GPU), null)
val inputs = model.createInputBuffers(); val outputs = model.createOutputBuffers()
inputs[0].writeFloat(bgr)           // [1,3,640,640] NCHW BGR, 0-255 (no normalization)
model.run(inputs, outputs)
// 12 outputs in order: cls x3 [1,N,1], obj x3 [1,N,1], bbox x3 [1,N,4], kps x3 [1,N,10]
// for strides {8,16,32}, N = (640/stride)^2 = 6400/1600/400. Decode = Python below.
val cls8 = outputs[0].readFloat()

Python (desktop verification)

import math, numpy as np
from PIL import Image
from ai_edge_litert.interpreter import Interpreter

im = Image.open("faces.jpg").convert("RGB").resize((640, 640))
bgr = np.asarray(im, np.float32)[:, :, ::-1]                      # BGR, 0-255
x = bgr.transpose(2, 0, 1)[None].copy()                           # [1,3,640,640]

it = Interpreter(model_path="yunet_fp16.tflite"); it.allocate_tensors()
it.set_tensor(it.get_input_details()[0]["index"], x); it.invoke()
o = [it.get_tensor(d["index"])[0] for d in it.get_output_details()]
# output order: cls x3, obj x3, bbox x3, kps x3 (strides 8, 16, 32)
dets = []
for li, s in enumerate([8, 16, 32]):
    cls, obj, bb, kp = o[li][:, 0], o[3 + li][:, 0], o[6 + li], o[9 + li]
    fw = 640 // s
    for i in np.where(cls * obj > 0.6)[0]:                        # score threshold
        px, py = (i % fw) * s, (i // fw) * s
        cx, cy = bb[i, 0] * s + px, bb[i, 1] * s + py
        w, h = math.exp(bb[i, 2]) * s, math.exp(bb[i, 3]) * s
        lm = [(kp[i, 2 * j] * s + px, kp[i, 2 * j + 1] * s + py) for j in range(5)]
        dets.append(([cx - w/2, cy - h/2, cx + w/2, cy + h/2], float(cls[i] * obj[i]), lm))

def iou(a, b):                                                    # greedy NMS, IoU 0.45
    ix = max(0, min(a[2], b[2]) - max(a[0], b[0])); iy = max(0, min(a[3], b[3]) - max(a[1], b[1]))
    u = (a[2]-a[0])*(a[3]-a[1]) + (b[2]-b[0])*(b[3]-b[1]) - ix*iy
    return ix * iy / u if u > 0 else 0
dets.sort(key=lambda d: -d[1]); faces = []
for d in dets:
    if all(iou(d[0], f[0]) < 0.45 for f in faces): faces.append(d)
for box, score, lm in faces: print(f"face {score:.2f}", np.round(box, 1), "landmarks", np.round(lm, 1))

License

BSD-3-Clause. Upstream: ShiqiYu/libfacedetection.

Downloads last month: 14