YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

FaceDet — Production Face Detection for Video

SCRFD-family detectors + ByteTrack tracking + temporal smoothing Optimized for tiny faces, blur, occlusion, poor lighting, compression, and temporal stability.

Architecture Survey & Design Decisions

Ranked Candidate Models (WiderFace Hard AP)

Rank	Model	Easy	Medium	Hard	GFLOPs	FPS (V100)	Year	Selected?
1	ASFD-D6	97.2	96.5	92.5	~500	~7	2022	✗ (too slow)
2	TinaFace-R50+TTA	96.1	95.5	92.4	~42K (MS)	~3	2020	✗ (TTA-dependent)
3	TinaFace-R50 (single)	95.9	95.2	92.1	508	~15	2020	✗ (not efficient)
4	RetinaFace-R152+MS	96.9	96.1	91.8	High	13	2019	✗ (heavy backbone)
5	MOS-L (R152)	96.9	96.1	92.1	Multi-scale	~16	2021	✗ (heavy)
6	DSFD	96.6	95.7	90.4	~1532	—	2018	✗ (outdated)
7	SCRFD-34GF	96.1	95.0	85.2	34	~80	2021	✓ Flagship
8	SCRFD-10GF	95.2	93.9	83.1	10	~140	2021	✓ Balanced
9	SCRFD-2.5GF	93.8	92.2	77.9	2.5	~400	2021	✓ Real-time
10	SCRFD-0.5GF	90.6	88.1	68.5	0.5	~1000	2021	✓ Mobile
11	RetinaFace-MN0.25	91.4	90.1	78.2	~1	60 CPU	2019	✗ (SCRFD-2.5G better)
12	YuNet	85.6	84.2	72.7	2.5	77 CPU	2021	✗ (lower AP)

Why SCRFD?

The SCRFD family achieves the best accuracy-efficiency Pareto frontier for face detection. The key findings:

3.86% better Hard AP than TinaFace at 3× speed (SCRFD-34G vs TinaFace-R50)
No ImageNet pretraining needed — trains from scratch in 640 epochs
Scalable family — same architecture principles from 0.5 to 34 GFLOPs
Two orthogonal innovations: Sample Redistribution (augmentation) + Computation Redistribution (NAS)

Higher-ranked models (ASFD-D6, TinaFace+TTA) achieve marginally better Hard AP but at 10-100× the compute cost, making them impractical for video.

Key Technical Insights From Literature

Finding	Source	Impact
Large-scale crops [0.3–2.0] increase stride-8 positives from 72K→118K	SCRFD §3.2	+5-8% Hard AP
GFL jointly trains quality + classification → better score calibration	SCRFD §3.1	+1-2% Hard AP
Lower IoU matching threshold (0.35 vs 0.5) helps tiny faces	TinaFace §3.3	+2% Hard AP
GroupNorm > BatchNorm at small batch sizes	TinaFace §3.2	Stable training
5-point landmark auxiliary loss improves detection by ~1%	RetinaFace §4.2	+1% Hard AP
WiderFace Hard is near-saturated (~92% AP ceiling)	Survey '21	Focus on efficiency
No published improvements to WiderFace Hard since 2022	Literature scan	Benchmark mature

Model Zoo

Model	WiderFace (E/M/H)	GFLOPs	Params	FPS (V100 VGA)	Use Case
`scrfd_34g`	96.1/95.0/85.2	34	9.80M	~80	Flagship quality
`scrfd_10g`	95.2/93.9/83.1	10	3.86M	~140	Balanced
`scrfd_2.5g`	93.8/92.2/77.9	2.5	0.67M	~400	Real-time video
`scrfd_0.5g`	90.6/88.1/68.5	0.5	0.57M	~1000	Mobile/edge

Architecture

Input Image (640×640)
    │
    ▼
┌─────────────────────────────────────────┐
│  BACKBONE (NAS-searched ResNet-style)    │
│  ┌─────┐  ┌─────┐  ┌──────┐  ┌─────┐  │
│  │Stem │→ │ S1  │→ │  S2  │→ │ S3  │→ │ S4  │
│  │s=4  │  │s=4  │  │ s=8  │  │s=16 │  │s=32 │
│  └─────┘  └─────┘  └──┬───┘  └──┬──┘  └──┬──┘
│                        │ C3      │ C4     │ C5
└────────────────────────┼─────────┼────────┼──┘
                         │         │        │
    ┌────────────────────▼─────────▼────────▼──┐
    │  PAFPN (Path Aggregation FPN)             │
    │  Top-down (FPN) + Bottom-up (PAN)         │
    │  ┌────┐    ┌────┐    ┌────┐               │
    │  │ P3 │ ← │ P4 │ ← │ P5 │  (top-down)   │
    │  │ P3 │ → │ P4 │ → │ P5 │  (bottom-up)   │
    │  │s=8 │    │s=16│    │s=32│               │
    │  └──┬─┘    └──┬─┘    └──┬─┘               │
    └─────┼─────────┼─────────┼─────────────────┘
          │         │         │
    ┌─────▼─────────▼─────────▼─────────────────┐
    │  SHARED HEAD (per level, weight-shared)    │
    │  ┌──────────┐  ┌──────────┐               │
    │  │ CLS (GFL)│  │ REG(DIoU)│ [LMK (opt)]  │
    │  │ A×1      │  │ A×4      │ [A×10]        │
    │  └──────────┘  └──────────┘               │
    └───────────────────────────────────────────┘
          │                   │
          ▼                   ▼
    ┌─────────────┐    ┌──────────────┐
    │ ATSS Match  │    │ NMS (θ=0.4)  │
    │ (training)  │    │ (inference)  │
    └─────────────┘    └──────────────┘

Anchors (per level):

Stride 8: [16, 32] — small faces (≥16px)
Stride 16: [64, 128] — medium faces
Stride 32: [256, 512] — large faces
Aspect ratio: 1.0 (square — faces are roughly square)

Video Pipeline

Frame → Detector (SCRFD) → ByteTrack Tracker → Temporal Smoother → Output
         ↓                    ↓                    ↓
   Per-frame boxes      Track IDs (stable)   Jitter-free boxes
   + scores             + Kalman prediction   + Score momentum
   + landmarks          + 2-stage matching    + Adaptive EMA

ByteTrack (Zhang et al., 2022): Uses ALL detections — high + low confidence — for two-stage association. Low-confidence detections handle partially occluded faces that would be lost by traditional trackers.

Temporal Smoother: Adaptive EMA where smoothing factor scales with motion magnitude:

Static faces → heavy smoothing (α≈0.3) → no jitter
Fast-moving faces → light smoothing (α≈0.9) → no lag

Quick Start

Installation

pip install -r requirements.txt

Detect faces in a video

from facedet import VideoFaceDetector

detector = VideoFaceDetector(
    model_path='checkpoints/scrfd_34g_best.pth',
    model_name='scrfd_34g',
    device='cuda',
    use_tracking=True,
    use_smoothing=True,
)

# Process video file
stats = detector.process_video(
    source='input.mp4',
    output_path='output.mp4',
    show=True,
)
# → {total_frames: 1200, avg_fps: 75.3, avg_faces_per_frame: 4.2}

Detect faces in a single image

from facedet import build_detector
import cv2, torch

model = build_detector('scrfd_34g').cuda().eval()
# Load checkpoint...

img = cv2.imread('photo.jpg')
# Preprocess... (see scripts/evaluate.py for full example)
results = model(tensor)
# → [{'boxes': tensor([...]), 'scores': tensor([...])}]

Real-time webcam

python scripts/detect_video.py \
    --model scrfd_2.5g \
    --checkpoint checkpoints/scrfd_2.5g_best.pth \
    --input 0 --show

Training

Dataset Setup

Download WIDER FACE and arrange:

data/wider_face/
├── WIDER_train/images/
├── WIDER_val/images/
├── wider_face_split/
│   ├── wider_face_train_bbx_gt.txt
│   └── wider_face_val_bbx_gt.txt
└── retinaface_gt/  (optional, for landmark training)
    ├── train/label.txt
    └── val/label.txt

Training Commands

# Single GPU — SCRFD-34G (flagship)
python scripts/train.py \
    --model scrfd_34g \
    --data-root data/wider_face \
    --epochs 640 \
    --batch-size 8 \
    --lr 0.01

# Multi-GPU — 4× V100
torchrun --nproc_per_node=4 scripts/train.py \
    --model scrfd_34g \
    --data-root data/wider_face \
    --epochs 640 \
    --batch-size 8 \
    --lr 0.01

# Real-time variant
python scripts/train.py \
    --model scrfd_2.5g \
    --data-root data/wider_face \
    --epochs 640 \
    --batch-size 16 \
    --lr 0.02

Training Recipe (from SCRFD paper)

Parameter	Value	Rationale
Optimizer	SGD (m=0.9, wd=5e-4)	Standard for detection
Base LR	0.01 (8 imgs/GPU)	Linear scaling rule
LR Schedule	MultiStep [440, 544] ×0.1	Long training, late decay
Warmup	3 epochs, linear from 1e-5	Prevent early divergence
Total Epochs	640	Train from scratch
Input Size	640×640	Random crop from larger
Crop Scales	[0.3, 0.45, ..., 2.0]	Sample Redistribution
Augmentation	Crop + flip + photometric + robustness	See data/augmentations.py
Normalization	GroupNorm	Batch-size independent
Matching	ATSS (k=9)	Adaptive thresholds
Cls Loss	GFL (β=2)	Joint quality score
Reg Loss	DIoU	Better for tiny faces
Mixed Precision	✓	2× training speed

Evaluation

python scripts/evaluate.py \
    --model scrfd_34g \
    --checkpoint checkpoints/scrfd_34g_best.pth \
    --data-root data/wider_face \
    --output-dir results/scrfd_34g \
    --benchmark

Generates:

WiderFace Easy/Medium/Hard AP scores
Predictions in WiderFace submission format
Speed benchmark table (320/480/640/960px)

Deployment

ONNX Export

python scripts/export.py \
    --model scrfd_34g \
    --checkpoint checkpoints/scrfd_34g_best.pth \
    --output deploy/scrfd_34g.onnx \
    --input-size 640

TensorRT (FP16)

trtexec --onnx=deploy/scrfd_34g.onnx \
        --saveEngine=deploy/scrfd_34g_fp16.engine \
        --fp16 --workspace=4096

Expected Deployment Speedups

Model	PyTorch FP32	ONNX RT	TensorRT FP16	TensorRT INT8
SCRFD-34G	~80 FPS	~100 FPS	~200 FPS	~350 FPS
SCRFD-2.5G	~400 FPS	~500 FPS	~800 FPS	~1200 FPS
SCRFD-0.5G	~1000 FPS	~1200 FPS	~2000 FPS	~3000 FPS

PyTorch Quantization (CPU)

from facedet.deploy import quantize_model
quantized = quantize_model(model, method='dynamic')

Ablation Studies

Configured in configs/ablations.yaml. Each ablation isolates one variable:

Ablation	Variables	Expected Finding
Sample Redistribution	Crop scales [0.3–1.0] vs [0.3–2.0]	+5-8% Hard AP from large crops
Loss Functions	GFL+DIoU vs Focal+SmoothL1	GFL: +1-2% from quality scores
Matching Strategy	ATSS(k=9) vs IoU(0.35) vs IoU(0.5)	ATSS: best for mixed scales
Robustness Augmentation	None / blur / JPEG / all	All: +1-3% on degraded inputs
Normalization	GroupNorm vs BatchNorm	GN: stable at batch<8
Input Resolution	320 / 480 / 640 / 960	960: +5-10% Hard AP, 4× slower
Landmarks	With/without 5-point landmarks	+~1% Hard AP (RetinaFace finding)
Tracker Config	None / conservative / aggressive	Aggressive: more tracks, more FP

Handling Challenging Conditions

Tiny Faces (<16px)

Sample Redistribution (crop scale up to 2.0×) generates small face training samples
Stride-8 feature maps with anchors [16, 32]px
Higher inference resolution (960px) trades speed for +5-10% small face recall
ATSS matching gives tiny faces lower IoU thresholds automatically

Blur / Motion Blur

Training augmentation: Gaussian blur σ∈[0.5, 3.0] applied with p=0.2
Model learns blur-invariant features
ByteTrack Kalman filter predicts through blurred frames

Occlusion

Random erasing (Cutout) during training simulates partial occlusion
ATSS assigns multiple anchors per GT → partial detection still gets signal
ByteTrack 2nd-stage matching recovers occluded faces with low-confidence detections

Poor Lighting

Gamma darkening augmentation (γ∈[1.5, 3.0]) simulates low-light
Photometric distortion (brightness, contrast jitter)
For extreme cases: pair with CLAHE preprocessing

Compression Artifacts

JPEG quality degradation (Q=20-80) during training
No published method addresses this — our augmentation is novel for face detection

Temporal Stability

ByteTrack: stable track IDs across frames, handles occlusion
Kalman filter: smooth trajectory prediction
Temporal EMA: adaptive smoothing eliminates box jitter
Keyframe strategy: full detection every N frames, tracker-only in between

Repository Structure

facedet/
├── README.md                 # This file
├── setup.py                  # Package installation
├── requirements.txt          # Dependencies
│
├── models/                   # Model architectures
│   ├── backbone.py           # NAS-searched ResNet backbones
│   ├── neck.py               # PAFPN feature pyramid
│   ├── head.py               # Shared detection head (cls/reg/lmk)
│   ├── anchor.py             # Anchor generation + ATSS matching
│   ├── losses.py             # GFL, DIoU, Focal, Landmark losses
│   └── detector.py           # Full SCRFD detector (train + inference)
│
├── data/                     # Data pipeline
│   ├── widerface.py          # WiderFace dataset loader
│   ├── augmentations.py      # Training/val/robustness augmentations
│   └── dataloader.py         # DataLoader builders
│
├── engine/                   # Video inference engine
│   ├── video_detector.py     # End-to-end video processing
│   ├── tracker.py            # ByteTrack face tracker
│   └── temporal.py           # Temporal EMA smoother
│
├── evaluation/               # Evaluation suite
│   ├── widerface_eval.py     # WiderFace protocol (Easy/Med/Hard AP)
│   ├── speed_benchmark.py    # Latency/throughput benchmarks
│   └── metrics.py            # Core metrics (AP, IoU, recall)
│
├── deploy/                   # Deployment
│   ├── export_onnx.py        # ONNX export + verification
│   └── optimize.py           # Quantization, TensorRT guide
│
├── configs/                  # Configuration files
│   ├── scrfd_34g.yaml        # Flagship (quality)
│   ├── scrfd_10g.yaml        # Balanced
│   ├── scrfd_2.5g.yaml       # Real-time
│   ├── scrfd_0.5g.yaml       # Mobile
│   └── ablations.yaml        # Ablation study configs
│
├── scripts/                  # Entry points
│   ├── train.py              # Training (single/multi-GPU)
│   ├── evaluate.py           # WiderFace evaluation + speed bench
│   ├── detect_video.py       # Video inference CLI
│   └── export.py             # ONNX export CLI
│
└── utils/                    # Helpers
    ├── visualization.py      # Drawing utilities
    └── io.py                 # Checkpoint I/O

References

SCRFD: Guo et al., "Sample and Computation Redistribution for Efficient Face Detection", arXiv:2105.04714 (2021)
RetinaFace: Deng et al., "RetinaFace: Single-shot Multi-level Face Localisation in the Wild", arXiv:1905.00641 (2019)
TinaFace: Zhu et al., "TinaFace: Strong but Simple Baseline for Face Detection", arXiv:2011.13183 (2020)
ByteTrack: Zhang et al., "ByteTrack: Multi-Object Tracking by Associating Every Detection Box", ECCV 2022
ATSS: Zhang et al., "Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection", CVPR 2020
GFL: Li et al., "Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection", NeurIPS 2020
DIoU: Zheng et al., "Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression", AAAI 2020
ASFD: Zhang et al., "ASFD: Automatic and Scalable Face Detector", arXiv:2201.10781 (2022)
DSFD: Li et al., "DSFD: Dual Shot Face Detector", CVPR 2019
WiderFace: Yang et al., "WIDER FACE: A Face Detection Benchmark", CVPR 2016

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for cledouxluma/facedet