YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

FaceDet β€” Production Face Detection for Video

SCRFD-family detectors + ByteTrack tracking + temporal smoothing Optimized for tiny faces, blur, occlusion, poor lighting, compression, and temporal stability.

Architecture Survey & Design Decisions

Ranked Candidate Models (WiderFace Hard AP)

Rank Model Easy Medium Hard GFLOPs FPS (V100) Year Selected?
1 ASFD-D6 97.2 96.5 92.5 ~500 ~7 2022 βœ— (too slow)
2 TinaFace-R50+TTA 96.1 95.5 92.4 ~42K (MS) ~3 2020 βœ— (TTA-dependent)
3 TinaFace-R50 (single) 95.9 95.2 92.1 508 ~15 2020 βœ— (not efficient)
4 RetinaFace-R152+MS 96.9 96.1 91.8 High 13 2019 βœ— (heavy backbone)
5 MOS-L (R152) 96.9 96.1 92.1 Multi-scale ~16 2021 βœ— (heavy)
6 DSFD 96.6 95.7 90.4 ~1532 β€” 2018 βœ— (outdated)
7 SCRFD-34GF 96.1 95.0 85.2 34 ~80 2021 βœ“ Flagship
8 SCRFD-10GF 95.2 93.9 83.1 10 ~140 2021 βœ“ Balanced
9 SCRFD-2.5GF 93.8 92.2 77.9 2.5 ~400 2021 βœ“ Real-time
10 SCRFD-0.5GF 90.6 88.1 68.5 0.5 ~1000 2021 βœ“ Mobile
11 RetinaFace-MN0.25 91.4 90.1 78.2 ~1 60 CPU 2019 βœ— (SCRFD-2.5G better)
12 YuNet 85.6 84.2 72.7 2.5 77 CPU 2021 βœ— (lower AP)

Why SCRFD?

The SCRFD family achieves the best accuracy-efficiency Pareto frontier for face detection. The key findings:

  1. 3.86% better Hard AP than TinaFace at 3Γ— speed (SCRFD-34G vs TinaFace-R50)
  2. No ImageNet pretraining needed β€” trains from scratch in 640 epochs
  3. Scalable family β€” same architecture principles from 0.5 to 34 GFLOPs
  4. Two orthogonal innovations: Sample Redistribution (augmentation) + Computation Redistribution (NAS)

Higher-ranked models (ASFD-D6, TinaFace+TTA) achieve marginally better Hard AP but at 10-100Γ— the compute cost, making them impractical for video.

Key Technical Insights From Literature

Finding Source Impact
Large-scale crops [0.3–2.0] increase stride-8 positives from 72Kβ†’118K SCRFD Β§3.2 +5-8% Hard AP
GFL jointly trains quality + classification β†’ better score calibration SCRFD Β§3.1 +1-2% Hard AP
Lower IoU matching threshold (0.35 vs 0.5) helps tiny faces TinaFace Β§3.3 +2% Hard AP
GroupNorm > BatchNorm at small batch sizes TinaFace Β§3.2 Stable training
5-point landmark auxiliary loss improves detection by ~1% RetinaFace Β§4.2 +1% Hard AP
WiderFace Hard is near-saturated (~92% AP ceiling) Survey '21 Focus on efficiency
No published improvements to WiderFace Hard since 2022 Literature scan Benchmark mature

Model Zoo

Model WiderFace (E/M/H) GFLOPs Params FPS (V100 VGA) Use Case
scrfd_34g 96.1/95.0/85.2 34 9.80M ~80 Flagship quality
scrfd_10g 95.2/93.9/83.1 10 3.86M ~140 Balanced
scrfd_2.5g 93.8/92.2/77.9 2.5 0.67M ~400 Real-time video
scrfd_0.5g 90.6/88.1/68.5 0.5 0.57M ~1000 Mobile/edge

Architecture

Input Image (640Γ—640)
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  BACKBONE (NAS-searched ResNet-style)    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚Stem β”‚β†’ β”‚ S1  β”‚β†’ β”‚  S2  β”‚β†’ β”‚ S3  β”‚β†’ β”‚ S4  β”‚
β”‚  β”‚s=4  β”‚  β”‚s=4  β”‚  β”‚ s=8  β”‚  β”‚s=16 β”‚  β”‚s=32 β”‚
β”‚  β””β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”˜
β”‚                        β”‚ C3      β”‚ C4     β”‚ C5
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”˜
                         β”‚         β”‚        β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”
    β”‚  PAFPN (Path Aggregation FPN)             β”‚
    β”‚  Top-down (FPN) + Bottom-up (PAN)         β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”               β”‚
    β”‚  β”‚ P3 β”‚ ← β”‚ P4 β”‚ ← β”‚ P5 β”‚  (top-down)   β”‚
    β”‚  β”‚ P3 β”‚ β†’ β”‚ P4 β”‚ β†’ β”‚ P5 β”‚  (bottom-up)   β”‚
    β”‚  β”‚s=8 β”‚    β”‚s=16β”‚    β”‚s=32β”‚               β”‚
    β”‚  β””β”€β”€β”¬β”€β”˜    β””β”€β”€β”¬β”€β”˜    β””β”€β”€β”¬β”€β”˜               β”‚
    β””β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚         β”‚         β”‚
    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  SHARED HEAD (per level, weight-shared)    β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
    β”‚  β”‚ CLS (GFL)β”‚  β”‚ REG(DIoU)β”‚ [LMK (opt)]  β”‚
    β”‚  β”‚ AΓ—1      β”‚  β”‚ AΓ—4      β”‚ [AΓ—10]        β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                   β”‚
          β–Ό                   β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ ATSS Match  β”‚    β”‚ NMS (ΞΈ=0.4)  β”‚
    β”‚ (training)  β”‚    β”‚ (inference)  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Anchors (per level):

  • Stride 8: [16, 32] β€” small faces (β‰₯16px)
  • Stride 16: [64, 128] β€” medium faces
  • Stride 32: [256, 512] β€” large faces
  • Aspect ratio: 1.0 (square β€” faces are roughly square)

Video Pipeline

Frame β†’ Detector (SCRFD) β†’ ByteTrack Tracker β†’ Temporal Smoother β†’ Output
         ↓                    ↓                    ↓
   Per-frame boxes      Track IDs (stable)   Jitter-free boxes
   + scores             + Kalman prediction   + Score momentum
   + landmarks          + 2-stage matching    + Adaptive EMA

ByteTrack (Zhang et al., 2022): Uses ALL detections β€” high + low confidence β€” for two-stage association. Low-confidence detections handle partially occluded faces that would be lost by traditional trackers.

Temporal Smoother: Adaptive EMA where smoothing factor scales with motion magnitude:

  • Static faces β†’ heavy smoothing (Ξ±β‰ˆ0.3) β†’ no jitter
  • Fast-moving faces β†’ light smoothing (Ξ±β‰ˆ0.9) β†’ no lag

Quick Start

Installation

pip install -r requirements.txt

Detect faces in a video

from facedet import VideoFaceDetector

detector = VideoFaceDetector(
    model_path='checkpoints/scrfd_34g_best.pth',
    model_name='scrfd_34g',
    device='cuda',
    use_tracking=True,
    use_smoothing=True,
)

# Process video file
stats = detector.process_video(
    source='input.mp4',
    output_path='output.mp4',
    show=True,
)
# β†’ {total_frames: 1200, avg_fps: 75.3, avg_faces_per_frame: 4.2}

Detect faces in a single image

from facedet import build_detector
import cv2, torch

model = build_detector('scrfd_34g').cuda().eval()
# Load checkpoint...

img = cv2.imread('photo.jpg')
# Preprocess... (see scripts/evaluate.py for full example)
results = model(tensor)
# β†’ [{'boxes': tensor([...]), 'scores': tensor([...])}]

Real-time webcam

python scripts/detect_video.py \
    --model scrfd_2.5g \
    --checkpoint checkpoints/scrfd_2.5g_best.pth \
    --input 0 --show

Training

Dataset Setup

Download WIDER FACE and arrange:

data/wider_face/
β”œβ”€β”€ WIDER_train/images/
β”œβ”€β”€ WIDER_val/images/
β”œβ”€β”€ wider_face_split/
β”‚   β”œβ”€β”€ wider_face_train_bbx_gt.txt
β”‚   └── wider_face_val_bbx_gt.txt
└── retinaface_gt/  (optional, for landmark training)
    β”œβ”€β”€ train/label.txt
    └── val/label.txt

Training Commands

# Single GPU β€” SCRFD-34G (flagship)
python scripts/train.py \
    --model scrfd_34g \
    --data-root data/wider_face \
    --epochs 640 \
    --batch-size 8 \
    --lr 0.01

# Multi-GPU β€” 4Γ— V100
torchrun --nproc_per_node=4 scripts/train.py \
    --model scrfd_34g \
    --data-root data/wider_face \
    --epochs 640 \
    --batch-size 8 \
    --lr 0.01

# Real-time variant
python scripts/train.py \
    --model scrfd_2.5g \
    --data-root data/wider_face \
    --epochs 640 \
    --batch-size 16 \
    --lr 0.02

Training Recipe (from SCRFD paper)

Parameter Value Rationale
Optimizer SGD (m=0.9, wd=5e-4) Standard for detection
Base LR 0.01 (8 imgs/GPU) Linear scaling rule
LR Schedule MultiStep [440, 544] Γ—0.1 Long training, late decay
Warmup 3 epochs, linear from 1e-5 Prevent early divergence
Total Epochs 640 Train from scratch
Input Size 640Γ—640 Random crop from larger
Crop Scales [0.3, 0.45, ..., 2.0] Sample Redistribution
Augmentation Crop + flip + photometric + robustness See data/augmentations.py
Normalization GroupNorm Batch-size independent
Matching ATSS (k=9) Adaptive thresholds
Cls Loss GFL (Ξ²=2) Joint quality score
Reg Loss DIoU Better for tiny faces
Mixed Precision βœ“ 2Γ— training speed

Evaluation

python scripts/evaluate.py \
    --model scrfd_34g \
    --checkpoint checkpoints/scrfd_34g_best.pth \
    --data-root data/wider_face \
    --output-dir results/scrfd_34g \
    --benchmark

Generates:

  • WiderFace Easy/Medium/Hard AP scores
  • Predictions in WiderFace submission format
  • Speed benchmark table (320/480/640/960px)

Deployment

ONNX Export

python scripts/export.py \
    --model scrfd_34g \
    --checkpoint checkpoints/scrfd_34g_best.pth \
    --output deploy/scrfd_34g.onnx \
    --input-size 640

TensorRT (FP16)

trtexec --onnx=deploy/scrfd_34g.onnx \
        --saveEngine=deploy/scrfd_34g_fp16.engine \
        --fp16 --workspace=4096

Expected Deployment Speedups

Model PyTorch FP32 ONNX RT TensorRT FP16 TensorRT INT8
SCRFD-34G ~80 FPS ~100 FPS ~200 FPS ~350 FPS
SCRFD-2.5G ~400 FPS ~500 FPS ~800 FPS ~1200 FPS
SCRFD-0.5G ~1000 FPS ~1200 FPS ~2000 FPS ~3000 FPS

PyTorch Quantization (CPU)

from facedet.deploy import quantize_model
quantized = quantize_model(model, method='dynamic')

Ablation Studies

Configured in configs/ablations.yaml. Each ablation isolates one variable:

Ablation Variables Expected Finding
Sample Redistribution Crop scales [0.3–1.0] vs [0.3–2.0] +5-8% Hard AP from large crops
Loss Functions GFL+DIoU vs Focal+SmoothL1 GFL: +1-2% from quality scores
Matching Strategy ATSS(k=9) vs IoU(0.35) vs IoU(0.5) ATSS: best for mixed scales
Robustness Augmentation None / blur / JPEG / all All: +1-3% on degraded inputs
Normalization GroupNorm vs BatchNorm GN: stable at batch<8
Input Resolution 320 / 480 / 640 / 960 960: +5-10% Hard AP, 4Γ— slower
Landmarks With/without 5-point landmarks +~1% Hard AP (RetinaFace finding)
Tracker Config None / conservative / aggressive Aggressive: more tracks, more FP

Handling Challenging Conditions

Tiny Faces (<16px)

  • Sample Redistribution (crop scale up to 2.0Γ—) generates small face training samples
  • Stride-8 feature maps with anchors [16, 32]px
  • Higher inference resolution (960px) trades speed for +5-10% small face recall
  • ATSS matching gives tiny faces lower IoU thresholds automatically

Blur / Motion Blur

  • Training augmentation: Gaussian blur Οƒβˆˆ[0.5, 3.0] applied with p=0.2
  • Model learns blur-invariant features
  • ByteTrack Kalman filter predicts through blurred frames

Occlusion

  • Random erasing (Cutout) during training simulates partial occlusion
  • ATSS assigns multiple anchors per GT β†’ partial detection still gets signal
  • ByteTrack 2nd-stage matching recovers occluded faces with low-confidence detections

Poor Lighting

  • Gamma darkening augmentation (γ∈[1.5, 3.0]) simulates low-light
  • Photometric distortion (brightness, contrast jitter)
  • For extreme cases: pair with CLAHE preprocessing

Compression Artifacts

  • JPEG quality degradation (Q=20-80) during training
  • No published method addresses this β€” our augmentation is novel for face detection

Temporal Stability

  • ByteTrack: stable track IDs across frames, handles occlusion
  • Kalman filter: smooth trajectory prediction
  • Temporal EMA: adaptive smoothing eliminates box jitter
  • Keyframe strategy: full detection every N frames, tracker-only in between

Repository Structure

facedet/
β”œβ”€β”€ README.md                 # This file
β”œβ”€β”€ setup.py                  # Package installation
β”œβ”€β”€ requirements.txt          # Dependencies
β”‚
β”œβ”€β”€ models/                   # Model architectures
β”‚   β”œβ”€β”€ backbone.py           # NAS-searched ResNet backbones
β”‚   β”œβ”€β”€ neck.py               # PAFPN feature pyramid
β”‚   β”œβ”€β”€ head.py               # Shared detection head (cls/reg/lmk)
β”‚   β”œβ”€β”€ anchor.py             # Anchor generation + ATSS matching
β”‚   β”œβ”€β”€ losses.py             # GFL, DIoU, Focal, Landmark losses
β”‚   └── detector.py           # Full SCRFD detector (train + inference)
β”‚
β”œβ”€β”€ data/                     # Data pipeline
β”‚   β”œβ”€β”€ widerface.py          # WiderFace dataset loader
β”‚   β”œβ”€β”€ augmentations.py      # Training/val/robustness augmentations
β”‚   └── dataloader.py         # DataLoader builders
β”‚
β”œβ”€β”€ engine/                   # Video inference engine
β”‚   β”œβ”€β”€ video_detector.py     # End-to-end video processing
β”‚   β”œβ”€β”€ tracker.py            # ByteTrack face tracker
β”‚   └── temporal.py           # Temporal EMA smoother
β”‚
β”œβ”€β”€ evaluation/               # Evaluation suite
β”‚   β”œβ”€β”€ widerface_eval.py     # WiderFace protocol (Easy/Med/Hard AP)
β”‚   β”œβ”€β”€ speed_benchmark.py    # Latency/throughput benchmarks
β”‚   └── metrics.py            # Core metrics (AP, IoU, recall)
β”‚
β”œβ”€β”€ deploy/                   # Deployment
β”‚   β”œβ”€β”€ export_onnx.py        # ONNX export + verification
β”‚   └── optimize.py           # Quantization, TensorRT guide
β”‚
β”œβ”€β”€ configs/                  # Configuration files
β”‚   β”œβ”€β”€ scrfd_34g.yaml        # Flagship (quality)
β”‚   β”œβ”€β”€ scrfd_10g.yaml        # Balanced
β”‚   β”œβ”€β”€ scrfd_2.5g.yaml       # Real-time
β”‚   β”œβ”€β”€ scrfd_0.5g.yaml       # Mobile
β”‚   └── ablations.yaml        # Ablation study configs
β”‚
β”œβ”€β”€ scripts/                  # Entry points
β”‚   β”œβ”€β”€ train.py              # Training (single/multi-GPU)
β”‚   β”œβ”€β”€ evaluate.py           # WiderFace evaluation + speed bench
β”‚   β”œβ”€β”€ detect_video.py       # Video inference CLI
β”‚   └── export.py             # ONNX export CLI
β”‚
└── utils/                    # Helpers
    β”œβ”€β”€ visualization.py      # Drawing utilities
    └── io.py                 # Checkpoint I/O

References

  1. SCRFD: Guo et al., "Sample and Computation Redistribution for Efficient Face Detection", arXiv:2105.04714 (2021)
  2. RetinaFace: Deng et al., "RetinaFace: Single-shot Multi-level Face Localisation in the Wild", arXiv:1905.00641 (2019)
  3. TinaFace: Zhu et al., "TinaFace: Strong but Simple Baseline for Face Detection", arXiv:2011.13183 (2020)
  4. ByteTrack: Zhang et al., "ByteTrack: Multi-Object Tracking by Associating Every Detection Box", ECCV 2022
  5. ATSS: Zhang et al., "Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection", CVPR 2020
  6. GFL: Li et al., "Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection", NeurIPS 2020
  7. DIoU: Zheng et al., "Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression", AAAI 2020
  8. ASFD: Zhang et al., "ASFD: Automatic and Scalable Face Detector", arXiv:2201.10781 (2022)
  9. DSFD: Li et al., "DSFD: Dual Shot Face Detector", CVPR 2019
  10. WiderFace: Yang et al., "WIDER FACE: A Face Detection Benchmark", CVPR 2016

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for cledouxluma/facedet