YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Deepfake Detector

An image-based deepfake detection web app that classifies face images as Real or Deepfake using a two-branch fusion model: a DoRA-fine-tuned CLIP ViT-L/14 backbone (spatial features) combined with a lightweight CNN over the 2D FFT magnitude spectrum (frequency artifacts). A Flask backend serves predictions, attention-rollout heatmaps, and uncanny-valley heuristics; a React/Vite frontend provides the UI.

Setup

pip install -r requirements.txt
python train.py       # fine-tune the model (downloads datasets via kagglehub)
python evaluate.py    # evaluate on held-out test sets
python app.py         # start the Flask API at http://127.0.0.1:5001

# Frontend (optional)
cd frontend
npm install
npm run dev

How It Works

  1. User uploads a face image.
  2. MediaPipe Face Landmarker detects and crops faces.
  3. Each face is passed through the two-branch detector:
    • Branch 1: CLIP ViT-L/14 vision encoder (spatial semantics).
    • Branch 2: 2D FFT magnitude spectrum β†’ small CNN (frequency artifacts).
    • A fusion MLP head produces the Real / Deepfake logits.
  4. Attention rollout over the ViT layers produces a heatmap.
  5. OpenCV-based heuristics compute an "uncanny valley" score (symmetry, eye consistency, skin texture, edge naturalness, lighting consistency, sensor-noise pattern).
  6. Results are returned to the frontend for display.

Training Recipe

  • Backbone: openai/clip-vit-large-patch14
  • Adapter: DoRA (r=16, Ξ±=32) on all CLIP attention projections (q_proj, k_proj, v_proj, out_proj), with LoRA dropout 0.1.
  • Loss: 0.7 Γ— Supervised Contrastive + 0.3 Γ— Cross-Entropy (label smoothing 0.1). Projection head (128-d, L2-normalized) used only during training.
  • Optimizer: AdamW (lr=2e-4, weight decay 0.01), gradient clipping at 1.0.
  • Schedule: CosineAnnealingWarmRestarts (T_0=5, Ξ·_min=1e-6).
  • Precision: AMP (float16) on MPS / CUDA.
  • Augmentation: Albumentations pipeline β€” horizontal flip, rotate, random resized crop, color jitter, simulated social-media degradation (downscale β†’ JPEG 20–70 β†’ upscale β†’ blur), JPEG compression, Gaussian blur, downscale, and a stochastic high-pass filter (p=0.15).
  • Batch sampling: Balanced sampler with 4 groups (2 datasets Γ— 2 classes), equal representation per batch.
  • Validation metric: ROC-AUC; early stopping with patience 3; per-epoch versioned DoRA snapshots.

Optimizations

Training Efficiency

  • DoRA adapter instead of full fine-tuning β€” only the decomposed low-rank updates on CLIP's q/k/v/out_proj layers are trained; the 300M+ backbone weights stay frozen. Massive VRAM savings, and DoRA typically closes the gap to full fine-tuning that plain LoRA leaves on the table.
  • Automatic Mixed Precision (float16 autocast) β€” roughly 2Γ— memory reduction on MPS/CUDA and faster matmuls on tensor-core hardware.
  • Gradient accumulation (ACCUM_STEPS) β€” lets a small per-step batch simulate a much larger effective batch without the memory cost.
  • Gradient clipping at max_norm=1.0 β€” stabilizes DoRA + SupCon updates, which can spike early in training.
  • CosineAnnealingWarmRestarts β€” resume-friendly LR schedule; periodic restarts help escape flat regions without manual LR tuning.
  • Early stopping (patience=3 on val AUC) β€” avoids wasted epochs once the model plateaus.
  • Device auto-selection β€” CUDA β†’ MPS β†’ CPU, with num_workers=4 and pin_memory=True enabled automatically on CUDA only (MPS / CPU get num_workers=0 to avoid Python multiprocessing stalls on macOS).
  • CPU-side Albumentations pipeline β€” all augmentation runs in NumPy/OpenCV so it doesn't contend with the MPS GPU mid-step.

Data Pipeline

  • Balanced batch sampler β€” each batch draws equally from 4 groups (2 datasets Γ— 2 classes). Prevents the larger dataset / majority class from dominating gradients and ensures every step sees every generator type.
  • Stochastic high-pass filter augmentation (p=0.15) β€” forces the network to learn frequency-domain cues even when the FFT branch alone would not be enough.
  • Simulated social-media degradation β€” downscale β†’ JPEG 20–70 β†’ upscale β†’ blur, approximating the Instagram/TikTok transcode pipeline so the model generalizes to "in-the-wild" deepfakes rather than pristine dataset images.
  • Two datasets combined (manjilkarki deepfakes + xhlulu 140k StyleGAN) β€” broader generator coverage (diffusion, face-swap, StyleGAN) than any single source.
  • Label smoothing (0.1) β€” prevents the CE head from producing overconfident logits, which also improves calibration of the softmax score shown in the UI.

Loss & Representation

  • SupCon (0.7) + CE (0.3) β€” SupCon shapes the fused embedding space so same-class samples cluster together regardless of generator, while CE maintains a clean decision boundary. The projection head is training-only and discarded for inference.
  • Two-branch fusion (CLIP + FFT) β€” CLIP handles spatial semantics; the FFT CNN captures spectral peaks and blending artifacts CLIP cannot see in pixel space. Concatenated before the classifier.
  • log1p + fftshift on FFT magnitude β€” compresses the dynamic range of the spectrum and centers the DC component, making the distribution easier for a small CNN to learn.

Checkpointing & Reproducibility

  • Best-only saving, tracked by val AUC β€” head_weights.pt and the DoRA adapter are overwritten only when validation AUC improves.
  • Versioned per-epoch snapshots (dora_epoch{N}_auc{X}) β€” any prior epoch can be rolled back to without re-training.
  • train_state.pt β€” stores optimizer, scheduler, epoch, best-AUC, and patience counter in sync with the best weights, so python train.py can safely resume from interruption.
  • DoRA merge at end of training (merge_and_unload) β€” the final inference checkpoint is a plain CLIPVisionModel with the adapter folded in, so app.py does not need PEFT at serve time.

Inference & Serving

  • MediaPipe Face Landmarker with 20% bbox padding β€” crops each face before classification so the model sees a consistent face-centered input instead of wide scenes.
  • Multi-face support β€” every detected face is classified independently and returned as its own result.
  • Attention rollout heatmap via class-swap patch β€” CLIP's CLIPSdpaAttention ignores output_attentions=True; the code temporarily swaps each layer's class back to the eager CLIPAttention parent so attention matrices can be read without monkey-patching forward. Rollout adds residual and re-normalizes per layer.
  • torch.no_grad() everywhere in /predict β€” no autograd graph is built for inference.
  • Base64 PNG encoding of heatmaps β€” avoids a second HTTP round-trip; the frontend can render the overlay immediately.

Citations

Pretrained Backbone

Methods & Techniques

Datasets

Core Libraries

Backend

Frontend

License

This repository is for research and educational use. Please respect the upstream licenses of the datasets and pretrained weights:

  • CLIP ViT-L/14 weights β€” MIT License (OpenAI).
  • Kaggle datasets β€” see each dataset page for terms of use.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for knmrfr/slop