Multi-Modal Deepfake Detection โ trained weights
Trained weights for Multi-Model_Deepfake_Detection, a CLI that scores a video for manipulation by fusing three independent modalities (visual frames, audio track, spoken transcript).
Do not download these files by hand โ clone the GitHub repo and run
python download_weights.py, which places every file where config.yaml
expects it.
Contents
| File | Role | Val metric |
|---|---|---|
video_model_colab_ft/ |
SigLIP ViT, full fine-tune on FaceForensics++ (c23) face crops | AUC 0.861 / frame, 0.939 / video |
clf_ep6_torchscript.pt |
MLP over WavLM-base-plus embeddings (WaveFake + JSUT) | AUC 0.999 |
text_model/head_best.pt |
Linear head over DistilBERT for transcript claim scoring (FibVID) | AUC 0.909 |
fusion_model.pkl |
HistGradientBoosting meta-classifier over the three probabilities | AUC 0.998 (simulated joint data) |
Validation methodology and known limitations (audio language confound, simulated fusion training data) are documented in the project README.
Licensing
Code and these trained heads: MIT. The video model is a fine-tune of prithivMLmods/deepfake-detector-model-v1 and retains its upstream terms. Training datasets are not redistributed here: FaceForensics++ requires a signed EULA; WaveFake is CC BY-SA; JSUT is for research use.
Model tree for AUA27/deepfake-multimodal-weights
Base model
google/siglip2-base-patch16-512