Video Classification
English

Simplifying Traffic Anomaly Detection with Video Foundation Models

Svetlana Orlova, Tommie Kerssies, Brunó B. Englert, Gijs Dubbelman
Eindhoven University of Technology

arXiv Hugging Face Models Code

Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) advanced pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity.

Simple_Main

✨ DoTA and DADA-2000 results

Simple_Results

Video ViT-based encoder-only models set a new state of the art on both datasets, while being significantly more efficient than top-performing specialized methods. FPS measured using NVIDIA A100 MIG, 2 1 GPU. † From prior work. ‡ Optimistic estimates using publicly available components of the model. “A→B”: trained on A, tested on B; D2K: DADA-2000.

🧩 Code

Check out our GitHub repo: simple-tad

📍Model Zoo

DAPT (adapted) models

Method Backbone Initialized with DAPT epochs DAPT data Checkpoint
VideoMAE ViT-S Kinetics-400 1600 ep 12 Kinetics-700 simpletad_dapt-k700_videomae-s_ep12.pth
VideoMAE ViT-S Kinetics-400 1600 ep 12 BDD100K simpletad_dapt-onlybdd_videomae-s_ep12.pth
VideoMAE ViT-S Kinetics-400 1600 ep 12 BDD100K + CAP-DATA simpletad_dapt_videomae-s_ep12.pth
VideoMAE ViT-B Kinetics-400 1600 ep 12 BDD100K + CAP-DATA simpletad_dapt_videomae-b_ep12.pth
VideoMAE ViT-L Kinetics-400 1600 ep 12 BDD100K + CAP-DATA simpletad_dapt_videomae-l_ep12.pth

Fine-tuned on DoTA

Method Backbone Initialized with Best AUCROC checkpoint Best AUCMCC checkpoint AUCROC AUCMCC
VideoMAE ViT-S VideoMAE (Kinetics-400 1600 ep) simpletad_ft-dota_vm1-s_auroc.pth simpletad_ft-dota_vm1-s_aumcc.pth 83.7 46.9
VideoMAE ViT-B VideoMAE (Kinetics-400 1600 ep) simpletad_ft-dota_vm1-b-1600_auroc.pth simpletad_ft-dota_vm1-b-1600_aumcc.pth 86.3 54.8
VideoMAE ViT-L VideoMAE (Kinetics-400 1600 ep) simpletad_ft-dota_vm1-l_auroc.pth simpletad_ft-dota_vm1-l_aumcc.pth 88.2 58.7
VideoMAE2 ViT-S VideoMAE2 (vit_s_k710_dl_from_giant.pth) simpletad_ft-dota_vm2-s_auroc.pth simpletad_ft-dota_vm2-s_aumcc.pth 86.0 54.1
VideoMAE2 ViT-B VideoMAE2 (vit_b_k710_dl_from_giant.pth) simpletad_ft-dota_vm2-b_auroc.pth simpletad_ft-dota_vm2-b_aumcc.pth 86.9 55.4
MVD_fromL ViT-S MVD (Kinetics-400 Teacher ViT-L) simpletad_ft-dota_mvd-s-fromL_auroc.pth simpletad_ft-dota_mvd-s-fromL_aumcc.pth 85.3 53.8
MVD_fromB ViT-B MVD (Kinetics-400 Teacher ViT-B) simpletad_ft-dota_mvd-b-fromB_auroc.pth simpletad_ft-dota_mvd-b-fromB_aumcc.pth 86.1 54.7
MVD_fromL ViT-L MVD (Kinetics-400 Teacher ViT-L) simpletad_ft-dota_mvd-l-fromL_auroc.pth simpletad_ft-dota_mvd-l-fromL_aumcc.pth 87.2 58.1
DAPT- VideoMAE ViT-S DAPT (BDD100K + CAP-DATA) simpletad_ft-dota_dapt-vm1-s_auroc.pth simpletad_ft-dota_dapt-vm1-s_aumcc.pth 86.4 54.0
DAPT- VideoMAE ViT-B DAPT (BDD100K + CAP-DATA) simpletad_ft-dota_dapt-vm1-b_auroc.pth simpletad_ft-dota_dapt-vm1-b_aumcc.pth 87.9 57.5
DAPT- VideoMAE ViT-L DAPT (BDD100K + CAP-DATA) simpletad_ft-dota_dapt-vm1-l_auroc.pth simpletad_ft-dota_dapt-vm1-l_aumcc.pth 88.4 58.9

Fine-tuned on DADA-2000

Method Backbone Initialized with Best AUCROC checkpoint Best AUCMCC checkpoint AUCROC AUCMCC
VideoMAE ViT-S VideoMAE (Kinetics-400 1600 ep) simpletad_ft-dada_vm1-s_auroc.pth simpletad_ft-dada_vm1-s_aumcc.pth 83.0 48.2
VideoMAE ViT-B VideoMAE (Kinetics-400 1600 ep) simpletad_ft-dada_vm1-b-1600_auroc.pth simpletad_ft-dada_vm1-b-1600_aumcc.pth 85.4 52.2
VideoMAE ViT-L VideoMAE (Kinetics-400 1600 ep) simpletad_ft-dada_vm1-l_auroc.pth simpletad_ft-dada_vm1-l_aumcc.pth 87.2 55.4
VideoMAE2 ViT-S VideoMAE2 (vit_s_k710_dl_from_giant.pth) simpletad_ft-dada_vm2-s_auroc.pth simpletad_ft-dada_vm2-s_aumcc.pth 84.8 50.3
VideoMAE2 ViT-B VideoMAE2 (vit_b_k710_dl_from_giant.pth) simpletad_ft-dada_vm2-b_auroc.pth simpletad_ft-dada_vm2-b_aumcc.pth 86.3 53.3
MVD_fromL ViT-S MVD (Kinetics-400 Teacher ViT-L) simpletad_ft-dada_mvd-s-fromL_auroc.pth simpletad_ft-dada_mvd-s-fromL_aumcc.pth 82.2 50.2
MVD_fromB ViT-B MVD (Kinetics-400 Teacher ViT-B) simpletad_ft-dada_mvd-b-fromB_auroc.pth simpletad_ft-dada_mvd-b-fromB_aumcc.pth 84.7 50.9
MVD_fromL ViT-L MVD (Kinetics-400 Teacher ViT-L) simpletad_ft-dada_mvd-l-fromL_auroc.pth simpletad_ft-dada_mvd-l-fromL_aumcc.pth 86.1 53.7
DAPT- VideoMAE ViT-S DAPT (BDD100K + CAP-DATA) simpletad_ft-dada_dapt-vm1-s_auroc.pth simpletad_ft-dada_dapt-vm1-s_aumcc.pth 85.6 52.0
DAPT- VideoMAE ViT-B DAPT (BDD100K + CAP-DATA) simpletad_ft-dada_dapt-vm1-b_auroc.pth simpletad_ft-dada_dapt-vm1-b_aumcc.pth 87.6 55.2
DAPT- VideoMAE ViT-L DAPT (BDD100K + CAP-DATA) simpletad_ft-dada_dapt-vm1-l_auroc.pth simpletad_ft-dada_dapt-vm1-l_aumcc.pth 88.5 56.8

☎️ Contact

Svetlana Orlova: [email protected], [email protected]

👍 Acknowledgements

Our code is mainly based on the VideoMAE codebase. With Video ViTs that have identical architecture, we only used their weights: ViViT, VideoMAE2, SMILE, SIGMA, MME, MGMAE.
We used fragments of original implementations of MVD, InternVideo2, and UMT to integrate these models with our codebase.

✏️ Citation

If you think this project is helpful, please feel free to like us ❤️ and cite our paper:

@inproceedings{orlova2025simplifying,
  title={Simplifying Traffic Anomaly Detection with Video Foundation Models},
  author={Orlova, Svetlana and Kerssies, Tommie and Englert, Brun{\'o} B and Dubbelman, Gijs},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}

@article{orlova2025simplifying,
  title={Simplifying Traffic Anomaly Detection with Video Foundation Models},
  author={Orlova, Svetlana and Kerssies, Tommie and Englert, Brun{\'o} B and Dubbelman, Gijs},
  journal={arXiv preprint arXiv:2507.09338},
  year={2025}
}

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support