Simplifying Traffic Anomaly Detection with Video Foundation Models

Svetlana Orlova, Tommie Kerssies, Brunó B. Englert, Gijs Dubbelman
Eindhoven University of Technology

Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) advanced pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity.

✨ DoTA and DADA-2000 results

Video ViT-based encoder-only models set a new state of the art on both datasets, while being significantly more efficient than top-performing specialized methods. FPS measured using NVIDIA A100 MIG, 2 1 GPU. † From prior work. ‡ Optimistic estimates using publicly available components of the model. “A→B”: trained on A, tested on B; D2K: DADA-2000.

🧩 Code

Check out our GitHub repo: simple-tad

📍Model Zoo

DAPT (adapted) models

Method	Backbone	Initialized with	DAPT epochs	DAPT data	Checkpoint
VideoMAE	ViT-S	Kinetics-400 1600 ep	12	Kinetics-700	simpletad_dapt-k700_videomae-s_ep12.pth
VideoMAE	ViT-S	Kinetics-400 1600 ep	12	BDD100K	simpletad_dapt-onlybdd_videomae-s_ep12.pth
VideoMAE	ViT-S	Kinetics-400 1600 ep	12	BDD100K + CAP-DATA	simpletad_dapt_videomae-s_ep12.pth
VideoMAE	ViT-B	Kinetics-400 1600 ep	12	BDD100K + CAP-DATA	simpletad_dapt_videomae-b_ep12.pth
VideoMAE	ViT-L	Kinetics-400 1600 ep	12	BDD100K + CAP-DATA	simpletad_dapt_videomae-l_ep12.pth

Fine-tuned on DoTA

Method	Backbone	Initialized with	Best AUC_ROC checkpoint	Best AUC_MCC checkpoint	AUC_ROC	AUC_MCC
VideoMAE	ViT-S	VideoMAE (Kinetics-400 1600 ep)	simpletad_ft-dota_vm1-s_auroc.pth	simpletad_ft-dota_vm1-s_aumcc.pth	83.7	46.9
VideoMAE	ViT-B	VideoMAE (Kinetics-400 1600 ep)	simpletad_ft-dota_vm1-b-1600_auroc.pth	simpletad_ft-dota_vm1-b-1600_aumcc.pth	86.3	54.8
VideoMAE	ViT-L	VideoMAE (Kinetics-400 1600 ep)	simpletad_ft-dota_vm1-l_auroc.pth	simpletad_ft-dota_vm1-l_aumcc.pth	88.2	58.7
VideoMAE2	ViT-S	VideoMAE2 (vit_s_k710_dl_from_giant.pth)	simpletad_ft-dota_vm2-s_auroc.pth	simpletad_ft-dota_vm2-s_aumcc.pth	86.0	54.1
VideoMAE2	ViT-B	VideoMAE2 (vit_b_k710_dl_from_giant.pth)	simpletad_ft-dota_vm2-b_auroc.pth	simpletad_ft-dota_vm2-b_aumcc.pth	86.9	55.4
MVD_fromL	ViT-S	MVD (Kinetics-400 Teacher ViT-L)	simpletad_ft-dota_mvd-s-fromL_auroc.pth	simpletad_ft-dota_mvd-s-fromL_aumcc.pth	85.3	53.8
MVD_fromB	ViT-B	MVD (Kinetics-400 Teacher ViT-B)	simpletad_ft-dota_mvd-b-fromB_auroc.pth	simpletad_ft-dota_mvd-b-fromB_aumcc.pth	86.1	54.7
MVD_fromL	ViT-L	MVD (Kinetics-400 Teacher ViT-L)	simpletad_ft-dota_mvd-l-fromL_auroc.pth	simpletad_ft-dota_mvd-l-fromL_aumcc.pth	87.2	58.1
DAPT- VideoMAE	ViT-S	DAPT (BDD100K + CAP-DATA)	simpletad_ft-dota_dapt-vm1-s_auroc.pth	simpletad_ft-dota_dapt-vm1-s_aumcc.pth	86.4	54.0
DAPT- VideoMAE	ViT-B	DAPT (BDD100K + CAP-DATA)	simpletad_ft-dota_dapt-vm1-b_auroc.pth	simpletad_ft-dota_dapt-vm1-b_aumcc.pth	87.9	57.5
DAPT- VideoMAE	ViT-L	DAPT (BDD100K + CAP-DATA)	simpletad_ft-dota_dapt-vm1-l_auroc.pth	simpletad_ft-dota_dapt-vm1-l_aumcc.pth	88.4	58.9

Fine-tuned on DADA-2000

Method	Backbone	Initialized with	Best AUC_ROC checkpoint	Best AUC_MCC checkpoint	AUC_ROC	AUC_MCC
VideoMAE	ViT-S	VideoMAE (Kinetics-400 1600 ep)	simpletad_ft-dada_vm1-s_auroc.pth	simpletad_ft-dada_vm1-s_aumcc.pth	83.0	48.2
VideoMAE	ViT-B	VideoMAE (Kinetics-400 1600 ep)	simpletad_ft-dada_vm1-b-1600_auroc.pth	simpletad_ft-dada_vm1-b-1600_aumcc.pth	85.4	52.2
VideoMAE	ViT-L	VideoMAE (Kinetics-400 1600 ep)	simpletad_ft-dada_vm1-l_auroc.pth	simpletad_ft-dada_vm1-l_aumcc.pth	87.2	55.4
VideoMAE2	ViT-S	VideoMAE2 (vit_s_k710_dl_from_giant.pth)	simpletad_ft-dada_vm2-s_auroc.pth	simpletad_ft-dada_vm2-s_aumcc.pth	84.8	50.3
VideoMAE2	ViT-B	VideoMAE2 (vit_b_k710_dl_from_giant.pth)	simpletad_ft-dada_vm2-b_auroc.pth	simpletad_ft-dada_vm2-b_aumcc.pth	86.3	53.3
MVD_fromL	ViT-S	MVD (Kinetics-400 Teacher ViT-L)	simpletad_ft-dada_mvd-s-fromL_auroc.pth	simpletad_ft-dada_mvd-s-fromL_aumcc.pth	82.2	50.2
MVD_fromB	ViT-B	MVD (Kinetics-400 Teacher ViT-B)	simpletad_ft-dada_mvd-b-fromB_auroc.pth	simpletad_ft-dada_mvd-b-fromB_aumcc.pth	84.7	50.9
MVD_fromL	ViT-L	MVD (Kinetics-400 Teacher ViT-L)	simpletad_ft-dada_mvd-l-fromL_auroc.pth	simpletad_ft-dada_mvd-l-fromL_aumcc.pth	86.1	53.7
DAPT- VideoMAE	ViT-S	DAPT (BDD100K + CAP-DATA)	simpletad_ft-dada_dapt-vm1-s_auroc.pth	simpletad_ft-dada_dapt-vm1-s_aumcc.pth	85.6	52.0
DAPT- VideoMAE	ViT-B	DAPT (BDD100K + CAP-DATA)	simpletad_ft-dada_dapt-vm1-b_auroc.pth	simpletad_ft-dada_dapt-vm1-b_aumcc.pth	87.6	55.2
DAPT- VideoMAE	ViT-L	DAPT (BDD100K + CAP-DATA)	simpletad_ft-dada_dapt-vm1-l_auroc.pth	simpletad_ft-dada_dapt-vm1-l_aumcc.pth	88.5	56.8

☎️ Contact

Svetlana Orlova: [email protected], [email protected]

👍 Acknowledgements

Our code is mainly based on the VideoMAE codebase. With Video ViTs that have identical architecture, we only used their weights: ViViT, VideoMAE2, SMILE, SIGMA, MME, MGMAE.
We used fragments of original implementations of MVD, InternVideo2, and UMT to integrate these models with our codebase.

✏️ Citation

If you think this project is helpful, please feel free to like us ❤️ and cite our paper:

@inproceedings{orlova2025simplifying,
  title={Simplifying Traffic Anomaly Detection with Video Foundation Models},
  author={Orlova, Svetlana and Kerssies, Tommie and Englert, Brun{\'o} B and Dubbelman, Gijs},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}

@article{orlova2025simplifying,
  title={Simplifying Traffic Anomaly Detection with Video Foundation Models},
  author={Orlova, Svetlana and Kerssies, Tommie and Englert, Brun{\'o} B and Dubbelman, Gijs},
  journal={arXiv preprint arXiv:2507.09338},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support