Simplifying Traffic Anomaly Detection with Video Foundation Models
Svetlana Orlova, Tommie Kerssies, Brunó B. Englert, Gijs Dubbelman
Eindhoven University of Technology
![]() |
![]() |
Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) advanced pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity.
✨ DoTA and DADA-2000 results
Video ViT-based encoder-only models set a new state of the art on both datasets, while being significantly more efficient than top-performing specialized methods. FPS measured using NVIDIA A100 MIG, 2 1 GPU. † From prior work. ‡ Optimistic estimates using publicly available components of the model. “A→B”: trained on A, tested on B; D2K: DADA-2000.
🧩 Code
Check out our GitHub repo: simple-tad
📍Model Zoo
DAPT (adapted) models
Method | Backbone | Initialized with | DAPT epochs | DAPT data | Checkpoint |
---|---|---|---|---|---|
VideoMAE | ViT-S | Kinetics-400 1600 ep | 12 | Kinetics-700 | simpletad_dapt-k700_videomae-s_ep12.pth |
VideoMAE | ViT-S | Kinetics-400 1600 ep | 12 | BDD100K | simpletad_dapt-onlybdd_videomae-s_ep12.pth |
VideoMAE | ViT-S | Kinetics-400 1600 ep | 12 | BDD100K + CAP-DATA | simpletad_dapt_videomae-s_ep12.pth |
VideoMAE | ViT-B | Kinetics-400 1600 ep | 12 | BDD100K + CAP-DATA | simpletad_dapt_videomae-b_ep12.pth |
VideoMAE | ViT-L | Kinetics-400 1600 ep | 12 | BDD100K + CAP-DATA | simpletad_dapt_videomae-l_ep12.pth |
Fine-tuned on DoTA
Fine-tuned on DADA-2000
☎️ Contact
Svetlana Orlova: [email protected], [email protected]
👍 Acknowledgements
Our code is mainly based on the VideoMAE codebase.
With Video ViTs that have identical architecture, we only used their weights:
ViViT,
VideoMAE2,
SMILE,
SIGMA,
MME,
MGMAE.
We used fragments of original implementations of
MVD,
InternVideo2,
and UMT to integrate these models with our codebase.
✏️ Citation
If you think this project is helpful, please feel free to like us ❤️ and cite our paper:
@inproceedings{orlova2025simplifying,
title={Simplifying Traffic Anomaly Detection with Video Foundation Models},
author={Orlova, Svetlana and Kerssies, Tommie and Englert, Brun{\'o} B and Dubbelman, Gijs},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025}
}
@article{orlova2025simplifying,
title={Simplifying Traffic Anomaly Detection with Video Foundation Models},
author={Orlova, Svetlana and Kerssies, Tommie and Englert, Brun{\'o} B and Dubbelman, Gijs},
journal={arXiv preprint arXiv:2507.09338},
year={2025}
}
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |