---
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
datasets:
- andaba/TEMPURA-VER
library_name: transformers
tags:
- text-generation-inference
pipeline_tag: video-text-to-text
---

# TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

TEMPURA enhances video temporal understanding by integrating causal reasoning with fine-grained temporal segmentation. It uses a two-stage training framework: first, masked event prediction reasoning reconstructs missing events and generates causal explanations; second, it learns video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. TEMPURA is trained on VER, a large-scale dataset (1M training instances, 500K videos) with temporally aligned event descriptions and structured reasoning steps. It outperforms strong baseline models on temporal grounding and highlight detection benchmarks.

[Project Page](https://andy-cheng.github.io/TEMPURA/) | [arXiv Preprint](https://arxiv.org/abs/2505.01583) | [VER Dataset](https://huggingface.co/datasets/andaba/TEMPURA-VER) | [Github Repo](https://github.com/TH14/TEMPURA/)

## Model Weights
- [TEMPURA-Qwen2.5-VL-3B-s1](https://huggingface.co/andaba/TEMPURA-Qwen2.5-VL-3B-s1)
- [TEMPURA-Qwen2.5-VL-3B-s2](https://huggingface.co/andaba/TEMPURA-Qwen2.5-VL-3B-s2)

## Citing TEMPURA
If you find our paper or dataset useful, please consider citing our work!

```tex
@article{tempura,
       title={TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action}, 
       author={Jen-Hao Cheng and Vivian Wang and Huayu Wang and Huapeng Zhou and Yi-Hao Peng and Hou-I Liu
              and Hsiang-Wei Huang and Kuang-Ming Chen and Cheng-Yen Yang
              and Wenhao Chai and Yi-Ling Chen and Vibhav Vineet and Qin Cai and Jenq-Neng Hwang},
       journal={arXiv preprint arXiv:2505.01583},
       year={2025}
}
```