--- base_model: - Qwen/Qwen2.5-VL-3B-Instruct datasets: - andaba/TEMPURA-VER library_name: transformers tags: - text-generation-inference pipeline_tag: video-text-to-text --- # TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action TEMPURA enhances video temporal understanding by integrating causal reasoning with fine-grained temporal segmentation. It uses a two-stage training framework: first, masked event prediction reasoning reconstructs missing events and generates causal explanations; second, it learns video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. TEMPURA is trained on VER, a large-scale dataset (1M training instances, 500K videos) with temporally aligned event descriptions and structured reasoning steps. It outperforms strong baseline models on temporal grounding and highlight detection benchmarks. [Project Page](https://andy-cheng.github.io/TEMPURA/) | [arXiv Preprint](https://arxiv.org/abs/2505.01583) | [VER Dataset](https://huggingface.co/datasets/andaba/TEMPURA-VER) | [Github Repo](https://github.com/TH14/TEMPURA/) ## Model Weights - [TEMPURA-Qwen2.5-VL-3B-s1](https://huggingface.co/andaba/TEMPURA-Qwen2.5-VL-3B-s1) - [TEMPURA-Qwen2.5-VL-3B-s2](https://huggingface.co/andaba/TEMPURA-Qwen2.5-VL-3B-s2) ## Citing TEMPURA If you find our paper or dataset useful, please consider citing our work! ```tex @article{tempura, title={TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action}, author={Jen-Hao Cheng and Vivian Wang and Huayu Wang and Huapeng Zhou and Yi-Hao Peng and Hou-I Liu and Hsiang-Wei Huang and Kuang-Ming Chen and Cheng-Yen Yang and Wenhao Chai and Yi-Ling Chen and Vibhav Vineet and Qin Cai and Jenq-Neng Hwang}, journal={arXiv preprint arXiv:2505.01583}, year={2025} } ```