--- base_model: - Qwen/Qwen2.5-VL-3B-Instruct datasets: - andaba/TEMPURA-VER library_name: transformers license: cc-by-4.0 tags: - text-generation-inference pipeline_tag: video-text-to-text --- # Model Card for Model ID This model card describes TEMPURA, a vision-language models to reason about causal event relationships and generate fine-grained, timestamped descriptions of untrimmed videos. ## Model Details ### Model Description TEMPURA enhances video temporal understanding by integrating causal reasoning with fine-grained temporal segmentation. More details can be found on the [project page](https://andy-cheng.github.io/TEMPURA/). - **Developed by:** Jen-Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi-Hao Peng, Hou-I Liu, Hsiang-Wei Huang, Kuang-Ming Chen, Cheng-Yen Yang, Wenhao Chai, Yi-Ling Chen, Vibhav Vineet, Qin Cai, Jenq-Neng Hwang - **Model type:** Video-Language Model - **Language(s) (NLP):** English - **License:** cc-by-4.0 - **Finetuned from model:** Qwen/Qwen2.5-VL-3B-Instruct ### Model Sources - **Repository:** [https://github.com/andy-cheng/TEMPURA](https://github.com/andy-cheng/TEMPURA) - **Paper:** [TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action](https://huggingface.co/papers/2505.01583) - **Project Page:** [https://andy-cheng.github.io/TEMPURA/](https://andy-cheng.github.io/TEMPURA/) ## Uses ### Direct Use The model can be used directly for temporal grounding and highlight detection in videos. ### Downstream Use [optional] The model can be fine-tuned for various applications requiring temporal video understanding, such as video summarization, event extraction, and question answering. ### Out-of-Scope Use The model may not perform well on videos with significantly different visual styles or languages compared to the training data. ## Bias, Risks, and Limitations The model's performance is influenced by biases present in the VER dataset. Further analysis is needed to fully characterize these biases. ### Recommendations Users should be aware of potential biases in the model's outputs. ## How to Get Started with the Model Inference: Please check the [inference example](https://github.com/Andy-Cheng/TEMPURA?tab=readme-ov-file#inference). Training: Please check the [model training script](https://github.com/Andy-Cheng/TEMPURA?tab=readme-ov-file#training). ## Training Details ### Training Data The model was trained on the VER dataset ([https://huggingface.co/datasets/andaba/TEMPURA-VER](https://huggingface.co/datasets/andaba/TEMPURA-VER)). ### Training Procedure The training procedure involves masked event prediction and video event segmentation with temporal dense captioning. See the training scripts in the repository for details. #### Training Hyperparameters - **Training regime:** [More Information Needed] #### Speeds, Sizes, Times [More Information Needed] ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data [More Information Needed] #### Factors [More Information Needed] #### Metrics [More Information Needed] ### Results [More Information Needed] #### Summary ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** [More Information Needed] - **Hours used:** [More Information Needed] - **Cloud Provider:** [More Information Needed] - **Compute Region:** [More Information Needed] - **Carbon Emitted:** [More Information Needed] ## Technical Specifications [optional] ### Model Architecture and Objective [More Information Needed] ### Compute Infrastructure [More Information Needed] #### Hardware [More Information Needed] #### Software [More Information Needed] ## Citation **BibTeX:** ```tex @article{tempura, title={TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action}, author={Jen-Hao Cheng and Vivian Wang and Huayu Wang and Huapeng Zhou and Yi-Hao Peng and Hou-I Liu and Hsiang-Wei Huang and Kuang-Ming Chen and Cheng-Yen Yang and Wenhao Chai and Yi-Ling Chen and Vibhav Vineet and Qin Cai and Jenq-Neng Hwang}, journal={arXiv preprint arXiv:2505.01583}, year={2025} } ``` **APA:** Cheng, J.-H., Wang, V., Wang, H., Zhou, H., Peng, Y.-H., Liu, H.-I., Huang, H.-W., Chen, K.-M., Yang, C.-Y., Chai, W., Chen, Y.-L., Vineet, V., Cai, Q., & Hwang, J.-N. (2025). *TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action*. arXiv preprint arXiv:2505.01583. ## Model Card Contact Jen-Hao Cheng, andyhci@uw.edu