|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Yongxin-Guo/VTG-IT |
|
tags: |
|
- dense-video-caption |
|
- video-highlight-detection |
|
- video-summarization |
|
- moment-retrieval |
|
--- |
|
|
|
[VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding](https://arxiv.org/abs/2405.13382) |
|
|
|
## Overview |
|
|
|
We introduce |
|
- VTG-IT-120K, a high-quality and comprehensive instruction tuning dataset that covers VTG tasks such as moment retrieval (63.2K), dense video captioning (37.2K), video summarization (15.2K), and video highlight detection (3.9K). |
|
- VTG-LLM, which (1) effectively integrates timestamp knowledge into visual tokens; (2) incorporates absolute-time tokens that specifically handle timestamp knowledge, thereby avoiding concept shifts; and (3) introduces a lightweight, high-performance slot-based token compression method to facilitate the sampling of more video frames. |
|
|
|
## How to Use |
|
|
|
Please refer to [GitHub repo](https://github.com/gyxxyg/VTG-LLM) for details. |
|
|
|
## Citation |
|
If you find this repository helpful for your project, please consider citing: |
|
``` |
|
@article{guo2024vtg, |
|
title={VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding}, |
|
author={Guo, Yongxin and Liu, Jingyu and Li, Mingda and Tang, Xiaoying and Chen, Xi and Zhao, Bo}, |
|
journal={arXiv preprint arXiv:2405.13382}, |
|
year={2024} |
|
} |
|
``` |