Ted412 commited on
Commit
a0e6a1a
·
verified ·
1 Parent(s): 5aa18e3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -7
README.md CHANGED
@@ -1,7 +1,85 @@
1
- ---
2
- license: openrail
3
- language:
4
- - en
5
- base_model:
6
- - Qwen/Qwen2.5-VL-7B-Instruct
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # VIDEO-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
2
+
3
+ This is the official implementation for Video-RTS.
4
+
5
+ [![Project Website](https://img.shields.io/badge/Project-Website-blue)](https://sites.google.com/cs.unc.edu/videorts2025/) [![arXiv](https://img.shields.io/badge/arXiv-2507.06485-b31b1b.svg)](https://arxiv.org/abs/2507.06485) [![HuggingFace](https://img.shields.io/badge/🤗-HuggingFace%20-cyan.svg)](https://huggingface.co/Ted412/Video-RTS)
6
+
7
+ ### Authors: [Ziyang Wang*](https://ziyangw2000.github.io/), [Jaehong Yoon*](https://jaehong31.github.io/), [Shoubin Yu](https://yui010206.github.io/), [Md Mohaiminul Islam](https://md-mohaiminul.github.io/), [Gedas Bertasius](https://www.gedasbertasius.com/), [Mohit Bansal](https://www.cs.unc.edu/~mbansal/)
8
+
9
+ ### University of North Carolina at Chapel Hill
10
+
11
+
12
+ We introduce Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy.
13
+
14
+
15
+ ## **Installation**
16
+
17
+ ```bash
18
+ git clone https://github.com/Ziyang412/Video-RTS.git
19
+ cd Video-RTS
20
+
21
+ # build environment
22
+ conda create -n video-rts python=3.11
23
+ conda activate video-rts
24
+ bash setup.sh
25
+
26
+ # qwen video extraction setting, e.g., max frames, resolutions
27
+ # Use the [decord] feature to improve speed
28
+ cd src/qwen-vl-utils
29
+ pip install -e .[decord]
30
+ cd ..
31
+ ```
32
+
33
+
34
+ Following Video-R1, please install the provided version of transformers
35
+
36
+ ```bash
37
+ unzip transformers-main.zip
38
+ cd ./transformers-main
39
+ pip install .
40
+ ```
41
+
42
+ ## **Download Dataset**
43
+ Please refer to the official github of each dataset for video downloading.
44
+
45
+ For evaluation, we provide the annotation file in `./src/r1-v/Evaluation` and please refer to the `./src/r1-v/Evaluation/path_coversion.py` to update the video path.
46
+
47
+ For training, we provided the training data annotation in `./src/training_data` and please refer to the [CG-Bench](https://huggingface.co/datasets/CG-Bench/CG-Bench) repo for video data
48
+
49
+ ## **Download Video-RTS model checkpoint**
50
+ We provided the model checkpoint in [Huggingface](https://huggingface.co/Ted412/Video-RTS), noted that the model is only trained on about 2k samples but yield similar performance with the 6k sample training.
51
+
52
+ ## **Video-RTS Training**
53
+
54
+ We use the [Open-R1-Video](https://github.com/Wang-Xiaodong1899/Open-R1-Video) as trainig codebased. We provided our modification files in `./src/training_files` so please replace the exact same files in the original repo. You could also use the [Video-R1](https://github.com/tulerfeng/Video-R1/tree/main) as training codebase, we find the results are similar.
55
+
56
+
57
+ ## **Inference with S2D Video TTS**
58
+
59
+ Please update the input model / file name / output file in the given bash file. After running the inference code, please update the json_path in `cal_results_acc.py` to calculate the final video reasoning accuracy.
60
+
61
+ ```bash
62
+ bash src/video_rts_eval.sh
63
+ python src/cal_results_acc.py
64
+ ```
65
+
66
+
67
+ ## Acknowledgments
68
+ We thank the developers of [Open-R1-Video](https://github.com/Wang-Xiaodong1899/Open-R1-Video), [Video-R1](https://github.com/tulerfeng/Video-R1/tree/main), [Qwen-2.5-VL](https://github.com/QwenLM/Qwen2.5-VL/tree/main) and [TRL](https://github.com/huggingface/trl) for their public code release.
69
+
70
+ # Reference
71
+ Please cite our paper if you use our models in your works:
72
+
73
+ ```bibtex
74
+
75
+
76
+ @misc
77
+ {wang2025videortsrethinkingreinforcementlearning,
78
+ title={Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning},
79
+ author={Ziyang Wang and Jaehong Yoon and Shoubin Yu and Md Mohaiminul Islam and Gedas Bertasius and Mohit Bansal},
80
+ year={2025},
81
+ eprint={2507.06485},
82
+ archivePrefix={arXiv},
83
+ primaryClass={cs.CV},
84
+ url={https://arxiv.org/abs/2507.06485},
85
+ }