Update README.md
Browse files
README.md
CHANGED
|
@@ -1,7 +1,85 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# VIDEO-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
|
| 2 |
+
|
| 3 |
+
This is the official implementation for Video-RTS.
|
| 4 |
+
|
| 5 |
+
[](https://sites.google.com/cs.unc.edu/videorts2025/) [](https://arxiv.org/abs/2507.06485) [](https://huggingface.co/Ted412/Video-RTS)
|
| 6 |
+
|
| 7 |
+
### Authors: [Ziyang Wang*](https://ziyangw2000.github.io/), [Jaehong Yoon*](https://jaehong31.github.io/), [Shoubin Yu](https://yui010206.github.io/), [Md Mohaiminul Islam](https://md-mohaiminul.github.io/), [Gedas Bertasius](https://www.gedasbertasius.com/), [Mohit Bansal](https://www.cs.unc.edu/~mbansal/)
|
| 8 |
+
|
| 9 |
+
### University of North Carolina at Chapel Hill
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
We introduce Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy.
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
## **Installation**
|
| 16 |
+
|
| 17 |
+
```bash
|
| 18 |
+
git clone https://github.com/Ziyang412/Video-RTS.git
|
| 19 |
+
cd Video-RTS
|
| 20 |
+
|
| 21 |
+
# build environment
|
| 22 |
+
conda create -n video-rts python=3.11
|
| 23 |
+
conda activate video-rts
|
| 24 |
+
bash setup.sh
|
| 25 |
+
|
| 26 |
+
# qwen video extraction setting, e.g., max frames, resolutions
|
| 27 |
+
# Use the [decord] feature to improve speed
|
| 28 |
+
cd src/qwen-vl-utils
|
| 29 |
+
pip install -e .[decord]
|
| 30 |
+
cd ..
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
Following Video-R1, please install the provided version of transformers
|
| 35 |
+
|
| 36 |
+
```bash
|
| 37 |
+
unzip transformers-main.zip
|
| 38 |
+
cd ./transformers-main
|
| 39 |
+
pip install .
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
## **Download Dataset**
|
| 43 |
+
Please refer to the official github of each dataset for video downloading.
|
| 44 |
+
|
| 45 |
+
For evaluation, we provide the annotation file in `./src/r1-v/Evaluation` and please refer to the `./src/r1-v/Evaluation/path_coversion.py` to update the video path.
|
| 46 |
+
|
| 47 |
+
For training, we provided the training data annotation in `./src/training_data` and please refer to the [CG-Bench](https://huggingface.co/datasets/CG-Bench/CG-Bench) repo for video data
|
| 48 |
+
|
| 49 |
+
## **Download Video-RTS model checkpoint**
|
| 50 |
+
We provided the model checkpoint in [Huggingface](https://huggingface.co/Ted412/Video-RTS), noted that the model is only trained on about 2k samples but yield similar performance with the 6k sample training.
|
| 51 |
+
|
| 52 |
+
## **Video-RTS Training**
|
| 53 |
+
|
| 54 |
+
We use the [Open-R1-Video](https://github.com/Wang-Xiaodong1899/Open-R1-Video) as trainig codebased. We provided our modification files in `./src/training_files` so please replace the exact same files in the original repo. You could also use the [Video-R1](https://github.com/tulerfeng/Video-R1/tree/main) as training codebase, we find the results are similar.
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
## **Inference with S2D Video TTS**
|
| 58 |
+
|
| 59 |
+
Please update the input model / file name / output file in the given bash file. After running the inference code, please update the json_path in `cal_results_acc.py` to calculate the final video reasoning accuracy.
|
| 60 |
+
|
| 61 |
+
```bash
|
| 62 |
+
bash src/video_rts_eval.sh
|
| 63 |
+
python src/cal_results_acc.py
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
## Acknowledgments
|
| 68 |
+
We thank the developers of [Open-R1-Video](https://github.com/Wang-Xiaodong1899/Open-R1-Video), [Video-R1](https://github.com/tulerfeng/Video-R1/tree/main), [Qwen-2.5-VL](https://github.com/QwenLM/Qwen2.5-VL/tree/main) and [TRL](https://github.com/huggingface/trl) for their public code release.
|
| 69 |
+
|
| 70 |
+
# Reference
|
| 71 |
+
Please cite our paper if you use our models in your works:
|
| 72 |
+
|
| 73 |
+
```bibtex
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
@misc
|
| 77 |
+
{wang2025videortsrethinkingreinforcementlearning,
|
| 78 |
+
title={Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning},
|
| 79 |
+
author={Ziyang Wang and Jaehong Yoon and Shoubin Yu and Md Mohaiminul Islam and Gedas Bertasius and Mohit Bansal},
|
| 80 |
+
year={2025},
|
| 81 |
+
eprint={2507.06485},
|
| 82 |
+
archivePrefix={arXiv},
|
| 83 |
+
primaryClass={cs.CV},
|
| 84 |
+
url={https://arxiv.org/abs/2507.06485},
|
| 85 |
+
}
|