Add pipeline tag, links to paper, code, and project page
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,19 +1,29 @@
|
|
1 |
---
|
2 |
-
|
|
|
|
|
3 |
datasets:
|
4 |
- THUdyh/Oryx-SFT-Data
|
5 |
language:
|
6 |
- en
|
7 |
- zh
|
|
|
|
|
8 |
metrics:
|
9 |
- accuracy
|
10 |
-
|
11 |
-
|
12 |
-
-
|
13 |
-
|
14 |
---
|
|
|
15 |
# LLaVA-Scissor-baseline-0.5B
|
16 |
|
|
|
|
|
|
|
|
|
|
|
17 |
## Model Summary
|
18 |
This repository contains the baseline model used in LLaVA-Scissor.
|
19 |
This model is an enhanced version of [LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) large language model and is finetuned with [Oryx](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data) data.
|
@@ -70,7 +80,8 @@ image_tensors.append(frames)
|
|
70 |
|
71 |
# Prepare conversation input
|
72 |
conv_template = "qwen_2"
|
73 |
-
question = f"{DEFAULT_IMAGE_TOKEN}
|
|
|
74 |
conv = copy.deepcopy(conv_templates[conv_template])
|
75 |
conv.append_message(conv.roles[0], question)
|
76 |
conv.append_message(conv.roles[1], None)
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- google/siglip-so400m-patch14-384
|
4 |
+
- Qwen/Qwen2.5-0.5B-Instruct
|
5 |
datasets:
|
6 |
- THUdyh/Oryx-SFT-Data
|
7 |
language:
|
8 |
- en
|
9 |
- zh
|
10 |
+
library_name: transformers
|
11 |
+
license: cc-by-nc-4.0
|
12 |
metrics:
|
13 |
- accuracy
|
14 |
+
pipeline_tag: video-text-to-text
|
15 |
+
tags:
|
16 |
+
- video-understanding
|
17 |
+
- multimodal
|
18 |
---
|
19 |
+
|
20 |
# LLaVA-Scissor-baseline-0.5B
|
21 |
|
22 |
+
The model was presented in the paper [LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs](https://huggingface.co/papers/2506.21862).
|
23 |
+
|
24 |
+
Project page: [https://humanmllm.github.io/LLaVA-Scissor](https://humanmllm.github.io/LLaVA-Scissor)
|
25 |
+
Code: [https://github.com/HumanMLLM/LLaVA-Scissor](https://github.com/HumanMLLM/LLaVA-Scissor)
|
26 |
+
|
27 |
## Model Summary
|
28 |
This repository contains the baseline model used in LLaVA-Scissor.
|
29 |
This model is an enhanced version of [LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) large language model and is finetuned with [Oryx](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data) data.
|
|
|
80 |
|
81 |
# Prepare conversation input
|
82 |
conv_template = "qwen_2"
|
83 |
+
question = f"{DEFAULT_IMAGE_TOKEN}
|
84 |
+
Describe this video."
|
85 |
conv = copy.deepcopy(conv_templates[conv_template])
|
86 |
conv.append_message(conv.roles[0], question)
|
87 |
conv.append_message(conv.roles[1], None)
|