Add pipeline tag, links to paper, code, and project page

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +17 -6
README.md CHANGED
@@ -1,19 +1,29 @@
1
  ---
2
- license: cc-by-nc-4.0
 
 
3
  datasets:
4
  - THUdyh/Oryx-SFT-Data
5
  language:
6
  - en
7
  - zh
 
 
8
  metrics:
9
  - accuracy
10
- base_model:
11
- - google/siglip-so400m-patch14-384
12
- - Qwen/Qwen2.5-0.5B-Instruct
13
- library_name: transformers
14
  ---
 
15
  # LLaVA-Scissor-baseline-0.5B
16
 
 
 
 
 
 
17
  ## Model Summary
18
  This repository contains the baseline model used in LLaVA-Scissor.
19
  This model is an enhanced version of [LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) large language model and is finetuned with [Oryx](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data) data.
@@ -70,7 +80,8 @@ image_tensors.append(frames)
70
 
71
  # Prepare conversation input
72
  conv_template = "qwen_2"
73
- question = f"{DEFAULT_IMAGE_TOKEN}\nDescribe this video."
 
74
  conv = copy.deepcopy(conv_templates[conv_template])
75
  conv.append_message(conv.roles[0], question)
76
  conv.append_message(conv.roles[1], None)
 
1
  ---
2
+ base_model:
3
+ - google/siglip-so400m-patch14-384
4
+ - Qwen/Qwen2.5-0.5B-Instruct
5
  datasets:
6
  - THUdyh/Oryx-SFT-Data
7
  language:
8
  - en
9
  - zh
10
+ library_name: transformers
11
+ license: cc-by-nc-4.0
12
  metrics:
13
  - accuracy
14
+ pipeline_tag: video-text-to-text
15
+ tags:
16
+ - video-understanding
17
+ - multimodal
18
  ---
19
+
20
  # LLaVA-Scissor-baseline-0.5B
21
 
22
+ The model was presented in the paper [LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs](https://huggingface.co/papers/2506.21862).
23
+
24
+ Project page: [https://humanmllm.github.io/LLaVA-Scissor](https://humanmllm.github.io/LLaVA-Scissor)
25
+ Code: [https://github.com/HumanMLLM/LLaVA-Scissor](https://github.com/HumanMLLM/LLaVA-Scissor)
26
+
27
  ## Model Summary
28
  This repository contains the baseline model used in LLaVA-Scissor.
29
  This model is an enhanced version of [LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) large language model and is finetuned with [Oryx](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data) data.
 
80
 
81
  # Prepare conversation input
82
  conv_template = "qwen_2"
83
+ question = f"{DEFAULT_IMAGE_TOKEN}
84
+ Describe this video."
85
  conv = copy.deepcopy(conv_templates[conv_template])
86
  conv.append_message(conv.roles[0], question)
87
  conv.append_message(conv.roles[1], None)