BBBBCHAN nielsr HF Staff commited on
Commit
444cc11
·
verified ·
1 Parent(s): 1b452c4

Improve model card: Add pipeline tag, paper link, and GitHub repository link (#1)

Browse files

- Improve model card: Add pipeline tag, paper link, and GitHub repository link (b8e0608de4d13fc308a7f7eb0dd913957341729c)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +18 -12
README.md CHANGED
@@ -1,16 +1,23 @@
1
  ---
2
- license: cc-by-nc-4.0
 
 
3
  datasets:
4
  - THUdyh/Oryx-SFT-Data
5
  language:
6
  - en
7
  - zh
 
 
8
  metrics:
9
  - accuracy
10
- base_model:
11
- - google/siglip-so400m-patch14-384
12
- - Qwen/Qwen2.5-7B-Instruct
13
- library_name: transformers
 
 
 
14
  model-index:
15
  - name: llava-onevision-qwen-7b-ov
16
  results:
@@ -74,16 +81,14 @@ model-index:
74
  value: 40.55
75
  name: accuracy
76
  verified: true
77
- tags:
78
- - llava
79
- - llava-scissor
80
- - llava-onevision
81
- - llava-ov
82
- - token-compression
83
  ---
84
 
85
  # LLaVA-Scissor-baseline-7B
86
 
 
 
 
 
87
  ## Model Summary
88
  This repository contains the baseline model used in LLaVA-Scissor.
89
  This model is an enhanced version of [LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) large language model and is finetuned with [Oryx](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data) data.
@@ -140,7 +145,8 @@ image_tensors.append(frames)
140
 
141
  # Prepare conversation input
142
  conv_template = "qwen_2"
143
- question = f"{DEFAULT_IMAGE_TOKEN}\nDescribe this video."
 
144
  conv = copy.deepcopy(conv_templates[conv_template])
145
  conv.append_message(conv.roles[0], question)
146
  conv.append_message(conv.roles[1], None)
 
1
  ---
2
+ base_model:
3
+ - google/siglip-so400m-patch14-384
4
+ - Qwen/Qwen2.5-7B-Instruct
5
  datasets:
6
  - THUdyh/Oryx-SFT-Data
7
  language:
8
  - en
9
  - zh
10
+ library_name: transformers
11
+ license: cc-by-nc-4.0
12
  metrics:
13
  - accuracy
14
+ pipeline_tag: video-text-to-text
15
+ tags:
16
+ - llava
17
+ - llava-scissor
18
+ - llava-onevision
19
+ - llava-ov
20
+ - token-compression
21
  model-index:
22
  - name: llava-onevision-qwen-7b-ov
23
  results:
 
81
  value: 40.55
82
  name: accuracy
83
  verified: true
 
 
 
 
 
 
84
  ---
85
 
86
  # LLaVA-Scissor-baseline-7B
87
 
88
+ This repository contains the baseline model for [LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs](https://huggingface.co/papers/2506.21862).
89
+
90
+ Code: https://github.com/HumanMLLM/LLaVA-Scissor
91
+
92
  ## Model Summary
93
  This repository contains the baseline model used in LLaVA-Scissor.
94
  This model is an enhanced version of [LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) large language model and is finetuned with [Oryx](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data) data.
 
145
 
146
  # Prepare conversation input
147
  conv_template = "qwen_2"
148
+ question = f"{DEFAULT_IMAGE_TOKEN}
149
+ Describe this video."
150
  conv = copy.deepcopy(conv_templates[conv_template])
151
  conv.append_message(conv.roles[0], question)
152
  conv.append_message(conv.roles[1], None)