BBBBCHAN
/

LLaVA-Scissor-baseline-7B

- Improve model card: Add pipeline tag, paper link, and GitHub repository link (b8e0608de4d13fc308a7f7eb0dd913957341729c)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,16 +1,23 @@
 ---
-license: cc-by-nc-4.0
 datasets:
 - THUdyh/Oryx-SFT-Data
 language:
 - en
 - zh
 metrics:
 - accuracy
-base_model:
-- google/siglip-so400m-patch14-384
-- Qwen/Qwen2.5-7B-Instruct
-library_name: transformers
 model-index:
 - name: llava-onevision-qwen-7b-ov
   results:
@@ -74,16 +81,14 @@ model-index:
       value: 40.55
       name: accuracy
       verified: true
-tags:
-- llava
-- llava-scissor
-- llava-onevision
-- llava-ov
-- token-compression
 ---
 # LLaVA-Scissor-baseline-7B
 ## Model Summary
 This repository contains the baseline model used in LLaVA-Scissor.
 This model is an enhanced version of [LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) large language model and is finetuned with [Oryx](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data) data.
@@ -140,7 +145,8 @@ image_tensors.append(frames)
 # Prepare conversation input
 conv_template = "qwen_2"
-question = f"{DEFAULT_IMAGE_TOKEN}\nDescribe this video."
 conv = copy.deepcopy(conv_templates[conv_template])
 conv.append_message(conv.roles[0], question)
 conv.append_message(conv.roles[1], None)

 ---
+base_model:
+- google/siglip-so400m-patch14-384
+- Qwen/Qwen2.5-7B-Instruct
 datasets:
 - THUdyh/Oryx-SFT-Data
 language:
 - en
 - zh
+library_name: transformers
+license: cc-by-nc-4.0
 metrics:
 - accuracy
+pipeline_tag: video-text-to-text
+tags:
+- llava
+- llava-scissor
+- llava-onevision
+- llava-ov
+- token-compression
 model-index:
 - name: llava-onevision-qwen-7b-ov
   results:
       value: 40.55
       name: accuracy
       verified: true
 ---
 # LLaVA-Scissor-baseline-7B
+This repository contains the baseline model for [LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs](https://huggingface.co/papers/2506.21862).
+Code: https://github.com/HumanMLLM/LLaVA-Scissor
 ## Model Summary
 This repository contains the baseline model used in LLaVA-Scissor.
 This model is an enhanced version of [LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) large language model and is finetuned with [Oryx](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data) data.
 # Prepare conversation input
 conv_template = "qwen_2"
+question = f"{DEFAULT_IMAGE_TOKEN}
+Describe this video."
 conv = copy.deepcopy(conv_templates[conv_template])
 conv.append_message(conv.roles[0], question)
 conv.append_message(conv.roles[1], None)