BBBBCHAN
/

LLaVA-Scissor-baseline-0.5B

@@ -1,19 +1,29 @@
 ---
-license: cc-by-nc-4.0
 datasets:
 - THUdyh/Oryx-SFT-Data
 language:
 - en
 - zh
 metrics:
 - accuracy
-base_model:
-- google/siglip-so400m-patch14-384
-- Qwen/Qwen2.5-0.5B-Instruct
-library_name: transformers
 ---
 # LLaVA-Scissor-baseline-0.5B
 ## Model Summary
 This repository contains the baseline model used in LLaVA-Scissor.
 This model is an enhanced version of [LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) large language model and is finetuned with [Oryx](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data) data.
@@ -70,7 +80,8 @@ image_tensors.append(frames)
 # Prepare conversation input
 conv_template = "qwen_2"
-question = f"{DEFAULT_IMAGE_TOKEN}\nDescribe this video."
 conv = copy.deepcopy(conv_templates[conv_template])
 conv.append_message(conv.roles[0], question)
 conv.append_message(conv.roles[1], None)

 ---
+base_model:
+- google/siglip-so400m-patch14-384
+- Qwen/Qwen2.5-0.5B-Instruct
 datasets:
 - THUdyh/Oryx-SFT-Data
 language:
 - en
 - zh
+library_name: transformers
+license: cc-by-nc-4.0
 metrics:
 - accuracy
+pipeline_tag: video-text-to-text
+tags:
+- video-understanding
+- multimodal
 ---
 # LLaVA-Scissor-baseline-0.5B
+The model was presented in the paper [LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs](https://huggingface.co/papers/2506.21862).
+Project page: [https://humanmllm.github.io/LLaVA-Scissor](https://humanmllm.github.io/LLaVA-Scissor)
+Code: [https://github.com/HumanMLLM/LLaVA-Scissor](https://github.com/HumanMLLM/LLaVA-Scissor)
 ## Model Summary
 This repository contains the baseline model used in LLaVA-Scissor.
 This model is an enhanced version of [LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) large language model and is finetuned with [Oryx](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data) data.
 # Prepare conversation input
 conv_template = "qwen_2"
+question = f"{DEFAULT_IMAGE_TOKEN}
+Describe this video."
 conv = copy.deepcopy(conv_templates[conv_template])
 conv.append_message(conv.roles[0], question)
 conv.append_message(conv.roles[1], None)