--- base_model: - google/siglip-so400m-patch14-384 - Qwen/Qwen2.5-0.5B-Instruct datasets: - THUdyh/Oryx-SFT-Data language: - en - zh library_name: transformers license: cc-by-nc-4.0 metrics: - accuracy pipeline_tag: video-text-to-text tags: - video-understanding - multimodal --- # LLaVA-Scissor-baseline-0.5B The model was presented in the paper [LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs](https://huggingface.co/papers/2506.21862). Code: [https://github.com/HumanMLLM/LLaVA-Scissor](https://github.com/HumanMLLM/LLaVA-Scissor) ## Model Summary This repository contains the baseline model used in LLaVA-Scissor. This model is an enhanced version of [LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) large language model and is finetuned with [Oryx](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data) data. ## Quick Start Here we provide a script for LLaVA-Scissor full token inference (without token compression). ```python from operator import attrgetter from llava.model.builder import load_pretrained_model from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX from llava.conversation import conv_templates, SeparatorStyle import torch import cv2 import numpy as np from PIL import Image import requests import copy import warnings from decord import VideoReader, cpu warnings.filterwarnings("ignore") # Load the OneVision model pretrained = "model_zoo/BBBBCHAN/LLaVA-Scissor-baseline-0.5B" model_name = "llava_qwen" device = "cuda" device_map = "auto" tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, attn_implementation="sdpa") model.eval() # Function to extract frames from video def load_video(video_path, max_frames_num): if type(video_path) == str: vr = VideoReader(video_path, ctx=cpu(0)) else: vr = VideoReader(video_path[0], ctx=cpu(0)) total_frame_num = len(vr) uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int) frame_idx = uniform_sampled_frames.tolist() spare_frames = vr.get_batch(frame_idx).asnumpy() return spare_frames # (frames, height, width, channels) # Load and process video video_path = "Your/path/to/the/video" video_frames = load_video(video_path, 16) print(video_frames.shape) image_tensors = [] frames = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"].half().cuda() image_tensors.append(frames) # Prepare conversation input conv_template = "qwen_2" question = f"{DEFAULT_IMAGE_TOKEN} Describe this video." conv = copy.deepcopy(conv_templates[conv_template]) conv.append_message(conv.roles[0], question) conv.append_message(conv.roles[1], None) prompt_question = conv.get_prompt() input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device) image_sizes = [frame.size for frame in video_frames] # Generate response cont = model.generate( input_ids, images=image_tensors, image_sizes=image_sizes, do_sample=False, temperature=0, max_new_tokens=4096, modalities=["video"], ) text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True) print(text_outputs[0]) ``` ## Citation If you find our repo useful for your research, please consider citing our paper: ```bibtex @article{sun2025llava, title={LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs}, author={Sun, Boyuan and Zhao, Jiaxing and Wei, Xihan and Hou, Qibin}, journal={arXiv preprint arXiv:2506.21862}, year={2025} } ```