|
--- |
|
base_model: |
|
- google/siglip-so400m-patch14-384 |
|
- Qwen/Qwen2.5-0.5B-Instruct |
|
datasets: |
|
- THUdyh/Oryx-SFT-Data |
|
language: |
|
- en |
|
- zh |
|
library_name: transformers |
|
license: cc-by-nc-4.0 |
|
metrics: |
|
- accuracy |
|
pipeline_tag: video-text-to-text |
|
tags: |
|
- video-understanding |
|
- multimodal |
|
--- |
|
|
|
# LLaVA-Scissor-baseline-0.5B |
|
|
|
The model was presented in the paper [LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs](https://huggingface.co/papers/2506.21862). |
|
|
|
Project page: [https://humanmllm.github.io/LLaVA-Scissor](https://humanmllm.github.io/LLaVA-Scissor) |
|
Code: [https://github.com/HumanMLLM/LLaVA-Scissor](https://github.com/HumanMLLM/LLaVA-Scissor) |
|
|
|
## Model Summary |
|
This repository contains the baseline model used in LLaVA-Scissor. |
|
This model is an enhanced version of [LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) large language model and is finetuned with [Oryx](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data) data. |
|
|
|
## Quick Start |
|
Here we provide a script for LLaVA-Scissor full token inference (without token compression). |
|
```python |
|
from operator import attrgetter |
|
from llava.model.builder import load_pretrained_model |
|
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token |
|
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX |
|
from llava.conversation import conv_templates, SeparatorStyle |
|
|
|
import torch |
|
import cv2 |
|
import numpy as np |
|
from PIL import Image |
|
import requests |
|
import copy |
|
import warnings |
|
from decord import VideoReader, cpu |
|
|
|
warnings.filterwarnings("ignore") |
|
# Load the OneVision model |
|
pretrained = "model_zoo/BBBBCHAN/LLaVA-Scissor-baseline-0.5B" |
|
model_name = "llava_qwen" |
|
device = "cuda" |
|
device_map = "auto" |
|
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, attn_implementation="sdpa") |
|
|
|
model.eval() |
|
|
|
|
|
# Function to extract frames from video |
|
def load_video(video_path, max_frames_num): |
|
if type(video_path) == str: |
|
vr = VideoReader(video_path, ctx=cpu(0)) |
|
else: |
|
vr = VideoReader(video_path[0], ctx=cpu(0)) |
|
total_frame_num = len(vr) |
|
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int) |
|
frame_idx = uniform_sampled_frames.tolist() |
|
spare_frames = vr.get_batch(frame_idx).asnumpy() |
|
return spare_frames # (frames, height, width, channels) |
|
|
|
|
|
# Load and process video |
|
video_path = "Your/path/to/the/video" |
|
video_frames = load_video(video_path, 16) |
|
print(video_frames.shape) |
|
image_tensors = [] |
|
frames = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"].half().cuda() |
|
image_tensors.append(frames) |
|
|
|
# Prepare conversation input |
|
conv_template = "qwen_2" |
|
question = f"{DEFAULT_IMAGE_TOKEN} |
|
Describe this video." |
|
conv = copy.deepcopy(conv_templates[conv_template]) |
|
conv.append_message(conv.roles[0], question) |
|
conv.append_message(conv.roles[1], None) |
|
prompt_question = conv.get_prompt() |
|
|
|
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device) |
|
image_sizes = [frame.size for frame in video_frames] |
|
|
|
# Generate response |
|
cont = model.generate( |
|
input_ids, |
|
images=image_tensors, |
|
image_sizes=image_sizes, |
|
do_sample=False, |
|
temperature=0, |
|
max_new_tokens=4096, |
|
modalities=["video"], |
|
) |
|
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True) |
|
print(text_outputs[0]) |
|
``` |
|
|
|
## Citation |
|
|
|
If you find our repo useful for your research, please consider citing our paper: |
|
|
|
```bibtex |
|
@article{sun2025llava, |
|
title={LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs}, |
|
author={Sun, Boyuan and Zhao, Jiaxing and Wei, Xihan and Hou, Qibin}, |
|
journal={arXiv preprint arXiv:2506.21862}, |
|
year={2025} |
|
} |
|
``` |