File size: 5,137 Bytes
59a5d65 444cc11 59a5d65 444cc11 59a5d65 444cc11 88b23f9 59a5d65 ddca950 444cc11 ddca950 c6e20af 444cc11 c6e20af 1b452c4 c6e20af |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
---
base_model:
- google/siglip-so400m-patch14-384
- Qwen/Qwen2.5-7B-Instruct
datasets:
- THUdyh/Oryx-SFT-Data
language:
- en
- zh
library_name: transformers
license: cc-by-nc-4.0
metrics:
- accuracy
pipeline_tag: video-text-to-text
tags:
- llava
- llava-scissor
- llava-onevision
- llava-ov
- token-compression
- video-understanding
- multimodal
model-index:
- name: llava-onevision-qwen-7b-ov
results:
- task:
type: multimodal
dataset:
name: MVBench
type: mvbench
metrics:
- type: accuracy
value: 62.425
name: accuracy
verified: true
- task:
type: multimodal
dataset:
name: NextQA
type: nextqa
metrics:
- type: accuracy
value: 81.33
name: accuracy
verified: true
- task:
type: multimodal
dataset:
name: EgoSchema
type: egoschema
metrics:
- type: accuracy
value: 58.08
name: accuracy
verified: true
- task:
type: multimodal
dataset:
name: VideoMME
type: videomme
metrics:
- type: accuracy
value: 57.96
name: accuracy
verified: true
- task:
type: multimodal
dataset:
name: MLVU
type: mlvu
metrics:
- type: accuracy
value: 62.48
name: accuracy
verified: true
- task:
type: multimodal
dataset:
name: VideoMMMU
type: videommmu
metrics:
- type: accuracy
value: 40.55
name: accuracy
verified: true
---
# LLaVA-Scissor-baseline-7B
This repository contains the baseline model for [LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs](https://huggingface.co/papers/2506.21862).
Code: https://github.com/HumanMLLM/LLaVA-Scissor
## Model Summary
This repository contains the baseline model used in LLaVA-Scissor.
This model is an enhanced version of [LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) large language model and is finetuned with [Oryx](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data) data.
## Quick Start
Here we provide a script for LLaVA-Scissor full token inference (without token compression).
```python
from operator import attrgetter
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
import torch
import cv2
import numpy as np
from PIL import Image
import requests
import copy
import warnings
from decord import VideoReader, cpu
warnings.filterwarnings("ignore")
# Load the OneVision model
pretrained = "model_zoo/BBBBCHAN/LLaVA-Scissor-baseline-7B"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, attn_implementation="sdpa")
model.eval()
# Function to extract frames from video
def load_video(video_path, max_frames_num):
if type(video_path) == str:
vr = VideoReader(video_path, ctx=cpu(0))
else:
vr = VideoReader(video_path[0], ctx=cpu(0))
total_frame_num = len(vr)
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
spare_frames = vr.get_batch(frame_idx).asnumpy()
return spare_frames # (frames, height, width, channels)
# Load and process video
video_path = "Your/path/to/the/video"
video_frames = load_video(video_path, 16)
print(video_frames.shape)
image_tensors = []
frames = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"].half().cuda()
image_tensors.append(frames)
# Prepare conversation input
conv_template = "qwen_2"
question = f"{DEFAULT_IMAGE_TOKEN}
Describe this video."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [frame.size for frame in video_frames]
# Generate response
cont = model.generate(
input_ids,
images=image_tensors,
image_sizes=image_sizes,
do_sample=False,
temperature=0,
max_new_tokens=4096,
modalities=["video"],
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs[0])
```
## Citation
If you find our repo useful for your research, please consider citing our paper:
```bibtex
@article{sun2025llava,
title={LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs},
author={Sun, Boyuan and Zhao, Jiaxing and Wei, Xihan and Hou, Qibin},
journal={arXiv preprint arXiv:2506.21862},
year={2025}
}
``` |