Sophia
Sophia is a vision-language model for hour-scale long video understanding.
- In this work, we are the first to to propose the two-stage Shot-adaptive Frame Pruning technique, which prunes noisy shots and redundant frames to more sharply identify and focus on the frames relevant to the specific query.
- We also introduce the Hierarchical Attention mechanism to effectively model the long-term temporal dependencies between video frames, which achieves time and space complexity of $\mathcal{O}(N)$ w.r.t. the input sequence length $N$, while theoretically maintaining the global modeling efficiency (measured by IPD).
Method

The overall architecture of Sophia is shown in the figure above.
- First, we employ a shot detector that segments the video into shots based on the semantic information of frames, which allows the model to comprehend temporally uneven events or scenes in long videos more naturally.
- Then, each frame is fed into a vision encoder and a projector to obtain a set of visual tokens.
- During the Inter-shot Pruning stage, we assess the correlation between the visual embeddings of each shot and the user's query textual embeddings to prune the shots unrelated to the query.
- Considering that shots often contain continuous actions or identical scenes, the Intra-shot Filtering stage removes redundant frames within a shot.
- Moreover, from a foundational level, we employ the sparse Hierarchical Attention in our LLM, to perform attention with $\mathcal{O}(N)$ complexity among video tokens while preserving competitive model capabilities.
Model Weights
We release the model code and weights of our Sophia, which are at ./Sophia_ckpt
. It conforms to the model storage format of the transformers
, so it can be conveniently loaded directly with transformers
.
How to Use
First install the requirements of Sophia
python==3.10.0
safetensors==0.4.5
sentence-transformers==2.2.2
sentencepiece==0.2.0
tokenizers==0.19.1
torch==2.4.0+cu121
torchaudio==2.5.1+cu121
torchvision==0.19.0+cu121
tqdm==4.67.1
transformers==4.45.0
triton==3.0.0
Model loading
import torch
from modelscope import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained(
'./Sophia_ckpt',
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for Tao-tse/Sophia
Base model
OpenGVLab/InternVL2_5-8B