Sophia

Sophia is a vision-language model for hour-scale long video understanding.

In this work, we are the first to to propose the two-stage Shot-adaptive Frame Pruning technique, which prunes noisy shots and redundant frames to more sharply identify and focus on the frames relevant to the specific query.
We also introduce the Hierarchical Attention mechanism to effectively model the long-term temporal dependencies between video frames, which achieves time and space complexity of $\mathcal{O}(N)$ w.r.t. the input sequence length $N$, while theoretically maintaining the global modeling efficiency (measured by IPD).

Method

The overall architecture of Sophia is shown in the figure above.

First, we employ a shot detector that segments the video into shots based on the semantic information of frames, which allows the model to comprehend temporally uneven events or scenes in long videos more naturally.
Then, each frame is fed into a vision encoder and a projector to obtain a set of visual tokens.
During the Inter-shot Pruning stage, we assess the correlation between the visual embeddings of each shot and the user's query textual embeddings to prune the shots unrelated to the query.
Considering that shots often contain continuous actions or identical scenes, the Intra-shot Filtering stage removes redundant frames within a shot.
Moreover, from a foundational level, we employ the sparse Hierarchical Attention in our LLM, to perform attention with $\mathcal{O}(N)$ complexity among video tokens while preserving competitive model capabilities.

Model Weights

We release the model code and weights of our Sophia, which are at ./Sophia_ckpt. It conforms to the model storage format of the transformers, so it can be conveniently loaded directly with transformers.

How to Use

First install the requirements of Sophia

python==3.10.0    
safetensors==0.4.5      
sentence-transformers==2.2.2
sentencepiece==0.2.0
tokenizers==0.19.1
torch==2.4.0+cu121
torchaudio==2.5.1+cu121
torchvision==0.19.0+cu121
tqdm==4.67.1
transformers==4.45.0
triton==3.0.0

Model loading

import torch
from modelscope import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained(
    './Sophia_ckpt',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

Tao-tse
/

Sophia

You need to agree to share your contact information to access this model

Sophia

Method

Model Weights

How to Use

Model tree for Tao-tse/Sophia