StreamFormer's picture
Update README.md
0950e17 verified
---
license: "cc-by-nc-4.0"
tags:
- vision
- video-classification
---
# StreamFormer (base-sized model)
StreamFormer backbone model pre-trained on *Global*-, *Temporal*- and *Spatial*- granularities. It was introduced in the paper [Learning Streaming Video Representation via Multitask Training](https://arxiv.org/abs/2504.20041) and first released in [this repository](https://github.com/Go2Heart/StreamFormer).
## Intended uses & limitations
StreamFormer is a streaming video representation backbone that encodes a stream of video input. It is designed for multiple downstream applications like Online Action Detection, Online Video Instance Segmentation and Video Question Answering.
### Installation
```bash
git clone https://github.com/Go2Heart/StreamFormer.git
cd StreamFormer
conda create -n streamformer python=3.10
conda activate streamformer
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt
```
### How to use
How to get the multi-granularity feature:
```python
from models import TimesformerMultiTaskingModelSigLIP
import torch
model = TimesformerMultiTaskingModelSigLIP.from_pretrained("StreamFormer/streamformer-timesformer").eval()
with torch.no_grad():
fake_frames = torch.randn(1, 16, 3, 224, 224)
fake_frames = fake_frames.to(model.device)
output = model(fake_frames)
# global representation [B, D]
print(output.pooler_output[:,-1].shape, output.pooler_output[:,-1])
# temporal representation [B, T, D]
print(output.pooler_output.shape, output.pooler_output)
# spatial representation [B, T, HxW, D]
print(output.last_hidden_state.shape, output.last_hidden_state)
```
### BibTeX entry and citation info
```bibtex
@misc{yan2025learning,
title={Learning Streaming Video Representation via Multitask Training},
author={Yibin Yan and Jilan Xu and Shangzhe Di and Yikun Liu and Yudi Shi and Qirui Chen and Zeqian Li and Yifei Huang and Weidi Xie},
year={2025},
eprint={2504.20041},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```