File size: 2,138 Bytes
6972901 6e2bb37 0950e17 02f9475 6e2bb37 6972901 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
---
license: "cc-by-nc-4.0"
tags:
- vision
- video-classification
---
# StreamFormer (base-sized model)
StreamFormer backbone model pre-trained on *Global*-, *Temporal*- and *Spatial*- granularities. It was introduced in the paper [Learning Streaming Video Representation via Multitask Training](https://arxiv.org/abs/2504.20041) and first released in [this repository](https://github.com/Go2Heart/StreamFormer).
## Intended uses & limitations
StreamFormer is a streaming video representation backbone that encodes a stream of video input. It is designed for multiple downstream applications like Online Action Detection, Online Video Instance Segmentation and Video Question Answering.
### Installation
```bash
git clone https://github.com/Go2Heart/StreamFormer.git
cd StreamFormer
conda create -n streamformer python=3.10
conda activate streamformer
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt
```
### How to use
How to get the multi-granularity feature:
```python
from models import TimesformerMultiTaskingModelSigLIP
import torch
model = TimesformerMultiTaskingModelSigLIP.from_pretrained("StreamFormer/streamformer-timesformer").eval()
with torch.no_grad():
fake_frames = torch.randn(1, 16, 3, 224, 224)
fake_frames = fake_frames.to(model.device)
output = model(fake_frames)
# global representation [B, D]
print(output.pooler_output[:,-1].shape, output.pooler_output[:,-1])
# temporal representation [B, T, D]
print(output.pooler_output.shape, output.pooler_output)
# spatial representation [B, T, HxW, D]
print(output.last_hidden_state.shape, output.last_hidden_state)
```
### BibTeX entry and citation info
```bibtex
@misc{yan2025learning,
title={Learning Streaming Video Representation via Multitask Training},
author={Yibin Yan and Jilan Xu and Shangzhe Di and Yikun Liu and Yudi Shi and Qirui Chen and Zeqian Li and Yifei Huang and Weidi Xie},
year={2025},
eprint={2504.20041},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
``` |