TRASER:

TRASER is the video scene graph generation model introduced in Synthetic Visual Genome 2 (SVG2). Given a video and per-object segmentation trajectories, it generates a structured spatio-temporal scene graph describing objects, attributes, and their relations across time.

Paper: Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

Authors: Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Quan Kong, Rajat Saini, Ranjay Krishna. (Allen Institute for AI · University of Washington · Woven by Toyota)

Model Architecture

TRASER extends Qwen2.5-VL-3B-Instruct with two trainable Perceiver Resampler modules that implement Trajectory-Aligned Token Arrangement:

Module	Abbrev.	Role
Object-Trajectory Resampler	OTR	Aggregates all cross-frame tokens for one object into a global summary
Temporal-Windows Resampler	TWR	Compresses per-object tokens within each temporal window into a fixed set of latents

For each tracked object the LLM sees a structured token block: `<obj_traj_start> Object N: <|vision_start|> [OTR: N latents] <t1-t2> [TWR: N latents] <t2-t3> [TWR: N latents] ... <|vision_end|> <obj_traj_end>`

How to Get Started

Installation

pip install transformers>=4.54.0 torch pycocotools

Prepare Inputs

Two inputs are required alongside the video:

Video — any format supported by qwen_vl_utils (e.g. .mp4)
Mask JSON — per-frame, per-object RLE segmentation masks in COCO pycocotools format:

[
  // frame 0
  [{"size": [H, W], "counts": "..."}, {"size": [H, W], "counts": "..."}, ...],
  // frame 1
  [...]
]

See example/2401075277_rle.json for a complete example.

Run Inference

python inference.py \
    --model_path /path/to/vsg_release_model \
    --video_path /path/to/video.mp4 \
    --mask_path /path/to/masks.json \
    --out_dir ./output

CLI Arguments

Argument	Default	Description
`--model_path`	required	Path to this model directory
`--video_path`	required	Input video file
`--mask_path`	required	Per-object RLE mask JSON
`--out_dir`	`./output`	Directory to write `output.txt`
`--max_objects`	`40`	Maximum number of objects to process per video

Quickstart with the Bundled Example

python inference.py \
    --model_path . \
    --video_path example/2401075277.mp4 \
    --mask_path example/2401075277_rle.json \
    --out_dir ./output

Python API

import torch
from transformers import AutoProcessor, AutoTokenizer
from modeling_traser import TRASER

model_path = "/path/to/vsg_release_model"
device = "cuda"

model = TRASER.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
processor.tokenizer = AutoTokenizer.from_pretrained(model_path)

Then follow the preprocessing steps in inference.py: load masks → build object mask tensors → select_tokens → rearrange_token → model.generate.

Repository Structure

├── modeling_traser.py           # TRASER model class
├── inference.py                 # End-to-end inference script
├── config.json                  # Model configuration
├── generation_config.json       # Default generation hyperparameters
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── tokenizer_config.json
├── vocab.json
├── merges.txt
├── added_tokens.json
├── special_tokens_map.json
├── chat_template.jinja
├── resampler_utils/
│   ├── token_selection.py       # Mask-based visual token selection (coverage threshold)
│   └── token_arrangement.py     # Token sequence rearrangement with OTR/TWR injection
├── qwen_vl_vsg_utils/           # Adapted Qwen-VL video processing utilities
├── static/
│   └── image.png                # Architecture diagram
└── example/
    ├── 2401075277.mp4           # Example video
    └── 2401075277_rle.json      # Example RLE segmentation masks

Training Data

TRASER is trained on SVG2, a large-scale automatically annotated video scene graph dataset:

~636K videos with dense panoptic, per-frame annotations
~6.6M objects · ~52M attributes · ~6.7M relations

Citation

@article{gao2026svg2,
  author = {Gao, Ziqi and Zhang, Jieyu and others},
  title  = {Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos},
  year   = {2026}
}

Downloads last month: 39

Safetensors

Model size

928k params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for UWGZQ/TRASER

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(694)

this model