TRASER:

TRASER is the video scene graph generation model introduced in Synthetic Visual Genome 2 (SVG2). Given a video and per-object segmentation trajectories, it generates a structured spatio-temporal scene graph describing objects, attributes, and their relations across time.

Paper: Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

Authors: Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Quan Kong, Rajat Saini, Ranjay Krishna. (Allen Institute for AI Β· University of Washington Β· Woven by Toyota)


Model Architecture

TRASER Architecture

TRASER extends Qwen2.5-VL-3B-Instruct with two trainable Perceiver Resampler modules that implement Trajectory-Aligned Token Arrangement:

Module Abbrev. Role
Object-Trajectory Resampler OTR Aggregates all cross-frame tokens for one object into a global summary
Temporal-Windows Resampler TWR Compresses per-object tokens within each temporal window into a fixed set of latents

For each tracked object the LLM sees a structured token block: <obj_traj_start> Object N: <|vision_start|> [OTR: N latents] <t1-t2> [TWR: N latents] <t2-t3> [TWR: N latents] ... <|vision_end|> <obj_traj_end>

How to Get Started

Installation

pip install transformers>=4.54.0 torch pycocotools

Prepare Inputs

Two inputs are required alongside the video:

  • Video β€” any format supported by qwen_vl_utils (e.g. .mp4)
  • Mask JSON β€” per-frame, per-object RLE segmentation masks in COCO pycocotools format:
[
  // frame 0
  [{"size": [H, W], "counts": "..."}, {"size": [H, W], "counts": "..."}, ...],
  // frame 1
  [...]
]

See example/2401075277_rle.json for a complete example.

Run Inference

python inference.py \
    --model_path /path/to/vsg_release_model \
    --video_path /path/to/video.mp4 \
    --mask_path /path/to/masks.json \
    --out_dir ./output

CLI Arguments

Argument Default Description
--model_path required Path to this model directory
--video_path required Input video file
--mask_path required Per-object RLE mask JSON
--out_dir ./output Directory to write output.txt
--max_objects 40 Maximum number of objects to process per video

Quickstart with the Bundled Example

python inference.py \
    --model_path . \
    --video_path example/2401075277.mp4 \
    --mask_path example/2401075277_rle.json \
    --out_dir ./output

Python API

import torch
from transformers import AutoProcessor, AutoTokenizer
from modeling_traser import TRASER

model_path = "/path/to/vsg_release_model"
device = "cuda"

model = TRASER.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
processor.tokenizer = AutoTokenizer.from_pretrained(model_path)

Then follow the preprocessing steps in inference.py: load masks β†’ build object mask tensors β†’ select_tokens β†’ rearrange_token β†’ model.generate.


Repository Structure

β”œβ”€β”€ modeling_traser.py           # TRASER model class
β”œβ”€β”€ inference.py                 # End-to-end inference script
β”œβ”€β”€ config.json                  # Model configuration
β”œβ”€β”€ generation_config.json       # Default generation hyperparameters
β”œβ”€β”€ model-00001-of-00002.safetensors
β”œβ”€β”€ model-00002-of-00002.safetensors
β”œβ”€β”€ model.safetensors.index.json
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ vocab.json
β”œβ”€β”€ merges.txt
β”œβ”€β”€ added_tokens.json
β”œβ”€β”€ special_tokens_map.json
β”œβ”€β”€ chat_template.jinja
β”œβ”€β”€ resampler_utils/
β”‚   β”œβ”€β”€ token_selection.py       # Mask-based visual token selection (coverage threshold)
β”‚   └── token_arrangement.py     # Token sequence rearrangement with OTR/TWR injection
β”œβ”€β”€ qwen_vl_vsg_utils/           # Adapted Qwen-VL video processing utilities
β”œβ”€β”€ static/
β”‚   └── image.png                # Architecture diagram
└── example/
    β”œβ”€β”€ 2401075277.mp4           # Example video
    └── 2401075277_rle.json      # Example RLE segmentation masks

Training Data

TRASER is trained on SVG2, a large-scale automatically annotated video scene graph dataset:

  • ~636K videos with dense panoptic, per-frame annotations
  • ~6.6M objects Β· ~52M attributes Β· ~6.7M relations

Citation

@article{gao2026svg2,
  author = {Gao, Ziqi and Zhang, Jieyu and others},
  title  = {Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos},
  year   = {2026}
}
Downloads last month
39
Safetensors
Model size
928k params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for UWGZQ/TRASER

Finetuned
(694)
this model