track_on2 / README.md
nielsr's picture
nielsr HF Staff
Improve model card: Add pipeline tag, paper, project page, code, usage, and more
cc51f44 verified
|
raw
history blame
4.62 kB
metadata
license: mit
pipeline_tag: keypoint-detection

Track-On2: Enhancing Online Point Tracking with Memory

📚 Paper - 🌐 Project Page - 💻 Code

Overview

Track-On2 is an efficient online point tracking model that processes videos frame-by-frame with a compact transformer memory—no future frames, no windows. Track-On2 builds on this with improved accuracy and efficiency.

Track-On Overview

Pretrained models

We provide two pretrained Track-On2 checkpoints, each using a different backbone:

  • Track-On2 with DINOv3 Download here This checkpoint uses the DINOv3 visual backbone.

    • To use it, you must separately obtain the official pretrained DINOv3 weights of dinov3-vits16plus by requesting access through Hugging Face.
    • Our released checkpoints do not include backbone weights in order to comply with DINOv3’s licensing and distribution policy.
  • Track-On2 with DINOv2 Download here No additional permissions or downloads are needed.

    • It offers competitive, often comparable (or stronger) performance to the DINOv3 variant.
    • Recommended if you want a quick setup without external dependencies.

Usage

You can track points on a video using the Predictor class.

Minimal example

import torch
from model.trackon_predictor import Predictor

device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize
model = Predictor(args, checkpoint_path="path/to/checkpoint.pth").to(device).eval()

# Inputs
# video:   (1, T, 3, H, W) in range 0-255
# queries: (1, N, 3) with rows = (t, x, y) in pixel coordinates
#          or use None to enable the model's uniform grid querying
video = ...          # e.g., torchvision.io.read_video -> (T, H, W, 3) -> (T, 3, H, W) -> add batch dim
queries = ...        # e.g., torch.tensor([[0, 190, 190], [0, 200, 190], ...]).unsqueeze(0).to(device)

# Inference
traj, vis = model(video, queries)

# Outputs
# traj: (1, T, N, 2)  -> per-point (x, y) in pixels
# vis:  (1, T, N)     -> per-point visibility in {0, 1}

Using demo.py

A ready-to-run script (demo.py) handles loading, preprocessing, inference, and visualization.

Given:

  • $video_path: Path to the input video file (e.g., .mp4)
  • $config_path: Config file of the model with yaml extension (default: ./config/test.yaml)
  • $ckpt_path: Path to the Track-On2 checkpoint (.pth)
  • $output_path: Path to save the rendered tracking video (e.g., demo_output.mp4)
  • $use_grid: Whether to use a uniform grid of queries (true or false)

you can run the demo by

python demo.py \
--video $video_path \
--config $config_path \
--ckpt $ckpt_path \
--output $output_path \
--use-grid $use_grid

Running the model with uniform grid queries on the video at media/sample.mp4 produces the visualization shown below.

Sample Tracking

Citation

If you find this work useful, please cite:

@article{Aydemir2025TrackOn2,
  title={{Track-On2}: Enhancing Online Point Tracking with Memory},
  author={Aydemir, G\"orkay and Xie, Weidi and G\"uney, Fatma},
  journal={arXiv preprint arXiv:2509.19115},
  year={2025}
}
@InProceedings{Aydemir2025TrackOn,
  title     = {{Track-On}: Transformer-based Online Point Tracking with Memory},
  author    = {Aydemir, G\"orkay and Cai, Xiongyi and Xie, Weidi and G\"uney, Fatma},
  booktitle = {The Thirteenth International Conference on Learning Representations},
  year      = {2025}
}

Acknowledgments

This repository incorporates code from public works including CoTracker, TAPNet, DINOv2, ViT-Adapter, and SPINO. We thank the authors for making their code available.