Multi-View 3D Point Tracking
Frano Rajič1 ·
Haofei Xu1 ·
Marko Mihajlovic1 ·
Siyuan Li1 ·
Irem Demir1
Emircan Gündoğdu1 ·
Lei Ke2 ·
Sergey Prokudin1,3 ·
Marc Pollefeys1,4 ·
Siyu Tang1
1ETH Zürich
2Carnegie Mellon University
3Balgrist University Hospital
4Microsoft
MVTracker is the first data-driven multi-view 3D point tracker for tracking arbitrary 3D points across multiple cameras. It fuses multi-view features into a unified 3D feature point cloud, within which it leverages kNN-based correlation to capture spatiotemporal relationships across views. A transformer then iteratively refines the point tracks, handling occlusions and adapting to varying camera setups without per-sequence optimization.
Abstract
We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or prior multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks: Panoptic Studio and DexYCB, achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Our method generalizes well to diverse camera setups of 1-8 views with varying vantage points and video lengths of 24-150 frames. By releasing our tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for real-world applications.
Quick Start
This repo was validated on Python 3.10.12, PyTorch 2.3.0 (CUDA 12.1), cuDNN 8903, and gcc 11.3.0. If you want a fresh minimal environment that runs the Hub demo and demo.py
:
conda create -n 3dpt python=3.10.12 -y
conda activate 3dpt
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -r https://raw.githubusercontent.com/ethz-vlg/mvtracker/refs/heads/main/requirements.txt
# Optional, speeds up the model
pip install --upgrade --no-build-isolation flash-attn==2.5.8 # Speeds up attention
pip install "git+https://github.com/ethz-vlg/pointcept.git@2082918#subdirectory=libs/pointops" # Speeds up kNN search; may require gcc 11.3.0: conda install -c conda-forge gcc_linux-64=11.3.0 gxx_linux-64=11.3.0 gcc=11.3.0 gxx=11.3.0
With the minimal dependencies in place, you can try MVTracker directly via PyTorch Hub:
import torch
import numpy as np
from huggingface_hub import hf_hub_download
device = "cuda" if torch.cuda.is_available() else "cpu"
mvtracker = torch.hub.load("ethz-vlg/mvtracker", "mvtracker", pretrained=True, device=device)
# Example input from demo sample (downloaded automatically)
sample = np.load(hf_hub_download("ethz-vlg/mvtracker", "data_sample.npz"))
rgbs = torch.from_numpy(sample["rgbs"]).float()
depths = torch.from_numpy(sample["depths"]).float()
intrs = torch.from_numpy(sample["intrs"]).float()
extrs = torch.from_numpy(sample["extrs"]).float()
query_points = torch.from_numpy(sample["query_points"]).float()
with torch.no_grad():
results = mvtracker(
rgbs=rgbs[None].to(device) / 255.0,
depths=depths[None].to(device),\
intrs=intrs[None].to(device),
extrs=extrs[None].to(device),
query_points_3d=query_points[None].to(device),
)
pred_tracks = results["traj_e"].cpu() # [T,N,3]
pred_vis = results["vis_e"].cpu() # [T,N]
print(pred_tracks.shape, pred_vis.shape)
Citation
If you find our repository useful, please consider giving it a star ⭐ and citing our work:
@inproceedings{rajic2025mvtracker,
title = {Multi-View 3D Point Tracking},
author = {Raji{\v{c}}, Frano and Xu, Haofei and Mihajlovic, Marko and Li, Siyuan and Demir, Irem and G{\"u}ndo{\u{g}}du, Emircan and Ke, Lei and Prokudin, Sergey and Pollefeys, Marc and Tang, Siyu},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025}
}