Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
Abstract
Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets. Our code is publicly available for research purpose at https://easi3r.github.io/
Community
🦣Easi3R: 4D Reconstruction Without Training!
Limited 4D datasets? No problem, we can easily adapt #DUSt3R for 4D reconstruction → no training needed!
#Easi3R - By disentangling and repurposing DUSt3R’s attention maps for robust dynamic segmentation, Easi3R makes 4D reconstruction easier than ever!
🔗Page: https://easi3r.github.io
📄Paper: https://arxiv.org/abs/2503.2439
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SIRE: SE(3) Intrinsic Rigidity Embeddings (2025)
- Can Video Diffusion Model Reconstruct 4D Geometry? (2025)
- GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors (2025)
- AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos (2025)
- Distilling Monocular Foundation Model for Fine-grained Depth Completion (2025)
- HORT: Monocular Hand-held Objects Reconstruction with Transformers (2025)
- VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper