Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning
Abstract
Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, even finetuning on a single object for just one iteration results in substantial performance gains. All code and resources will be made publicly available to support further advancements in 3D-aware vision models. Our code is available at https://github.com/qq456cvb/3DCorrEnhance.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding (2024)
- Can Generative Video Models Help Pose Estimation? (2024)
- 3D$^2$-Actor: Learning Pose-Conditioned 3D-Aware Denoiser for Realistic Gaussian Avatar Modeling (2024)
- Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs (2025)
- GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting (2024)
- DenseMatcher: Learning 3D Semantic Correspondence for Category-Level Manipulation from a Single Demo (2024)
- TSGaussian: Semantic and Depth-Guided Target-Specific Gaussian Splatting from Sparse Views (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper