Aligning Text, Images, and 3D Structure Token-by-Token
Abstract
A unified language, image, and 3D scene model framework is proposed, achieving optimal training and performance across various 3D tasks and datasets.
Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following, and question-answering -- and four 3D datasets, synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings, and show our model's effectiveness on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/
Community
Introducing Kyvo – a decoder-only LLM that aligns text, images & structured 3D scenes token-by-token.
From a single image, it reconstructs individual 3D shapes and their locations, renders & edits scenes, answers spatial questions, and more.
Project webpage: https://glab-caltech.github.io/kyvo/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MLLMs Need 3D-Aware Representation Supervision for Scene Understanding (2025)
- ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding (2025)
- MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans (2025)
- VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction (2025)
- Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models (2025)
- 3D CoCa: Contrastive Learners are 3D Captioners (2025)
- OpenMaskDINO3D : Reasoning 3D Segmentation via Large Language Model (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper