Papers
arxiv:2506.08002

Aligning Text, Images, and 3D Structure Token-by-Token

Published on Jun 9
· Submitted by aadarsh99 on Jun 11
Authors:
,
,

Abstract

A unified language, image, and 3D scene model framework is proposed, achieving optimal training and performance across various 3D tasks and datasets.

AI-generated summary

Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following, and question-answering -- and four 3D datasets, synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings, and show our model's effectiveness on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/

Community

Paper submitter

Introducing Kyvo – a decoder-only LLM that aligns text, images & structured 3D scenes token-by-token.
From a single image, it reconstructs individual 3D shapes and their locations, renders & edits scenes, answers spatial questions, and more.
Project webpage: https://glab-caltech.github.io/kyvo/

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.08002 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.08002 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.08002 in a Space README.md to link it from this page.

Collections including this paper 1