arxiv:2506.03096

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Published on Jun 3

· Submitted by

chs20 on Jun 4

Upvote

Authors:

Christian Schlarmann ,

Abstract

FuseLIP is a transformer-based architecture that uses a shared vocabulary for text and image tokens to enhance multimodal embedding and outperforms existing models in tasks such as VQA and text-guided image retrieval.

AI-generated summary

Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoder models. We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval, while being comparable to baselines on unimodal tasks.

View arXiv page View PDF Add to collection

Community

chs20

Paper author Paper submitter 2 days ago

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

librarian-bot

2 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.03096 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.03096 in a Space README.md to link it from this page.

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Abstract

Community

Models citing this paper 4

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 2