Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
Abstract
Diffusion language models outperform large language model embeddings in text retrieval tasks due to their bidirectional architecture.
Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.
Community
Wondering how embeddings from Text Diffusion Models ✨compare to those from autoregressive LLMs 🦙? Introducing DiffEmbed — a diffusion-based embedding model that excels in long-document and reasoning-intensive retrieval.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Hakim: Farsi Text Embedding Model (2025)
- Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling (2025)
- llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length (2025)
- CSPLADE: Learned Sparse Retrieval with Causal Language Models (2025)
- MedEIR: A Specialized Medical Embedding Model for Enhanced Information Retrieval (2025)
- ModernGBERT: German-only 1B Encoder Model Trained from Scratch (2025)
- From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper