Papers
arxiv:2511.03276

Diffusion Language Models are Super Data Learners

Published on Nov 5
· Submitted by Jinjie Ni on Nov 6
#1 Paper of the day
Authors:
,
,
,
,

Abstract

Diffusion language models outperform autoregressive models in low-data settings due to any-order modeling, iterative bidirectional denoising, and Monte Carlo augmentation, and maintain advantages even at scale.

AI-generated summary

Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime.

Community

Paper author Paper submitter
edited 14 days ago

The first work empirically showing diffusion language models have much higher data potential compared with autoregressive ones at scale (up to 8B parameters, 1.5T tokens, 480 epochs). Clear crossovers are seen across model sizes, data budgets, data qualities, model sparsities, etc.

Work from the same series:

Quokka (large-scale DLM scaling law): https://github.com/JinjieNi/Quokka
OpenMoE 2 (MoE DLM): https://github.com/JinjieNi/OpenMoE2

Screenshot 2025-11-06 at 12.30.20 PM

High starting point, zero gains from added diversity—yet they’re hyping "Super Data Learners".

·
Paper author

I think you are describing the AR model dude?

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.03276 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.03276 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.03276 in a Space README.md to link it from this page.

Collections including this paper 12