Ambient Diffusion Omni (Ambient-o): Training Good Models with Bad Data

Model Description

Ambient Diffusion Omni (Ambient-o) is a framework for using low-quality, synthetic, and out-of-distribution images to improve the quality of diffusion models. Unlike traditional approaches that rely on highly curated datasets, Ambient-o extracts valuable signal from all available images during training, including data typically discarded as "low-quality."

This model card is for a text-to-image diffusion model trained on 8-H100 GPUs only for two days. The key innovation is the usage of synthetic data as "noisy" samples.

Architecture

Ambient-o builds upon the MicroDiffusion cobebase -- we use a Mixture of Experts Diffusion Transformer totaling ~1.1B parameters.

Text-to-Image Results

Ambient-o demonstrates improvements in text-to-image generation. Compared to the two baselines of 1) filtering low-quality samples and 2) using all the data as equal, Ambient-o achieves increased diversity compared to 1) and enhanced quality compared to 2). Ambient-o achieves visual improvements without sacrificing diversity.

Training Data Composition

The model was trained on a diverse mixture of datasets:

Conceptual Captions (CC12M): 12M image-caption pairs
Segment Anything (SA1B): 11.1M high-resolution images with LLaVA-generated captions
JourneyDB: 4.4M synthetic image-caption pairs from Midjourney
DiffusionDB: 10.7M quality-filtered synthetic image-caption pairs from Stable Diffusion

Data from DiffusionDB were treated as noisy samples.

Technical Approach

High Noise Regime

At high diffusion times, the model leverages the theoretical insight that noise contracts distributional differences, reducing mismatch between high-quality target distribution and mixed-quality training data. This creates a beneficial bias-variance trade-off where low-quality samples increase sample size and reduce estimator variance.

Low Noise Regime

At low diffusion times, the model exploits locality properties of natural images, using small image crops that allow borrowing high-frequency details from out-of-distribution or synthetic images when their marginal distributions match the target data.

Usage

from micro_diffusion.models.model import create_latent_diffusion
import torch


params = {
    'latent_res': 64,
    'in_channels': 4,
    'pos_interp_scale': 2.0,
}
model = create_latent_diffusion(**params).to('cuda')
checkpoint = torch.load(ckpt_path, map_location='cuda', weights_only=False)
model_dict = checkpoint['state']['model']
# Convert parameters to float32
float_model_params = {
    k.replace('dit.', ''): v.to(torch.float32) for k, v in model_dict.items() if 'dit' in k
}
model.dit.load_state_dict(float_model_params)





prompts = [
    "Pirate ship trapped in a cosmic maelstrom nebula, rendered in cosmic beach whirlpool engine, volumet",
    "A illustration from a graphic novel. A bustling city street under the shine of a full moon.",
]

model = model.eval()
gen_images = model.generate(prompt=prompts, num_inference_steps=30, 
                           guidance_scale=5.0, seed=42)

Citation

@article{daras2025ambient,
  title={Ambient Diffusion Omni: Training Good Models with Bad Data},
  author={Daras, Giannis and Rodriguez-Munoz, Adrian and Klivans, Adam and Torralba, Antonio and Daskalakis, Constantinos},
  journal={arXiv preprint},
  year={2025},
}

License

The model follows the license of the MicroDiffusion repo.