We’re open-sourcing our text-to-image model and the process behind it

Community Article Published November 12, 2025

Upvote

Photoroom

Photoroom

Photoroom

We’ve been training a text-to-image model from scratch and are excited to share the first results: PRX, our open-source model, now available in 🤗Diffusers under an Apache 2.0 license.

This release is just the beginning. The idea is to make the whole process open, not just the final weights: how we trained, what worked, what didn’t, and all the details that usually stay hidden. We want this to become both a strong open model and a practical resource for anyone interested in training text-to-image models from scratch.

Over the next few weeks, we’ll share a series of posts going deeper into each part of the journey, from design experiments and architecture benchmarks to acceleration tricks and post-training methods. The first part of this series is already available, with more coming soon.

See examples ↓

Try it out

You can explore PRX in our demo, or load it directly in Diffusers:

from diffusers.pipelines.prx import PRXPipeline

pipe = PRXPipeline.from_pretrained(
    "Photoroom/prx-1024-t2i-beta",
    torch_dtype=torch.bfloat16
).to("cuda")

prompt = "A front-facing portrait of a lion in the golden savanna at sunset"
image = pipe(prompt, num_inference_steps=28, guidance_scale=5.0).images[0]
image.save("lion.png")

You can browse the full collection, which includes different variants of the model: base, SFT, and distilled checkpoints, with multiple VAEs. Most of these models are 256- and 512-pixel models, but we've also included a preview of the upcoming 1024-pixel model.

👉 Try the PRX 1024-pixel preview

Early results

Here are a few examples from a 1024-pixel PRX checkpoint (model card), a preview version of the upcoming high-resolution models.

These images come from a 1.3B-parameter PRX model trained for 1.7M steps at 1024-pixel resolution in under 10 days on 32 H200 GPUs. This particular checkpoint uses REPA [8] with DINOv2 features [17], Flux VAE [5], and T5-Gemma [7] as the text embedder.

Below you can see animations from very early checkpoints during training, all generated from the same prompt and seed, showing how the model evolves from scratch.

What we’ve done so far

These first weights are the result of weeks of experimentation, trying to refine a training recipe that is both efficient and high quality. Here’s a summary of what we’ve explored so far.

Architectures: DiT [1], UViT [2], MMDiT [3], DiT-Air [4], and PRX (Photoroom eXperimental), our own and more efficient MMDiT-like variant.
VAEs and text embedders: Flux’s [5] and DC-AE [6] VAEs, and T5-Gemma [7] for text encoding.
Training techniques: REPA [8], REPA-E [9], contrastive flow matching [10], TREAD [11], Uniform ROPE [12], Immiscible [13], and the Muon optimizer [14].
Post-pretraining: distillation with LADD [15], supervised fine-tuning, and DPO [16].
Implementation details: EMA, precision settings, and extensive hyperparameter sweeps.

We’ve run extensive experiments to understand how these design choices affect convergence, visual quality, and efficiency. We’ll go into more detail about these experiments and what we learned from them in this series of posts.

Deep dive: our training & research series

We’ve also started publishing a detailed blog series that breaks down the full training pipeline and all the experiments behind PRX. This is meant to complement the release and document the process in an open and reproducible way.

Part 1 — Design experiments and architecture benchmark
Part 2 — Accelerating training (coming soon)
Part 3 — Post-pretraining (coming soon)

We’ll continue adding to this series as we refine the training recipe and push toward the next generation of PRX models.

What's next?

This post marks the first of a series of research updates and releases we’re planning. There’s still plenty in the pipeline:

We’ll continue expanding the research series with more experiments, ablations, and model variants.
We’re continuing training and preparing the release of the 1024-pixel resolution model.
We’ve started exploring preference alignment through supervised finetuning, DPO [16], and GRPO-based methods such as Pref-GRPO [18]. We’re also looking into other recent approaches like Representation Autoencoders (RAE) [19].

We’ll keep iterating, releasing more weights, and documenting the process along the way.

Interested in contributing?

We’ve set up a Discord server (join here!) for more regular updates and discussion with the community. Join us there if you’d like to follow progress more closely or talk through details.

If you have ideas you’d like to explore or contributions you’d like to make, you can either message us on Discord or email [email protected]. We’d be glad to have more people involved.

The team

This project is the result of contributions from across the team in engineering, data, and research: David Bertoin, Roman Frigg, Simona Maggio, Lucas Gestin, Marco Forte, David Briand, Thomas Bordier, Matthieu Toulemont, and Jon Almazán, with earlier contributions from Quentin Desreumaux, Tarek Ayed, Antoine d’Andigné, and Benjamin Lefaudeux. We’re hiring for senior roles!

References

[1] Peebles et al., Scalable Diffusion Models with Transformers

[2] Bao et al., All are Worth Words: A ViT Backbone for Diffusion Models

[3] Esser et al., Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

[4] Chen et al., DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation

[5] Black Forest Labs, FLUX

[6] Chen et al., Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

[7] Dua et al., EmbeddingGemma: Powerful and Lightweight Text Representations

[8] Yu et al., Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

[9] Leng et al., REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

[10] Stoica et al., Contrastive Flow Matching

[11] Krause et al., TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training

[12] Jerry Xiong, On N-dimensional Rotary Positional Embeddings

[13] Li et al., Immiscible Diffusion: Accelerating Diffusion Training with Noise Assignment

[14] Jordan et al., Muon: An optimizer for hidden layers in neural networks

[15] Sauer et al., Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

[16] Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model

[17] Oquab et al., DINOv2: Learning Robust Visual Features without Supervision

[18] Wang et al., Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

[19] Zheng et al., Diffusion Transformers with Representation Autoencoders

Community

artyomboyko

about 1 month ago

•

edited about 1 month ago

Great work 👏! Thanks!

bghira

24 days ago

oh, neat! how do you like TREAD?

Samhygopl3567845

23 days ago

Tens of paper lanterns drifting along a quiet sea at dusk, soft purple light piercing cold orange mist, reflections trembling across rippled water, camera at water level with shallow DOF, cinematic color contrast of warm and cool tones, shot on Sony Venice 2 with Cooke S4 50 mm lens, f/1.8, ISO 800, graded on Kodak 2383 film LUT.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote