PixArt-Σ

Overview

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation is Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li.

Some notes about this pipeline:

It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as DiT.
It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details.
It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found here.
It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as PixArt-α, Stable Diffusion XL, Playground V2.0 and DALL-E 3, while being more efficient than them.
It shows the ability of generating super high resolution images, such as 2048px or even 4K.
It shows that text-to-image models can grow from a weak model to a stronger one through several improvements (VAEs, datasets, and so on.)

🤗 Optimum extends Diffusers to support inference on the second generation of Neuron devices(powering Trainium and Inferentia 2). It aims at inheriting the ease of Diffusers on Neuron.

Export to Neuron

To deploy models in the PixArt-Σ pipeline, you will need to compile them to TorchScript optimized for AWS Neuron. There are four components which need to be exported to the .neuron format to boost the performance:

Text encoder
Transformer
VAE encoder
VAE decoder

You can either compile and export a PixArt-Σ Checkpoint via CLI or NeuronPixArtSigmaPipeline class.

Option 1: CLI

optimum-cli export neuron --model Jingya/pixart_sigma_pipe_xl_2_512_ms --batch_size 1 --height 512 --width 512 --num_images_per_prompt 1 --torch_dtype bfloat16 --sequence_length 120 pixart_sigma_neuron_512/

We recommend using a inf2.8xlarge or a larger instance for the model compilation. You will also be able to compile the model with the Optimum CLI on a CPU-only instance (needs ~35 GB memory), and then run the pre-compiled model on inf2.xlarge to reduce the expenses. In this case, don’t forget to disable validation of inference by adding the --disable-validation argument.

Option 2: Python API

import torch
from optimum.neuron import NeuronPixArtSigmaPipeline

# Compile
compiler_args = {"auto_cast": "none"}
input_shapes = {"batch_size": 1, "height": 512, "width": 512, "sequence_length": 120}

neuron_model = NeuronPixArtSigmaPipeline.from_pretrained("Jingya/pixart_sigma_pipe_xl_2_512_ms", torch_dtype=torch.bfloat16, export=True, disable_neuron_cache=True, **compiler_args, **input_shapes)

# Save locally
neuron_model.save_pretrained("pixart_sigma_neuron_512/")

# Upload to the HuggingFace Hub
neuron_model.push_to_hub(
    "pixart_sigma_neuron_512/", repository_id="optimum/pixart_sigma_pipe_xl_2_512_ms_neuronx"  # Replace with your HF Hub repo id
)

Text-to-Image

NeuronPixArtSigmaPipeline class allows you to generate images from a text prompt on neuron devices similar to the experience with Diffusers.

With pre-compiled PixArt-Σ models, now generate an image with a prompt on Neuron:

from optimum.neuron import NeuronPixArtSigmaPipeline

neuron_model = NeuronPixArtSigmaPipeline.from_pretrained("pixart_sigma_neuron_512/")
prompt = "Oppenheimer sits on the beach on a chair, watching a nuclear exposition with a huge mushroom cloud, 120mm."
image = neuron_model(prompt=prompt).images[0]

NeuronPixArtSigmaPipeline

Pipeline for text-to-image generation using PixArt-Σ.

class optimum.neuron.NeuronPixArtSigmaPipeline

< source >

( **kwargs )

call