ParaAttention

Large image and video generation models, such as FLUX.1-dev and HunyuanVideo, can be an inference challenge for real-time applications and deployment because of their size.

ParaAttention is a library that implements context parallelism and first block cache, and can be combined with other techniques (torch.compile, fp8 dynamic quantization), to accelerate inference.

This guide will show you how to apply ParaAttention to FLUX.1-dev and HunyuanVideo on NVIDIA L20 GPUs. No optimizations are applied for our baseline benchmark, except for HunyuanVideo to avoid out-of-memory errors.

Our baseline benchmark shows that FLUX.1-dev is able to generate a 1024x1024 resolution image in 28 steps in 26.36 seconds, and HunyuanVideo is able to generate 129 frames at 720p resolution in 30 steps in 3675.71 seconds.

For even faster inference with context parallelism, try using NVIDIA A100 or H100 GPUs (if available) with NVLink support, especially when there is a large number of GPUs.

First Block Cache

Caching the output of the transformers blocks in the model and reusing them in the next inference steps reduces the computation cost and makes inference faster.

However, it is hard to decide when to reuse the cache to ensure quality generated images or videos. ParaAttention directly uses the residual difference of the first transformer block output to approximate the difference among model outputs. When the difference is small enough, the residual difference of previous inference steps is reused. In other words, the denoising step is skipped.

This achieves a 2x speedup on FLUX.1-dev and HunyuanVideo inference with very good quality.

Cache in Diffusion Transformer — How AdaCache works, First Block Cache is a variant of it

FLUX-1.dev

HunyuanVideo

To apply first block cache on FLUX.1-dev, call apply_cache_on_pipe as shown below. 0.08 is the default residual difference value for FLUX models.

import time
import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
).to("cuda")

from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe

apply_cache_on_pipe(pipe, residual_diff_threshold=0.08)

# Enable memory savings
# pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()

begin = time.time()
image = pipe(
    "A cat holding a sign that says hello world",
    num_inference_steps=28,
).images[0]
end = time.time()
print(f"Time: {end - begin:.2f}s")

print("Saving image to flux.png")
image.save("flux.png")

Optimizations	Original	FBCache rdt=0.06	FBCache rdt=0.08	FBCache rdt=0.10	FBCache rdt=0.12
Preview
Wall Time (s)	26.36	21.83	17.01	16.00	13.78

First Block Cache reduced the inference speed to 17.01 seconds compared to the baseline, or 1.55x faster, while maintaining nearly zero quality loss.

fp8 quantization

fp8 with dynamic quantization further speeds up inference and reduces memory usage. Both the activations and weights must be quantized in order to use the 8-bit NVIDIA Tensor Cores.

Use float8_weight_only and float8_dynamic_activation_float8_weight to quantize the text encoder and transformer model.

The default quantization method is per tensor quantization, but if your GPU supports row-wise quantization, you can also try it for better accuracy.

Install torchao with the command below.

pip3 install -U torch torchao

torch.compile with mode="max-autotune-no-cudagraphs" or mode="max-autotune" selects the best kernel for performance. Compilation can take a long time if it’s the first time the model is called, but it is worth it once the model has been compiled.

This example only quantizes the transformer model, but you can also quantize the text encoder to reduce memory usage even more.

Dynamic quantization can significantly change the distribution of the model output, so you need to change the residual_diff_threshold to a larger value for it to take effect.

FLUX-1.dev

HunyuanVideo

Context Parallelism

Context Parallelism parallelizes inference and scales with multiple GPUs. The ParaAttention compositional design allows you to combine Context Parallelism with First Block Cache and dynamic quantization.

Refer to the ParaAttention repository for detailed instructions and examples of how to scale inference with multiple GPUs.

If the inference process needs to be persistent and serviceable, it is suggested to use torch.multiprocessing to write your own inference processor. This can eliminate the overhead of launching the process and loading and recompiling the model.

FLUX-1.dev

HunyuanVideo

Benchmarks

FLUX-1.dev

HunyuanVideo

GPU Type	Number of GPUs	Optimizations	Wall Time (s)	Speedup
NVIDIA L20	1	Baseline	26.36	1.00x
NVIDIA L20	1	FBCache (rdt=0.08)	17.01	1.55x
NVIDIA L20	1	FP8 DQ	13.40	1.96x
NVIDIA L20	1	FBCache (rdt=0.12) + FP8 DQ	7.56	3.48x
NVIDIA L20	2	FBCache (rdt=0.12) + FP8 DQ + CP	4.92	5.35x
NVIDIA L20	4	FBCache (rdt=0.12) + FP8 DQ + CP	3.90	6.75x

Update on GitHub

Diffusers

ParaAttention

First Block Cache

fp8 quantization

Context Parallelism

Benchmarks