Fast LoRA inference for Flux with Diffusers and PEFT π¨
There are great materials that demonstrate how to optimize inference for popular image generation models, such as Flux. However, very few cover how to serve LoRAs fast, despite LoRAs being an inseparable part of their adoption.
In our latest post, @BenjaminB and I show different techniques to optimize LoRA inference for the Flux family of models for image generation. Our recipe includes the use of:
1. torch.compile 2. Flash Attention 3 (when compatible) 3. Dynamic FP8 weight quantization (when compatible) 4. Hotswapping for avoiding recompilation during swapping new LoRAs π€―
We have tested our recipe with Flux.1-Dev on both H100 and RTX 4090. We achieve at least a *2x speedup* in either of the GPUs. We believe our recipe is grounded in the reality of how LoRA-based use cases are generally served. So, we hope this will be beneficial to the community π€
Even though our recipe was tested primarily with NVIDIA GPUs, it should also work with AMD GPUs.
Today in Privacy & AI Tooling - introducing a nifty new tool to examine where data goes in open-source apps on π€
HF Spaces have tons (100Ks!) of cool demos leveraging or examining AI systems - and because most of them are OSS we can see exactly how they handle user data ππ
That requires actually reading the code though, which isn't always easy or quick! Good news: code LMs have gotten pretty good at automatic review, so we can offload some of the work - here I'm using Qwen/Qwen2.5-Coder-32B-Instruct to generate reports and it works pretty OK π
The app works in three stages: 1. Download all code files 2. Use the Code LM to generate a detailed report pointing to code where data is transferred/(AI-)processed (screen 1) 3. Summarize the app's main functionality and data journeys (screen 2) 4. Build a Privacy TLDR with those inputs
It comes with a bunch of pre-reviewed apps/Spaces, great to see how many process data locally or through (private) HF endpoints π€