Sayak Paul's picture

Sayak Paul

sayakpaul

AI & ML interests

Diffusion models, representation learning

Recent Activity

Organizations

Hugging Face's profile picture 🧨Diffusers's profile picture TensorFlow TPU's profile picture Hugging Face Internal Testing Organization's profile picture All Things ViTs's profile picture Probing ViTs's profile picture Evaluation on the Hub's profile picture Instruction-tuned Diffusion Models's profile picture JAX ♥️ Diffusers 🧨's profile picture (De)fusing's profile picture Huggingface Projects's profile picture Keras Dreambooth Event's profile picture Hugging Face OSS Metrics's profile picture Deploy HF TF ViTs's profile picture Open Generative AI's profile picture UniDiffuser Testing's profile picture Personal Coding Assistant's profile picture Diffusers Demo at ICCV 2023's profile picture huggingPartyParis's profile picture Latent Consistency's profile picture ZeroGPU Explorers's profile picture SPRIGHT's profile picture PEFT's profile picture NYU VisionX's profile picture Social Post Explorers's profile picture MaPO's profile picture diffusers-internal-dev's profile picture AuraFlow's profile picture lawrence's profile picture Optimum Internal Testing's profile picture Diffusion Guidance's profile picture syn-t2i's profile picture Hugging Face FineVideo's profile picture DDUF's profile picture HunyuanVideo Community's profile picture Finetrainers's profile picture Diffusion CoT's profile picture Cinematic T2V's profile picture

sayakpaul's activity

upvoted an article 10 days ago
view article
Article

How to train a new language model from scratch using Transformers and Tokenizers

By julien-c
37
posted an update 13 days ago
view post
Post
2334
Diffusers supports a good variety of quantization backends. It can be challenging to navigate through them, given the complex nature of diffusion pipelines in general.

So, @derekl35 set out to write a comprehensive guide that puts users in the front seat. Explore the different backends we support, learn the trade-offs they offer, and finally, check out the cool space we built that lets you compare quantization results.

Give it a go here:
https://lnkd.in/gf8Pi4-2
upvoted an article 14 days ago
view article
Article

Exploring Quantization Backends in Diffusers

By derekl35 and 2 others
31
published an article 14 days ago
view article
Article

Exploring Quantization Backends in Diffusers

By derekl35 and 2 others
31
posted an update 15 days ago
view post
Post
1672
Despite the emergence of combining LLM and DiT architectures for T2I synthesis, its design remains severely understudied.

This was done long ago and got into CVPR25 -- super excited to finally share it now, along with the data and code ♥️

We explore several architectural choices that affect this design. We provide an open & reproducible training recipe that works at scale.

Works like Playground v3 have already explored a deep fusion between an LLM and a DiT, sharing their representations through layerwise attention. They exhibit excellent performance on T2I.

Despite its compelling results and other performance virtues, it remains unexplored, which is what we want to improve in our work. Specifically, we take a pre-trained LLM (Gemma-2B) and trainable DiT, and set out to explore what makes a "good deep fusion" between the two for T2I.

We explore several key questions in the work, such as:

Q1: How should we do attention? We considered several alternatives. PixArt-Alpha like attention (cross-attention) is very promising.
Q2: Should we incorporate additional text modulation?
Q3: Can we eliminate timestep conditioning?
Q4: How do we do positional encodings?
Q5: Do instruction-tuned LLMs help deep fusion?
Q6: Would using a decoder LLM from a multimodal model be helpful?
Q7: Does using a better variant of Gemma help?

Based on the above findings, we arrive at FuseDiT with the following components on top of the base architecture from the findings of our experiments.

* No AdaLN-Zero modules
* 1D + 2D-RoPE
* Gemma 2 2B, adjusting DiT configurations accordingly

We trained FuseDiT on a mixture from CC12M, JourneyDB, & SA (~26M image-text pairs) for 800 steps. While not the best model, it's encouraging to develop something in a guided manner using open datasets.

To know more (code, models, all are available), please check out the paper:
https://lnkd.in/gg6qyqZX.
New activity in ooutlierr/fuse-dit 16 days ago
New activity in ooutlierr/fuse-dit 19 days ago
updated a collection 21 days ago
updated a collection 22 days ago