Probing ViTs
non-profit
AI & ML interests
We are interested to study the representations learned by Vision Transformers.
Recent Activity
View all activity
probing-vits's activity
Post
2359
Diffusers supports a good variety of quantization backends. It can be challenging to navigate through them, given the complex nature of diffusion pipelines in general.
So, @derekl35 set out to write a comprehensive guide that puts users in the front seat. Explore the different backends we support, learn the trade-offs they offer, and finally, check out the cool space we built that lets you compare quantization results.
Give it a go here:
https://lnkd.in/gf8Pi4-2
So, @derekl35 set out to write a comprehensive guide that puts users in the front seat. Explore the different backends we support, learn the trade-offs they offer, and finally, check out the cool space we built that lets you compare quantization results.
Give it a go here:
https://lnkd.in/gf8Pi4-2
Post
1678
Despite the emergence of combining LLM and DiT architectures for T2I synthesis, its design remains severely understudied.
This was done long ago and got into CVPR25 -- super excited to finally share it now, along with the data and code โฅ๏ธ
We explore several architectural choices that affect this design. We provide an open & reproducible training recipe that works at scale.
Works like Playground v3 have already explored a deep fusion between an LLM and a DiT, sharing their representations through layerwise attention. They exhibit excellent performance on T2I.
Despite its compelling results and other performance virtues, it remains unexplored, which is what we want to improve in our work. Specifically, we take a pre-trained LLM (Gemma-2B) and trainable DiT, and set out to explore what makes a "good deep fusion" between the two for T2I.
We explore several key questions in the work, such as:
Q1: How should we do attention? We considered several alternatives. PixArt-Alpha like attention (cross-attention) is very promising.
Q2: Should we incorporate additional text modulation?
Q3: Can we eliminate timestep conditioning?
Q4: How do we do positional encodings?
Q5: Do instruction-tuned LLMs help deep fusion?
Q6: Would using a decoder LLM from a multimodal model be helpful?
Q7: Does using a better variant of Gemma help?
Based on the above findings, we arrive at FuseDiT with the following components on top of the base architecture from the findings of our experiments.
* No AdaLN-Zero modules
* 1D + 2D-RoPE
* Gemma 2 2B, adjusting DiT configurations accordingly
We trained FuseDiT on a mixture from CC12M, JourneyDB, & SA (~26M image-text pairs) for 800 steps. While not the best model, it's encouraging to develop something in a guided manner using open datasets.
To know more (code, models, all are available), please check out the paper:
https://lnkd.in/gg6qyqZX.
This was done long ago and got into CVPR25 -- super excited to finally share it now, along with the data and code โฅ๏ธ
We explore several architectural choices that affect this design. We provide an open & reproducible training recipe that works at scale.
Works like Playground v3 have already explored a deep fusion between an LLM and a DiT, sharing their representations through layerwise attention. They exhibit excellent performance on T2I.
Despite its compelling results and other performance virtues, it remains unexplored, which is what we want to improve in our work. Specifically, we take a pre-trained LLM (Gemma-2B) and trainable DiT, and set out to explore what makes a "good deep fusion" between the two for T2I.
We explore several key questions in the work, such as:
Q1: How should we do attention? We considered several alternatives. PixArt-Alpha like attention (cross-attention) is very promising.
Q2: Should we incorporate additional text modulation?
Q3: Can we eliminate timestep conditioning?
Q4: How do we do positional encodings?
Q5: Do instruction-tuned LLMs help deep fusion?
Q6: Would using a decoder LLM from a multimodal model be helpful?
Q7: Does using a better variant of Gemma help?
Based on the above findings, we arrive at FuseDiT with the following components on top of the base architecture from the findings of our experiments.
* No AdaLN-Zero modules
* 1D + 2D-RoPE
* Gemma 2 2B, adjusting DiT configurations accordingly
We trained FuseDiT on a mixture from CC12M, JourneyDB, & SA (~26M image-text pairs) for 800 steps. While not the best model, it's encouraging to develop something in a guided manner using open datasets.
To know more (code, models, all are available), please check out the paper:
https://lnkd.in/gg6qyqZX.

sayakpaulย
authored
a
paper
20 days ago

sayakpaulย
authored
a
paper
about 1 month ago

sayakpaulย
authored
a
paper
3 months ago
Post
3858
Inference-time scaling meets Flux.1-Dev (and others) ๐ฅ
Presenting a simple re-implementation of "Inference-time scaling diffusion models beyond denoising steps" by Ma et al.
I did the simplest random search strategy, but results can potentially be improved with better-guided search methods.
Supports Gemini 2 Flash & Qwen2.5 as verifiers for "LLMGrading" ๐ค
The steps are simple:
For each round:
1> Starting by sampling 2 starting noises with different seeds.
2> Score the generations w.r.t a metric.
3> Obtain the best generation from the current round.
If you have more compute budget, go to the next search round. Scale the noise pool (
This constitutes the random search method as done in the paper by Google DeepMind.
Code, more results, and a bunch of other stuff are in the repository. Check it out here: https://github.com/sayakpaul/tt-scale-flux/ ๐ค
Presenting a simple re-implementation of "Inference-time scaling diffusion models beyond denoising steps" by Ma et al.
I did the simplest random search strategy, but results can potentially be improved with better-guided search methods.
Supports Gemini 2 Flash & Qwen2.5 as verifiers for "LLMGrading" ๐ค
The steps are simple:
For each round:
1> Starting by sampling 2 starting noises with different seeds.
2> Score the generations w.r.t a metric.
3> Obtain the best generation from the current round.
If you have more compute budget, go to the next search round. Scale the noise pool (
2 ** search_round
) and repeat 1 - 3.This constitutes the random search method as done in the paper by Google DeepMind.
Code, more results, and a bunch of other stuff are in the repository. Check it out here: https://github.com/sayakpaul/tt-scale-flux/ ๐ค
Post
2108
We have been cooking a couple of fine-tuning runs on CogVideoX with finetrainers, smol datasets, and LoRA to generate cool video effects like crushing, dissolving, etc.
We are also releasing a LoRA extraction utility from a fully fine-tuned checkpoint. I know that kind of stuff has existed since eternity, but the quality on video models was nothing short of spectacular. Below are some links:
* Models and datasets:
finetrainers
* finetrainers: https://github.com/a-r-r-o-w/finetrainers
* LoRA extraction: https://github.com/huggingface/diffusers/blob/main/scripts/extract_lora_from_model.py
We are also releasing a LoRA extraction utility from a fully fine-tuned checkpoint. I know that kind of stuff has existed since eternity, but the quality on video models was nothing short of spectacular. Below are some links:
* Models and datasets:
* finetrainers: https://github.com/a-r-r-o-w/finetrainers
* LoRA extraction: https://github.com/huggingface/diffusers/blob/main/scripts/extract_lora_from_model.py
Post
2049
We have authored a post to go over the state of video generation in the Diffusers ecosystem ๐งจ
We cover the models supported, the knobs of optims our users can fire, fine-tuning, and more ๐ฅ
5-6GBs for HunyuanVideo, sky is the limit ๐ ๐ค
https://huggingface.co/blog/video_gen
We cover the models supported, the knobs of optims our users can fire, fine-tuning, and more ๐ฅ
5-6GBs for HunyuanVideo, sky is the limit ๐ ๐ค
https://huggingface.co/blog/video_gen
Post
2823
Tried my hand at simplifying the derivations of Direct Preference Optimization.
I cover how one can reformulate RLHF into DPO. The idea of implicit reward modeling is chef's kiss.
Blog: https://huggingface.co/blog/ariG23498/rlhf-to-dpo
I cover how one can reformulate RLHF into DPO. The idea of implicit reward modeling is chef's kiss.
Blog: https://huggingface.co/blog/ariG23498/rlhf-to-dpo
Post
2016
Timm โค๏ธ Transformers
Wtih the latest version of transformers you can now use any timm model with the familiar transformers API.
Blog Post: https://huggingface.co/blog/timm-transformers
Repository with examples: https://github.com/ariG23498/timm-wrapper-examples
Collection: ariG23498/timmwrapper-6777b85f1e8d085d3f1374a1
Wtih the latest version of transformers you can now use any timm model with the familiar transformers API.
Blog Post: https://huggingface.co/blog/timm-transformers
Repository with examples: https://github.com/ariG23498/timm-wrapper-examples
Collection: ariG23498/timmwrapper-6777b85f1e8d085d3f1374a1
Post
4452
Commits speak louder than words ๐คช
* 4 new video models
* Multiple image models, including SANA & Flux Control
* New quantizers -> GGUF & TorchAO
* New training scripts
Enjoy this holiday-special Diffusers release ๐ค
Notes: https://github.com/huggingface/diffusers/releases/tag/v0.32.0
* 4 new video models
* Multiple image models, including SANA & Flux Control
* New quantizers -> GGUF & TorchAO
* New training scripts
Enjoy this holiday-special Diffusers release ๐ค
Notes: https://github.com/huggingface/diffusers/releases/tag/v0.32.0
Post
2194
Introducing a high-quality open-preference dataset to further this line of research for image generation.
Despite being such an inseparable component for modern image generation, open preference datasets are a rarity!
So, we decided to work on one with the community!
Check it out here:
https://huggingface.co/blog/image-preferences
Despite being such an inseparable component for modern image generation, open preference datasets are a rarity!
So, we decided to work on one with the community!
Check it out here:
https://huggingface.co/blog/image-preferences
Post
2216
The Control family of Flux from
@black-forest-labs
should be discussed more!
It enables structural controls like ControlNets while being significantly less expensive to run!
So, we're working on a Control LoRA training script ๐ค
It's still WIP, so go easy:
https://github.com/huggingface/diffusers/pull/10130
It enables structural controls like ControlNets while being significantly less expensive to run!
So, we're working on a Control LoRA training script ๐ค
It's still WIP, so go easy:
https://github.com/huggingface/diffusers/pull/10130

sayakpaulย
authored
a
paper
6 months ago
Post
1445
We are blessed with another iteration of Pali Gemma. Google launches PaliGemma 2.
google/paligemma-2-release-67500e1e1dbfdd4dee27ba48
merve/paligemma2-vqav2
google/paligemma-2-release-67500e1e1dbfdd4dee27ba48
merve/paligemma2-vqav2
Post
1576
Let 2024 be the year of video model fine-tunes!
Check it out here:
https://github.com/a-r-r-o-w/cogvideox-factory/tree/main/training/mochi-1
Check it out here:
https://github.com/a-r-r-o-w/cogvideox-factory/tree/main/training/mochi-1
Post
2730
It's been a while we shipped native quantization support in
We currently support
This post is just a reminder of what's possible:
1. Loading a model with a quantization config
2. Saving a model with quantization config
3. Loading a pre-quantized model
4.
5. Training and loading LoRAs into quantized checkpoints
Docs:
https://huggingface.co/docs/diffusers/main/en/quantization/bitsandbytes
diffusers
๐งจWe currently support
bistandbytes
as the official backend but using others like torchao
is already very simple. This post is just a reminder of what's possible:
1. Loading a model with a quantization config
2. Saving a model with quantization config
3. Loading a pre-quantized model
4.
enable_model_cpu_offload()
5. Training and loading LoRAs into quantized checkpoints
Docs:
https://huggingface.co/docs/diffusers/main/en/quantization/bitsandbytes