Loubna Ben Allal

loubnabnl

AI & ML interests

SmolLMs, ML for code, data

Recent Activity

Organizations

Hugging Face's profile picture BigScience Workshop's profile picture BigScience Catalogue Data's profile picture BigScience Data's profile picture HuggingFaceBR4's profile picture Team 8's profile picture CodeParrot's profile picture BigCode's profile picture Hugging Face H4's profile picture Hugging Face OSS Metrics's profile picture CompVis Community's profile picture BigCode Data's profile picture LocalCodeLLMs's profile picture Need4Speed's profile picture EPFL Machine Learning and Optimization Laboratory's profile picture Code Llama's profile picture Hugging Face Smol Models Research's profile picture Hugging Face Smol Cluster's profile picture Nt3awnou's profile picture huggingPartyParis's profile picture Qwen's profile picture ZeroGPU Explorers's profile picture HF AFAIK's profile picture gg-hf's profile picture Nanotron Research's profile picture Women on Hugging Face's profile picture Hugging Face SMOL's profile picture FineData's profile picture bigcode nvidia's profile picture Social Post Explorers's profile picture Dev Mode Explorers's profile picture Cosmopedia Stories Collab's profile picture HuggingFaceFW-Dev's profile picture StarCoder2 Data's profile picture Data Agents's profile picture Argilla Warehouse's profile picture smol-explorers's profile picture swissai-hf-data's profile picture Hugging Face Science's profile picture Open R1's profile picture smol-ablations's profile picture SmolEvalData's profile picture

loubnabnl's activity

published an article 1 day ago
view article
Article

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

By danaaubakirova and 8 others β€’
β€’ 47
upvoted an article 7 days ago
view article
Article

CodeAgents + Structure: AΒ Better Way to Execute Actions

By akseljoonas and 1 other β€’
β€’ 37
reacted to clem's post with πŸš€πŸ”₯ 9 days ago
view post
Post
3439
Playing with Veo3 this morning. Share your prompt if you want me to create videos for you (bonus point if they funnily reference HF/open-source). These videos are "a cat on the moon rapping "I love Hugging Face""!
Β·
reacted to nyuuzyou's post with πŸ”₯ 9 days ago
view post
Post
2903
I recently updated nyuuzyou/pxhere dataset and it now contains approximately 1.1M CC0 high-resolution images
upvoted 3 changelogs 12 days ago
view changelog
Changelog

Static Spaces can now have a build step

β€’ 82
view changelog
Changelog

Xet is now the default storage option for new users and organizations

β€’ 53
view changelog
Changelog

AI-generated Abstract summaries on Hugging Face Papers

β€’ 65
reacted to merve's post with πŸ”₯ 12 days ago
view post
Post
2574
Google released MedGemma on I/O'25 πŸ‘ google/medgemma-release-680aade845f90bec6a3f60c4

> 4B and 27B instruction fine-tuned vision LMs and a 4B pre-trained vision LM for medicine
> available with transformers from the get-go πŸ€—

they also released a cool demo for scan reading ➑️ google/rad_explain

use with transformers ‡️
  • 1 reply
Β·
replied to their post 14 days ago
reacted to AdinaY's post with πŸ”₯πŸš€ 14 days ago
view post
Post
2765
ByteDance is absolutely cooking latelyπŸ”₯

BAGEL πŸ₯― 7B active parameter open multimodal foundation model by Bytedance Seed team.

ByteDance-Seed/BAGEL-7B-MoT

✨ Apache 2.0
✨ Outperforms top VLMs (Qwen2.5-VL & InternVL-2.5)
✨ Mixture-of-Transformer-Experts + dual encoders
✨ Trained on trillions of interleaved tokens
reacted to sayakpaul's post with πŸ”₯ 14 days ago
view post
Post
1672
Despite the emergence of combining LLM and DiT architectures for T2I synthesis, its design remains severely understudied.

This was done long ago and got into CVPR25 -- super excited to finally share it now, along with the data and code β™₯️

We explore several architectural choices that affect this design. We provide an open & reproducible training recipe that works at scale.

Works like Playground v3 have already explored a deep fusion between an LLM and a DiT, sharing their representations through layerwise attention. They exhibit excellent performance on T2I.

Despite its compelling results and other performance virtues, it remains unexplored, which is what we want to improve in our work. Specifically, we take a pre-trained LLM (Gemma-2B) and trainable DiT, and set out to explore what makes a "good deep fusion" between the two for T2I.

We explore several key questions in the work, such as:

Q1: How should we do attention? We considered several alternatives. PixArt-Alpha like attention (cross-attention) is very promising.
Q2: Should we incorporate additional text modulation?
Q3: Can we eliminate timestep conditioning?
Q4: How do we do positional encodings?
Q5: Do instruction-tuned LLMs help deep fusion?
Q6: Would using a decoder LLM from a multimodal model be helpful?
Q7: Does using a better variant of Gemma help?

Based on the above findings, we arrive at FuseDiT with the following components on top of the base architecture from the findings of our experiments.

* No AdaLN-Zero modules
* 1D + 2D-RoPE
* Gemma 2 2B, adjusting DiT configurations accordingly

We trained FuseDiT on a mixture from CC12M, JourneyDB, & SA (~26M image-text pairs) for 800 steps. While not the best model, it's encouraging to develop something in a guided manner using open datasets.

To know more (code, models, all are available), please check out the paper:
https://lnkd.in/gg6qyqZX.