SmolVLM-2 and SigLIP-2 are now part of transformers in dedicated releases!
They're added on top of the v4.49.0 release, and can be installed from the following tags: v4.49.0-SmolVLM-2 and v4.49.0-SigLIP-2.
This marks a new beginning for the release process of transformers. For the past five years, we've been doing monthly releases featuring many models (v4.49.0, the latest release, features 9 new architectures).
Starting with SmolVLM-2 & SigLIP2, we'll now additionally release tags supporting new models on a stable branch. These models are therefore directly available for use by installing from the tag itself. These tags will continue to be updated with fixes applied to these models.
Going forward, continue expecting software releases following semantic versioning: v4.50.0 will have ~10 new architectures compared to v4.49.0, as well as a myriad of new features, improvements and bug fixes. Accompanying these software releases, we'll release tags offering brand new models as fast as possible, to make them accessible to all immediately.
Google just released PaliGemma 2 Mix: new versatile instruction vision language models 🔥
> Three new models: 3B, 10B, 28B with res 224, 448 💙 > Can do vision language tasks with open-ended prompts, understand documents, and segment or detect anything 🤯
Less is More for Reasoning (LIMO): a 32B model fine-tuned with 817 examples can beat o1-preview on math reasoning! 🤯
Do we really need o1's huge RL procedure to see reasoning emerge? It seems not. Researchers from Shanghai Jiaotong University just demonstrated that carefully selected examples can boost math performance in large language models using SFT —no huge datasets or RL procedures needed.
Their procedure allows Qwen2.5-32B-Instruct to jump from 6.5% to 57% on AIME and from 59% to 95% on MATH, while using only 1% of the data in previous approaches.
⚡ The Less-is-More Reasoning Hypothesis: ‣ Minimal but precise examples that showcase optimal reasoning patterns matter more than sheer quantity ‣ Pre-training knowledge plus sufficient computational resources at inference levels up math skills
➡️ Core techniques: ‣ High-quality reasoning chains with self-verification steps ‣ 817 handpicked problems that encourage deeper reasoning ‣ Enough inference-time computation to allow extended reasoning
💪 Efficiency gains: ‣ Only 817 examples instead of 100k+ ‣ 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data
This really challenges the notion that SFT leads to memorization rather than generalization! And opens up reasoning to GPU-poor researchers 🚀
👀 Multimodal > OpenGVLab released InternVideo 2.5 Chat models, new video LMs with long context > AIDC released Ovis2 model family along with Ovis dataset, new vision LMs in different sizes (1B, 2B, 4B, 8B, 16B, 34B), with video and OCR support > ColQwenStella-2b is a multilingual visual retrieval model that is sota in it's size > Hoags-2B-Exp is a new multilingual vision LM with contextual reasoning, long context video understanding
💬 LLMs A lot of math models! > Open-R1 team released OpenR1-Math-220k large scale math reasoning dataset, along with Qwen2.5-220K-Math fine-tuned on the dataset, OpenR1-Qwen-7B > Nomic AI released new Nomic Embed multilingual retrieval model, a MoE with 500 params with 305M active params, outperforming other models > DeepScaleR-1.5B-Preview is a new DeepSeek-R1-Distill fine-tune using distributed RL on math > LIMO is a new fine-tune of Qwen2.5-32B-Instruct on Math
🗣️ Audio > Zonos-v0.1 is a new family of speech recognition models, which contains the model itself and embeddings
🖼️ Vision and Image Generation > We have ported DepthPro of Apple to transformers for your convenience! > illustrious-xl-v1.0 is a new illustration generation model
𝗚𝗿𝗲𝗮𝘁 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗮𝗹𝗲𝗿𝘁: you can now share agents to the Hub! 🥳🥳
And any agent pushed to Hub get a cool Space interface to directly chat with it.
This was a real technical challenge: for instance, serializing tools to export them meant that you needed to get all the source code for a tool, verify that it was standalone (not relying on external variables), and gathering all the packages required to make it run.
I created the Tools gallery, which makes tools specifically developed by/for smolagents searchable and visible. This will help with: - inspiration - best practices - finding cool tools
🤖 Robotics > Pi0, first open-source foundation vision-language action model was released in Le Robot (Apache 2.0)
💬 LLMs > Groundbreaking: s1 is simpler approach to test-time scaling, the release comes with small s1K dataset of 1k question-reasoning trace pairs (from Gemini-Thinking Exp) they fine-tune Qwen2.5-32B-Instruct to get s1-32B, outperforming o1-preview on math 🤯 s1-32B and s1K is out! > Adyen released DABstep, a new benchmark along with it's leaderboard demo for agents doing data analysis > Krutrim released Krutrim-2 instruct, new 12B model based on NeMo12B trained and aligned on Indic languages, a new multilingual sentence embedding model (based on STSB-XLM-R), and a translation model for Indic languages
👀 Multimodal > PKU released Align-DS-V, a model aligned using their new technique called LLF for all modalities (image-text-audio), along with the dataset Align Anything > OLA-7B is a new any-to-any model by Tencent that can take text, image, video, audio data with context window of 32k tokens and output text and speech in English and Chinese > Krutrim released Chitrarth, a new vision language model for Indic languages and English
🖼️ Vision > BiRefNet_HR is a new higher resolution BiRefNet for background removal
🗣️ Audio > kyutai released Hibiki, it's a real-time speech-to-speech translation model 🤯 it's available for French-English translation > Krutrim released Dhwani, a new STT model for Indic languages > They also release a new dataset for STT-TTS
🖼️ Image Generation > Lumina released Lumina-Image-2.0, a 2B parameter-flow based DiT for text to image generation > Tencent released Hunyuan3D-2, a 3D asset generation model based on DiT and Hunyuan3D-Paint > boreal-hl-v1 is a new boring photorealistic image generation LoRA based on Hunyuan
➡️ How well do reasoning models perform on agentic tasks? Until now, all indicators seemed to show that they worked really well. On our recent reproduction of Deep Search, OpenAI's o1 was by far the best model to power an agentic system.
So when our partner Adyen built a huge benchmark of 450 data science tasks, and built data agents with smolagents to test different models, I expected reasoning models like o1 or DeepSeek-R1 to destroy the tasks at hand.
👎 But they really missed the mark. DeepSeek-R1 only got 1 or 2 out of 10 questions correct. Similarly, o1 was only at ~13% correct answers.
🧐 These results really surprised us. We thoroughly checked them, we even thought our APIs for DeepSeek were broken and colleagues Leandro Anton helped me start custom instances of R1 on our own H100s to make sure it worked well. But there seemed to be no mistake. Reasoning LLMs actually did not seem that smart. Often, these models made basic mistakes, like forgetting the content of a folder that they had just explored, misspelling file names, or hallucinating data. Even though they do great at exploring webpages through several steps, the same level of multi-step planning seemed much harder to achieve when reasoning over files and data.
It seems like there's still lots of work to do in the Agents x Data space. Congrats to Adyen for this great benchmark, looking forward to see people proposing better agents! 🚀
OpenAI's latest agentic app Deep Research seems really good... But it's closed, as usual.
⏱️ So with a team of cracked colleagues, we set ourselves a 24hours deadline to replicate and open-source Deep Research! ⏱️
➡️ We built open-Deep-Research, an entirely open agent that can: navigate the web autonomously, scroll and search through pages, download and manipulate files, run calculation on data...
We aimed for the best performance: are the agent's answers really rigorous?
On GAIA benchmark, Deep Research had 67% accuracy on the validation set. ➡️ open Deep Research is at 55% (powered by o1), it is: - the best pass@1 solution submitted - the best open solution 💪💪
And it's only getting started ! Please jump in, drop PRs, and let's bring it to the top !
This week in open AI was 🔥 Let's recap! 🤗 merve/january-31-releases-679a10669bd4030090c5de4d LLMs 💬 > Huge: AllenAI released new Tülu models that outperform DeepSeek R1 using Reinforcement Learning with Verifiable Reward (RLVR) based on Llama 3.1 405B 🔥 > Mistral AI is back to open-source with their "small" 24B models (base & SFT), with Apache 2.0 license 😱 > Alibaba Qwen released their 1M context length models Qwen2.5-Instruct-1M, great for agentic use with Apache 2.0 license 🔥 > Arcee AI released Virtuoso-medium, 32.8B LLMs distilled from DeepSeek V3 with dataset of 5B+ tokens > Velvet-14B is a new family of 14B Italian LLMs trained on 10T tokens in six languages > OpenThinker-7B is fine-tuned version of Qwen2.5-7B-Instruct on OpenThoughts dataset
VLMs & vision 👀 > Alibaba Qwen is back with Qwen2.5VL, amazing new capabilities ranging from agentic computer use to zero-shot localization 🔥 > NVIDIA released new series of Eagle2 models with 1B and 9B sizes > DeepSeek released Janus-Pro, new any-to-any model (image-text generation from image-text input) with MIT license > BEN2 is a new background removal model with MIT license!
Audio 🗣️ > YuE is a new open-source music generation foundation model, lyrics-to-song generation
Now you can launch a code agent directly from your terminal! ✨ 𝚜𝚖𝚘𝚕𝚊𝚐𝚎𝚗𝚝 "𝚈𝚘𝚞𝚛 𝚝𝚊𝚜𝚔" directly launches a CodeAgent ▶️ This also works with web agents (replace 𝚜𝚖𝚘𝚕𝚊𝚐𝚎𝚗𝚝 with 𝚠𝚎𝚋𝚊𝚐𝚎𝚗𝚝) thanks to @merve !
💾 Another treat from smolagents release 1.7.0: Now agents have a memory mechanism, enabling many possibilities like replaying the last run with 𝚊𝚐𝚎𝚗𝚝.𝚛𝚎𝚙𝚕𝚊𝚢(), thank you @clefourrier !
Datasets on the Hugging Face Hub rely on parquet files. We can interact with these files using DuckDB as a fast in-memory database system. One of DuckDB’s features is vector similarity search which can be used with or without an index.
✅ Hosting our own inference was not enough: now the Hub 4 new inference providers: fal, Replicate, SambaNova Systems, & Together AI.
Check model cards on the Hub: you can now, in 1 click, use inference from various providers (cf video demo)
Their inference can also be used through our Inference API client. There, you can use either your custom provider key, or your HF token, then billing will be handled directly on your HF account, as a way to centralize all expenses.
💸 Also, PRO users get 2$ inference credits per month!