Diffusion LLMs are coming for autoregressive LLMs โก๏ธโก๏ธ Inception Labs' new diffusion model demolishes all leading LLMs on generation speed, with equal quality !
Inception Labs was founded a few months ago, and they're not sleeping: after dropping a code model, they just published Mercury chat, a diffusion-based chat model that reaches 1000 tokens / second on H100, i.e. 10x more than models of equivalent performance on the same hardware!
What's the breakthrough? Well instead, of generating tokens left-to-right like the more common autoregressive LLMs, diffusion models generate their blocks of text all at once, and successive steps refine the whole text.
Diffusion models being really fast at isn't new, we have had some promising results on this by Google already back in May with Gemini Diffusion, and Mercury themselves had already published their coding model a few months ago
But being that good quality is new - and now Inception Labs just proved that their models work well in chat too, which could have been challenging given that's streaming generation is well suited to left-to-right generation.
They have a playground available at chat dot inceptionlabs dot ai, I recommend giving it a try!
If you're using any HF libraries, you should enable the Hub MCP in your agentic coding tool!
The brand new Docs Semantic Search tool is intravenous caffeine supply for Cursor, enables to correct API errors in a few seconds, gj @mishig โก๏ธโก๏ธ
๐ To enable Hub MCP, head to your account setting, under MCP, and it will give you everything you need!
AMD summer hackathons are here! A chance to get hands-on with MI300X GPUs and accelerate models. ๐ซ๐ท Paris - Station F - July 5-6 ๐ฎ๐ณ Mumbai - July 12-13 ๐ฎ๐ณ Bengaluru - July 19-20
Hugging Face and GPU Mode will be on site and on July 6 in Paris @ror will share lessons learned while building new kernels to accelerate Llama 3.1 405B on ROCm
If you didn't yet, you should read the technical report for SmolVLA, published yesterday by the Hugging Face robotics team! โก๏ธ Amongst other ideas, it introduces "Async inference" to boost their robot actions.
Robots have a problem: performing the actions takes time (Unlike agents where action executions are near-instant!) Most often, robots wait until they've finished performing actions to start thinking about hte next steps. This is a huge latency cost!
So the team decided to have the PolicyServer (aka the"thinking" part) restart early : instead of waiting for the n observations they just sent to be completed, they gather the observations after k < n steps, and start preparing the next actions based on that while the steps are running until n, to directly send their next steps.
โก๏ธ This boosted robot throughput by ~30%! (nearly 2ร tasks per time window).
A new research paper from KAIST builds on smolagents to push boundaries of distillation ๐ฅณ โก๏ธ "Distilling LLM Agent into Small Models with Retrieval and Code Tools" teaches that, when trying to distil reasoning capability from a strong LLM ("teacher") into a smaller one ("student"), it's much better to use Agent traces than CoT traces.
Advantages are: 1. Improved generalization Intuitively, this is because your agent can encounter more "surprising" results by interacting with its environment : for example, a web research called by the LLM teacher in agent mode can bring results that the LLM teacher would not have generated in CoT.
2. Reduce hallucinations The trace won't hallucinate tool call outputs!
It's just become easier to share your apps on the biggest AI app store (aka HF spaces) for unlimited storage, more visibility and community interactions.
Just pick a React, Svelte, or Vue template when you create your space or add app_build_command: npm run build in your README's YAML and app_file: build/index.html in your README's YAML block.
Wrapping up a week of shipping and announcements with Dell Enterprise Hub now featuring AI Applications, on-device models for AI PCs, a new CLI and Python SDK... all you need for building AI on premises!
Playing with Veo3 this morning. Share your prompt if you want me to create videos for you (bonus point if they funnily reference HF/open-source). These videos are "a cat on the moon rapping "I love Hugging Face""!
Recent RL paradigms often relied on a set of questions an answers that needs to be manually curated. Researchers from Tsinghua University went like "why though".
๐ค Indeed, why learn from question designed by a human teacher, when the model can start from their base knowledge and learn by experimenting in a code environment, proposing coding tasks themselves and trying to solve them?
Thus they created โAbsolute Zero Reasoningโ (AZR), an approach that removes any need for human curated data. ๐ญ ๐๐๐ฎ๐น ๐ฟ๐ผ๐น๐ฒ๐: โฃ Proposer: Generates challenging but solvable coding tasks โฃ Solver: Attempts to solve those self-proposed tasks
๐งช ๐ง๐ต๐ฟ๐ฒ๐ฒ ๐๐ฎ๐๐ธ ๐๐๐ฝ๐ฒ๐: all types are defined as triplets of program, input and output โฃ Deduction: Give model an input and program, it must deduce the output โฃ Abduction: Give model an program and output, it must find the input that gave said output โฃ Induction: Synthesize a program from input/output pairs Btw this reminded me of my long-forgotten philosophy classes: Aristotle was more on the induction side, learning from real-world analogies, while Plato was more on the deduction side, trying to progress quite far with just one input and his reasoning.
๐ ๐ฅ๐ฒ๐๐๐น๐๐: โฃ AZR post-training creates a nice improvement on known models like Qwen2.5-7B โฃ Shows strong cross-domain transfer: coding โ๏ธ math reasoning
๐ง ๐ข๐๐ต๐ฒ๐ฟ ๐ณ๐ถ๐ป๐ฑ๐ถ๐ป๐ด๐: โฃ Having a better base performance (general or code specific) amplify the gains from Absolute Zero Reasoning โฃ Researchers warn about "Uh-oh moments" (winking to the "aha moments" of DeepSeek) where the model generates concerning goals like "make an extremely convoluted code to outsmart all these humans": so supervision is still needed!