anakin87 (Stefano Fiorucci)

posted an update 9 days ago

Post

355

🛡️ AI Guardrails with Open Language Models - Tutorial

📓 https://haystack.deepset.ai/cookbook/safety_moderation_open_lms

How do you ensure your AI application is safe from harmful or inappropriate user inputs?

This is a core requirement for real-world AI deployments. Luckily, several open Language Models are built specifically for safety moderation.

I've been exploring them and put together a hands-on tutorial using the Haystack framework to build your own AI guardrails.

In the notebook, you'll learn how to use and customize:
🔹 Meta Llama Guard (via Hugging Face API)
🔹 IBM Granite Guardian (via Ollama), which can also evaluate RAG specific risk dimensions
🔹 Google ShieldGemma (via Ollama)
🔹 Nvidia NemoGuard models family, including a model for topic control

You'll also see how to integrate content moderation into a 🔎 RAG pipeline.

reacted to andito's post with 👀 9 days ago

Post

3857

🧠👁️ Can AI visualize solutions?

Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal “mental sketches”?

That’s the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.

These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.

🔧 Mirage is trained in two phases:

1) Grounding: It learns to produce latent tokens anchored in real images.
2) Refinement: The model drops the images and learns to generate visual tokens on its own.

📈 And yes, it works!
On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines.
Smart sketches > empty words.

By mimicking the way humans visualize solutions, Mirage gives AI a new kind of imagination, one that’s faster, more efficient, and more human-like.
Kudos to the teams at UMass Amherst and MIT behind this exciting work.
Check the paper: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218)

4 replies

·

posted an update 12 days ago

Post

1101

🧰 Free up space on the Hub with super_squash_history 🧹

As you may know, Hugging Face Hub has storage limits on private repos (100 GB for free users, 1 TB for PROs).

This weekend I did some cleanup on my private repos
I went 1.58 TB down to 1 GB. 😅

Besides deleting old, unused models, the main tool I used was a lesser-known command:
super_squash_history.

When you train a model, you often push multiple checkpoints to the Hub.
Each checkpoint = a commit.
A 2.6B model in BF16 is ~5 GB.
So 10 checkpoints = 50 GB. That adds up fast.

While full commit history can be useful for rollbacks, it's often unnecessary for older experiments where only the final model matters.

In these cases, you can use super_squash_history: it reduces your entire repo history to a single commit.

https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.super_squash_history

⚠️ super_squash_history is a non-revertible operation. Once squashed, the commit history cannot be retrieved.

Hope this is useful to others.

2 replies

·

reacted to as-cle-bert's post with ❤️ 2 months ago

Post

1914

One of the biggest challenges I've been facing since I started developing [𝐏𝐝𝐟𝐈𝐭𝐃𝐨𝐰𝐧](https://github.com/AstraBert/PdfItDown) was handling correctly the conversion of files like Excel sheets and CSVs: table conversion was bad and messy, almost unusable for downstream tasks🫣

That's why today I'm excited to introduce 𝐫𝐞𝐚𝐝𝐞𝐫𝐬, the new feature of PdfItDown v1.4.0!🎉

With 𝘳𝘦𝘢𝘥𝘦𝘳𝘴, you can choose among three (for now👀) flavors of text extraction and conversion to PDF:

- 𝗗𝗼𝗰𝗹𝗶𝗻𝗴, which does a fantastic work with presentations, spreadsheets and word documents🦆

- 𝗟𝗹𝗮𝗺𝗮𝗣𝗮𝗿𝘀𝗲 by LlamaIndex, suitable for more complex and articulated documents, with mixture of texts, images and tables🦙

- 𝗠𝗮𝗿𝗸𝗜𝘁𝗗𝗼𝘄𝗻 by Microsoft, not the best at handling highly structured documents, by extremly flexible in terms of input file format (it can even convert XML, JSON and ZIP files!)✒️

You can use this new feature in your python scripts (check the attached code snippet!😉) and in the command line interface as well!🐍

Have fun and don't forget to star the repo on GitHub ➡️ https://github.com/AstraBert/PdfItDown

posted an update 2 months ago

Post

3483

𝗜 𝘁𝗿𝗮𝗶𝗻𝗲𝗱 𝗮 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹 𝘁𝗼 𝘀𝗰𝗵𝗲𝗱𝘂𝗹𝗲 𝗲𝘃𝗲𝗻𝘁𝘀 𝘄𝗶𝘁𝗵 𝗚𝗥𝗣𝗢! 👑 🗓️

✍️ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo

I experimented with GRPO lately.

I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.

After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...

I wanted a different challenge, like 𝘁𝗲𝗮𝗰𝗵𝗶𝗻𝗴 𝗮 𝗺𝗼𝗱𝗲𝗹 𝘁𝗼 𝗰𝗿𝗲𝗮𝘁𝗲 𝗮 𝘀𝗰𝗵𝗲𝗱𝘂𝗹𝗲 𝗳𝗿𝗼𝗺 𝗮 𝗹𝗶𝘀𝘁 𝗼𝗳 𝗲𝘃𝗲𝗻𝘁𝘀 𝗮𝗻𝗱 𝗽𝗿𝗶𝗼𝗿𝗶𝘁𝗶𝗲𝘀.

Choosing an original problem forced me to:
🤔 Think about the problem setting
🧬 Generate data
🤏 Choose the right base model
🏆 Design reward functions (and experiencing reward hacking)
🔄 Run multiple rounds of training, hoping that my model would learn something.

A fun and rewarding 😄 experience.

I learned a lot of things, that I want to share with you. 👇
✍️ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
💻 Code: https://github.com/anakin87/qwen-scheduler-grpo
🤗 Hugging Face collection (dataset and model): anakin87/qwen-scheduler-grpo-680bcc583e817390525a8837

2 replies

·

reacted to as-cle-bert's post with ❤️ 3 months ago

Post

2931

Ever dreamt of ingesting into a vector DB that pile of CSVs, Word documents and presentations laying in some remote folders on your PC?🗂️
What if I told you that you can do it within three to six lines of code?🤯
Well, with my latest open-source project, 𝐢𝐧𝐠𝐞𝐬𝐭-𝐚𝐧𝐲𝐭𝐡𝐢𝐧𝐠 (https://github.com/AstraBert/ingest-anything), you can take all your non-PDF files, convert them to PDF, extract their text, chunk, embed and load them into a vector database, all in one go!🚀
How? It's pretty simple!
📁 The input files are converted into PDF by PdfItDown (https://github.com/AstraBert/PdfItDown)
📑 The PDF text is extracted using LlamaIndex readers
🦛 The text is chunked exploiting Chonkie
🧮 The chunks are embedded thanks to Sentence Transformers models
🗄️ The embeddings are loaded into a Qdrant vector database

And you're done!✅
Curious of trying it? Install it by running:

𝘱𝘪𝘱 𝘪𝘯𝘴𝘵𝘢𝘭𝘭 𝘪𝘯𝘨𝘦𝘴𝘵-𝘢𝘯𝘺𝘵𝘩𝘪𝘯𝘨

And you can start using it in your python scripts!🐍
Don't forget to star it on GitHub and let me know if you have any feedback! ➡️ https://github.com/AstraBert/ingest-anything

5 replies

·

reacted to giux78's post with ❤️ 4 months ago

Post

3216

This is truly an inspirational story please help us spread the word, @clem , @thomwolf and everyone who supports open source AI.

A few weeks ago, @mmuffo94 and @cittiberto from indigo_ai launched the Chatbot Arena for the Italian language: https://indigo.ai/it/chatbot-arena-italia/.

To our surprise, among the top-ranked models is mii-llm/maestrale-chat-v0.4-beta a carefully fine-tuned version of mistralai/Mistral-7B-v0.1, developed by @efederici and @mferraretto from

mii-llm , and released nearly a year ago.

At this very moment, as shown in the screenshot, mii-llm/maestrale-chat-v0.4-beta is ranked 8th right between ChatGPT-4.5 and ChatGPT-4o.

It's likely that for several months, the best Italian speaking LLM has been an open source 7B model created by open source contributors and hardly anyone knew it.

2 replies

·

replied to their post 6 months ago

Ok, I understand...

In the past, I've also fine-tuned models with different licenses.
You may be interested in https://huggingface.co/anakin87/Phi-3.5-mini-ITA (MIT license).

posted an update 6 months ago

Post

1745

𝐍𝐞𝐰 𝐈𝐭𝐚𝐥𝐢𝐚𝐧 𝐒𝐦𝐚𝐥𝐥 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐌𝐨𝐝𝐞𝐥𝐬: 𝐆𝐞𝐦𝐦𝐚 𝐍𝐞𝐨𝐠𝐞𝐧𝐞𝐬𝐢𝐬 𝐜𝐨𝐥𝐥𝐞𝐜𝐭𝐢𝐨𝐧 💎🌍🇮🇹

I am happy to release two new language models for the Italian Language!

💪 Gemma 2 9B Neogenesis ITA
anakin87/gemma-2-9b-neogenesis-ita
Building on the impressive work by VAGO Solutions, I applied Direct Preference Optimization with a mix of Italian and English data.
Using Spectrum, I trained 20% of model layers.

📊 Evaluated on the Open ITA LLM leaderboard ( mii-llm/open_ita_llm_leaderboard), this model achieves strong performance.
To beat it on this benchmark, you'd need a 27B model 😎

🤏 Gemma 2 2B Neogenesis ITA
anakin87/gemma-2-2b-neogenesis-ita
This smaller variant is fine-tuned from the original Gemma 2 2B it by Google.
Through a combination of Supervised Fine-Tuning and Direct Preference Optimization, I trained 25% of the layers using Spectrum.

📈 Compared to the original model, it shows improved Italian proficiency, good for its small size.

Both models were developed during the recent #gemma competition on Kaggle.
📓 Training code: https://www.kaggle.com/code/anakin87/post-training-gemma-for-italian-and-beyond

🙏 Thanks @FinancialSupport and mii-llm for the help during evaluation.

3 replies

·

reacted to tomaarsen's post with ❤️ 6 months ago

Post

4796

🏎️ Today I'm introducing a method to train static embedding models that run 100x to 400x faster on CPU than common embedding models, while retaining 85%+ of the quality! Including 2 fully open models: training scripts, datasets, metrics.

We apply our recipe to train 2 Static Embedding models that we release today! We release:
2️⃣ an English Retrieval model and a general-purpose Multilingual similarity model (e.g. classification, clustering, etc.), both Apache 2.0
🧠 my modern training strategy: ideation -> dataset choice -> implementation -> evaluation
📜 my training scripts, using the Sentence Transformers library
📊 my Weights & Biases reports with losses & metrics
📕 my list of 30 training and 13 evaluation datasets

The 2 Static Embedding models have the following properties:
🏎️ Extremely fast, e.g. 107500 sentences per second on a consumer CPU, compared to 270 for 'all-mpnet-base-v2' and 56 for 'gte-large-en-v1.5'
0️⃣ Zero active parameters: No Transformer blocks, no attention, not even a matrix multiplication. Super speed!
📏 No maximum sequence length! Embed texts at any length (note: longer texts may embed worse)
📐 Linear instead of exponential complexity: 2x longer text takes 2x longer, instead of 2.5x or more.
🪆 Matryoshka support: allow you to truncate embeddings with minimal performance loss (e.g. 4x smaller with a 0.56% perf. decrease for English Similarity tasks)

Check out the full blogpost if you'd like to 1) use these lightning-fast models or 2) learn how to train them with consumer-level hardware: https://huggingface.co/blog/static-embeddings

The blogpost contains a lengthy list of possible advancements; I'm very confident that our 2 models are only the tip of the iceberg, and we may be able to get even better performance.

Alternatively, check out the models:
* sentence-transformers/static-retrieval-mrl-en-v1
* sentence-transformers/static-similarity-mrl-multilingual-v1

1 reply

·

posted an update 6 months ago

Post

569

Hey, it has been a while... I was busy participating in 💎 𝐆𝐞𝐦𝐦𝐚 𝐜𝐨𝐦𝐩𝐞𝐭𝐢𝐭𝐢𝐨𝐧!

Here's the idea: Gemma open models have a large vocabulary size (256K), so improving them for a specific language or cultural context should be pretty affordable - no need for continued pre-training.

My submission: 💎🌍🇮🇹 𝐍𝐞𝐨𝐠𝐞𝐧𝐞𝐬𝐢𝐬 - 𝐏𝐨𝐬𝐭-𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐆𝐞𝐦𝐦𝐚 𝐟𝐨𝐫 𝐈𝐭𝐚𝐥𝐢𝐚𝐧 𝐚𝐧𝐝 𝐛𝐞𝐲𝐨𝐧𝐝
📓 Kaggle notebook: https://www.kaggle.com/code/anakin87/post-training-gemma-for-italian-and-beyond

In this notebook, I show how I improve the performance of Gemma 2 2B on Italian via Post-Training.
I believe this method is adaptable to other languages and model sizes.

𝘒𝘦𝘺 𝘚𝘵𝘦𝘱𝘴
📊 Choose reference metrics
🧑‍🔬 Data curation for Instruction Fine Tuning: identify existing datasets + generate synthetic data
🏋️‍♂️ Efficient Instruction Fine Tuning with Spectrum
🧑‍🔬 Data curation for Preference Tuning: identify existing datasets + generate synthetic data
👍👎 Efficient Direct Preference Optimization with Spectrum
📈 Evaluation

🤗 Hugging Face collection (with models and datasets): anakin87/gemma-neogenesis-67824b7bf13ac9cfe091fe2e

I'm also planning a 🎁 Gemma Giveaway (on LinkedIn - https://www.linkedin.com/in/stefano-fiorucci) in the next few days - sharing techniques, datasets, and models I used for my project... so stay tuned! 📻

reacted to tomaarsen's post with ❤️ 6 months ago

Post

3106

That didn't take long! Nomic AI has finetuned the new ModernBERT-base encoder model into a strong embedding model for search, classification, clustering and more!

Details:
🤖 Based on ModernBERT-base with 149M parameters.
📊 Outperforms both nomic-embed-text-v1 and nomic-embed-text-v1.5 on MTEB!
🏎️ Immediate FA2 and unpacking support for super efficient inference.
🪆 Trained with Matryoshka support, i.e. 2 valid output dimensionalities: 768 and 256.
➡️ Maximum sequence length of 8192 tokens!
2️⃣ Trained in 2 stages: unsupervised contrastive data -> high quality labeled datasets.
➕ Integrated in Sentence Transformers, Transformers, LangChain, LlamaIndex, Haystack, etc.
🏛️ Apache 2.0 licensed: fully commercially permissible

Try it out here: nomic-ai/modernbert-embed-base

Very nice work by Zach Nussbaum and colleagues at Nomic AI.

reacted to anton-l's post with 🔥 7 months ago

Post

3059

Introducing 📐𝐅𝐢𝐧𝐞𝐌𝐚𝐭𝐡: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
🛠️ carefully extracting math data from Common Crawl;
🔎 iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! 🚀
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2

reacted to lewtun's post with 🔥 7 months ago

Post

7015

We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute 🔥

How? By combining step-wise reward models with tree search algorithms :)

We show that smol models can match or exceed the performance of their much larger siblings when given enough "time to think"

We're open sourcing the full recipe and sharing a detailed blog post.

In our blog post we cover:

📈 Compute-optimal scaling: How we implemented DeepMind's recipe to boost the mathematical capabilities of open models at test-time.

🎄 Diverse Verifier Tree Search (DVTS): An unpublished extension we developed to the verifier-guided tree search technique. This simple yet effective method improves diversity and delivers better performance, particularly at large test-time compute budgets.

🧭 Search and Learn: A lightweight toolkit for implementing search strategies with LLMs and built for speed with vLLM

Here's the links:

- Blog post: HuggingFaceH4/blogpost-scaling-test-time-compute

- Code: https://github.com/huggingface/search-and-learn

Enjoy!

2 replies

·

reacted to DawnC's post with 👍 7 months ago

Post

1438

💡 Curious about dog breeds? 🐕 Meet PawMatchAI!
I've created this fun and interactive project to help you recognize dog breeds, find the perfect pup for your lifestyle, and even compare different breeds! Recently upgraded with smarter AI detection - it can now better distinguish between dogs and non-dogs (no more confusing cats for huskies! 😺➡️🐕).

🐾 What's cool about it?
Smart breed recognition powered by AI
Lifestyle-based breed recommendations
Detailed breed comparisons
And now with enhanced non-dog filtering!

🌟 Why try it?
Whether you're a dog lover, considering a new furry friend, or just curious, PawMatchAI makes discovering breeds fun and informative! As someone passionate about both AI and pets, I'm combining my two loves while working toward my goal of contributing to the AI industry.

🔎 Got feedback?
While it's not perfect, your input helps make it better! I'd love to hear your thoughts as I continue improving this project on my journey into AI development.

👉 Try it now: DawnC/PawMatchAI

🎯 Your support matters!
Every like 👍 or comment 📝 helps fuel my passion for AI development and keeps me motivated to create more helpful tools. Let's make the AI journey fun and impactful together!

#AI #MachineLearning #DeepLearning #Pytorch #ComputerVision

reacted to Narsil's post with ❤️ 7 months ago

Post

1725

Performance leap: TGI v3 is out. Processes 3x more tokens, 13x faster than vLLM on long prompts. Zero config !

3x more tokens.

By reducing our memory footprint, we’re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime and its effect are best seen on smaller constrained environments.
13x faster

On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead of the lookup is ~5us. Thanks @Dani ël de Kok for the beast data structure.
Zero config

That’s it. Remove all the flags your are using and you’re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we don’t have any flags anymore in our deployments. We kept all existing flags around, they may come in handy in niche scenarios.

Read more: https://huggingface.co/docs/text-generation-inference/conceptual/chunking

replied to their post 7 months ago

@Mollel created another dataset using Glot for language detection instead of fastText.

https://huggingface.co/datasets/sartifyllc/tulu-3-sft-mixture-language-glot

Good work!

posted an update 7 months ago

Post

1669

Tulu 3 SFT Mixture by AllenAI is a massive, good, multilingual dataset for fine-tuning Language Models.

Unfortunately, it was missing the "language" column.

I added it using the good old fastText.

Check out the dataset here 👉 anakin87/tulu-3-sft-mixture-with-language

1 reply

·

reacted to dvilasuero's post with ❤️ 7 months ago

Post

2769

🌐 Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.

Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Técnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.

🏷️ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!

Thanks to this annotation process, the open dataset contains two subsets:

1. 🗽 Culturally Agnostic: no specific regional, cultural knowledge is required.
2. ⚖️ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.

Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.

I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.

Dataset: https://huggingface.co/datasets/CohereForAI/Global-MMLU

posted an update 8 months ago

Post

443

🐝🐝🐝 𝐀 𝐒𝐰𝐚𝐫𝐦 𝐨𝐟 𝐀𝐠𝐞𝐧𝐭𝐬 𝐰𝐢𝐭𝐡 𝐋𝐥𝐚𝐦𝐚 3.2, 𝐆𝐏𝐓-4𝐨 𝐦𝐢𝐧𝐢 𝐚𝐧𝐝 𝐂𝐥𝐚𝐮𝐝𝐞 3.5 𝐒𝐨𝐧𝐧𝐞𝐭

𝐓𝐋;𝐃𝐑: I reimplemented the Swarm concept using Haystack, but made it work with both open and proprietary models 💫

✍️ blog article: https://haystack.deepset.ai/blog/swarm-of-agents
📓 notebook: https://haystack.deepset.ai/cookbook/swarm

Some time ago OpenAI published Swarm: an educational framework for building multi-agent systems.

Their approach focuses on two main concepts:
・ 𝐑𝐨𝐮𝐭𝐢𝐧𝐞𝐬: Each agent follows specific 📜 instructions and uses 🛠️ tools to execute them.
・ 𝐇𝐚𝐧𝐝𝐨𝐟𝐟𝐬 🤝: Agents can transfer control to one another using tool/function calling.

When I first read these ideas, I thought: 𝘴𝘪𝘮𝘱𝘭𝘦 𝘣𝘶𝘵 𝘱𝘰𝘸𝘦𝘳𝘧𝘶𝘭! And they pair well with the recent unified tool support in Haystack.

🧑‍💻 So, I decided to re-implement these concepts using Haystack, and in just a few lines of code, I had a working prototype.

🆒 Bonus feature: this implementation isn't tied to a single model provider - different agents can be powered by different models!

I replicated the ACME customer service example from the original article, with 3 Agents:
🐝 Triage Agent - Llama 3.2 running on Ollama
🐝 Sales Agent - Anthropic Claude 3.5 Sonnet
🐝 Issues and Repairs Agent - OpenAI GPT-4o mini

Want to see the full implementation and give it a try? Check out the blog post and notebook! ✨

Stefano Fiorucci PRO

AI & ML interests

Recent Activity

Organizations

Stefano Fiorucci PRO

AI & ML interests

Recent Activity

Organizations

anakin87's activity