Stefano Fiorucci PRO

anakin87

AI & ML interests

Contributing to Haystack LLM framework ๐Ÿ—๏ธ. Language Models: orchestration, post-training, synthetic data...

Recent Activity

liked a dataset about 8 hours ago
open-ita-llms/OpenSFT-ita
liked a dataset 7 days ago
anakin87/evol-dpo-ita-reranked
liked a dataset 7 days ago
anakin87/fine-instructions-ita-70k
View all activity

Articles

Organizations

deepset's profile picture Blog-explorers's profile picture ZeroGPU Explorers's profile picture Hugging Face Discord Community's profile picture

anakin87's activity

replied to their post 9 days ago
posted an update 9 days ago
view post
Post
1556
๐๐ž๐ฐ ๐ˆ๐ญ๐š๐ฅ๐ข๐š๐ง ๐’๐ฆ๐š๐ฅ๐ฅ ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž ๐Œ๐จ๐๐ž๐ฅ๐ฌ: ๐†๐ž๐ฆ๐ฆ๐š ๐๐ž๐จ๐ ๐ž๐ง๐ž๐ฌ๐ข๐ฌ ๐œ๐จ๐ฅ๐ฅ๐ž๐œ๐ญ๐ข๐จ๐ง ๐Ÿ’Ž๐ŸŒ๐Ÿ‡ฎ๐Ÿ‡น

I am happy to release two new language models for the Italian Language!

๐Ÿ’ช Gemma 2 9B Neogenesis ITA
anakin87/gemma-2-9b-neogenesis-ita
Building on the impressive work by VAGO Solutions, I applied Direct Preference Optimization with a mix of Italian and English data.
Using Spectrum, I trained 20% of model layers.

๐Ÿ“Š Evaluated on the Open ITA LLM leaderboard ( mii-llm/open_ita_llm_leaderboard), this model achieves strong performance.
To beat it on this benchmark, you'd need a 27B model ๐Ÿ˜Ž


๐Ÿค Gemma 2 2B Neogenesis ITA
anakin87/gemma-2-2b-neogenesis-ita
This smaller variant is fine-tuned from the original Gemma 2 2B it by Google.
Through a combination of Supervised Fine-Tuning and Direct Preference Optimization, I trained 25% of the layers using Spectrum.

๐Ÿ“ˆ Compared to the original model, it shows improved Italian proficiency, good for its small size.


Both models were developed during the recent #gemma competition on Kaggle.
๐Ÿ““ Training code: https://www.kaggle.com/code/anakin87/post-training-gemma-for-italian-and-beyond


๐Ÿ™ Thanks @FinancialSupport and mii-llm for the help during evaluation.
ยท
reacted to tomaarsen's post with โค๏ธ 14 days ago
view post
Post
4407
๐ŸŽ๏ธ Today I'm introducing a method to train static embedding models that run 100x to 400x faster on CPU than common embedding models, while retaining 85%+ of the quality! Including 2 fully open models: training scripts, datasets, metrics.

We apply our recipe to train 2 Static Embedding models that we release today! We release:
2๏ธโƒฃ an English Retrieval model and a general-purpose Multilingual similarity model (e.g. classification, clustering, etc.), both Apache 2.0
๐Ÿง  my modern training strategy: ideation -> dataset choice -> implementation -> evaluation
๐Ÿ“œ my training scripts, using the Sentence Transformers library
๐Ÿ“Š my Weights & Biases reports with losses & metrics
๐Ÿ“• my list of 30 training and 13 evaluation datasets

The 2 Static Embedding models have the following properties:
๐ŸŽ๏ธ Extremely fast, e.g. 107500 sentences per second on a consumer CPU, compared to 270 for 'all-mpnet-base-v2' and 56 for 'gte-large-en-v1.5'
0๏ธโƒฃ Zero active parameters: No Transformer blocks, no attention, not even a matrix multiplication. Super speed!
๐Ÿ“ No maximum sequence length! Embed texts at any length (note: longer texts may embed worse)
๐Ÿ“ Linear instead of exponential complexity: 2x longer text takes 2x longer, instead of 2.5x or more.
๐Ÿช† Matryoshka support: allow you to truncate embeddings with minimal performance loss (e.g. 4x smaller with a 0.56% perf. decrease for English Similarity tasks)

Check out the full blogpost if you'd like to 1) use these lightning-fast models or 2) learn how to train them with consumer-level hardware: https://huggingface.co/blog/static-embeddings

The blogpost contains a lengthy list of possible advancements; I'm very confident that our 2 models are only the tip of the iceberg, and we may be able to get even better performance.

Alternatively, check out the models:
* sentence-transformers/static-retrieval-mrl-en-v1
* sentence-transformers/static-similarity-mrl-multilingual-v1
  • 1 reply
ยท
posted an update 14 days ago
view post
Post
534
Hey, it has been a while... I was busy participating in ๐Ÿ’Ž ๐†๐ž๐ฆ๐ฆ๐š ๐œ๐จ๐ฆ๐ฉ๐ž๐ญ๐ข๐ญ๐ข๐จ๐ง!

Here's the idea: Gemma open models have a large vocabulary size (256K), so improving them for a specific language or cultural context should be pretty affordable - no need for continued pre-training.

My submission: ๐Ÿ’Ž๐ŸŒ๐Ÿ‡ฎ๐Ÿ‡น ๐๐ž๐จ๐ ๐ž๐ง๐ž๐ฌ๐ข๐ฌ - ๐๐จ๐ฌ๐ญ-๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐†๐ž๐ฆ๐ฆ๐š ๐Ÿ๐จ๐ซ ๐ˆ๐ญ๐š๐ฅ๐ข๐š๐ง ๐š๐ง๐ ๐›๐ž๐ฒ๐จ๐ง๐
๐Ÿ““ Kaggle notebook: https://www.kaggle.com/code/anakin87/post-training-gemma-for-italian-and-beyond

In this notebook, I show how I improve the performance of Gemma 2 2B on Italian via Post-Training.
I believe this method is adaptable to other languages and model sizes.

๐˜’๐˜ฆ๐˜บ ๐˜š๐˜ต๐˜ฆ๐˜ฑ๐˜ด
๐Ÿ“Š Choose reference metrics
๐Ÿง‘โ€๐Ÿ”ฌ Data curation for Instruction Fine Tuning: identify existing datasets + generate synthetic data
๐Ÿ‹๏ธโ€โ™‚๏ธ Efficient Instruction Fine Tuning with Spectrum
๐Ÿง‘โ€๐Ÿ”ฌ Data curation for Preference Tuning: identify existing datasets + generate synthetic data
๐Ÿ‘๐Ÿ‘Ž Efficient Direct Preference Optimization with Spectrum
๐Ÿ“ˆ Evaluation


๐Ÿค— Hugging Face collection (with models and datasets): anakin87/gemma-neogenesis-67824b7bf13ac9cfe091fe2e

I'm also planning a ๐ŸŽ Gemma Giveaway (on LinkedIn - https://www.linkedin.com/in/stefano-fiorucci) in the next few days - sharing techniques, datasets, and models I used for my project... so stay tuned! ๐Ÿ“ป
reacted to tomaarsen's post with โค๏ธ 29 days ago
view post
Post
2955
That didn't take long! Nomic AI has finetuned the new ModernBERT-base encoder model into a strong embedding model for search, classification, clustering and more!

Details:
๐Ÿค– Based on ModernBERT-base with 149M parameters.
๐Ÿ“Š Outperforms both nomic-embed-text-v1 and nomic-embed-text-v1.5 on MTEB!
๐ŸŽ๏ธ Immediate FA2 and unpacking support for super efficient inference.
๐Ÿช† Trained with Matryoshka support, i.e. 2 valid output dimensionalities: 768 and 256.
โžก๏ธ Maximum sequence length of 8192 tokens!
2๏ธโƒฃ Trained in 2 stages: unsupervised contrastive data -> high quality labeled datasets.
โž• Integrated in Sentence Transformers, Transformers, LangChain, LlamaIndex, Haystack, etc.
๐Ÿ›๏ธ Apache 2.0 licensed: fully commercially permissible

Try it out here: nomic-ai/modernbert-embed-base

Very nice work by Zach Nussbaum and colleagues at Nomic AI.
reacted to anton-l's post with ๐Ÿ”ฅ about 1 month ago
view post
Post
2311
Introducing ๐Ÿ“๐…๐ข๐ง๐ž๐Œ๐š๐ญ๐ก: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
๐Ÿ› ๏ธ carefully extracting math data from Common Crawl;
๐Ÿ”Ž iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! ๐Ÿš€
Weโ€™re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2
reacted to lewtun's post with ๐Ÿ”ฅ about 1 month ago
view post
Post
6808
We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute ๐Ÿ”ฅ

How? By combining step-wise reward models with tree search algorithms :)

We show that smol models can match or exceed the performance of their much larger siblings when given enough "time to think"

We're open sourcing the full recipe and sharing a detailed blog post.

In our blog post we cover:

๐Ÿ“ˆ Compute-optimal scaling: How we implemented DeepMind's recipe to boost the mathematical capabilities of open models at test-time.

๐ŸŽ„ Diverse Verifier Tree Search (DVTS): An unpublished extension we developed to the verifier-guided tree search technique. This simple yet effective method improves diversity and delivers better performance, particularly at large test-time compute budgets.

๐Ÿงญ Search and Learn: A lightweight toolkit for implementing search strategies with LLMs and built for speed with vLLM

Here's the links:

- Blog post: HuggingFaceH4/blogpost-scaling-test-time-compute

- Code: https://github.com/huggingface/search-and-learn

Enjoy!
  • 2 replies
ยท
reacted to DawnC's post with ๐Ÿ‘ about 2 months ago
view post
Post
1424
๐Ÿ’ก Curious about dog breeds? ๐Ÿ• Meet PawMatchAI!
I've created this fun and interactive project to help you recognize dog breeds, find the perfect pup for your lifestyle, and even compare different breeds! Recently upgraded with smarter AI detection - it can now better distinguish between dogs and non-dogs (no more confusing cats for huskies! ๐Ÿ˜บโžก๏ธ๐Ÿ•).

๐Ÿพ What's cool about it?
Smart breed recognition powered by AI
Lifestyle-based breed recommendations
Detailed breed comparisons
And now with enhanced non-dog filtering!

๐ŸŒŸ Why try it?
Whether you're a dog lover, considering a new furry friend, or just curious, PawMatchAI makes discovering breeds fun and informative! As someone passionate about both AI and pets, I'm combining my two loves while working toward my goal of contributing to the AI industry.

๐Ÿ”Ž Got feedback?
While it's not perfect, your input helps make it better! I'd love to hear your thoughts as I continue improving this project on my journey into AI development.

๐Ÿ‘‰ Try it now: DawnC/PawMatchAI

๐ŸŽฏ Your support matters!
Every like ๐Ÿ‘ or comment ๐Ÿ“ helps fuel my passion for AI development and keeps me motivated to create more helpful tools. Let's make the AI journey fun and impactful together!

#AI #MachineLearning #DeepLearning #Pytorch #ComputerVision
reacted to Narsil's post with โค๏ธ about 2 months ago
view post
Post
1211
Performance leap: TGI v3 is out. Processes 3x more tokens, 13x faster than vLLM on long prompts. Zero config !



3x more tokens.

By reducing our memory footprint, weโ€™re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime and its effect are best seen on smaller constrained environments.
13x faster

On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead of the lookup is ~5us. Thanks @Dani รซl de Kok for the beast data structure.
Zero config

Thatโ€™s it. Remove all the flags your are using and youโ€™re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we donโ€™t have any flags anymore in our deployments. We kept all existing flags around, they may come in handy in niche scenarios.

Read more: https://huggingface.co/docs/text-generation-inference/conceptual/chunking
replied to their post about 2 months ago
posted an update about 2 months ago
view post
Post
1651
Tulu 3 SFT Mixture by AllenAI is a massive, good, multilingual dataset for fine-tuning Language Models.

Unfortunately, it was missing the "language" column.

I added it using the good old fastText.

Check out the dataset here ๐Ÿ‘‰ anakin87/tulu-3-sft-mixture-with-language

  • 1 reply
ยท
reacted to dvilasuero's post with โค๏ธ about 2 months ago
view post
Post
2336
๐ŸŒ Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.

Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Tรฉcnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.

๐Ÿท๏ธ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!

Thanks to this annotation process, the open dataset contains two subsets:

1. ๐Ÿ—ฝ Culturally Agnostic: no specific regional, cultural knowledge is required.
2. โš–๏ธ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.

Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.

I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.

Dataset: CohereForAI/Global-MMLU
posted an update 2 months ago
view post
Post
438
๐Ÿ๐Ÿ๐Ÿ ๐€ ๐’๐ฐ๐š๐ซ๐ฆ ๐จ๐Ÿ ๐€๐ ๐ž๐ง๐ญ๐ฌ ๐ฐ๐ข๐ญ๐ก ๐‹๐ฅ๐š๐ฆ๐š 3.2, ๐†๐๐“-4๐จ ๐ฆ๐ข๐ง๐ข ๐š๐ง๐ ๐‚๐ฅ๐š๐ฎ๐๐ž 3.5 ๐’๐จ๐ง๐ง๐ž๐ญ

๐“๐‹;๐ƒ๐‘: I reimplemented the Swarm concept using Haystack, but made it work with both open and proprietary models ๐Ÿ’ซ

โœ๏ธ blog article: https://haystack.deepset.ai/blog/swarm-of-agents
๐Ÿ““ notebook: https://haystack.deepset.ai/cookbook/swarm


Some time ago OpenAI published Swarm: an educational framework for building multi-agent systems.

Their approach focuses on two main concepts:
ใƒป ๐‘๐จ๐ฎ๐ญ๐ข๐ง๐ž๐ฌ: Each agent follows specific ๐Ÿ“œ instructions and uses ๐Ÿ› ๏ธ tools to execute them.
ใƒป ๐‡๐š๐ง๐๐จ๐Ÿ๐Ÿ๐ฌ ๐Ÿค: Agents can transfer control to one another using tool/function calling.


When I first read these ideas, I thought: ๐˜ด๐˜ช๐˜ฎ๐˜ฑ๐˜ญ๐˜ฆ ๐˜ฃ๐˜ถ๐˜ต ๐˜ฑ๐˜ฐ๐˜ธ๐˜ฆ๐˜ณ๐˜ง๐˜ถ๐˜ญ! And they pair well with the recent unified tool support in Haystack.

๐Ÿง‘โ€๐Ÿ’ป So, I decided to re-implement these concepts using Haystack, and in just a few lines of code, I had a working prototype.

๐Ÿ†’ Bonus feature: this implementation isn't tied to a single model provider - different agents can be powered by different models!

I replicated the ACME customer service example from the original article, with 3 Agents:
๐Ÿ Triage Agent - Llama 3.2 running on Ollama
๐Ÿ Sales Agent - Anthropic Claude 3.5 Sonnet
๐Ÿ Issues and Repairs Agent - OpenAI GPT-4o mini


Want to see the full implementation and give it a try? Check out the blog post and notebook! โœจ
reacted to davanstrien's post with โค๏ธ 2 months ago
replied to their post 3 months ago
view reply

๐Ÿ’ก ๐Œ๐š๐ ๐ฉ๐ข๐ž ๐ฐ๐ข๐ญ๐ก ๐ฌ๐ฒ๐ฌ๐ญ๐ž๐ฆ ๐ฆ๐ž๐ฌ๐ฌ๐š๐ ๐ž

I had another idea: use the system message to steer generation towards a specific language.

The system message should be in the target language, like:
"You are an artificial intelligence that answers users' questions in TARGET_LANGUAGE in a useful and detailed way. The user asks complex questions in TARGET_LANGUAGE."

It is a simple approach, but it might work...

It turns out the authors had a similar idea, which they included in the latest revision of their paper. ๐ŸŽ‰


๐Ÿช Resources

Magpie paper and repository: https://huggingface.co/papers/2406.08464 https://github.com/magpie-align/magpie

Magpie demo by @davanstrien : https://huggingface.co/spaces/davanstrien/magpie

Magpie Ollama Datagen by @mrm8488 : https://github.com/mrm8488/magpie-ollama-datagen

magpie-ultra dataset - massive dataset built with Magpie by Argilla: https://huggingface.co/datasets/argilla/magpie-ultra-v0.1

โš—๏ธ distilabel framework - framework for synthetic data generation and AI feedback at scale: https://distilabel.argilla.io/latest/

posted an update 3 months ago
view post
Post
1106
Ok, you're finally convinced that synthetic data works... โš—๏ธ

๐๐จ๐ฐ ๐ฒ๐จ๐ฎ ๐ฐ๐š๐ง๐ญ ๐ญ๐จ ๐ ๐ž๐ง๐ž๐ซ๐š๐ญ๐ž ๐š๐ง ๐ข๐ง๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ข๐จ๐ง ๐๐š๐ญ๐š๐ฌ๐ž๐ญ ๐Ÿ๐จ๐ซ ๐Ÿ๐ข๐ง๐ž-๐ญ๐ฎ๐ง๐ข๐ง๐  ๐ข๐ง ๐š ๐ฅ๐š๐ง๐ ๐ฎ๐š๐ ๐ž ๐จ๐ญ๐ก๐ž๐ซ ๐ญ๐ก๐š๐ง ๐„๐ง๐ ๐ฅ๐ข๐ฌ๐ก.
But how do you get started?

I explore how to do this with Magpie in my new article
https://huggingface.co/blog/anakin87/multilingual-magpie

---

๐Ÿฆโ€โฌ› ๐–๐ก๐š๐ญ ๐ข๐ฌ ๐Œ๐š๐ ๐ฉ๐ข๐ž?

It's a recent technique for creating synthetic instruction datasets.

Magpie is based on a simple but ingenious idea ๐Ÿ‘‡
if you prompt an instruction-tuned model with a pre-query template, you can make it generate a plausible user query/instruction

Here's an example:
model: Llama-3-8B-Instruct
pre-query template: "<|begin_of_text|><|start_header_id|>user<|end_header_id|>"
generated user instruction: "What are some of the responsibilities of a commercial pilot?"

You can then feed this instruction back into the same model to get the assistant response.

By repeating this process, it's possible to generate large synthetic datasets with relatively little effort.

๐Ÿช„ The authors demonstrate that using these datasets for Supervised Fine Tuning (SFT) can yield strong performance, even competitive with the original instruct model.


๐Ÿง—๐†๐ž๐ง๐ž๐ซ๐š๐ญ๐ข๐ง๐  ๐ง๐จ๐ง-๐„๐ง๐ ๐ฅ๐ข๐ฌ๐ก ๐๐š๐ญ๐š

Most Language Models are primarily trained on English texts, so they tend to produce data in English.

How can we overcome this?

Earlier approaches were complex or costly.

Then @mrm8488 found a simple solution: add the target language to the pre-query template.
For Spanish, the template becomes "<|begin_of_text|><|start_header_id|>user<|end_header_id|>spanish:".

This method works for Spanish and German!

โŒ Unfortunately, it does not work well for other languages (๐Ÿ‡ฎ๐Ÿ‡น, ๐Ÿ‡ณ๐Ÿ‡ฑ, ...)

๐Ÿ‘‡
  • 1 reply
ยท
posted an update 4 months ago
view post
Post
1744
๐Ÿ•ต๐Ÿป ๐€๐ ๐ž๐ง๐ญ๐ข๐œ ๐‘๐€๐† ๐ฐ๐ข๐ญ๐ก ๐Ÿฆ™ ๐‹๐ฅ๐š๐ฆ๐š 3.2

I was excited to explore Llama 3.2, but as a simple ๐Ÿ‡ช๐Ÿ‡บ EU guy, I don't have access to Meta's multimodal models ๐Ÿ˜ฟ

๐Ÿค” So I thought: why not challenge the small 3B text model with Agentic RAG?

๐ŸŽฏ The plan:
- Build a system that tries to answer questions using a knowledge base.
- If the documents don't contain the answer, use Web search for additional context.


Check out my experimental notebook here: ๐Ÿ““ https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/llama32_agentic_rag.ipynb


My stack:
๐Ÿ—๏ธ haystack (https://haystack.deepset.ai/): open-source LLM orchestration framework
๐Ÿฆ™ meta-llama/Llama-3.2-3B-Instruct
๐Ÿฆ†๐ŸŒ free DuckDuckGo API, integrated with Haystack

โœจ ๐˜›๐˜ฉ๐˜ฆ ๐˜ณ๐˜ฆ๐˜ด๐˜ถ๐˜ญ๐˜ต๐˜ด? ๐˜Œ๐˜ฏ๐˜ค๐˜ฐ๐˜ถ๐˜ณ๐˜ข๐˜จ๐˜ช๐˜ฏ๐˜จ - ๐˜ข ๐˜ง๐˜ฆ๐˜ธ ๐˜ฎ๐˜ฐ๐˜ฏ๐˜ต๐˜ฉ๐˜ด ๐˜ข๐˜จ๐˜ฐ, ๐˜ต๐˜ฉ๐˜ช๐˜ด ๐˜ญ๐˜ฆ๐˜ท๐˜ฆ๐˜ญ ๐˜ฐ๐˜ง ๐˜ฑ๐˜ฆ๐˜ณ๐˜ง๐˜ฐ๐˜ณ๐˜ฎ๐˜ข๐˜ฏ๐˜ค๐˜ฆ ๐˜ง๐˜ณ๐˜ฐ๐˜ฎ ๐˜ข ๐˜ด๐˜ฎ๐˜ข๐˜ญ๐˜ญ ๐˜ฎ๐˜ฐ๐˜ฅ๐˜ฆ๐˜ญ ๐˜ธ๐˜ฐ๐˜ถ๐˜ญ๐˜ฅ'๐˜ท๐˜ฆ ๐˜ฃ๐˜ฆ๐˜ฆ๐˜ฏ ๐˜ถ๐˜ฏ๐˜ต๐˜ฉ๐˜ช๐˜ฏ๐˜ฌ๐˜ข๐˜ฃ๐˜ญ๐˜ฆ!
This probably reflects the impressive IFEval score of the model (comparable to Llama 3.1 8B).
posted an update 5 months ago
view post
Post
1094
๐Œ๐ฒ ๐Ÿ๐ข๐ซ๐ฌ๐ญ ๐œ๐จ๐ฆ๐ฆ๐ฎ๐ง๐ข๐ญ๐ฒ ๐š๐ซ๐ญ๐ข๐œ๐ฅ๐ž! ๐’๐ž๐ฅ๐ž๐œ๐ญ๐ข๐ฏ๐ž ๐Ÿ๐ข๐ง๐ž-๐ญ๐ฎ๐ง๐ข๐ง๐  ๐ฐ๐ข๐ญ๐ก ๐’๐ฉ๐ž๐œ๐ญ๐ซ๐ฎ๐ฆ ๐ŸŽฏ

Full walkthrough on how to get started with Spectrum and TRL for efficient fine-tuning.
๐Ÿ“” ๐Ÿ‘ฃ https://huggingface.co/blog/anakin87/spectrum

---

Looking to fine-tune Language Models efficiently and save on computational resources?

One popular method is QLoRa, which quantizes the original model and trains low-rank adapters on top.
It's quite effective and uses less GPU than full fine-tuning.

However, QLoRa applies Low-Rank Adaptation uniformly across the entire model.

What if we could identify the most informative layers and only fine-tune those? ๐Ÿค”

This is exactly what Spectrum does! ๐Ÿ‘‡

๐Ÿ”ฌ Spectrum analyzes the weight matrices for all layers in a Language Model and calculates a Signal to Noise Ratio (SNR) for each one.
(It uses Random Matrix Theory and Marchenko-Pastur distribution to distinguish signal from noise.)

๐ŸŽฏ Based on a chosen percentage (say, 25%), Spectrum selects the most informative layers of each type (mlp.down_proj, self_attn.o_proj, etc.).

You can then โ„๏ธ freeze the rest of the model and focus your ๐Ÿ‹๏ธโ€โ™‚๏ธ training on the chosen layers.


๐Ÿ† Results/Evaluation
- Spectrum is competitive with full fine-tuning and beats QLoRA on benchmarks.
- While QLoRA is more memory-efficient on a single GPU, Spectrum shines in distributed training setups.
- Great models trained with Spectrum: Dolphin models, Llama 3.1 Storm, numerous models by VAGO Solutions...

---

For a practical guide, check out the article above.
reacted to grimjim's post with ๐Ÿ‘€ 5 months ago
view post
Post
3245
I found this paper to be thought-provoking: "Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling" by Bansal, Hosseini, Agarwal, Tran, and Kazemi.
https://arxiv.org/abs/2408.16737
The direct implication is that smaller models could be used to create cost-effective synthetic datasets. And on that note, in the Gemma terms of use, Google explicitly claims no rights on outputs generated from those models, which means one is free to synthgen from the Gemma line. Meta's Llama 3 licence forbids synthetic generation of outputs if used to improve other models. Relevant Mistral, Qwen, and Yi models under the Apache 2.0 license are unrestricted for this purpose.
  • 2 replies
ยท
replied to their post 5 months ago