Stefano Fiorucci PRO

anakin87

AI & ML interests

Contributing to Haystack LLM framework šŸ—ļø. Language Models: orchestration, post-training, synthetic data...

Recent Activity

updated a Space 1 day ago
anakin87/fact-checking-rocks
updated a Space 1 day ago
anakin87/Phi-3.5-mini-ITA
updated a Space 1 day ago
anakin87/gemma-2-2b-neogenesis-ita
View all activity

Articles

Organizations

deepset's profile picture Blog-explorers's profile picture ZeroGPU Explorers's profile picture Hugging Face Discord Community's profile picture

Posts 12

view post
Post
450
Hey, it has been a while... I was busy participating in šŸ’Ž š†šžš¦š¦šš šœšØš¦š©šžš­š¢š­š¢šØš§!

Here's the idea: Gemma open models have a large vocabulary size (256K), so improving them for a specific language or cultural context should be pretty affordable - no need for continued pre-training.

My submission: šŸ’ŽšŸŒšŸ‡®šŸ‡¹ ššžšØš šžš§šžš¬š¢š¬ - ššØš¬š­-š“š«ššš¢š§š¢š§š  š†šžš¦š¦šš šŸšØš« šˆš­ššš„š¢ššš§ ššš§š š›šžš²šØš§š
šŸ““ Kaggle notebook: https://www.kaggle.com/code/anakin87/post-training-gemma-for-italian-and-beyond

In this notebook, I show how I improve the performance of Gemma 2 2B on Italian via Post-Training.
I believe this method is adaptable to other languages and model sizes.

š˜’š˜¦š˜ŗ š˜šš˜µš˜¦š˜±š˜“
šŸ“Š Choose reference metrics
šŸ§‘ā€šŸ”¬ Data curation for Instruction Fine Tuning: identify existing datasets + generate synthetic data
šŸ‹ļøā€ā™‚ļø Efficient Instruction Fine Tuning with Spectrum
šŸ§‘ā€šŸ”¬ Data curation for Preference Tuning: identify existing datasets + generate synthetic data
šŸ‘šŸ‘Ž Efficient Direct Preference Optimization with Spectrum
šŸ“ˆ Evaluation


šŸ¤— Hugging Face collection (with models and datasets): anakin87/gemma-neogenesis-67824b7bf13ac9cfe091fe2e

I'm also planning a šŸŽ Gemma Giveaway (on LinkedIn - https://www.linkedin.com/in/stefano-fiorucci) in the next few days - sharing techniques, datasets, and models I used for my project... so stay tuned! šŸ“»
view post
Post
1643
Tulu 3 SFT Mixture by AllenAI is a massive, good, multilingual dataset for fine-tuning Language Models.

Unfortunately, it was missing the "language" column.

I added it using the good old fastText.

Check out the dataset here šŸ‘‰ anakin87/tulu-3-sft-mixture-with-language