3C3H AraGen Leaderboard welcomes today deepseek-ai/DeepSeek-V3 and 12 other models (including the late gpt-3.5 ๐) to the ranking of best LLMs in Arabic !
Observations: - DeepSeek-v3 ranked 3rd and only Open model among the top 5 !
- A 14B open model (Qwen/Qwen2.5-14B-Instruct) outperforms gpt-3.5-turbo-0125 (from last year). This shows how much we came in advancing and supporting Arabic presence within the LLM ecosystem !
- Contrary to what observed in likelihood-acc leaderboards (like OALL/Open-Arabic-LLM-Leaderboard) further finetuned models like maldv/Qwentile2.5-32B-Instruct actually decreased the performance compared to the original model Qwen/Qwen2.5-32B-Instruct. It's worth to note that the decrease is statiscally insignificant which imply that at best, the out-domain finetuning do not really hurts the model original capabilities acquired during pretraining. Previous work addressed this (finetuning VS pretraining) but more investigation in this regard is required (any PhDs here ? This could be your question ...)
๐ฏFine-tuning SmolLM2 on a lightweight synthetic reasoning dataset for reasoning-specific tasks. Future updates will focus on lightweight, blazing-fast reasoning models. Until then, check out the blog for fine-tuning details.
๐ Introducing ๐ ๐ข๐ซ๐ฌ๐ญ ๐๐ฎ๐ ๐ ๐ข๐ง๐ ๐ ๐๐๐ ๐๐ง๐ญ๐๐ ๐ซ๐๐ญ๐ข๐จ๐ง ๐จ๐ ๐ฆ๐ข๐ง๐๐๐ ๐๐จ๐๐๐ฅ๐ฌ from the paper ๐๐๐ซ๐ ๐๐๐๐ฌ ๐๐ฅ๐ฅ ๐๐ ๐๐๐๐๐๐?
๐ฅ I have integrated ๐ง๐๐ฑ๐ญ-๐ ๐๐ง๐๐ซ๐๐ญ๐ข๐จ๐ง ๐๐๐๐ฌ, specifically minGRU, which offer faster performance compared to Transformer architectures, into HuggingFace. This allows users to leverage the lighter and more efficient minGRU models with the "๐ญ๐ซ๐๐ง๐ฌ๐๐จ๐ซ๐ฆ๐๐ซ๐ฌ" ๐ฅ๐ข๐๐ซ๐๐ซ๐ฒ for both usage and training.
๐ป I integrated two main tasks: ๐๐ข๐ง๐๐๐๐ ๐จ๐ซ๐๐๐ช๐ฎ๐๐ง๐๐๐๐ฅ๐๐ฌ๐ฌ๐ข๐๐ข๐๐๐ญ๐ข๐จ๐ง and ๐๐ข๐ง๐๐๐๐ ๐จ๐ซ๐๐๐ฎ๐ฌ๐๐ฅ๐๐.
๐๐ข๐ง๐๐๐๐ ๐จ๐ซ๐๐๐ช๐ฎ๐๐ง๐๐๐๐ฅ๐๐ฌ๐ฌ๐ข๐๐ข๐๐๐ญ๐ข๐จ๐ง: You can use this class for ๐๐๐ช๐ฎ๐๐ง๐๐ ๐๐ฅ๐๐ฌ๐ฌ๐ข๐๐ข๐๐๐ญ๐ข๐จ๐ง tasks. I also trained a Sentiment Analysis model with stanfordnlp/imdb dataset.
๐๐ข๐ง๐๐๐๐ ๐จ๐ซ๐๐๐ฎ๐ฌ๐๐ฅ๐๐: You can use this class for ๐๐๐ฎ๐ฌ๐๐ฅ ๐๐๐ง๐ ๐ฎ๐๐ ๐ ๐๐จ๐๐๐ฅ tasks such as GPT, Llama. I also trained an example model with roneneldan/TinyStories dataset. You can fine-tune and use it!
~75% on the challenging GPQA with only 40M parameters ๐ฅ๐ฅณ
GREAT ACHIEVEMENT ! Or is it ?
This new Work, "Data Laundering: Artificially Boosting Benchmark Results through Knowledge Distillation", take out the mystery about many models i personally suspected their results. Speacially on leaderboards other than the english one, Like the Open Arabic LLM Leaderbaord OALL/Open-Arabic-LLM-Leaderboard.
The authors of this work, first started by training a model on the GPQA data, which, unsurprisingly, led to the model achieving 100% performance.
Afterward, they trained what they referred to as a 'legitimate' model on legitimate data (MedMCQA). However, they introduced a distillation loss from the earlier, 'cheated' model.
What they discovered was fascinating: the knowledge of GPQA leaked through this distillation loss, even though the legitimate model was never explicitly trained on GPQA during this stage.
This raises important questions about the careful use of distillation in model training, especially when the training data is opaque. As they demonstrated, itโs apparently possible to (intentionally or unintentionally) leak test data through this method.
Unpopular opinion: Open Source takes courage to do !
Not everyone is brave enough to release what they have done (the way they've done it) to the wild to be judged ! It really requires a high level of "knowing wth are you doing" ! It's kind of a super power !
Cheers to the heroes here who see this!
5 replies
ยท
reacted to takarajordan's
post with ๐ฅโค๏ธabout 1 month ago
I'm super excited to release my first open-source text dataset:
WorldScenario 20K is a novel dataset of 20,000 synthetically generated multi-stakeholder scenarios designed to simulate real-world decision-making processes. Each scenario explores a unique environmental, societal, or economic issue.
I used the brand new meta-llama/Llama-3.3-70B-Instruct model to generate this dataset and I put the dataset through some post processing to clean and evaluate the dataset for diversity.
I'd appreciate some feedback and thoughts on my new release! Thanks!
Well, this is a bit late but consider given our recent blog a read if you are interested in Evaluation.
You don't have to be into Arabic NLP in order to read it, the main contribution we are introducing is a new evaluation measure for NLG. We made the fisrt application of this measure on Arabic for now and we will be working with colleagues from the community to expand it to other languages.
๐ Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.
Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Tรฉcnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.
๐ท๏ธ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!
Thanks to this annotation process, the open dataset contains two subsets:
1. ๐ฝ Culturally Agnostic: no specific regional, cultural knowledge is required. 2. โ๏ธ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.
Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.
I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.
If you remember my work on MAMF - to find the realistic TFLOPS achievable ceiling - the Intel AI team has shared their measurements and they scored ...
an incredible 99.4% TFLOPS efficiency for Gaudi 2!
That's quite amazing! Your ROI on these accelerators will be very high.
As we have seen the competitors get their achievable efficiency worse with each new generation, I'm looking forward to see if Gaudi 3 will keep the high bar!
Thanks to Avi Rubin, Lakshman Chari, Imtiaz Sajwani, Ramy J and Zhiqi Tao for helping to get these numbers to the community.
reacted to AdinaY's
post with โค๏ธabout 2 months ago
What I mean here is that traditional LLMs are trained on tasks irrelevant to what they will do for the user. Itโs like training a plane to efficiently operate on the runway, but not to fly. In short, it is almost impossible to train an LLM, and evaluating is just as challenging. Then, training is not even necessary. In this article, I dive on all these topics.
โก๏ธ Training LLMs for the wrong tasks
Since the beginnings with Bert, training an LLM typically consists of predicting the next tokens in a sentence, or removing some tokens and then have your algorithm fill the blanks. You optimize the underlying deep neural networks to perform these supervised learning tasks as well as possible. Typically, it involves growing the list of tokens in the training set to billions or trillions, increasing the cost and time to train. However, recently, there is a tendency to work with smaller datasets, by distilling the input sources and token lists. After all, out of one trillion tokens, 99% are noise and do not contribute to improving the results for the end-user; they may even contribute to hallucinations. Keep in mind that human beings have a vocabulary of about 30,000 keywords, and that the number of potential standardized prompts on a specialized corpus (and thus the number of potential answers) is less than a million.
โก๏ธ Read the full articles at https://mltblog.com/3CEJ9Pt, also featuring issues with evaluation metrics and the benefits of untrained LLMs.
reacted to malhajar's
post with ๐ฅabout 2 months ago
๐ซ๐ท Lancement officiel de l'OpenLLM French Leaderboard : initiative open-source pour rรฉfรฉrencer lโรฉvaluation des LLMs francophones
Aprรจs beaucoup dโefforts et de sueurs avec Alexandre Lavallee, nous sommes ravis dโannoncer que le OpenLLMFrenchLeaderboard est en ligne sur Hugging Face (space url: le-leadboard/OpenLLMFrenchLeaderboard) la toute premiรจre plateforme dรฉdiรฉe ร lโรฉvaluation des grands modรจles de langage (LLM) en franรงais. ๐ซ๐ทโจ
Ce projet de longue haleine est avant tout une ลuvre de passion mais surtout une nรฉcessitรฉ absolue. Il devient urgent et vital d'oeuvrer ร plus de transparence dans ce domaine stratรฉgique des LLM dits multilingues. La premiรจre piรจce ร l'รฉdifice est donc la mise en place d'une รฉvaluation systรฉmatique et systรฉmique des modรจles actuels et futurs.
Votre modรจle IA franรงais est-il prรชt ร se dรฉmarquer ? Soumettez le dans notre espace, et voyez comment vous vous comparez par rapport aux autres modรจles.
โ Comment รงa marche : Soumettez votre LLM franรงais pour รฉvaluation, et nous le testerons sur des benchmarks de rรฉfรฉrence spรฉcifiquement adaptรฉs pour la langue franรงaise โ notre suite de benchmarks comprend :
Le processus est encore manuel, mais nous travaillons sur son automatisation, avec le soutien de la communautรฉ Hugging Face.
@clem , on se prรฉpare pour une mise ร niveau de lโespace ? ๐๐
Ce n'est pas qu'une question de chiffresโil s'agit de crรฉer une IA qui reflรจte vraiment notre langue, notre culture et nos valeurs. OpenLLMFrenchLeaderboard est notre contribution personnelle pour faรงonner l'avenir des LLM en France.