Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singhsidhukuldeepΒ 
posted an update Jul 25, 2024
Post
1561
Yet another post hailing how good Meta Llama 3.1 is? πŸ€” I guess not!

While Llama 3.1 is truly impressive, especially 405B (which gives GPT-4o a run for its money! πŸ’ͺ)

I was surprised to see that on the Open LLM Leaderboard, Llama 3.1 70B was not able to dethrone the current king Qwen2-72B! πŸ‘‘

Not only that, for a few benchmarks like MATH Lvl 5, it was completely lagging behind Qwen2-72B! πŸ“‰

Also, the benchmarks are completely off compared to the official numbers from Meta! 🀯

Based on the responses, I still believe Llama 3.1 will perform better than Qwen2 on LMSYS Chatbot Arena. πŸ€– But it still lags behind on too many benchmarks! πŸƒβ€β™‚οΈ

Open LLM Leaderboard: open-llm-leaderboard/open_llm_leaderboard 🌐

Hopefully, this is just an Open LLM Leaderboard error! @open-llm-leaderboard SOS! 🚨

Thanks @LeroyDyer for confirming this

Β·

you should also know that they have already been trained on the test datasets also as they also cheat the system so these benchmarks are very outdated now !
until they change the benchmark datasets !
evn my own model has been trained on these sets ~! so they beat all these models as witgh neuralbeagle etc it also beats these models and it was merged with the top performers ! until it became the best !

So we can see the fakeness of it all here !

the reality resting is done by youtubers as often we can see they use a set of questions ( that have not been trined and the models still under perform !)

So the reality of the benchmaking systems is fake as a guradrailed model cannot function as its inputs and oputputs are being intercepted !
So Questions which may be detected as politcal agenda or sexual agenda etc ( according to the moralitys of the guard rail will BLOCK correct funcitonality of the model !<
It would need to be trained GuardRailed as well as Total Data filtering oif the training data asit comes in !!

SO only models which have been trained independanly of these main model outlets are worthy of interest!

Instead of looking at the board as an actual benchmarks, look at it as a memorization test.

If u didn't memorized the answers to the exam, u'll do worse vs someone who did.

Also, these benchmarks are not that useful for tackling real world tasks.

For me, in theory, Qwen is always inferior to llma 3, even if they are not equal at all, the above assessment is not accurate.

Β·

Why do you say that, what task are you using it for?