Interesting stats

#25
by BBLL3456 - opened

It is interesting to see that currently a 30b model beat Llama-65b; and that Alpaca 13b is worse off compared to Llama 13b.

Keep in context how close the numbers are, though. The difference of a few points is still relatively on par. =

It's also funny to see 13b models doing so well compared to 30 and 65. (also oof, galactica with the 120b), but, there might be other testing benchmarks that could tell a more full story.

Sure wish this leaderboard could handle 3 and 4 bit quantization though. Seems like a glaring oversight considering the march of current tech.

Edit: also no ram usage statistics, no tokens/second, etc. Honestly I'm now not sure what this leaderboard is even meant to be comparing besides 'stuff that llama is good at'

Those are some great points, this has become so popular that surely an update must be on the cards. Those stats you describe would be a great start.

Sure wish this leaderboard could handle 3 and 4 bit quantization though. Seems like a glaring oversight considering the march of current tech.

I just tried submitting some 4 bit quantz models, the models have been accepted for evaluation...so huggingface do listen to our grunts haha

Sure wish this leaderboard could handle 3 and 4 bit quantization though. Seems like a glaring oversight considering the march of current tech.

I just tried submitting some 4 bit quantz models, the models have been accepted for evaluation...so huggingface do listen to our grunts haha

It'll accept it. That's why the leaderboard was stuck in the first place though day two. once it reached a 4bit model the que went full derp.
No idea if they fixed it.

Would be interesting to see SuperHot and Bluemoon since they're one of the few models featuring higher context size. It would give us an idea of what's the impact of larger context size to coherency, especially SuperHot because SuperCot is already there to compare with. I feel like the stuff in the in queue is more of the same.

Would be interesting to see SuperHot and Bluemoon since they're one of the few models featuring higher context size. It would give us an idea of what's the impact of larger context size to coherency, especially SuperHot because SuperCot is already there to compare with. I feel like the stuff in the in queue is more of the same.

These have been submitted, looking forward to see their performance. I am also curious about Guanaco, so far, the results I have been getting on Guanaco 30B is the best (tho I haven't tried Falcon yet).

Take this leaderboard with a grain of salt. Somehow my 19M OPT chatsalad finetuned on a single 20mb corpus has beat half a dozen other models.

Take this leaderboard with a grain of salt. Somehow my 19M OPT chatsalad finetuned on a single 20mb corpus has beat half a dozen other models.

Is it published on HF?
Could you send a link?

It is interesting to see that currently a 30b model beat Llama-65b; and that Alpaca 13b is worse off compared to Llama 13b.

Falcon isn't near as good as llama 65B on my opinion.

It's in my profile

Open LLM Leaderboard org

Hi! All the scores have been updated with the correct MMLU results, following the discussions here!

clefourrier changed discussion status to closed

Sign up or log in to comment