Ricardo Fernandez Gasca

ricfergas

ricfergas

AI & ML interests

None yet

Recent Activity

upvoted a paper about 1 month ago

Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers

liked a Space 5 months ago

afrideva/Janus-Pro-1b

upvoted an article 5 months ago

We now support VLMs in smolagents!

View all activity

Organizations

ricfergas's activity

upvoted a paper about 1 month ago

Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers

Paper • 2504.20752 • Published Apr 29 • 91

liked a Space 5 months ago

Janus Pro 1b

🌍

A unified multimodal understanding and generation model.

upvoted an article 5 months ago

Article

We now support VLMs in smolagents!

and 2 others •

Jan 24

• 103

liked a model 5 months ago

deepseek-ai/DeepSeek-V3

Text Generation • Updated Mar 27 • 2.4M • • 3.88k

upvoted a paper 10 months ago

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Paper • 2406.12624 • Published Jun 18, 2024 • 38

reacted to clefourrier's post with 👍 about 1 year ago

Post

2383

Fun fact about evaluation!

Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing
♻️the order in which the few shot examples are added to the prompt ♻️
you get a difference of up to 3 points in evaluation score?

I did a small experiment using some MMLU subsets on the best performing 7B and lower pretrained models from the leaderboard.

I tried 8 different prompting methods (containing more or less information, such as just the question, or Question: question, or Question: question Choices: ..., see the x axis) that are commonly used in evaluation.

I then compared the results for all these methods, in 5-shot, during 2 runs. The *only difference* between the first and second run being that the samples used in few-shot are not introduced in the same order.
For example, run one would have been "A B C D E Current sample", vs, in run 2, "D C E A B Current sample".
All the other experiment parameters stayed exactly the same.

As you can see on the attached picture, you get a difference of up to 3 points between the 2 few-shot samples shuffling.

So, when just changing *the order of the few shot samples* can change your results by several points, what is the impact of all other "minimal" and unreported prompting changes?

-> Any kind of model score, provided without an evaluation script for reproducibility, is basically bullshit (or coms).
-> This is why we need reproducible evaluation in a fair and exactly similar setup, using evaluation suites such as lm_eval from the Harness, lighteval from HF, or the Open LLM Leaderboard.

4 replies

liked a dataset about 1 year ago

Posos/MedNERF

Viewer • Updated Jun 7, 2023 • 100 • 23 • 4

upvoted a paper about 1 year ago

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Paper • 2403.12943 • Published Mar 19, 2024 • 15

liked a Space over 1 year ago

10.4k

AI Comic Factory

👩

Create your own AI comic with a single prompt