Open RL Leaderboard

AI & ML interests

None defined yet.

Recent Activity

open-rl-leaderboard's activity

clefourrier 
posted an update 18 days ago
view post
Post
595
Always surprised that so few people actually read the FineTasks blog, on
✨how to select training evals with the highest signal✨

If you're serious about training models without wasting compute on shitty runs, you absolutely should read it!!

An high signal eval actually tells you precisely, during training, how wel & what your model is learning, allowing you to discard the bad runs/bad samplings/...!

The blog covers in depth prompt choice, metrics, dataset, across languages/capabilities, and my fave section is "which properties should evals have"👌
(to know on your use case how to select the best evals for you)

Blog: HuggingFaceFW/blogpost-fine-tasks
  • 2 replies
·
Aurelien-Morgan 
posted an update 21 days ago
Aurelien-Morgan 
posted an update about 1 month ago
view post
Post
3134
The Almighty function-caller

How would you like to build smart GenAi infrastructure ?
Give extensive tools memory to your edge agentic system,
And optimize the resources it takes to run yet a high-performance set of agents ?

We came up with a novel approach to function-calling at scale for smart companies and corporate-grade use-cases.

Read our full-fledged blog article on this here on Hugging Face :
https://huggingface.co/blog/Aurelien-Morgan/the-almighty-function-caller
Aurelien-Morgan 
posted an update about 1 month ago
view post
Post
663
retrain-pipelines 0.1.2 finally dropped. It comes with a hot Hugging Face Hub integration. Go check it out. We have 2 articles about it coming up. One already fully written so, be on the lookout !
@retrain-pipelines

Also, I'll be volunteering at GOSIM AI Paris 2025. If you're interested in chatting, hmu.
Aurelien-Morgan 
posted an update 2 months ago
clefourrier 
posted an update 3 months ago
view post
Post
2485
Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.
Aurelien-Morgan 
posted an update 7 months ago
view post
Post
507
I just shipped retrain-pipelines 0.1.1 today. The doc is also pimped compared to previous release. That was clearly not mature then.
I'll have to focus on another project for the next couple weeks but, anyone feel free to open issues on the GitHub repo and discuss any interest you'd have there if you will (please?) !
In the meantime, you may enjoy retrying this :
https://huggingface.co/blog/Aurelien-Morgan/stateful-metaflow-on-colab
Aurelien-Morgan 
posted an update 8 months ago
clefourrier 
posted an update about 1 year ago
view post
Post
6145
In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences 🩸

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co/blog/leaderboard-medicalllm
clefourrier 
posted an update about 1 year ago
view post
Post
4784
Contamination free code evaluations with LiveCodeBench! 🖥️

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date 📅

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! 🚀

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!