Open RL Leaderboard

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

ClementRomac authored a paper about 2 months ago

Meta Automatic Curriculum Learning

ClementRomac authored a paper about 2 months ago

SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

ClementRomac authored a paper about 2 months ago

MAGELLAN: Metacognitive predictions of learning progress guide autotelic LLM agents in large goal spaces

View all activity

open-rl-leaderboard's activity

clefourrier

posted an update 18 days ago

Post

595

Always surprised that so few people actually read the FineTasks blog, on
✨how to select training evals with the highest signal✨

If you're serious about training models without wasting compute on shitty runs, you absolutely should read it!!

An high signal eval actually tells you precisely, during training, how wel & what your model is learning, allowing you to discard the bad runs/bad samplings/...!

The blog covers in depth prompt choice, metrics, dataset, across languages/capabilities, and my fave section is "which properties should evals have"👌
(to know on your use case how to select the best evals for you)

Blog: HuggingFaceFW/blogpost-fine-tasks

2 replies

Aurelien-Morgan

posted an update 21 days ago

Post

391

Hey, I'll be presenting @retrain-pipelines and almighty function-calling at the Hugging Face Paris HQ, you guys.
Monday evening. Lightning-talk style. With AI Tinkerers.

Come hang !

https://paris.aitinkerers.org/p/ai-tinkerers-paris-ai21-labs-takeover-on-may-19th

https://huggingface.co/blog/Aurelien-Morgan/the-almighty-function-caller

Aurelien-Morgan

posted an update about 1 month ago

Post

3134

The Almighty function-caller

How would you like to build smart GenAi infrastructure ?
Give extensive tools memory to your edge agentic system,
And optimize the resources it takes to run yet a high-performance set of agents ?

We came up with a novel approach to function-calling at scale for smart companies and corporate-grade use-cases.

Read our full-fledged blog article on this here on Hugging Face :
https://huggingface.co/blog/Aurelien-Morgan/the-almighty-function-caller

Aurelien-Morgan

posted an update about 1 month ago

Post

663

retrain-pipelines 0.1.2 finally dropped. It comes with a hot Hugging Face Hub integration. Go check it out. We have 2 articles about it coming up. One already fully written so, be on the lookout !
@retrain-pipelines

Also, I'll be volunteering at GOSIM AI Paris 2025. If you're interested in chatting, hmu.

ClementRomac

authored 4 papers about 2 months ago

Meta Automatic Curriculum Learning

Paper • 2011.08463 • Published Nov 16, 2020

SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

Paper • 2410.12481 • Published Oct 16, 2024

MAGELLAN: Metacognitive predictions of learning progress guide autotelic LLM agents in large goal spaces

Paper • 2502.07709 • Published Feb 11

Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting

Paper • 2410.19920 • Published Oct 25, 2024

Aurelien-Morgan

posted an update 2 months ago

Post

1997

Almost there !
https://test.pypi.org/project/test-010-retrain-pipelines/

clefourrier

posted an update 3 months ago

Post

2485

Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.

clefourrier

authored a paper 4 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 232

qgallouedec

updated a dataset 6 months ago

open-rl-leaderboard/results_v2

Viewer • Updated Dec 7, 2024 • 93.4M • 345 • 1

clefourrier

authored a paper 6 months ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 19

Aurelien-Morgan

posted an update 7 months ago

Post

507

I just shipped retrain-pipelines 0.1.1 today. The doc is also pimped compared to previous release. That was clearly not mature then.
I'll have to focus on another project for the next couple weeks but, anyone feel free to open issues on the GitHub repo and discuss any interest you'd have there if you will (please?) !
In the meantime, you may enjoy retrying this :
https://huggingface.co/blog/Aurelien-Morgan/stateful-metaflow-on-colab

Aurelien-Morgan

posted an update 8 months ago

Post

560

I just published the first article in a pair. I could make it a longer tailed series, in case you liked em. This one dives into self-hosting Metaflow without needing S3, illustrated with a version tailored for Google Colab.
find it @ https://huggingface.co/blog/Aurelien-Morgan/stateful-metaflow-on-colab

2 replies

clefourrier

authored 2 papers 11 months ago

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

Paper • 2404.05904 • Published Apr 8, 2024 • 9

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 218

ClementRomac

authored a paper about 1 year ago

Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent

Paper • 2402.09844 • Published Feb 15, 2024 • 21

clefourrier

posted an update about 1 year ago

Post

6145

In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences 🩸

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co/blog/leaderboard-medicalllm

clefourrier

posted an update about 1 year ago

Post

4784

Contamination free code evaluations with LiveCodeBench! 🖥️

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date 📅

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! 🚀

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!

AI & ML interests

Recent Activity

Team members 6

open-rl-leaderboard's activity