hallucinations-leaderboard

community

https://www.neuralnoise.com

pminervini

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

pingnieuk authored a paper about 1 month ago

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

pminervini authored a paper about 1 month ago

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

pminervini authored a paper about 1 month ago

Neurosymbolic Diffusion Models

View all activity

pingnieuk

authored a paper about 1 month ago

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

Paper • 2505.20139 • Published May 26 • 18

pminervini

authored 2 papers about 1 month ago

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Paper • 2505.10610 • Published May 15 • 53

Neurosymbolic Diffusion Models

Paper • 2505.13138 • Published May 19 • 33

rohitsaxena

authored a paper about 1 month ago

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

Paper • 2502.08279 • Published Feb 12 • 1

yuzhaouoe

authored a paper about 1 month ago

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Paper • 2505.10610 • Published May 15 • 53

rohitsaxena

authored a paper about 1 month ago

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Paper • 2505.10610 • Published May 15 • 53

clefourrier

posted an update about 1 month ago

Post

765

Always surprised that so few people actually read the FineTasks blog, on
✨how to select training evals with the highest signal✨

If you're serious about training models without wasting compute on shitty runs, you absolutely should read it!!

An high signal eval actually tells you precisely, during training, how wel & what your model is learning, allowing you to discard the bad runs/bad samplings/...!

The blog covers in depth prompt choice, metrics, dataset, across languages/capabilities, and my fave section is "which properties should evals have"👌
(to know on your use case how to select the best evals for you)

Blog: HuggingFaceFW/blogpost-fine-tasks

2 replies

pingnieuk

authored a paper 3 months ago

ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

Paper • 2504.00824 • Published Apr 1 • 43

aryopg

authored a paper 3 months ago

An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering

Paper • 2503.23415 • Published Mar 30 • 1

clefourrier

posted an update 4 months ago

Post

2508

Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.