Always surprised that so few people actually read the FineTasks blog, on ✨how to select training evals with the highest signal✨
If you're serious about training models without wasting compute on shitty runs, you absolutely should read it!!
An high signal eval actually tells you precisely, during training, how wel & what your model is learning, allowing you to discard the bad runs/bad samplings/...!
The blog covers in depth prompt choice, metrics, dataset, across languages/capabilities, and my fave section is "which properties should evals have"👌 (to know on your use case how to select the best evals for you)
Just explored #MedAgentBench from @Yale researchers and it's mind-blowing! They've created a cutting-edge benchmark that finally exposes the true capabilities of LLMs in complex medical reasoning.
⚡ Key discoveries:
DeepSeek R1 & OpenAI O3 dominate clinical reasoning tasks Agent-based frameworks deliver exceptional performance-cost balance Open-source alternatives are closing the gap at fraction of the cost
This work shatters previous benchmarks that failed to challenge today's advanced models. The future of medical AI is here: https://github.com/gersteinlab/medagents-benchmark #MedicalAI #MachineLearning #AIinHealthcare 🔥