AIM-Harvard

university

https://aim.hms.harvard.edu

dbittermanmd

AIM-Harvard

Activity Feed Request to join this org

AI & ML interests

Artificial Intelligence in Medicine (AIM) Program (NLP group/Bitterman lab: https://www.bittermanlab.org/)

Recent Activity

shanchen authored a paper about 17 hours ago

MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

shanchen authored a paper about 17 hours ago

Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models

shanchen authored a paper about 17 hours ago

When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy

View all activity

AIM-Harvard's activity

shanchen

authored 3 papers about 17 hours ago

MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

Paper • 2505.14963 • Published 16 days ago • 1

Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models

Paper • 2505.13774 • Published 17 days ago • 1

When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy

Paper • 2505.22888 • Published 8 days ago • 6

shanchen

updated 2 datasets 14 days ago

AIM-Harvard/MedBrowseComp_Meta

Viewer • Updated 14 days ago • 442 • 69

AIM-Harvard/MedBrowseComp

Viewer • Updated 14 days ago • 1.14k • 470 • 4

clefourrier

posted an update 18 days ago

Post

595

Always surprised that so few people actually read the FineTasks blog, on
✨how to select training evals with the highest signal✨

If you're serious about training models without wasting compute on shitty runs, you absolutely should read it!!

An high signal eval actually tells you precisely, during training, how wel & what your model is learning, allowing you to discard the bad runs/bad samplings/...!

The blog covers in depth prompt choice, metrics, dataset, across languages/capabilities, and my fave section is "which properties should evals have"👌
(to know on your use case how to select the best evals for you)

Blog: HuggingFaceFW/blogpost-fine-tasks

2 replies

shanchen

updated a collection 23 days ago

MedBrowseComp

Collection

MedBrowseComp: Benchmarking Medical Deep Research and Computer Use • 3 items • Updated 23 days ago

shanchen

published a dataset 23 days ago

AIM-Harvard/MedBrowseComp

Viewer • Updated 14 days ago • 1.14k • 470 • 4

shanchen

updated a dataset 23 days ago

AIM-Harvard/MedBrowseComp_CUA

Viewer • Updated 23 days ago • 484 • 32

shanchen

published a dataset 23 days ago

AIM-Harvard/MedBrowseComp_CUA

Viewer • Updated 23 days ago • 484 • 32

oskarvanderwal

authored 4 papers about 2 months ago

Inseq: An Interpretability Toolkit for Sequence Generation Models

Paper • 2302.13942 • Published Feb 27, 2023 • 1

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Paper • 2304.01373 • Published Apr 3, 2023 • 9

Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model

Paper • 2310.12611 • Published Oct 19, 2023

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Paper • 2211.05100 • Published Nov 9, 2022 • 32

clefourrier

posted an update 3 months ago

Post

2485

Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.