Perry the Platypus's picture

Perry the Platypus PRO

AgPerry

·

AI & ML interests

None yet

Recent Activity

upvoted a paper 2 days ago

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

upvoted a paper 2 days ago

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

updated a dataset 4 days ago

TIGER-Lab/ClawBench

View all activity

Organizations

upvoted 2 papers 2 days ago

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

Paper • 2605.26340 • Published 21 days ago • 36

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

Paper • 2606.06113 • Published 11 days ago • 14

updated a dataset 4 days ago

TIGER-Lab/ClawBench

Viewer • Updated 4 days ago • 283 • 489

upvoted a paper 11 days ago

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Paper • 2605.30288 • Published 17 days ago • 22

updated a Space 21 days ago

ClawBench Leaderboard

Can AI agents complete everyday online tasks?

updated 4 datasets 21 days ago

TIGER-Lab/ClawBenchV2Trace

Updated 21 days ago • 9.17k

NAIL-Group/ClawBenchV2Trace

Updated 21 days ago • 4.07k

NAIL-Group/ClawBenchV1Trace

Updated 21 days ago • 7.18k

NAIL-Group/ClawBench

Viewer • Updated 21 days ago • 153 • 226 • 2

commented a paper 29 days ago

RewardHarness: Self-Evolving Agentic Post-Training

Paper • 2605.08703 • Published May 9 • 10 •

upvoted a paper 29 days ago

RewardHarness: Self-Evolving Agentic Post-Training

Paper • 2605.08703 • Published May 9 • 10

New activity in huggingface/HuggingDiscussions about 1 month ago

[FEEDBACK] Daily Papers

#32 opened about 2 years ago by

submitted a paper to Daily Papers about 1 month ago

RewardHarness: Self-Evolving Agentic Post-Training

Paper • 2605.08703 • Published May 9 • 10

updated a collection about 1 month ago

ClawBench

Benchmark dataset (V1+V2), live leaderboard Space, and full V1 execution traces — everything you need to run, regrade, or compare on ClawBench. • 5 items • Updated May 12

published a Space about 1 month ago

ClawBench Leaderboard

Can AI agents complete everyday online tasks?

updated a collection about 1 month ago

ClawBench

Benchmark dataset (V1+V2), live leaderboard Space, and full V1 execution traces — everything you need to run, regrade, or compare on ClawBench. • 5 items • Updated May 12

updated a Space about 1 month ago

ClawBench Leaderboard

Live leaderboard for the ClawBench web-agent benchmark

updated 2 collections about 1 month ago

ClawBench

Benchmark dataset (V1+V2), live leaderboard Space, and full V1 execution traces — everything you need to run, regrade, or compare on ClawBench. • 5 items • Updated May 12

ClawBench — Browser Agent Benchmark Suite

Benchmark dataset (V1+V2), live leaderboard Space, and full V1 execution traces — everything you need to run, regrade, or compare on ClawBench. • 5 items • Updated May 12 • 1

published a dataset about 1 month ago

TIGER-Lab/ClawBenchV2Trace

Updated 21 days ago • 9.17k