Performance Discrepancy on SWE-bench-Verified with OpenHands

#1
by zengliangcs - opened

When using the latest version of OpenHands, I was unable to reproduce the reported 36.6% resolved rate on SWE-bench-Verified with SWE-Dev-32B. In my evaluation, the model only achieved a 25.2% resolved rate. Could this discrepancy be due to changes in the OpenHands version or its scaffold? I've attached the evaluation log below for reference:

"total_instances": 500,
    "submitted_instances": 500,
    "completed_instances": 488,
    "resolved_instances": 126,
    "unresolved_instances": 362,
    "empty_patch_instances": 7,
    "error_instances": 5,
Z.ai & THUKEG org

hi! Please refer to our blog
We use OpenHands v0.15.0 and I think the version matters :)

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment