Performance Discrepancy on SWE-bench-Verified with OpenHands
#1
by
zengliangcs
- opened
When using the latest version of OpenHands, I was unable to reproduce the reported 36.6% resolved rate on SWE-bench-Verified with SWE-Dev-32B. In my evaluation, the model only achieved a 25.2% resolved rate. Could this discrepancy be due to changes in the OpenHands version or its scaffold? I've attached the evaluation log below for reference:
"total_instances": 500,
"submitted_instances": 500,
"completed_instances": 488,
"resolved_instances": 126,
"unresolved_instances": 362,
"empty_patch_instances": 7,
"error_instances": 5,