THUDM/SWE-Dev-32B · Performance Discrepancy on SWE-bench-Verified with OpenHands

When using the latest version of OpenHands, I was unable to reproduce the reported 36.6% resolved rate on SWE-bench-Verified with SWE-Dev-32B. In my evaluation, the model only achieved a 25.2% resolved rate. Could this discrepancy be due to changes in the OpenHands version or its scaffold? I've attached the evaluation log below for reference:

"total_instances": 500,
    "submitted_instances": 500,
    "completed_instances": 488,
    "resolved_instances": 126,
    "unresolved_instances": 362,
    "empty_patch_instances": 7,
    "error_instances": 5,