Unveiling More Reliable Human Preferences in Code Generation via Execution
Compare two AI models by sending them coding tasks