Qwen3-Coder-EKTO-30B

EntroPO-EKTO-30B is trained with EntroPO (Entropy-Enhanced Preference Optimization), a novel method designed to preserve solution diversity and significantly improve performance on complex software engineering problems. The base model is Qwen/Qwen3-Coder-30B-A3B-Instruct.

This model achieves state-of-the-art results among open-weight models on the SWE-bench leaderboard, demonstrating its effectiveness in solving real-world GitHub issues.

Model Description

LLM-powered software engineering agents often face a "diversity collapse" problem: when generating multiple solutions, the outputs are often too similar, limiting the chance of finding a correct one. This is a common side effect of preference optimization techniques like DPO.

EntroPO was created to solve this. It is an entropy-enhanced preference optimization method that fine-tunes the model to preserve a diverse range of potential solutions. By learning from entire solution trajectories and explicitly rewarding policy entropy, EntroPO trains agents that are better at exploring the solution space and less likely to get stuck on a single, incorrect idea.

The key innovations are:

Entropy-Enhanced Optimization: The training objective is modified to directly counteract diversity collapse by rewarding policy entropy, encouraging the agent to explore meaningfully different solution pathways.
Multi-Turn Trajectory Optimization: Instead of evaluating only the final code, EntroPO learns from preferences over the entire sequence of actions an agent takes, teaching it to make better decisions at every step.

How to use

You can use this model with sglang(recommended) or vllm for fast inference.

Performance

The model's performance was evaluated on SWE-bench-Verified and SWE-bench-Lite. Note that all the evaluations are on the R2E scaffold and the max context length is set as 130k due to compute limitation. Thus, the results for original model may be different from Qwen's official reported results, which are evaluated on OpenHands scaffold.

Method	SWE-bench-Verified	SWE-bench-Lite
origin	37.4%	28.00%
sft	43.8%	33.67%
sft+ekto	51.6%	44.67%
sft+ekto@bo16	59.8%	49.33%

Recommended Hyper-parameters

temperature: 0.7

top_p: 0.8

top_k: 20

repetition_penalty: 1.05

Intended Use and Limitations

This model is primarily intended for use in AI-powered software engineering agents. It excels at multi-step tasks that require reasoning and tool use to resolve real-world coding issues.

Limitations:

The model requires significant computational resources due to its size (30B parameters).
It is highly specialized for code-related tasks and may not perform as well on general-purpose NLP tasks like creative writing or summarization.
It is trained with R2E scaffold, and may not have optimal performances if you use it with other scaffolds like OpenHands or SWE-Agent.

Downloads last month: 27

Safetensors

Model size

30.5B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support