---
license: mit
base_model:
- Qwen/Qwen3-8B
---

## Introduce
We adapted the official speculative sampling training method, Eagle3, for training on Qwen3-8B. 

After implementing Eagle3, the inference performance of Qwen3-8B using the SGLang framework on a single H200 GPU improved from 187 tokens/s to 365 tokens/s.

The TPS (tokens per second) improvement reached nearly 100%.

Amazingly, on a single RTX 5090, the TPS (transactions per second) of Qwen3-8B-Eagle3 increased from 90 to 220.

The TPS (tokens per second) improvement reached nearly 140%.

| model | gpu | tps | 
|---------|---------|---------|
| qwen3-8b   | 5090   | 90   | 
| qwen3-8b-eagle3   | 5090   | 220   |
| qwen3-8b   | h200   | 187   |
| qwen3-8b-eagle3  | h200  | 365  |

## How to use


The launch command for using Eagle3 with SGLang is:

```python
python3 -m sglang.launch_server --model Qwen/Qwen3-8B --speculative-algorithm EAGLE3  --speculative-draft-model-path  Tengyunw/qwen3_8b_eagle3 --speculative-num-steps 6        --speculative-eagle-topk 10 --speculative-num-draft-tokens 32 --mem-fraction 0.9         --cuda-graph-max-bs 2 --dtype bfloat16

```

## How to train

Training Dataset:
ultrachat_200k.
Only the prompts from these datasets were utilized for data synthesis. This synthesized data is used to train the Eagle modules. 

dataset nums: 600K samples,1B tokens

Evaluation Dataset:
ShareGPT,GSM8K,HUAMEVAL,MT-BENCH,APLCA

Our Sharegpt test data is located in the eagle_data.jsonl file under this directory.