--- license: mit base_model: - Qwen/Qwen3-8B --- ## Introduce We adapted the official speculative sampling training method, Eagle3, for training on Qwen3-8B. After implementing Eagle3, the inference performance of Qwen3-8B using the SGLang framework on a single H200 GPU improved from 187 tokens/s to 365 tokens/s. The TPS (tokens per second) improvement reached nearly 100%. Amazingly, on a single RTX 5090, the TPS (transactions per second) of Qwen3-8B-Eagle3 increased from 90 to 220. The TPS (tokens per second) improvement reached nearly 140%. | model | gpu | tps | |---------|---------|---------| | qwen3-8b | 5090 | 90 | | qwen3-8b-eagle3 | 5090 | 220 | | qwen3-8b | h200 | 187 | | qwen3-8b-eagle3 | h200 | 365 | ## How to use The launch command for using Eagle3 with SGLang is: ```python python3 -m sglang.launch_server --model Qwen/Qwen3-8B --speculative-algorithm EAGLE3 --speculative-draft-model-path Tengyunw/qwen3_8b_eagle3 --speculative-num-steps 6 --speculative-eagle-topk 10 --speculative-num-draft-tokens 32 --mem-fraction 0.9 --cuda-graph-max-bs 2 --dtype bfloat16 ``` ## How to train Training Dataset: ultrachat_200k. Only the prompts from these datasets were utilized for data synthesis. This synthesized data is used to train the Eagle modules. dataset nums: 600K samples,1B tokens Evaluation Dataset: ShareGPT,GSM8K,HUAMEVAL,MT-BENCH,APLCA Our Sharegpt test data is located in the eagle_data.jsonl file under this directory.