Tengyunw/qwen3_8b_eagle2_v0

This is a weight file that uses the EAGLE method to accelerate inference for Qwen3-8B

You can use EAGLE with sglang:

python3 -m sglang.launch_server --model Qwen/Qwen3-8B-FP8 --speculative-algorithm EAGLE
--speculative-draft-model-path Tengyunw/qwen3_8b_eagle2_v0 --speculative-num-steps 5
--speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \

Under a single H200 GPU, the TPS for single concurrency using the Eagle method on gsm8k reaches 248, compared to 172 without the Eagle method, achieving a 44% improvement.

Here is a test case from the GSM8K dataset that you can use to benchmark generation speed: “Darrell and Allen's ages are in the ratio of 7:11. If their total age now is 162, calculate Allen's age 10 years from now”

Tengyunw
/

qwen3_8b_eagle2_v0

Model tree for Tengyunw/qwen3_8b_eagle2_v0