Tengyunw/qwen_2.5_7b_instruct_eagle2_v0

This is a weight file that uses the EAGLE method to accelerate inference for Qwen2.5-7B-Instruct

You can use EAGLE with sglang:

python3 -m sglang.launch_server --model Qwen/Qwen2.5-7B-Instruct --speculative-algorithm EAGLE
--speculative-draft-model-path Tengyunw/qwen_2.5_7b_instruct_eagle2_v0 --speculative-num-steps 5
--speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \

Under a single H200 GPU, the TPS for single concurrency using the Eagle method on gsm8k reaches 241, compared to 167 without the Eagle method, achieving a 44.3% improvement.

Here is a test case from the GSM8K dataset that you can use to benchmark generation speed: “Darrell and Allen's ages are in the ratio of 7:11. If their total age now is 162, calculate Allen's age 10 years from now”

Tengyunw
/

qwen_2.5_7b_instruct_eagle2_v0

Model tree for Tengyunw/qwen_2.5_7b_instruct_eagle2_v0