This is a weight file that uses the EAGLE method to accelerate inference for Qwen2.5-7B-Instruct
You can use EAGLE with sglang:
python3 -m sglang.launch_server --model Qwen/Qwen2.5-7B-Instruct --speculative-algorithm EAGLE
--speculative-draft-model-path Tengyunw/qwen_2.5_7b_instruct_eagle2_v0 --speculative-num-steps 5
--speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \
Under a single H200 GPU, the TPS for single concurrency using the Eagle method on gsm8k reaches 241, compared to 167 without the Eagle method, achieving a 44.3% improvement.
Here is a test case from the GSM8K dataset that you can use to benchmark generation speed: “Darrell and Allen's ages are in the ratio of 7:11. If their total age now is 162, calculate Allen's age 10 years from now”
- Downloads last month
- 16