Speeding Up LLM Decoding with Advanced Universal Assisted Generation Techniques

Community Article Published March 24, 2025

⭐ TL;DR: LLMs can achieve faster inference using speculative decoding. However, they often lack much smaller versions that can be used as assisted models and achieve even further acceleration. This blog post introduces UAG-TLI, a new method for Universal Assisted Generation (UAG) that allows using any small LM for delivering enhanced speed boosts. Our experiments with state-of-the-art LLMs demonstrate speedups of up to 2.5x. UAG-TLI method is now integrated into Transformers🤗 release 4.50.0 as part of Assisted Generation (AG), making advanced AG more accessible. 🚀

Introduction

Large Language Models (LLMs) such as DeepSeek are transforming AI applications, from chatbots to code generation. However, their slow inference speed remains a major bottleneck. Speculative Decoding (SD) has emerged as a practical solution, accelerating text generation by predicting multiple tokens at once.

Traditional SD methods require the assistant and target models to share the same vocabulary. However, many LLMs do not have smaller, lightweight versions available to serve as the assistant model. This limitation reduces the flexibility of SD and hinders their broader adoption.

In our previous blog post we introduced UAG which mitigates this pain point and enables the use of any off-the-shelf model to serve as assistant model regardless of its vocabulary. This method, however, was limited to greedy decoding. Yet, probabilistic decoding is crucial for generating diverse, fluent, and coherent text.

In this blog post, we introduce UAG-TLI (UAG-Token-Level Intersection), an extension of UAG that enables probabilistic decoding (i.e., sampling). This enhancement makes speculative decoding through UAG more applicable, easier to integrate, and further expand the capabilities to boost LLM inference speed. In the second half of the blog post, we showcase the acceleration of LLMs such as deepSeek-R1 using the various speculative decoding techniques and share code example.

UAG-TLI (Token Level Intersection)

The core idea of UAG-TLI is to map the assistant model probability distribution onto the intersection of its vocabulary with the target model’s vocabulary. In other words, we eliminate tokens from the assistant model’s vocabulary that do not exist in the target model’s vocabulary. This ensures the assistant model only produces tokens from the target vocabulary, eliminating the need for translation between vocabularies and allowing the use of the rejection sampling method introduced in the traditional SD paper.

The main benefits of UAG-TLI are:

🔧 Enables to use any model on the model hub as assistant

🎲 Enables using sampling (temprature > 0)

📈 Boosts inference speedup

Benchmarking Results on DeepSeek AI Models

To demonstrate the impact of this technique, we benchmarked various LLMs. Benchmark code is available. The table below shows significant speedups without compromising accuracy.

  • Table 1 shows speedup achieved for models that lack much smaller variants sharing their vocabulary. This makes UAG-TLI the go-to solution for achieving inference speedups when working in non-zero temperature environments
  • Table 2 shows models that do have smaller variants that share their vocabulary. In many cases, UAG-TLI proves more effective than traditional AG (Traditional SD). For example, vicuna-68m serving as the assistant for gemma-2-9b-it on the humaneval dataset using UAG-TLI outperforms (1.46x) using gemma-2-2b-it (1.36x) as assistant in traditional SD mode.
  • We note that DeepSeek-R1-Distill-Qwen-14B and DeepSeek-R1-Distill-Qwen-32B do not share the same vocabulary as DeepSeek-R1-Distill-Qwen-1.5B; therefore this model can serve as assistant for these models only using UAG-TLI.
                                                                                                                                                                                                                                                                                                                       
Target HWDatasetMethodDrafterSpeedup
Mixtral-8x22B-Instruct-v0.1 4*H100 NVLscrollsUAG-TLIQwen2.5-0.5B-Instruct1.69x
humanevalUAG-TLIvicuna-68m1.67x
cnn_dailymailUAG-TLIvicuna-68m1.53x
phi-4 1*H100 NVLscrollsUAG-TLIQwen2.5-0.5B-Instruct1.45x
CodeLlama-13b-Instruct-hf 1*A6000humanevalUAG-TLItiny_starcoder1.74x
DeepSeek-R1-Distill-Qwen-14B 1*A6000scrollsUAG-TLIvicuna-68m1.59x
cnn_dailymailUAG-TLIvicuna-68m1.31x
humanevalUAG-TLItiny_starcoder1.30x

Table 1: Speedup performance across target model lacking smaller variants that share their vocabulary.
*Speedup compared to running the target in autoregressive mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
Target HWDatasetMethodDrafterSpeedup
DeepSeek-R1-Distill-Qwen-32B 2*A100 80GB PCIescrollsTraditional SDDeepSeek-R1-Distill-Qwen-7B2.02x
UAG-TLIDeepSeek-R1-Distill-Qwen-1.5B2.26x
cnn_dailymailTraditional SDDeepSeek-R1-Distill-Qwen-7B1.37x
UAG-TLIvicuna-68m1.38x
humanevalTraditional SDDeepSeek-R1-Distill-Qwen-7B1.96x
UAG-TLIDeepSeek-R1-Distill-Qwen-1.5B1.70x
gemma-2-9b-it 1*H100 NVLscrollsTraditional SDgemma-2-2b-it2.49x
UAG-TLIvicuna-68m2.04x
humanevalTraditional SDgemma-2-2b-it1.36x
UAG-TLIvicuna-68m1.46x
DeepSeek-R1-Distill-Llama-70B 2*H100 NVL scrollsBaselineDeepSeek-R1-Distill-Llama-8B1.98x
UAG-TLIDeepSeek-R1-Distill-Qwen-1.5B1.82x
2*A100 8GB PCIe humanevalBaselineDeepSeek-R1-Distill-Llama-8B2.3x
UAG-TLItiny_starcoder1.44x

Table 2: Comparing between Traditional SD and UAG-TLI for target models that have variants that share their vocabulary.
*Speedup compared to running the target in autoregressive mode.

We note that when using DeepSeek-R1-Distill-Qwen-14B as draft model for DeepSeek-R1-Distill-Qwen-32B on a single A100 80GB device showed a significant slowdown probably due to memory offloads.

Available Now in Hugging Face Transformers

UAG-TLI is now available in the Hugging Face Transformers library and serves as the default choice for heterogeneous (different vocabularies) Speculative Decoding when do_sample=True. See UniversalSpeculativeDecodingGenerator class. Developers can easily integrate these techniques into their workflows and unlock faster LLM inference today.

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="google/gemma-2-9b",
    assistant_model="double7/vicuna-68m",  # This extra line is all that's needed!
    torch_dtype="bfloat16"
)
pipe_output = pipe("Your prompt here", max_new_tokens=50, do_sample=True)
print(pipe_output[0]["generated_text"])

Reference

Universal Assisted Generation: Faster Decoding with Any Assistant Model

Citation

@article{timor2025acceleratingllminferencelossless,
      title={Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies}, 
      author={Nadav Timor and Jonathan Mamou and Daniel Korat and Moshe Berchansky and Oren Pereg and Gaurav Jain and Roy Schwartz and Moshe Wasserblat and David Harel},
      year={2025},
      eprint={2502.05202},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.05202}, 
}

Community

As mentioned, we’ve open-sourced our benchmarking code here: https://github.com/keyboardAnt/hf-bench

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment