Speeding Up LLM Decoding with Advanced Universal Assisted Generation Techniques

Community Article Published March 24, 2025

Upvote

⭐ TL;DR: LLMs can achieve faster inference using speculative decoding. However, they often lack much smaller versions that can be used as assisted models and achieve even further acceleration. This blog post introduces UAG-TLI, a new method for Universal Assisted Generation (UAG) that allows using any small LM for delivering enhanced speed boosts. Our experiments with state-of-the-art LLMs demonstrate speedups of up to 2.5x. UAG-TLI method is now integrated into Transformers🤗 release 4.50.0 as part of Assisted Generation (AG), making advanced AG more accessible. 🚀

Introduction

Large Language Models (LLMs) such as DeepSeek are transforming AI applications, from chatbots to code generation. However, their slow inference speed remains a major bottleneck. Speculative Decoding (SD) has emerged as a practical solution, accelerating text generation by predicting multiple tokens at once.

Traditional SD methods require the assistant and target models to share the same vocabulary. However, many LLMs do not have smaller, lightweight versions available to serve as the assistant model. This limitation reduces the flexibility of SD and hinders their broader adoption.

In our previous blog post we introduced UAG which mitigates this pain point and enables the use of any off-the-shelf model to serve as assistant model regardless of its vocabulary. This method, however, was limited to greedy decoding. Yet, probabilistic decoding is crucial for generating diverse, fluent, and coherent text.

In this blog post, we introduce UAG-TLI (UAG-Token-Level Intersection), an extension of UAG that enables probabilistic decoding (i.e., sampling). This enhancement makes speculative decoding through UAG more applicable, easier to integrate, and further expand the capabilities to boost LLM inference speed. In the second half of the blog post, we showcase the acceleration of LLMs such as deepSeek-R1 using the various speculative decoding techniques and share code example.

UAG-TLI (Token Level Intersection)

The core idea of UAG-TLI is to map the assistant model probability distribution onto the intersection of its vocabulary with the target model’s vocabulary. In other words, we eliminate tokens from the assistant model’s vocabulary that do not exist in the target model’s vocabulary. This ensures the assistant model only produces tokens from the target vocabulary, eliminating the need for translation between vocabularies and allowing the use of the rejection sampling method introduced in the traditional SD paper.

The main benefits of UAG-TLI are:

🔧 Enables to use any model on the model hub as assistant

🎲 Enables using sampling (temprature > 0)

📈 Boosts inference speedup

Benchmarking Results on DeepSeek AI Models

To demonstrate the impact of this technique, we benchmarked various LLMs. Benchmark code is available. The table below shows significant speedups without compromising accuracy.

Table 1 shows speedup achieved for models that lack much smaller variants sharing their vocabulary. This makes UAG-TLI the go-to solution for achieving inference speedups when working in non-zero temperature environments
Table 2 shows models that do have smaller variants that share their vocabulary. In many cases, UAG-TLI proves more effective than traditional AG (Traditional SD). For example, vicuna-68m serving as the assistant for gemma-2-9b-it on the humaneval dataset using UAG-TLI outperforms (1.46x) using gemma-2-2b-it (1.36x) as assistant in traditional SD mode.
We note that DeepSeek-R1-Distill-Qwen-14B and DeepSeek-R1-Distill-Qwen-32B do not share the same vocabulary as DeepSeek-R1-Distill-Qwen-1.5B; therefore this model can serve as assistant for these models only using UAG-TLI.

Target	HW	Dataset	Method	Drafter	Speedup
Mixtral-8x22B-Instruct-v0.1	4*H100 NVL	scrolls	UAG-TLI	Qwen2.5-0.5B-Instruct	1.69x
		humaneval	UAG-TLI	vicuna-68m	1.67x
		cnn_dailymail	UAG-TLI	vicuna-68m	1.53x
phi-4	1*H100 NVL	scrolls	UAG-TLI	Qwen2.5-0.5B-Instruct	1.45x
CodeLlama-13b-Instruct-hf	1*A6000	humaneval	UAG-TLI	tiny_starcoder	1.74x
DeepSeek-R1-Distill-Qwen-14B	1*A6000	scrolls	UAG-TLI	vicuna-68m	1.59x
		cnn_dailymail	UAG-TLI	vicuna-68m	1.31x
		humaneval	UAG-TLI	tiny_starcoder	1.30x

Table 1: Speedup performance across target model lacking smaller variants that share their vocabulary.
*Speedup compared to running the target in autoregressive mode.

Target	HW	Dataset	Method	Drafter	Speedup
DeepSeek-R1-Distill-Qwen-32B	2*A100 80GB PCIe	scrolls	Traditional SD	DeepSeek-R1-Distill-Qwen-7B	2.02x
		scrolls	UAG-TLI	DeepSeek-R1-Distill-Qwen-1.5B	2.26x
		cnn_dailymail	Traditional SD	DeepSeek-R1-Distill-Qwen-7B	1.37x
		cnn_dailymail	UAG-TLI	vicuna-68m	1.38x
		humaneval	Traditional SD	DeepSeek-R1-Distill-Qwen-7B	1.96x
		humaneval	UAG-TLI	DeepSeek-R1-Distill-Qwen-1.5B	1.70x
gemma-2-9b-it	1*H100 NVL	scrolls	Traditional SD	gemma-2-2b-it	2.49x
		scrolls	UAG-TLI	vicuna-68m	2.04x
		humaneval	Traditional SD	gemma-2-2b-it	1.36x
		humaneval	UAG-TLI	vicuna-68m	1.46x
DeepSeek-R1-Distill-Llama-70B	2*H100 NVL	scrolls	Baseline	DeepSeek-R1-Distill-Llama-8B	1.98x
	2*H100 NVL	scrolls	UAG-TLI	DeepSeek-R1-Distill-Qwen-1.5B	1.82x
	2*A100 8GB PCIe	humaneval	Baseline	DeepSeek-R1-Distill-Llama-8B	2.3x
	2*A100 8GB PCIe	humaneval	UAG-TLI	tiny_starcoder	1.44x

Table 2: Comparing between Traditional SD and UAG-TLI for target models that have variants that share their vocabulary.
*Speedup compared to running the target in autoregressive mode.

We note that when using DeepSeek-R1-Distill-Qwen-14B as draft model for DeepSeek-R1-Distill-Qwen-32B on a single A100 80GB device showed a significant slowdown probably due to memory offloads.

Available Now in Hugging Face Transformers

UAG-TLI is now available in the Hugging Face Transformers library and serves as the default choice for heterogeneous (different vocabularies) Speculative Decoding when do_sample=True. See UniversalSpeculativeDecodingGenerator class. Developers can easily integrate these techniques into their workflows and unlock faster LLM inference today.

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="google/gemma-2-9b",
    assistant_model="double7/vicuna-68m",  # This extra line is all that's needed!
    torch_dtype="bfloat16"
)
pipe_output = pipe("Your prompt here", max_new_tokens=50, do_sample=True)
print(pipe_output[0]["generated_text"])

Reference

Universal Assisted Generation: Faster Decoding with Any Assistant Model

Citation

@article{timor2025acceleratingllminferencelossless,
      title={Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies}, 
      author={Nadav Timor and Jonathan Mamou and Daniel Korat and Moshe Berchansky and Oren Pereg and Gaurav Jain and Roy Schwartz and Moshe Wasserblat and David Harel},
      year={2025},
      eprint={2502.05202},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.05202}, 
}

Community

Nadav-Timor

Article author Mar 24

•

edited Mar 24

As mentioned, we’ve open-sourced our benchmarking code here: https://github.com/keyboardAnt/hf-bench

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote