Speeding Up LLM Decoding with Advanced Universal Assisted Generation Techniques
⭐ TL;DR: LLMs can achieve faster inference using speculative decoding. However, they often lack much smaller versions that can be used as assisted models and achieve even further acceleration. This blog post introduces UAG-TLI, a new method for Universal Assisted Generation (UAG) that allows using any small LM for delivering enhanced speed boosts. Our experiments with state-of-the-art LLMs demonstrate speedups of up to 2.5x. UAG-TLI method is now integrated into Transformers🤗 release 4.50.0 as part of Assisted Generation (AG), making advanced AG more accessible. 🚀
Introduction
Large Language Models (LLMs) such as DeepSeek are transforming AI applications, from chatbots to code generation. However, their slow inference speed remains a major bottleneck. Speculative Decoding (SD) has emerged as a practical solution, accelerating text generation by predicting multiple tokens at once.
Traditional SD methods require the assistant and target models to share the same vocabulary. However, many LLMs do not have smaller, lightweight versions available to serve as the assistant model. This limitation reduces the flexibility of SD and hinders their broader adoption.
In our previous blog post we introduced UAG which mitigates this pain point and enables the use of any off-the-shelf model to serve as assistant model regardless of its vocabulary. This method, however, was limited to greedy decoding. Yet, probabilistic decoding is crucial for generating diverse, fluent, and coherent text.
In this blog post, we introduce UAG-TLI (UAG-Token-Level Intersection), an extension of UAG that enables probabilistic decoding (i.e., sampling). This enhancement makes speculative decoding through UAG more applicable, easier to integrate, and further expand the capabilities to boost LLM inference speed. In the second half of the blog post, we showcase the acceleration of LLMs such as deepSeek-R1 using the various speculative decoding techniques and share code example.
UAG-TLI (Token Level Intersection)
The core idea of UAG-TLI is to map the assistant model probability distribution onto the intersection of its vocabulary with the target model’s vocabulary. In other words, we eliminate tokens from the assistant model’s vocabulary that do not exist in the target model’s vocabulary. This ensures the assistant model only produces tokens from the target vocabulary, eliminating the need for translation between vocabularies and allowing the use of the rejection sampling method introduced in the traditional SD paper.
The main benefits of UAG-TLI are:
🔧 Enables to use any model on the model hub as assistant
🎲 Enables using sampling (temprature > 0)
📈 Boosts inference speedup
Benchmarking Results on DeepSeek AI Models
To demonstrate the impact of this technique, we benchmarked various LLMs. Benchmark code is available. The table below shows significant speedups without compromising accuracy.
- Table 1 shows speedup achieved for models that lack much smaller variants sharing their vocabulary. This makes UAG-TLI the go-to solution for achieving inference speedups when working in non-zero temperature environments
- Table 2 shows models that do have smaller variants that share their vocabulary. In many cases, UAG-TLI proves more effective than traditional AG (Traditional SD). For example, vicuna-68m serving as the assistant for gemma-2-9b-it on the humaneval dataset using UAG-TLI outperforms (1.46x) using gemma-2-2b-it (1.36x) as assistant in traditional SD mode.
- We note that DeepSeek-R1-Distill-Qwen-14B and DeepSeek-R1-Distill-Qwen-32B do not share the same vocabulary as DeepSeek-R1-Distill-Qwen-1.5B; therefore this model can serve as assistant for these models only using UAG-TLI.
Target | HW | Dataset | Method | Drafter | Speedup |
---|---|---|---|---|---|
Mixtral-8x22B-Instruct-v0.1 | 4*H100 NVL | scrolls | UAG-TLI | Qwen2.5-0.5B-Instruct | 1.69x |
humaneval | UAG-TLI | vicuna-68m | 1.67x | ||
cnn_dailymail | UAG-TLI | vicuna-68m | 1.53x | ||
phi-4 | 1*H100 NVL | scrolls | UAG-TLI | Qwen2.5-0.5B-Instruct | 1.45x |
CodeLlama-13b-Instruct-hf | 1*A6000 | humaneval | UAG-TLI | tiny_starcoder | 1.74x |
DeepSeek-R1-Distill-Qwen-14B | 1*A6000 | scrolls | UAG-TLI | vicuna-68m | 1.59x |
cnn_dailymail | UAG-TLI | vicuna-68m | 1.31x | ||
humaneval | UAG-TLI | tiny_starcoder | 1.30x |
Table 1: Speedup performance across target model lacking smaller variants that share their vocabulary.
*Speedup compared to running the target in autoregressive mode.
Target | HW | Dataset | Method | Drafter | Speedup |
---|---|---|---|---|---|
DeepSeek-R1-Distill-Qwen-32B | 2*A100 80GB PCIe | scrolls | Traditional SD | DeepSeek-R1-Distill-Qwen-7B | 2.02x |
UAG-TLI | DeepSeek-R1-Distill-Qwen-1.5B | 2.26x | |||
cnn_dailymail | Traditional SD | DeepSeek-R1-Distill-Qwen-7B | 1.37x | ||
UAG-TLI | vicuna-68m | 1.38x | |||
humaneval | Traditional SD | DeepSeek-R1-Distill-Qwen-7B | 1.96x | ||
UAG-TLI | DeepSeek-R1-Distill-Qwen-1.5B | 1.70x | |||
gemma-2-9b-it | 1*H100 NVL | scrolls | Traditional SD | gemma-2-2b-it | 2.49x |
UAG-TLI | vicuna-68m | 2.04x | |||
humaneval | Traditional SD | gemma-2-2b-it | 1.36x | ||
UAG-TLI | vicuna-68m | 1.46x | |||
DeepSeek-R1-Distill-Llama-70B | 2*H100 NVL | scrolls | Baseline | DeepSeek-R1-Distill-Llama-8B | 1.98x |
UAG-TLI | DeepSeek-R1-Distill-Qwen-1.5B | 1.82x | |||
2*A100 8GB PCIe | humaneval | Baseline | DeepSeek-R1-Distill-Llama-8B | 2.3x | |
UAG-TLI | tiny_starcoder | 1.44x |
Table 2: Comparing between Traditional SD and UAG-TLI for target models that have variants that share their vocabulary.
*Speedup compared to running the target in autoregressive mode.
We note that when using DeepSeek-R1-Distill-Qwen-14B as draft model for DeepSeek-R1-Distill-Qwen-32B on a single A100 80GB device showed a significant slowdown probably due to memory offloads.
Available Now in Hugging Face Transformers
UAG-TLI is now available in the Hugging Face Transformers library and serves as the default choice for heterogeneous (different vocabularies) Speculative Decoding when do_sample=True
. See UniversalSpeculativeDecodingGenerator
class. Developers can easily integrate these techniques into their workflows and unlock faster LLM inference today.
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="google/gemma-2-9b",
assistant_model="double7/vicuna-68m", # This extra line is all that's needed!
torch_dtype="bfloat16"
)
pipe_output = pipe("Your prompt here", max_new_tokens=50, do_sample=True)
print(pipe_output[0]["generated_text"])
Reference
Universal Assisted Generation: Faster Decoding with Any Assistant Model
Citation
@article{timor2025acceleratingllminferencelossless,
title={Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies},
author={Nadav Timor and Jonathan Mamou and Daniel Korat and Moshe Berchansky and Oren Pereg and Gaurav Jain and Roy Schwartz and Moshe Wasserblat and David Harel},
year={2025},
eprint={2502.05202},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.05202},
}