Papers
arxiv:2506.22694

VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

Published on Jun 28
· Submitted by RaghavvGoel on Jul 1
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

A technique called VocabTrim improves drafter-based speculative decoding by reducing the vocabulary of the drafter language model, thus decreasing drafting latency in memory-bound environments.

AI-generated summary

In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, followed by verification by a base LLM, a target model, accepting a subset as its valid generation. As it is usually considered that the speculative decoding requires one-to-one mapping between vocabularies of the target model and the draft model, it has been natural to share the vocabulary between them, or even share the LM head as in EAGLE or Medusa. We first identify that this draft token sampling scheme inherently contains an unnecessary inference overhead in drafting, especially for some target LLMs with very large vocabularies. Then, we propose a simple technique, VocabTrim, to mitigate the drafting overhead to improve the generation speed in memory-bound environment. VocabTrim reconstructs the drafter LM head to contain only a limited set of tokens, selected by the most frequently sampled from the vocabulary of the target model. While limiting the vocabulary in drafting slightly degrades the acceptance rate, it significantly reduces the drafting latency in memory-bound process which is often the case on edge devices, resulting in higher memory-bound speed up (MBSU). We show that our method can boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically by 16% for Llama-3.2-3B-Instruct.

Community

Paper submitter
edited 1 day ago

We present VOCABTRIM, a training-free method to accelerate drafter-based speculative decoding (SpD) by reducing inference overhead during draft generation. SpD uses smaller draft models to propose token sequences, which are then verified by a larger target language model (LLM). Existing approaches often share vocabularies or LM heads between models, leading to inefficiencies—especially with large-vocabulary LLMs. VOCABTRIM addresses this by reconstructing the drafter’s LM head with a reduced vocabulary comprising the most frequently sampled tokens from the target model. This significantly lowers drafting latency in memory-bound environments, with minimal impact on acceptance rate. On Spec-Bench, VOCABTRIM achieves up to 16% memory-bound speed-up on LLaMA3.2-3B-Instruct.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.22694 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.22694 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.22694 in a Space README.md to link it from this page.

Collections including this paper 1