August 8, 2025 • Written By Aaron Pham, Frost Ming, Larme Zhao and Sherlock Xu
LLMs are powerful, but they’re slow during inference. That’s because they’re trained with an autoregressive pattern: generating the next token based on all the previous ones. During inference, this means every new token requires a full forward pass through the model, followed by sampling and then appending that token to the input for the next step. This process is inherently sequential, meaning the model can’t compute future tokens ahead even if the GPU is idle. This leads to high Inter-Token Latency (ITL) and poor GPU utilization.
Speculative decoding offers a solution. By having a small draft model predict several tokens in advance, and letting a larger target model verify them in parallel, you can accelerate the token generation process.
In practice, however, we found speculative decoding only delivers the expected speedup if the draft model’s distribution matches closely with the target model. The key is using the right draft model for your workload, which in many real-world cases means training one on your own data.
In this post, we’ll walk through:
Speculative decoding is an inference-time optimization technique that speeds up LLM token generation without sacrificing output quality. It’s inspired by the concept of speculative execution, where operations are computed in parallel ahead of time and discarded if unneeded.
This technique builds on two key observations about LLM inference:
To take advantage of these facts, speculative decoding uses a draft-then-verify paradigm:
Note: This draft-then-verify idea was first introduced by Stern et al. (2018), and later extended into a statistically grounded technique called speculative sampling by DeepMind. Speculative decoding is the application of speculative sampling to inference from autoregressive models, like transformers.
Here is how it works in more details:
The draft model proposes the next k
tokens based on the current context.
The target model then verifies these tokens in parallel.
i
, the token is accepted.The target model generates the next token after the last accepted one, and the cycle continues.
This technique parallelizes the expensive part (i.e. the forward pass) and replaces many slow sequential steps with a single verification, thus reducing ITL. It is helpful to interactive and latency-sensitive applications, such as:
Speculative decoding promises faster LLM inference, but only when it works well. And its effectiveness depends on a critical factor: how often the target model accepts the draft model’s predictions.
This is known as the acceptance rate (α). It is not a fixed number and can vary based on several factors like decoding strategy (e.g., nucleus vs. random sampling) and application domains.
High α means:
Low α means:
To understand how acceptance rate impacts performance, we patched vLLM to simulate speculative decoding using theoretical acceptance rates. We bypassed the draft model and tested different speculative token counts (the number of tokens proposed by the draft model each step).
Here’s how we set it up:
We evaluated the following metrics:
Output throughput (Output Tokens Per Second/TPS)
E2EL in milliseconds
Speedup relative to baseline
Acceptance length (Ď„): The average number of tokens accepted per round of decoding. According to the paper Fast Inference from Transformers via Speculative Decoding, theoretically, it is calculated with the formula:
Here are the results:
Key takeaways:
But how do you actually achieve a high acceptance rate in the real world?
In the previous section, we showed that speculative decoding can offer up to 3Ă— speedups, but only in theory. Those results were based on simulated acceptance rates, not actual draft model performance.
To understand how this works in practice, we ran real-world benchmarks using both vLLM and SGLang with and without their respective EAGLE 3 draft models.
Note: EAGLE improves on vanilla speculative decoding. It reuses the top-layer features of the target model (the features before the LM head). It trains the draft model to predict the next feature and then uses the target model’s LM head to obtain the draft token. By leveraging the rich information from the target model, EAGLE achieves significantly better acceleration compared to vanilla speculative sampling.
Here is our test setup:
Here are the measured results:
Despite achieving ~2× speedups, we found that real-world acceptance lengths (τ) were lower than ideal. Based on the test results in the previous section, the acceptance rates (α) were likely in the 0.6–0.8 range, not the near-perfect values used in theory.
We believe two major factors limit acceptance rates when using out-of-the-box EAGLE 3 draft models:
Based on these findings, our conclusion is:
To validate our conclusion and test speculative decoding in practice, we trained a custom EAGLE 3 draft model using the official EAGLE repository.
We successfully replicated EAGLE 3 and observed the hypothetical speedup in downstream inference benchmarks.
Here is how we did it.
The EAGLE repo provides a script to build a training dataset by mixing UltraChat-200k and ShareGPT. Run the script to convert both datasets to a shared ShareGPT-style format:
python eagle/train/eagle3/prepare.py
Note that:
gradient_accumulation_steps = 8
. This helps fully utilize available VRAM and accelerates the epoch. If you're using fewer GPUs, adjust this value accordingly.Once configured, training can be launched. After 10 epochs, you’ll be ready to validate your draft model.
Copy your trained config into the state directory:
cp config.json output_dir/state_<epoch>/config.json
Run vLLM with EAGLE 3:
VLLM_USE_V1=1 VLLM_LOGGING_LEVEL=DEBUG \ vllm serve meta-llama/Llama-3.1-8B-Instruct \ --max-model-len 8192 \ -tp 1 \ --speculative-config '{"method": "eagle3", "model": "path/to/state_dir", "num_speculative_tokens": 5, "draft_tensor_parallel_size": 1}'
In a separate terminal, benchmark the setup and here is an example command:
vllm bench serve \ --save-result \ --sharegpt-output-len 1024 \ --dataset-name sharegpt \ --endpoint-type openai \ --ramp-up-strategy linear \ --ramp-up-start-rps 1 \ --ramp-up-end-rps 2 \ --label eagle3-2048-o1024-tp1 \ --model meta-llama/Llama-3.1-8B-Instruct \ --dataset-path ~/path/to/your/datasets \ --endpoint-type openai-chat \ --endpoint "/v1/chat/completions"
Speculative decoding is a powerful technique to speed up LLM inference. It’s also a fast-moving research area. Other methods like LayerSkip and MTP are being actively explored and may offer further gains in the future.
Through both simulation and real-world benchmarks, we showed that:
If your LLM application is latency-sensitive, don’t just plug in speculative decoding as a drop-in solution. Benchmark it. Tune it. Train your draft model if necessary.
Still have questions?
Stay updated with the latest news