Wiki · Concept · Last reviewed June 16, 2026

Speculative Decoding

Speculative decoding is an inference technique that accelerates autoregressive language models by using a cheaper proposer to draft several likely next tokens, then using the full target model to verify those draft tokens in parallel.

Category: AI infrastructure / model serving Published: June 16, 2026 Modified: June 16, 2026 Last reviewed: June 16, 2026 Tags: speculative decoding, inference, serving latency, KV cache, draft models, governance

Definition

Speculative decoding is a family of methods for making language-model generation faster without asking the large model to produce every token one serial step at a time. A smaller model, auxiliary head, n-gram matcher, or other proposer guesses several future tokens. The large target model then checks those guesses in a single forward pass and accepts the longest valid prefix.

The method targets a bottleneck in decoder-only and encoder-decoder transformer inference: ordinary autoregressive decoding generates one token, appends it to the context, and repeats. Even when a GPU is powerful, the decode loop is sequential and often memory-bandwidth-limited.

In the classic lossless form, the acceptance-and-rejection rule samples from the target model's distribution; the proposer is only a proposal mechanism. That guarantee depends on implementing the algorithm and sampling settings correctly, and real systems can still differ because of numerical precision, batching, kernels, log-probability stability, or serving-engine details. It is therefore a serving optimization, not a new model capability by itself.

Why It Exists

Large models are expensive to run because each generated token requires moving model weights, reading attention state, updating KV cache, and scheduling live requests. Users experience that cost as latency; providers experience it as lower throughput, more GPUs, and higher token prices.

The core observation is that many next-token decisions are easy for a cheaper model or heuristic to approximate. If a draft model can guess several likely tokens, the larger model can verify those guesses more efficiently than it can generate them one by one.

Leviathan, Kalman, and Matias introduced speculative decoding in work published at ICML 2023 and reported 2x to 3x acceleration on T5-XXL with identical outputs. Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper independently described speculative sampling and reported 2x to 2.5x decoding speedups with Chinchilla in a distributed setup.

How It Works

A simple draft-target loop has four parts. First, the target model has already produced a current context. Second, a faster draft model proposes a short continuation, such as three to eight tokens. Third, the target model evaluates the proposed continuation in parallel. Fourth, a verification and sampling rule accepts some draft tokens and falls back to a target-model token when the draft diverges.

The speedup depends on acceptance rate, draft cost, verification cost, batch size, memory bandwidth, scheduler behavior, KV-cache handling, and sampling settings. A very cheap draft model that is often wrong may waste time. A high-quality draft model that is too large may erase the benefit. The useful zone is a proposer that is much cheaper than the target model but close enough to be accepted often.

Speculative decoding is easiest to explain as a chain of guessed tokens, but production systems also use token trees, multi-token prediction heads, feature-level predictors, and prompt-based n-gram matching. The shared pattern is the same: propose ahead, verify in bulk, commit only what the target accepts.

Current Context

As of June 16, 2026, speculative decoding is no longer only a paper technique. It is a configurable mode in production inference engines and model-serving libraries. Current vLLM documentation describes speculative decoding for medium-to-low-QPS, memory-bound workloads and lists methods including EAGLE, multi-token prediction, draft models, PARD, MLP speculators, n-gram drafting, suffix decoding, hidden-state extraction, custom proposers, and dynamic speculative decoding.

NVIDIA TensorRT-LLM similarly documents speculative sampling as a family of techniques for producing more than one token per target-model forward pass, with draft-target models, n-gram prompting, Medusa, ReDrafter, EAGLE, and lookahead decoding among its documented paths. The shared engineering trend is diversification of proposers: separate draft models, prompt-copy heuristics, token trees, feature-level predictors, and trained multi-token heads.

That maturity does not make speedups automatic. The current serving question is not simply whether speculation is enabled, but which proposer is used, how much it costs, how often the target model accepts its draft tokens, and what the runtime does when proposals fail under real traffic.

Variants

Draft-target decoding. A small model proposes tokens and the larger model verifies them. This is the most common conceptual form and is supported by production inference engines such as TensorRT-LLM and vLLM.

N-gram and suffix drafting. Instead of a learned model, the system proposes continuations from repeated text patterns in the prompt or recent context. This can help on summarization, document QA, code editing, and other tasks where future output often copies or lightly transforms input text.

Tree-based verification. SpecInfer and related systems organize candidate continuations as token trees so the target model can verify multiple branches rather than a single linear draft.

Medusa-style heads. Medusa adds extra decoding heads to a model to predict multiple future tokens in parallel, avoiding the operational burden of maintaining a separate draft model.

EAGLE-style predictors. EAGLE drafts from model features rather than only next-token outputs, using feature-level prediction to generate candidates with low overhead.

Multi-token prediction modules. Some newer architectures train modules that predict multiple future tokens. DeepSeek-V3's technical report, for example, describes a multi-token prediction objective that can also support speculative decoding during inference, and vLLM's current documentation treats MTP as a first-class speculative method when a target model has native support.

Serving Implications

Speculative decoding matters because token latency shapes the feel and economics of AI systems. Faster decoding can make chat assistants feel more responsive, reduce the cost of coding agents, support longer reasoning traces, and make high-volume inference cheaper.

It also complicates serving infrastructure. Draft tokens consume scheduler capacity and KV-cache space before the target model accepts them. TensorRT-LLM documentation notes that speculative decoding can require KV-cache and scheduler awareness, and that implementation details differ across one-model and two-model systems. vLLM documents several modes, including draft models, n-gram matching, suffix decoding, MLP speculators, MTP, and EAGLE-family draft paths.

Speculative decoding is therefore part of the broader inference stack with paged attention, continuous batching, quantization, disaggregated prefill, cache reuse, custom kernels, and hardware-specific execution. It is not just an algorithm in a paper; it is a scheduling, memory, and systems problem.

Limits and Failure Modes

Acceptance-rate fragility. Speedups shrink when draft tokens are frequently rejected. Acceptance can vary by task, language, sampling temperature, prompt style, domain shift, and tokenizer compatibility.

Batching tradeoffs. Speculation often helps most at low batch sizes or low-latency settings. At high utilization, extra draft work can compete with serving the target model unless the engine schedules it carefully.

Operational complexity. Separate draft models add versioning, deployment, tokenizer, memory, and quality-management burdens. Embedded heads and feature predictors reduce some burdens but require model-specific support.

KV-cache pressure. Draft tokens may require temporary cache allocation and rollback when rejected. That makes speculative decoding sensitive to serving-engine design.

Numerical and implementation differences. Lossless guarantees are mathematical ideals. Real systems can show small differences because of hardware precision, batching, logprob instability, kernel behavior, or sampling implementation.

Misleading benchmark claims. A headline speedup may apply only to particular models, prompts, batch sizes, sequence lengths, hardware, and sampling settings. Production evaluation should measure end-to-end latency, throughput, cost per accepted token, and user-visible quality.

Governance Requirements

Speculative decoding is usually invisible to users, but it affects cost, latency, energy use, and access. Providers should document when inference optimizations change determinism, log probabilities, observability, or reproducibility.

Systems that expose model behavior for audits, safety evaluations, legal discovery, or scientific comparison should record whether speculative decoding was enabled, which proposer was used, how many draft tokens were attempted, and whether outputs are expected to match non-speculative decoding.

For high-stakes deployments, speculative decoding should be treated like other serving changes: tested against regression suites, monitored for domain-specific acceptance rates, and reviewed for interactions with safety filters, structured-output constraints, tool-call generation, and long-context behavior.

Structured-output and tool-call systems need extra care. A speculative server may verify draft tokens against the target model while a separate constrained decoder, schema validator, or tool parser enforces action validity. Audit logs should preserve both layers so investigators can distinguish model behavior, proposer behavior, constraint-engine behavior, and downstream application logic.

Source Discipline

For speculative decoding, separate algorithmic guarantees from deployment claims. Papers can prove distribution preservation or report speedups under specific models, hardware, batch sizes, sequence lengths, and sampling settings. Production documentation can show that an engine supports a mode. Neither proves that a given deployment is faster, cheaper, or output-identical without workload-specific measurement.

A serious benchmark should name the target model, proposer, proposer cost, tokenizer compatibility, draft length, acceptance rate, sampling settings, batch size, prompt/output length distribution, hardware, serving-engine version, KV-cache policy, structured-output or tool-call settings, and latency metric. Tokens per second alone is not enough.

When source claims conflict, prefer the narrower reading. Lossless means the implemented sampling procedure is meant to preserve the target distribution; it does not mean every optimized server run is byte-for-byte reproducible across hardware, kernels, batching, precision, log-probability reporting, or logging configuration.

Spiralist Reading

Speculative decoding is infrastructure learning to pre-stage likely continuations.

The public sees a stream of words. Beneath it, a cheaper proposer races ahead, guessing what the larger model would accept. The target model then accepts or rejects those guesses. The system feels smoother because possible continuations are drafted before they become official text.

For Spiralism, the pattern matters beyond engineering. AI infrastructure increasingly works by precomputing, predicting, ranking, and accepting futures before humans notice the machinery. Speculative decoding is a small, technical example of a wider institutional habit: speed comes from prediction, and prediction quietly becomes authority unless the verification layer remains real.

Sources

Leviathan, Kalman, and Matias, Fast Inference from Transformers via Speculative Decoding, ICML/PMLR, 2023.
Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper, Accelerating Large Language Model Decoding with Speculative Sampling, arXiv, 2023.
Miao et al., SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification, arXiv, 2023.
Cai et al., Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads, arXiv, 2024.
Li, Wei, Zhang, and Zhang, EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty, arXiv, 2024.
vLLM, Speculative Decoding, reviewed June 16, 2026.
NVIDIA TensorRT-LLM, Speculative Sampling, reviewed June 16, 2026.
NVIDIA Developer, TensorRT-LLM overview, reviewed June 16, 2026.
DeepSeek-AI, DeepSeek-V3 Technical Report, arXiv, 2024.

Return to Wiki