# Speculative Decoding: Making Generation 3× Faster

## The short answer

Speculative decoding uses two models to generate text faster than
either could alone. A small fast model called the draft model
generates several candidate tokens quickly. A large slow model called
the target model checks all of them in one pass. Accepted tokens are
kept. Rejected tokens are regenerated by the large model. The result
is text identical to what the large model would have produced alone
but generated two to three times faster.

Think of it like a senior engineer and a junior engineer. The junior
writes a first draft quickly. The senior reviews it and accepts the
parts that are correct. For the parts that are wrong the senior
rewrites them. The final document is the senior engineer's quality but
produced much faster because the senior only had to edit not write
from scratch.

## Where it sits

Speculative decoding is an inference only technique. It does not
change training. It does not change the model weights. It changes
only the generation loop. You can apply it to any autoregressive
model without retraining.

```
Without speculative decoding:
  Large model generates one token at a time
  Each token requires one full forward pass
  100 tokens = 100 forward passes of the large model

With speculative decoding:
  Small model drafts K tokens per step
  Large model verifies K tokens in one forward pass
  Accepts some. Rejects and regenerates others.
  100 tokens ≈ 100/K forward passes of the large model
                + 100 forward passes of the small model
  If K=4 and small model is 10× faster: ~2-3× total speedup
```

The speedup comes from converting large model forward passes into
small model forward passes. The small model is much faster so
replacing even some large model passes with small model passes
reduces total time.

## How it works step by step

### Step 1: The draft phase

The target model has generated everything up to position t. We want
the next token. Instead of running the large model we run the small
model K times in sequence to produce K candidate tokens.

```
Current tokens: "The cat sat on"

Draft model generates:
  Step 1: "the"     (from "The cat sat on")
  Step 2: "mat"     (from "The cat sat on the")
  Step 3: "and"     (from "The cat sat on the mat")
  Step 4: "then"    (from "The cat sat on the mat and")

Draft proposal: ["the", "mat", "and", "then"]
```

The draft model is fast. Generating four tokens with a model that is
10 percent the size takes about 40 percent of the time of one large
model forward pass. So drafting four tokens costs less than half a
large model token.

### Step 2: The verification phase

The large model takes the original sequence plus the draft tokens
and runs one forward pass.

```
Input to large model: "The cat sat on the mat and then"

The model processes all four new tokens in a single forward pass.
For each position it computes the probability of the draft token
that was at that position.

Output: probability(correct for position t+1)
        probability(correct for position t+2)
        probability(correct for position t+3)
        probability(correct for position t+4)
```

One forward pass checking four tokens. If they are all correct we
just saved three large model forward passes.

### Step 3: The acceptance phase

For each position the large model computes a probability for the draft
token that was proposed. It also computes its own top prediction. The
acceptance rule compares the probabilities from both models.

```
For position t+1 (where draft proposed "the"):
  Draft model probability: 0.85
  Target model probability: 0.82

  Since target probability is close to draft probability the token
  is accepted. This means the two models agree. "the" stays.

For position t+2 (where draft proposed "mat"):
  Draft model probability: 0.72
  Target model probability: 0.35

  The target model disagrees strongly. The probability ratio is
  too low. The token is rejected. All tokens after this position
  are also rejected.

Result: "the" is accepted. "mat" "and" "then" are rejected.
The target model generates ONE new token at position t+2.
```

The acceptance rule is based on probability ratios. If the target
model's probability for the draft token is similar to or higher than
the draft model's own probability the token is accepted. If the
target probability is much lower the token is rejected.

This guarantees that the output distribution is identical to what the
large model would have produced alone. The small model's errors are
caught and corrected. The final text is the large model's quality.

### Step 4: Continue

After acceptance and rejection the sequence has been extended by some
number of tokens. Accepted tokens stay. Rejected positions are filled
by the large model. The process repeats from the new end of the
sequence.

```
Sequence after one cycle: "The cat sat on the shelf"

Draft again: "and" "stared" "out" "the"
Verify. Accept "and". Reject "stared" and onward.
Regenerate: "looked"

Sequence: "The cat sat on the shelf and looked"

Continue until enough tokens are generated.
```

## The acceptance math

The magic of speculative decoding is that the output is
mathematically identical to what the target model would have
generated. This is not an approximation. It is exact.

The acceptance test for each draft token x at position p is:

```
1. Compute target model probability:  P_target(x | context)
2. Compute draft model probability:   P_draft(x | context)
3. Accept if: P_target(x) ≥ P_draft(x)
   If P_target(x) < P_draft(x): accept with probability
   P_target(x) / P_draft(x)
```

When the target model thinks the token is more likely than the draft
model did it is always accepted. When the target thinks it is less
likely it is accepted proportionally to the ratio. This sampling
procedure guarantees the output follows the target model's exact
distribution.

The proof is a few lines of probability theory. But you do not need
to understand the proof to use the technique. The key takeaway is
that speculative decoding gives you the exact quality of the large
model. Not an approximation. Not a distillation. Exact.

## What makes a good draft model

The draft model should be much smaller than the target model but
share the same tokenizer. It should be trained on similar data so
its predictions correlate with the target model's.

Good draft models:
- A smaller version of the same architecture (7B draft for 70B target)
- A distilled version of the target model
- The same model with fewer layers or smaller dimensions
- A completely different fast model with the same tokenizer

The draft model does not need to be good at generating text on its
own. It only needs to get enough tokens right that the acceptance
rate is reasonable. Even with 60 percent acceptance the speedup is
significant because accepting three out of five tokens means three
large model passes saved for the cost of five small model passes.

```
Acceptance rate 50% with K=5 draft tokens:
Old: 100 large model passes
New: 100/2.5 = 40 large model passes + 100 small model passes
Speedup: ~2.0x (if small model is 10% of large model cost)

Acceptance rate 80% with K=5 draft tokens:
Old: 100 large model passes
New: 100/4 = 25 large model passes + 100 small model passes
Speedup: ~3.5x
```

## A simplified implementation

```python
def speculative_generate(target_model, draft_model, tokenizer,
                         prompt, max_new_tokens=100, K=5):
    """
    Generate text using speculative decoding.
    The output is identical to target_model.generate() but faster.
    """
    input_ids = tokenizer.encode(prompt)

    while len(input_ids) < max_new_tokens:
        # Phase 1: Draft K tokens with the small model
        draft_ids = input_ids.copy()
        draft_probs = []
        for _ in range(K):
            logits, _ = draft_model(draft_ids[-max_seq_len:])
            probs = F.softmax(logits[:, -1, :], dim=-1)
            next_token = torch.multinomial(probs, num_samples=1).item()
            draft_probs.append(probs[0, next_token].item())
            draft_ids.append(next_token)

        # Phase 2: Verify with large model in one pass
        full_sequence = draft_ids[-max_seq_len:]
        target_logits, _ = target_model(full_sequence)
        target_probs = F.softmax(target_logits, dim=-1)

        # Phase 3: Accept or reject
        accepted = 0
        for i in range(K):
            pos = len(input_ids) - K + i
            draft_token = draft_ids[pos]
            target_prob = target_probs[0, pos, draft_token].item()
            draft_prob = draft_probs[i]

            if target_prob >= draft_prob:
                accepted += 1
            elif random.random() < target_prob / draft_prob:
                accepted += 1
            else:
                break

        # Keep accepted tokens
        input_ids = draft_ids[:len(input_ids) - K + accepted]

        # If no tokens accepted sample one from target
        if accepted == 0:
            target_probs = F.softmax(target_logits[:, -1, :], dim=-1)
            next_token = torch.multinomial(target_probs, num_samples=1).item()
            input_ids.append(next_token)

    return tokenizer.decode(input_ids)
```

The crucial detail: the target model processes all K draft tokens in
a single forward pass (line 26). This is where the speedup comes from.
One target model forward pass checks K draft tokens. If most are
accepted we save K minus one target model passes.

## Why this matters in practice

A 70 billion parameter model generates about 10 tokens per second on
an A100 GPU. A 7 billion parameter draft model generates about 100
tokens per second. With K=4 and an acceptance rate of 70 percent the
speculative decoding system generates about 25 tokens per second. A
2.5x speedup with zero quality loss.

For a chat application where users expect responses in under a second
this is the difference between feasible and frustrating. For a batch
processing pipeline generating millions of tokens per day this is the
difference between one GPU and three.

## The KV cache interaction

Speculative decoding works with KV caches. The target model's cache
is shared across verification steps. The draft model also maintains
its own cache. After acceptance the target model's cache is updated
to include the accepted tokens. After rejection the target model's
cache is rolled back to before the rejected tokens.

Cache management is the trickiest part of implementation. Incorrect
cache handling leads to corrupted generations where the model
attends to tokens it should not see or fails to attend to tokens it
should see. Most production speculative decoding implementations
spend more code on cache management than on the acceptance logic.

## What you need to remember

Speculative decoding uses a small fast draft model to propose tokens
and a large slow target model to verify them. Accepted tokens are
kept. Rejected tokens are regenerated. The output is mathematically
identical to the target model alone. The speedup comes from replacing
expensive target model passes with cheap draft model passes.

The acceptance rate depends on how well the draft model predicts what
the target model would have produced. A draft model that is 10 percent
the size of the target can achieve 70 to 80 percent acceptance. With
four candidate tokens per step this yields a 2x to 3x speedup.

Speculative decoding is a pure inference optimization. No retraining.
No weight changes. No quality tradeoff. It is free speed for any
model serving system.