# Speculative Decoding: Making Generation 3× Faster ## The short answer Speculative decoding uses two models to generate text faster than either could alone. A small fast model called the draft model generates several candidate tokens quickly. A large slow model called the target model checks all of them in one pass. Accepted tokens are kept. Rejected tokens are regenerated by the large model. The result is text identical to what the large model would have produced alone but generated two to three times faster. Think of it like a senior engineer and a junior engineer. The junior writes a first draft quickly. The senior reviews it and accepts the parts that are correct. For the parts that are wrong the senior rewrites them. The final document is the senior engineer's quality but produced much faster because the senior only had to edit not write from scratch. ## Where it sits Speculative decoding is an inference only technique. It does not change training. It does not change the model weights. It changes only the generation loop. You can apply it to any autoregressive model without retraining. ``` Without speculative decoding: Large model generates one token at a time Each token requires one full forward pass 100 tokens = 100 forward passes of the large model With speculative decoding: Small model drafts K tokens per step Large model verifies K tokens in one forward pass Accepts some. Rejects and regenerates others. 100 tokens ≈ 100/K forward passes of the large model + 100 forward passes of the small model If K=4 and small model is 10× faster: ~2-3× total speedup ``` The speedup comes from converting large model forward passes into small model forward passes. The small model is much faster so replacing even some large model passes with small model passes reduces total time. ## How it works step by step ### Step 1: The draft phase The target model has generated everything up to position t. We want the next token. Instead of running the large model we run the small model K times in sequence to produce K candidate tokens. ``` Current tokens: "The cat sat on" Draft model generates: Step 1: "the" (from "The cat sat on") Step 2: "mat" (from "The cat sat on the") Step 3: "and" (from "The cat sat on the mat") Step 4: "then" (from "The cat sat on the mat and") Draft proposal: ["the", "mat", "and", "then"] ``` The draft model is fast. Generating four tokens with a model that is 10 percent the size takes about 40 percent of the time of one large model forward pass. So drafting four tokens costs less than half a large model token. ### Step 2: The verification phase The large model takes the original sequence plus the draft tokens and runs one forward pass. ``` Input to large model: "The cat sat on the mat and then" The model processes all four new tokens in a single forward pass. For each position it computes the probability of the draft token that was at that position. Output: probability(correct for position t+1) probability(correct for position t+2) probability(correct for position t+3) probability(correct for position t+4) ``` One forward pass checking four tokens. If they are all correct we just saved three large model forward passes. ### Step 3: The acceptance phase For each position the large model computes a probability for the draft token that was proposed. It also computes its own top prediction. The acceptance rule compares the probabilities from both models. ``` For position t+1 (where draft proposed "the"): Draft model probability: 0.85 Target model probability: 0.82 Since target probability is close to draft probability the token is accepted. This means the two models agree. "the" stays. For position t+2 (where draft proposed "mat"): Draft model probability: 0.72 Target model probability: 0.35 The target model disagrees strongly. The probability ratio is too low. The token is rejected. All tokens after this position are also rejected. Result: "the" is accepted. "mat" "and" "then" are rejected. The target model generates ONE new token at position t+2. ``` The acceptance rule is based on probability ratios. If the target model's probability for the draft token is similar to or higher than the draft model's own probability the token is accepted. If the target probability is much lower the token is rejected. This guarantees that the output distribution is identical to what the large model would have produced alone. The small model's errors are caught and corrected. The final text is the large model's quality. ### Step 4: Continue After acceptance and rejection the sequence has been extended by some number of tokens. Accepted tokens stay. Rejected positions are filled by the large model. The process repeats from the new end of the sequence. ``` Sequence after one cycle: "The cat sat on the shelf" Draft again: "and" "stared" "out" "the" Verify. Accept "and". Reject "stared" and onward. Regenerate: "looked" Sequence: "The cat sat on the shelf and looked" Continue until enough tokens are generated. ``` ## The acceptance math The magic of speculative decoding is that the output is mathematically identical to what the target model would have generated. This is not an approximation. It is exact. The acceptance test for each draft token x at position p is: ``` 1. Compute target model probability: P_target(x | context) 2. Compute draft model probability: P_draft(x | context) 3. Accept if: P_target(x) ≥ P_draft(x) If P_target(x) < P_draft(x): accept with probability P_target(x) / P_draft(x) ``` When the target model thinks the token is more likely than the draft model did it is always accepted. When the target thinks it is less likely it is accepted proportionally to the ratio. This sampling procedure guarantees the output follows the target model's exact distribution. The proof is a few lines of probability theory. But you do not need to understand the proof to use the technique. The key takeaway is that speculative decoding gives you the exact quality of the large model. Not an approximation. Not a distillation. Exact. ## What makes a good draft model The draft model should be much smaller than the target model but share the same tokenizer. It should be trained on similar data so its predictions correlate with the target model's. Good draft models: - A smaller version of the same architecture (7B draft for 70B target) - A distilled version of the target model - The same model with fewer layers or smaller dimensions - A completely different fast model with the same tokenizer The draft model does not need to be good at generating text on its own. It only needs to get enough tokens right that the acceptance rate is reasonable. Even with 60 percent acceptance the speedup is significant because accepting three out of five tokens means three large model passes saved for the cost of five small model passes. ``` Acceptance rate 50% with K=5 draft tokens: Old: 100 large model passes New: 100/2.5 = 40 large model passes + 100 small model passes Speedup: ~2.0x (if small model is 10% of large model cost) Acceptance rate 80% with K=5 draft tokens: Old: 100 large model passes New: 100/4 = 25 large model passes + 100 small model passes Speedup: ~3.5x ``` ## A simplified implementation ```python def speculative_generate(target_model, draft_model, tokenizer, prompt, max_new_tokens=100, K=5): """ Generate text using speculative decoding. The output is identical to target_model.generate() but faster. """ input_ids = tokenizer.encode(prompt) while len(input_ids) < max_new_tokens: # Phase 1: Draft K tokens with the small model draft_ids = input_ids.copy() draft_probs = [] for _ in range(K): logits, _ = draft_model(draft_ids[-max_seq_len:]) probs = F.softmax(logits[:, -1, :], dim=-1) next_token = torch.multinomial(probs, num_samples=1).item() draft_probs.append(probs[0, next_token].item()) draft_ids.append(next_token) # Phase 2: Verify with large model in one pass full_sequence = draft_ids[-max_seq_len:] target_logits, _ = target_model(full_sequence) target_probs = F.softmax(target_logits, dim=-1) # Phase 3: Accept or reject accepted = 0 for i in range(K): pos = len(input_ids) - K + i draft_token = draft_ids[pos] target_prob = target_probs[0, pos, draft_token].item() draft_prob = draft_probs[i] if target_prob >= draft_prob: accepted += 1 elif random.random() < target_prob / draft_prob: accepted += 1 else: break # Keep accepted tokens input_ids = draft_ids[:len(input_ids) - K + accepted] # If no tokens accepted sample one from target if accepted == 0: target_probs = F.softmax(target_logits[:, -1, :], dim=-1) next_token = torch.multinomial(target_probs, num_samples=1).item() input_ids.append(next_token) return tokenizer.decode(input_ids) ``` The crucial detail: the target model processes all K draft tokens in a single forward pass (line 26). This is where the speedup comes from. One target model forward pass checks K draft tokens. If most are accepted we save K minus one target model passes. ## Why this matters in practice A 70 billion parameter model generates about 10 tokens per second on an A100 GPU. A 7 billion parameter draft model generates about 100 tokens per second. With K=4 and an acceptance rate of 70 percent the speculative decoding system generates about 25 tokens per second. A 2.5x speedup with zero quality loss. For a chat application where users expect responses in under a second this is the difference between feasible and frustrating. For a batch processing pipeline generating millions of tokens per day this is the difference between one GPU and three. ## The KV cache interaction Speculative decoding works with KV caches. The target model's cache is shared across verification steps. The draft model also maintains its own cache. After acceptance the target model's cache is updated to include the accepted tokens. After rejection the target model's cache is rolled back to before the rejected tokens. Cache management is the trickiest part of implementation. Incorrect cache handling leads to corrupted generations where the model attends to tokens it should not see or fails to attend to tokens it should see. Most production speculative decoding implementations spend more code on cache management than on the acceptance logic. ## What you need to remember Speculative decoding uses a small fast draft model to propose tokens and a large slow target model to verify them. Accepted tokens are kept. Rejected tokens are regenerated. The output is mathematically identical to the target model alone. The speedup comes from replacing expensive target model passes with cheap draft model passes. The acceptance rate depends on how well the draft model predicts what the target model would have produced. A draft model that is 10 percent the size of the target can achieve 70 to 80 percent acceptance. With four candidate tokens per step this yields a 2x to 3x speedup. Speculative decoding is a pure inference optimization. No retraining. No weight changes. No quality tradeoff. It is free speed for any model serving system.