# KV Cache: Making Text Generation Fast ## What is it The KV cache stores the Key and Value vectors from all previous tokens during text generation. When the model generates the next token it reuses these stored vectors instead of recomputing them from scratch. This makes generation hundreds of times faster. Think of it like writing a long email. Without a KV cache you would reread the entire email from the start every time you typed a new word. With a KV cache you remember everything you already wrote and only think about the new word. The difference in effort is enormous. ## Where is it used The KV cache lives inside the attention layer. Every attention head in every transformer block has its own cache. For a twelve block model with twelve heads there are one hundred and forty four separate caches. Each stores the Keys and Values for every token generated so far. ``` Text generation without cache: For each new token: Run the ENTIRE sequence through all layers This recomputes K and V for every past token Time grows quadratically with sequence length Text generation with cache: For each new token: Only compute K and V for the new token Append to cache Reuse cached K and V for all past tokens Time grows linearly with sequence length ``` ## Why we need it The attention formula is Q times K transpose. The Q matrix has one row per token. The K matrix has one row per token. If we have five hundred tokens we multiply a five hundred by sixty four matrix by a sixty four by five hundred matrix. That is already some work. Without a cache we would do this multiplication from scratch for every new token. When the sequence is one token long we do one comparison. When it is two tokens long we do two comparisons for token zero and two for token one. When it is five hundred tokens long we do five hundred comparisons for each of the five hundred tokens. The total work grows with the square of the sequence length. This is painfully slow. With a cache we only compute the comparisons involving the new token. Token five hundred compares itself against all five hundred previous tokens. That is five hundred new comparisons. Not five hundred squared. The cached Keys save us from recomputing the old comparisons that have not changed. ### The speed numbers ``` Sequence length 100: Without cache: 100² = 10,000 comparisons per generation step With cache: 100 comparisons per generation step Speedup: 100× Sequence length 1000: Without cache: 1,000² = 1,000,000 comparisons With cache: 1,000 comparisons Speedup: 1,000× ``` For a long conversation with thousands of tokens the KV cache makes the difference between waiting seconds and waiting minutes for each new word. ## When was it invented The KV cache was described in the original transformer paper in 2017. It was not an optimization added later. It was part of the design from the beginning. The authors knew that autoregressive generation would be painfully slow without it. Every transformer implementation since 2017 uses some form of KV cache. ## How it works step by step ### Step 1: the first token We have a prompt that is five tokens long. The model processes all five tokens in parallel during a prefill step. ``` Prefill: Tokens: [The, cat, sat, on, the] K cache for head 0: store K for all 5 tokens (5 × 64 matrix) V cache for head 0: store V for all 5 tokens (5 × 64 matrix) ... repeat for all 12 heads ... ``` This initial step is expensive but it only happens once. ### Step 2: generating the next token The model predicts that the next token is *mat*. We append *mat* to the sequence. Now we have six tokens. ``` Generation step 1: New token: [mat] Compute K for "mat" only: (1 × 64 matrix) Compute V for "mat" only: (1 × 64 matrix) Full K cache: 5 old rows + 1 new row = 6 × 64 matrix Full V cache: 5 old rows + 1 new row = 6 × 64 matrix Compute Q for "mat" only: (1 × 64 matrix) Attention scores: Q_mat × K_full^T = 1 × 6 comparisons Only the new token's Q was computed. Old Q values are not needed. ``` We computed only one new Key one new Value and one new Query. The attention scores for *mat* are computed against all six tokens because *mat* needs to attend to everything that came before. But the attention scores for *The* and *cat* and *sat* are not recomputed. They do not change. Why would they. Their context has not changed. Only the new token has new context to process. ### Step 3: memory growth For each new token we add one row to every K cache and every V cache. The cache grows linearly with the sequence length. ``` Memory for a GPT-2 Small model generating 1000 tokens: K cache: 12 layers × 12 heads × 1000 tokens × 64 dims × 2 bytes (bfloat16) = 18,432,000 bytes = 17.6 MB V cache: same as K cache = 17.6 MB Total KV cache: 35.2 MB ``` Thirty five megabytes is nothing for a modern GPU. But remember this is GPT-2 Small with only twelve layers and 768 dimensions. ``` Memory for a GPT-3 Large model generating 1000 tokens: K cache: 96 layers × 96 heads × 1000 tokens × 128 dims × 2 bytes = 2,359,296,000 bytes = 2.2 GB V cache: same as K cache = 2.2 GB Total KV cache: 4.4 GB ``` Now the cache is a significant portion of GPU memory. For long conversations spanning thousands of tokens the KV cache can become the dominant memory consumer. This is one reason why running large models at long context lengths requires enormous amounts of VRAM. ## A simplified code sketch This is not production code but it shows the idea. ```python class AttentionWithCache: def __init__(self): self.k_cache = None # Will hold accumulated K values self.v_cache = None # Will hold accumulated V values def forward(self, x, use_cache=False): q, k, v = self.qkv_proj(x) if use_cache and self.k_cache is not None: # Append new K and V to cache k = torch.cat([self.k_cache, k], dim=2) v = torch.cat([self.v_cache, v], dim=2) # Update cache for next step self.k_cache = k self.v_cache = v # Normal attention computation scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim) weights = F.softmax(scores, dim=-1) return weights @ v def reset_cache(self): self.k_cache = None self.v_cache = None ``` The key idea is on lines eleven and twelve. Instead of throwing away the old K and V we append the new ones. The cache grows over time. At the start of a new conversation we reset the cache to empty. ## The tradeoff KV cache trades memory for speed. We accept that generation will use more GPU memory in exchange for making each step much faster. For most real world applications this is a good trade. Memory is relatively cheap compared to the user's patience waiting for each word to appear. However for very long sequences the memory cost becomes prohibitive. Researchers have developed techniques like grouped query attention and multi query attention that reduce the number of K and V heads. Fewer heads means a smaller cache. These techniques are standard in modern models like LLaMA 2 and Mistral. ## What you need to remember The KV cache stores previously computed Keys and Values so they do not need to be recomputed for each new token. This changes the complexity of text generation from quadratic to linear in the sequence length. For a thousand token sequence this is a thousand times speedup. The cost is memory. The cache grows with the sequence length and the model size. For small models the memory cost is negligible. For large models at long context lengths the cache can consume tens of gigabytes. Without a KV cache text generation would be slow enough that chatbots would be unusable. Every response would take minutes instead of seconds. The cache is a necessary optimization for any real world deployment.