# Temperature Top-K and Top-P: Controlling Text Generation

## What are they

Temperature top-k and top-p are three knobs you can turn to
control how a language model generates text. They change how the
model picks the next word. Without them the model would always
pick the single most likely word. The output would be boring and
repetitive. With these knobs you can make the output more focused
or more creative. You can make it safe or adventurous.

Think of the model as a chef choosing ingredients. Without any
controls the chef always picks the most common ingredient for
every dish. Every meal is chicken. Every dessert is vanilla.
Temperature tells the chef to sometimes try the less common
ingredients. Top-k limits the pantry to only the most sensible
options. Top-p lets the chef grab ingredients until they have
enough variety and then stops.

## Where are they used

These knobs are applied right after the model produces its raw
scores and right before it picks the next token.

```
Model output (logits for 50257 tokens)
  → Divide by temperature
    → Keep only top-k tokens
      → Keep tokens until cumulative probability exceeds top-p
        → Softmax to probabilities
          → Pick one token randomly
```

They are used during text generation only. Not during training.
During training the model always sees the correct answer. During
generation there is no correct answer. The model must explore the
space of possible next words. These knobs control how it explores.

## Why we need them

Without any controls the model does one thing: it picks the token
with the highest probability. Always. Every time.

```
Prompt: "The cat sat on the"
Model prediction always: "mat"
Generated: "The cat sat on the mat. The cat sat on the mat. The cat sat on the mat..."
```

The output loops. It gets stuck in a cycle. This happens because
the highest probability path through language is often a loop.
Once the model says *The cat sat on the mat* the next most
likely continuation is *The cat sat on the mat* again. The
probabilities form a trap.

The knobs break this trap by introducing controlled randomness.
Instead of always picking the top token the model sometimes
picks the second best or the third best. The output stays
sensible but avoids repetition.

## Temperature: how adventurous should the model be

Temperature is a simple division. Take the model's raw scores and
divide every one by the temperature.

```
Low temperature (0.3):
  Scores are amplified. The top token gets even more probability.
  The model is confident and predictable.

  Prompt: "The capital of France is"
  Output: "Paris, which is located in the Île-de-France region."

High temperature (1.5):
  Scores are flattened. All tokens get more equal probability.
  The model is creative and unpredictable.

  Prompt: "The capital of France is"
  Output: "Paris, where baguettes dream of becoming croissants."
```

The math is simple. Here is a tiny example with four candidate
tokens.

```python
logits = [4.0, 2.0, 1.0, 0.5]

# Temperature 0.5 (focused)
scaled = [4.0/0.5, 2.0/0.5, 1.0/0.5, 0.5/0.5]
       = [8.0, 4.0, 2.0, 1.0]
probs  = softmax([8.0, 4.0, 2.0, 1.0])
       = [0.97, 0.02, 0.01, 0.00]
# Token 0 has 97% chance. Very confident.

# Temperature 2.0 (creative)
scaled = [4.0/2.0, 2.0/2.0, 1.0/2.0, 0.5/2.0]
       = [2.0, 1.0, 0.5, 0.25]
probs  = softmax([2.0, 1.0, 0.5, 0.25])
       = [0.48, 0.18, 0.18, 0.16]
# Token 0 has only 48% chance. Much more spread out.
```

Temperature of 0 means always pick the most likely token. This
is called greedy decoding. Temperature of 1 means use the natural
probabilities with no modification. Temperature above 1 makes the
model more random. Temperature below 1 makes the model more
focused.

## Top-K: only consider the best options

Temperature spreads the probabilities but even a tiny probability
is still a chance for complete nonsense. Top-k puts a hard limit.
Only the k most likely tokens are considered. Everything else gets
probability zero.

```python
# All 50257 tokens have some probability after temperature
# With top-k=50 we keep only the 50 most likely ones

v, _ = torch.topk(logits, 50)
logits[logits < v[:, -1:]] = float('-inf')
# Now only 50 tokens have non zero probability
```

The magic number is often 50. This eliminates truly nonsensical
completions while keeping enough variety for interesting output.
A smaller k like 10 makes the output more focused. A larger k
like 200 makes it more varied.

## Top-P: dynamic cutoff based on confidence

Top-k always keeps exactly k tokens. But the model's confidence
varies from word to word. Sometimes the model is very sure and
only a few tokens are reasonable. Sometimes the model is unsure
and many tokens are plausible. Top-p adapts to the situation.

Top-p also called nucleus sampling keeps the smallest set of
tokens whose cumulative probability exceeds p.

```
Tokens sorted by probability:
[0.45, 0.22, 0.13, 0.08, 0.05, 0.03, 0.02, 0.01, 0.01]

Top-p = 0.9:
  Cumulative: 0.45 > keep
  Cumulative: 0.45 + 0.22 = 0.67 > keep
  Cumulative: 0.45 + 0.22 + 0.13 = 0.80 > keep
  Cumulative: 0.45 + 0.22 + 0.13 + 0.08 = 0.88 > keep
  Cumulative: 0.45 + 0.22 + 0.13 + 0.08 + 0.05 = 0.93 > stop!
  Keep first 5 tokens. Drop the rest.

Top-p = 0.5:
  Cumulative: 0.45 > keep
  Cumulative: 0.45 + 0.22 = 0.67 > stop!
  Keep first 2 tokens.
```

When the model is very confident the top few tokens might already
have total probability 0.9. Top-p keeps just those few. When the
model is uncertain it takes many more tokens to reach 0.9. Top-p
keeps more options. This adaptive behavior is why top-p is often
preferred over top-k.

## The recommended combination

Most production systems use all three together.

```python
logits = logits / temperature          # Step 1: control randomness
logits = filter_top_k(logits, k=50)   # Step 2: eliminate nonsense
logits = filter_top_p(logits, p=0.9)  # Step 3: adapt to confidence
probs = softmax(logits)                # Step 4: convert to probabilities
next_token = sample(probs)            # Step 5: pick one
```

A common default that works well for general conversation is
temperature 0.7 with top-p 0.9 and top-k 50. For factual
responses lower the temperature. For creative writing raise it.

## A tiny code example

```python
import torch
import torch.nn.functional as F

def sample_next_token(logits, temperature=1.0, top_k=None, top_p=None):
    # Apply temperature
    logits = logits / temperature

    # Top-k filtering
    if top_k is not None:
        v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
        logits[logits < v[:, -1:]] = float('-inf')

    # Top-p filtering
    if top_p is not None:
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(
            F.softmax(sorted_logits, dim=-1), dim=-1)

        sorted_mask = cumulative_probs > top_p
        sorted_mask[:, 1:] = sorted_mask[:, :-1].clone()
        sorted_mask[:, 0] = False

        mask = sorted_mask.scatter(1, sorted_indices, sorted_mask)
        logits[mask] = float('-inf')

    # Sample
    probs = F.softmax(logits, dim=-1)
    return torch.multinomial(probs, num_samples=1)

# Test
logits = torch.tensor([[4.0, 2.0, 1.5, 0.8, 0.3, 0.1, 0.05, 0.02]])

print("Same prompt different temperatures:")
for temp in [0.3, 0.7, 1.5]:
    sampled = []
    for _ in range(5):
        t = sample_next_token(logits, temperature=temp, top_k=5)
        sampled.append(t.item())
    print(f"  T={temp}: samples={sampled}")
```

## What you need to remember

Temperature top-k and top-p control how the model picks the next
token during text generation. Temperature adjusts the randomness
of the whole distribution. Top-k keeps only the best k options.
Top-p adapts the number of options based on the model's confidence.

Without these controls text generation would be deterministic and
repetitive. The model would loop on the same phrases forever. With
these controls generation becomes varied and natural. Different
temperatures give different writing styles from the same model.
This is why the same language model can write both technical
documentation and poetry. The model is the same. The knobs are
different.