# Chapter 0 — What Even Is a GPT?

> *"If you can explain it to a 5-year-old, you truly understand it."*

---

## The 5-Year-Old Analogy

Imagine you have a friend who has read **every book in the library**. You start a sentence:

> *"The cat sat on the..."*

Your friend, having read so many books, **guesses** the next word: **"mat"**.

That's all a GPT is: **a machine that reads tons of text and learns to guess the next word.**

| Concept | Analogy |
|---|---|
| **GPT** | A very smart "next-word guesser" |
| **Training** | Reading millions of books to learn patterns |
| **Text Generation** | Playing "finish my sentence" forever |
| **Parameters** | The "memory" of all patterns it learned |
| **Attention** | Knowing which words matter most |

```mermaid
flowchart LR
    A["Input Text: 'The cat sat on'"] --> B["GPT Model (The Smart Guesser)"]
    B --> C["Next Word: 'the'"]
    C --> D["Feed back: 'The cat sat on the'"]
    D --> B
    D --> E["Next Word: 'mat'"]
    style A fill:#1565c0,stroke:#0d47a1,color:#ffffff
    style B fill:#ef6c00,stroke:#bf360c,color:#ffffff
    style C fill:#2e7d32,stroke:#1b5e20,color:#ffffff
    style E fill:#2e7d32,stroke:#1b5e20,color:#ffffff
```

## The Big Picture: Pipeline Overview

```mermaid
flowchart TD
    A["Raw Text: 'Hello world'"] --> B["Tokenizer: Splits into pieces"]
    B --> C["Token IDs: [15496, 995, ...]"]
    C --> D["Embedding: Each ID -> vector"]
    D --> E["Position Info: RoPE"]
    E --> F["Transformer Blocks x N"]
    F --> G["Output Head: Predict next token"]
    G --> H["Sample next word"]
```

## Which Models Is This Based On?

**Short answer: This is a modern decoder-only Transformer (LLaMA-style), incorporating the best publicly-documented techniques from 2023-2025.**

## What You Will Build

By the end of this guide, you will have built from scratch:

| Component | What It Does | Chapter |
|---|---|---|
| **Tokenizer** | Converts text ↔ numbers (BPE, same algorithm as GPT-4) | [2](02_tokenization.md) |
| **Embeddings** | Gives each token a 768-dimensional "meaning vector" | [3](03_embeddings.md) |
| **RoPE** | Teaches the model about word order using rotation | [4](04_positional_encoding.md) |
| **Attention** | Lets words "look at" and "talk to" each other | [5](05_attention.md) |
| **Transformer Block** | Complete thinking unit: attention + feed-forward + residuals | [6](06_transformer_block.md) |
| **GPT Model** | Full 151M parameter language model (with SwiGLU) | [7](07_gpt_model.md) |
| **Training Pipeline** | Data loading, AdamW, cosine schedule, mixed precision | [8](08_training.md) |
| **Inference Engine** | Text generation with temperature, top-k, top-p, KV cache | [9](09_inference.md) |
| **Complete Script** | One file that trains and generates — runnable start to finish | [10](10_full_script.md) |

**Who is this for?** Anyone who knows basic Python. No ML/AI experience needed. Every concept is explained with analogies first, then math, then annotated code.

**What you'll need:** A computer with Python 3.10+. A GPU is nice but not required — we provide a tiny config that runs on CPU.

## Which Models Is This Based On? (Technical)

| Technique | Source Model | Publicly Confirmed? |
|---|---|---|
| Decoder-only Transformer | GPT-2 (2019), GPT-3 (2020) | Yes |
| Pre-Norm residual | GPT-3 (2020) | Yes |
| BPE tokenizer | GPT-2/3/4 | Yes |
| AdamW optimizer | GPT-3 (2020) | Yes |
| Cosine LR + warmup | GPT-3 (2020) | Yes |
| Weight tying | GPT-2/3 | Yes |
| **RoPE** (position encoding) | **LLaMA, Mistral, Qwen** | Yes — NOT GPT-3/4 |
| **RMSNorm** (normalization) | **LLaMA, Mistral, Gemma** | Yes — NOT GPT-3/4 |
| **SwiGLU** (activation) | **PaLM, LLaMA, Gemini** | Yes — NOT GPT-3 |
| Mixed precision (bfloat16) | All modern models | Yes |

**What about GPT-4 and Claude?** Their architectures are **proprietary and undisclosed**. We know GPT-4 is a Transformer, but not which positional encoding, normalization, or activation it uses. Claude's architecture is entirely secret.

**What this guide teaches:** The most advanced **publicly documented** architecture — essentially what **LLaMA 3, Mistral, Qwen 2.5, and Gemma** use. This is the architecture behind the best open-source models and represents the state of the art that we actually have confirmed documentation for.

**What makes a model "world-class"?**

1. **Scale** — billions of parameters trained on trillions of tokens
2. **Architecture** — the modern Transformer (our focus)
3. **Data Quality** — clean, diverse, well-filtered text
4. **Training Tricks** — mixed precision, gradient clipping, LR schedules

> We'll build a tiny version using the **same publicly-documented techniques** as the best open-source models.

---

**Next:** [Chapter 1 — Setup & Tooling](01_setup.md)