# Chapter 1 — Setup & Tooling

## What You Need to Know Before We Start

### "What is Python, really?"

Python is just a **language for telling the computer what to do**. You write instructions in a `.py` file, and Python "reads" them and executes them one by one. If you've written any Python before — even just `print("hello")` — you're ready.

### "What is a GPU and why do I need one?"

**Analogy:** Imagine you need to paint 10,000 tiny tiles.

- A **CPU** is like a master artist who paints tiles ONE at a time — precise but slow.
- A **GPU** is like 10,000 art students who each paint ONE tile simultaneously — faster, even though each student is "dumber" than the master.

Training neural networks involves millions of **identical, independent math operations** (matrix multiplications). GPUs have thousands of small cores designed exactly for this. A GPU can be 50-100x faster than CPU for training.

**Do you absolutely need a GPU?** No — our tiny test model will run on CPU, just very slowly (minutes vs hours). For real training, a GPU is essential.

| Your Hardware | What You Can Train | Approximate Speed |
|---|---|---|
| CPU only | Tiny model (4 layers, 256 dims) | Hours |
| Apple M1/M2/M3 | Small model (12 layers, 768 dims) | Hours |
| RTX 3060/4060 (12GB) | GPT-2 small (124M params) | Few hours |
| RTX 3090/4090 (24GB) | GPT-2 medium (350M) | Few hours |
| A100 (80GB) | GPT-2 large (774M) | Hours |

### "What is a virtual environment?"

A virtual environment (`venv`) is like a **clean, empty kitchen** just for this project. Without it, you'd be mixing your project's ingredients (Python packages) with everything else on your computer — leading to conflicts when two projects need different versions of the same package.

```bash
# Create a clean kitchen
python -m venv gpt_env

# Step into it
source gpt_env/bin/activate          # Mac/Linux
# OR:
gpt_env\Scripts\activate             # Windows

# Now pip install only affects this kitchen
# To leave: type `deactivate`
```

### "What is pip?"

`pip` is Python's **package installer**. It downloads code other people have written (libraries) from the internet and installs them into your environment. Think of it as an "app store" for Python code.

### "What is PyTorch?"

PyTorch is the framework we'll use to build our neural network. It provides:

| PyTorch Feature | What It Does | Analogy |
|---|---|---|
| `torch.Tensor` | Multi-dimensional arrays | Like NumPy arrays, but can live on GPU |
| `torch.nn.Module` | Building blocks for networks | LEGO pieces you snap together |
| `torch.optim` | Algorithms that update weights | The "learning" part of machine learning |
| `autograd` | Automatic gradient calculation | Does calculus for you automatically |
| `DataLoader` | Feeds data efficiently | A conveyor belt delivering training data |

## Installation — Step by Step

```bash
# Step 1: Create the virtual environment
python -m venv gpt_env

# Step 2: Activate it
source gpt_env/bin/activate          # Mac/Linux
# gpt_env\Scripts\activate           # Windows

# Step 3: Install PyTorch (choose the right one)
# For CPU only (default, works everywhere):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# For Apple Silicon (M1/M2/M3):
# pip install torch torchvision torchaudio

# For NVIDIA GPU (CUDA 11.8):
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# For NVIDIA GPU (CUDA 12.1 - newer cards like RTX 40 series):
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Step 4: Install remaining packages
pip install tiktoken datasets numpy matplotlib

# Step 5: Verify everything works
python -c "import torch; print(f'PyTorch {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"
```

## What Each Library Does (In Detail)

| Library | What It Does | Why We Need It |
|---|---|---|
| **torch** | Core PyTorch: tensors, GPU ops, autograd | The foundation — everything else builds on this |
| **tiktoken** | Fast BPE tokenizer from OpenAI | Same tokenizer GPT-3.5/4 use. Written in Rust, extremely fast |
| **datasets** (HuggingFace) | Downloads + caches training data | Saves us from manually downloading and parsing Wikipedia |
| **numpy** | Fast numerical arrays on CPU | For quick data manipulation (though PyTorch handles most) |
| **matplotlib** | Creates charts and graphs | To visualize our training loss — is the model learning? |
| **math** (built-in) | sqrt, sin, cos, pi | Mathematical constants for positional encoding |
| **time** (built-in) | Measure elapsed time | Track training speed in tokens/second |
| **os** (built-in) | Create directories, save files | Save model checkpoints so we don't lose progress |

## Our Complete Import Block

```python
# ===== WHAT: Standard Python libraries =====
import math              # WHY: sqrt(), sin(), cos() for positional encoding math
import time              # WHY: measure training speed (tokens per second)
import os                # WHY: create directories, save/load model checkpoint files
from dataclasses import dataclass  # WHY: clean config class — no messy dictionaries

# ===== WHAT: NumPy — the CPU array library =====
import numpy as np       # WHY: fast numerical operations on CPU arrays
                         #      (mostly used for quick data checks, not heavy lifting)

# ===== WHAT: PyTorch — the neural network framework =====
import torch             # WHY: core library — tensors, GPU support, autograd
import torch.nn as nn               # WHY: neural network building blocks:
                                     #      Linear (dense layers), Embedding (lookup tables),
                                     #      Dropout (regularization), ModuleList (stacking layers)
import torch.nn.functional as F     # WHY: stateless functions used inside forward():
                                     #      softmax (convert to probabilities),
                                     #      cross_entropy (measure prediction error),
                                     #      silu (SwiGLU activation function)
from torch.utils.data import Dataset, DataLoader  # WHY: efficient data pipeline
#                                  Dataset = define how to load one sample
#                                  DataLoader = batch them, shuffle, prefetch

# ===== WHAT: tiktoken — OpenAI's fast BPE tokenizer =====
import tiktoken          # WHY: same Byte Pair Encoding tokenizer as GPT-3.5/GPT-4
                         #      Written in Rust, ~100x faster than pure Python tokenizers
                         #      Handles 50K+ vocabulary efficiently

# ===== WHAT: HuggingFace datasets — download training text =====
from datasets import load_dataset    # WHY: one line to download WikiText-103
                                     #      Handles caching (only downloads once),
                                     #      streaming (for datasets too big for disk),
                                     #      and format conversion automatically

# ===== WHAT: matplotlib — plot loss curves =====
import matplotlib.pyplot as plt      # WHY: visualize training progress
                                     #      Is the loss going down? Is it plateauing?
                                     #      A picture is worth 1,000 log lines

# ===== WHAT: Quick verification =====
# WHY: Always test your environment before writing 500 lines of code.
#      A missing import now saves hours of debugging later.
print("All imports ready!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available:  {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU:             {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory:      {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
```

**Expected output (with GPU):**
```
All imports ready!
PyTorch version: 2.1.0
CUDA available:  True
GPU:             NVIDIA GeForce RTX 3090
GPU Memory:      24.0 GB
```

**Expected output (CPU only):**
```
All imports ready!
PyTorch version: 2.1.0
CUDA available:  False
```

If you see the GPU output, you're ready to train. If you see CPU only, training will work — just slower. Either way, let's continue.

---

## How to Think About the Rest of This Guide

Every chapter follows this pattern:

1. **Analogy** — Explain the concept in plain English (like teaching a 5-year-old)
2. **Math** — Show the actual formulas and why they work
3. **Code** — Every single line annotated with WHAT it does and WHY
4. **Visual** — Diagram or worked example showing data flowing through

If you ever feel lost, go back to the analogy. If the code feels overwhelming, focus on the WHAT/WHY comments — they're designed to be read top-to-bottom like a story.

---

**Previous:** [Chapter 0 — Overview](00_overview.md)
**Next:** [Chapter 2 — Tokenization](02_tokenization.md)