{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 4: NumPy for Data & ML\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/04-numpy.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/04-numpy.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Python Foundations**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed Parts 1-3 (`01-python-core.ipynb`, `02-control-flow.ipynb`, `03-python-patterns.ipynb`). If you have not, start there.\n", "\n", "Every ML library you will use (pandas, scikit-learn, PyTorch, TensorFlow) stores its data as **NumPy arrays** (or something modelled directly on them) and builds its fast operations on top of NumPy's rules. This notebook teaches those rules from first principles, using the same **university analytics platform** running example from Parts 1-3: this time we build a small student feature matrix (`study_hours`, `attendance_pct`, `prior_gpa`) and use it to predict `exam_score`.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **ndarray vs. list** | Why ML libraries do not use plain Python lists |\n", "| **Creating arrays** | Build feature matrices and synthetic data |\n", "| **Shape & dtype** | Avoid silent shape-mismatch bugs |\n", "| **Indexing & slicing** | Select rows, columns, and subsets fast |\n", "| **Boolean masking** | Vectorised filtering and labelling |\n", "| **Aggregation & axis** | Per-feature vs. per-sample statistics |\n", "| **Broadcasting** | The rule behind every elementwise ML operation |\n", "| **Vectorisation** | Why loops are slow and arrays are fast |\n", "| **Linear algebra basics** | What is under the hood of a linear model |\n", "| **Saving & loading** | Persist arrays between pipeline stages |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Explain why NumPy arrays outperform Python lists for numeric data | Sec. 1 |\n", "| 2 | Create arrays with `array`, `arange`, `zeros`, `ones`, and a modern `Generator` | Sec. 2 |\n", "| 3 | Inspect and reshape arrays using `shape`, `dtype`, `reshape`, and stacking | Sec. 3 |\n", "| 4 | Select data with slicing, fancy indexing, and boolean masks | Sec. 4, 5 |\n", "| 5 | Compute per-row / per-column statistics with `axis` | Sec. 6 |\n", "| 6 | Apply NumPy's broadcasting rule to vectorise feature engineering | Sec. 7 |\n", "| 7 | Explain why vectorised code beats Python loops | Sec. 8 |\n", "| 8 | Use `@`/`np.dot` to express a linear model as matrix multiplication | Sec. 9 |\n", "| 9 | Save and reload arrays with `.npy` / `.npz` | Sec. 10 |\n", "| 10 | Recognise NumPy's most common silent bugs (views, dtype, float equality) | Sec. 11 |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 0. Meet NumPy\n", "\n", "You have written Python for three chapters. You can loop, slice, and unpack. Now imagine you need to normalise 2,400 exam scores: subtract the mean, divide by the standard deviation, do it across three columns. A Python `for` loop would work. It would also be slow, verbose, and nothing like the code your colleagues expect to read.\n", "\n", "NumPy ([numpy.org](https://numpy.org)) was created in 2005 by Travis Oliphant to give Python scientists the numeric performance of Fortran and C without leaving the language. The idea was simple: store numbers in a contiguous block of memory with a fixed type, and let a thin Python wrapper call heavily optimised C and Fortran routines on that block. Two decades later, nearly every numerical computing library in Python (pandas, scikit-learn, PyTorch, TensorFlow) uses NumPy arrays as its currency.\n", "\n", "### How it compares\n", "\n", "| Approach | Speed on large arrays | Readable math | When to use |\n", "| --- | --- | --- | --- |\n", "| Python `list` + loop | Slow (Python objects, GIL) | Verbose | Small collections, mixed types |\n", "| **NumPy `ndarray`** | **Fast (C/Fortran, contiguous)** | **Concise (`a * 2`)** | **Numeric data of any size** |\n", "| PyTorch `Tensor` | Fast (optionally GPU) | Similar to NumPy | Deep learning, autodiff |\n", "| JAX `Array` | Very fast (XLA, JIT, GPU/TPU) | NumPy-compatible | Research, differentiable programs |\n", "| CuPy `ndarray` | GPU only | NumPy-compatible | Large-scale GPU computing |\n", "\n", "For everything up to classical ML on a laptop, NumPy is the right level of abstraction. PyTorch and JAX add complexity (device management, gradient tracking) that you do not need yet.\n", "\n", "### Already in your environment\n", "\n", "NumPy is included in `pyproject.toml`. If you ever start a standalone project:\n", "\n", "```bash\n", "uv add numpy # or: pip install numpy\n", "```\n", "\n", "Official docs and API reference: [numpy.org/doc](https://numpy.org/doc/stable/)" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 1. Why NumPy? The `ndarray`\n", "\n", "A Python `list` can hold anything (mixed types, nested objects), which makes it flexible but slow for numeric work: every element is a separate Python object, and arithmetic on a list means a Python-level loop.\n", "\n", "A NumPy **`ndarray`** (\"n-dimensional array\") is different: it stores **one fixed dtype** in a single contiguous block of memory. That uniformity lets NumPy hand the math off to compiled C/Fortran loops instead of the Python interpreter, often 10-100x faster, and with far less memory per element." ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "study_hours = [12, 5, 18, 9, 22]\n", "\n", "# A Python list has no element-wise arithmetic: this is string repetition, not math!\n", "print(\"list * 2 :\", study_hours * 2)\n", "\n", "hours = np.array(study_hours)\n", "print(\"array * 2 :\", hours * 2) # element-wise multiplication" ] }, { "cell_type": "raw", "id": "7", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "> **Memory layout: Python list vs NumPy ndarray**\n", "\n", "```{mermaid}\n", "flowchart LR\n", " subgraph py [\"Python list: scattered pointers\"]\n", " P1[\"int obj\"] & P2[\"float obj\"] & P3[\"str obj\"]\n", " L[\"list\"] -->|ptr| P1\n", " L -->|ptr| P2\n", " L -->|ptr| P3\n", " end\n", " subgraph np [\"NumPy ndarray: contiguous memory\"]\n", " direction LR\n", " M1[\"64-bit float\"] --- M2[\"64-bit float\"] --- M3[\"64-bit float\"]\n", " A[\"array\"] --> M1\n", " end\n", "\n", " style np fill:#EBF5F0,stroke:#059669\n", " style py fill:#FEF2F2,stroke:#DC2626\n", "```\n" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "
\n", " Key Concept: ndarray = dtype + shape + contiguous memory

\n", "A NumPy array is homogeneous (one dtype for every element) and has a fixed shape. Because elements sit next to each other in memory, NumPy can vectorise operations: apply one compiled loop to the whole array instead of looping in Python.

Lists are general-purpose containers; arrays are numeric data structures. Use a list for a heterogeneous bag of objects, an array for a column of numbers.\n", "
" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "## 2. Creating Arrays\n", "\n", "The most direct way to create an array is `np.array()` from a Python list (or list of lists, for 2D data). For larger or synthetic data, NumPy provides dedicated creation functions so you never have to type out values by hand." ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "# 1D array: one feature for five students\n", "study_hours = np.array([12, 5, 18, 9, 22])\n", "\n", "# 2D array: rows = students, columns = features\n", "# columns: [study_hours, attendance_pct, prior_gpa]\n", "features = np.array(\n", " [\n", " [12, 85, 3.1],\n", " [5, 60, 2.4],\n", " [18, 95, 3.8],\n", " [9, 70, 2.9],\n", " [22, 98, 3.9],\n", " ]\n", ")\n", "print(features)\n", "print(f\"shape: {features.shape}\") # (5 students, 3 features)" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "For larger arrays, typing out every value is impractical. These functions build arrays from a rule instead of a literal list:\n", "\n", "| Function | Produces |\n", "|---|---|\n", "| `np.arange(start, stop, step)` | Evenly spaced integers/floats, like `range()` |\n", "| `np.linspace(start, stop, n)` | `n` evenly spaced floats, **inclusive** of both ends |\n", "| `np.zeros(shape)` / `np.ones(shape)` | Array filled with `0.0` / `1.0` |\n", "| `np.full(shape, value)` | Array filled with a constant |\n", "| `np.eye(n)` | `n x n` identity matrix |" ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "print(\"arange(0, 10) :\", np.arange(0, 10))\n", "print(\"arange(0, 10, 2) :\", np.arange(0, 10, 2))\n", "print(\"linspace(0, 1, 5) :\", np.linspace(0, 1, 5))\n", "print(\"zeros((2, 3)) :\\n\", np.zeros((2, 3)))\n", "print(\"ones(3) :\", np.ones(3))\n", "print(\"full((2, 2), 7) :\\n\", np.full((2, 2), 7))" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "### Random Data with a `Generator`\n", "\n", "Synthetic data and simulations need reproducible randomness. Modern NumPy (1.17+) recommends `np.random.default_rng(seed)` over the legacy `np.random.seed(...)`. A `Generator` object is self-contained, so two generators never interfere with each other's state (unlike the old global `np.random.seed`, which silently affects every call anywhere in the program)." ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "rng = np.random.default_rng(seed=42) # one independent, reproducible stream\n", "\n", "print(\"uniform [0, 1) :\", rng.random(3))\n", "print(\"normal(mean=0,std=1):\", rng.normal(0, 1, size=3))\n", "print(\"integers [60, 100) :\", rng.integers(60, 100, size=5))" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "
\n", " Pro Tip: Prefer default_rng over np.random.seed

\n", "np.random.seed(42) mutates a single global random state shared by your whole program. Any other code (or library) calling np.random.* shifts that shared state and breaks your reproducibility. rng = np.random.default_rng(42) gives you an isolated generator: pass it around explicitly, and your results stay reproducible no matter what else runs.\n", "
" ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "
\n", " Activity 1 - Build a Synthetic Student Dataset

\n", "\n", "Goal: Using rng = np.random.default_rng(7), generate a feature matrix X of shape (50, 3) for 50 students with columns:

\n", "\n", "Combine the three 1D arrays into one (50, 3) array with np.column_stack.\n", "
X.shape  # -> (50, 3)\n",
    "X[0]     # -> array([study_hours_0, attendance_pct_0, prior_gpa_0])
\n", "Hint: np.column_stack([a, b, c]) stacks 1D arrays as columns of a 2D array.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [], "source": [ "rng = np.random.default_rng(7)\n", "\n", "study_hours = rng.uniform(0, 25, size=50)\n", "attendance_pct = rng.uniform(50, 100, size=50)\n", "prior_gpa = rng.uniform(2.0, 4.0, size=50)\n", "\n", "X = ... # TODO: combine the three arrays into one (50, 3) feature matrix\n", "\n", "# print(f\"X.shape : {X.shape}\")\n", "# print(f\"X[0] : {X[0]}\")" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "## 3. Shape, Size, and dtype\n", "\n", "Every array carries metadata you should check before trusting a computation: `shape` (size along each dimension), `ndim` (number of dimensions), `size` (total element count), and `dtype` (the single data type of every element)." ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "X = np.array(\n", " [\n", " [12.0, 85.0, 3.1],\n", " [5.0, 60.0, 2.4],\n", " [18.0, 95.0, 3.8],\n", " [9.0, 70.0, 2.9],\n", " ]\n", ")\n", "\n", "print(f\"shape : {X.shape}\") # (rows, columns)\n", "print(f\"ndim : {X.ndim}\")\n", "print(f\"size : {X.size}\") # rows * columns\n", "print(f\"dtype : {X.dtype}\")" ] }, { "cell_type": "markdown", "id": "20", "metadata": {}, "source": [ "
\n", " Common Mistake: Mixed int/float input silently upcasts

\n", "np.array([1, 2, 3.5]) produces a float64 array, not a mix of int and float. NumPy must pick one dtype for the whole array and silently widens every element to fit. This is usually harmless, but np.array([1, 2, 3], dtype=np.int32) / 2 truncating, or an unexpected int8 overflowing past 127, are the same root cause: always check .dtype when results look wrong.\n", "
" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "### Reshaping\n", "\n", "`reshape()` returns the **same data** viewed with a different shape. It does not copy or reorder values, so the total element count (`size`) must stay the same. `-1` tells NumPy \"infer this dimension from the others\":" ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [], "source": [ "x = np.arange(12)\n", "print(f\"x : {x} shape={x.shape}\")\n", "\n", "x_grid = x.reshape(3, 4)\n", "print(f\"reshape(3,4):\\n{x_grid}\")\n", "\n", "# -1 means \"figure this dimension out for me\"\n", "x_col = x.reshape(-1, 1) # turn a 1D array into a single column\n", "print(f\"reshape(-1,1) shape: {x_col.shape}\")\n", "\n", "x_flat = x_grid.flatten() # back to 1D - always returns a COPY\n", "print(f\"flatten() : {x_flat}\")" ] }, { "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "
\n", " Common Mistake: reshape() returns a view, flatten() returns a copy

\n", "Mutating the result of .reshape() mutates the original array too. They share the same underlying memory. .flatten() always copies, so mutating it is safe. This distinction (view vs. copy) comes up constantly in NumPy; Sec. 11 covers it in more depth.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "24", "metadata": {}, "outputs": [], "source": [ "original = np.arange(6)\n", "view = original.reshape(2, 3)\n", "view[0, 0] = 99 # mutating the reshaped VIEW...\n", "\n", "print(f\"view :\\n{view}\")\n", "print(f\"original : {original}\") # ...also changed the original!" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "### Combining Arrays: `column_stack`, `hstack`, `vstack`\n", "\n", "Feature engineering often means assembling separate 1D feature vectors into one 2D matrix, or stacking two matrices together. Use the right function for the shape change you want:\n", "\n", "| Function | Effect |\n", "|---|---|\n", "| `np.column_stack([a, b, ...])` | 1D arrays -> columns of a 2D array |\n", "| `np.hstack([a, b])` | Join side-by-side (same number of rows) |\n", "| `np.vstack([a, b])` | Stack on top of each other (same number of columns) |\n", "| `np.concatenate([a, b], axis=...)` | General join along a chosen axis |" ] }, { "cell_type": "code", "execution_count": null, "id": "26", "metadata": {}, "outputs": [], "source": [ "gpa = np.array([3.1, 2.4, 3.8, 2.9])\n", "attendance = np.array([85, 60, 95, 70])\n", "\n", "# Two 1D feature vectors -> one (4, 2) matrix\n", "combined = np.column_stack([gpa, attendance])\n", "print(f\"column_stack:\\n{combined}\")\n", "\n", "# Two (4, 2) batches of students -> one (8, 2) matrix\n", "more_students = np.array([[3.5, 90], [2.0, 55]])\n", "all_students = np.vstack([combined, more_students])\n", "print(f\"vstack shape: {all_students.shape}\")" ] }, { "cell_type": "markdown", "id": "27", "metadata": {}, "source": [ "## 4. Indexing and Slicing\n", "\n", "NumPy indexing extends Python's list slicing to multiple dimensions. For a 2D array, the convention is `array[rows, columns]`, and negative indices still count from the end." ] }, { "cell_type": "code", "execution_count": null, "id": "28", "metadata": {}, "outputs": [], "source": [ "scores = np.array([62, 78, 85, 91, 55, 73, 88, 95, 67, 80])\n", "\n", "print(f\"first three : {scores[:3]}\")\n", "print(f\"last three : {scores[-3:]}\")\n", "print(f\"between 3 & 7 : {scores[3:7]}\")\n", "print(f\"every other : {scores[::2]}\")\n", "print(f\"reversed : {scores[::-1]}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "29", "metadata": {}, "outputs": [], "source": [ "X = np.array(\n", " [\n", " [12, 85, 3.1],\n", " [5, 60, 2.4],\n", " [18, 95, 3.8],\n", " [9, 70, 2.9],\n", " [22, 98, 3.9],\n", " ]\n", ")\n", "\n", "print(f\"row 0 : {X[0]}\")\n", "print(f\"column 1 (all rows): {X[:, 1]}\") # every row, attendance column\n", "print(f\"rows 1-3, col 0 & 2: \\n{X[1:4, [0, 2]]}\") # fancy column indexing\n", "print(f\"single cell [2, 1] : {X[2, 1]}\")" ] }, { "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "
\n", " Common Mistake: Basic slices are views; fancy/boolean indexing copies

\n", "X[:, 1] (a slice) returns a view: mutating it mutates X. X[1:4, [0, 2]] (a list of indices, known as \"fancy indexing\") always returns a copy. If you need an independent array from a slice, call .copy() explicitly: col = X[:, 1].copy().\n", "
" ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "
\n", " Activity 2 - Select Top Performers

\n", "\n", "Goal: Given the X feature matrix above (columns: study_hours, attendance_pct, prior_gpa), use slicing to print:

\n", "
    \n", "
  1. The prior_gpa column (all rows)
  2. \n", "
  3. The first two rows, all columns
  4. \n", "
  5. The study_hours and prior_gpa columns (skip attendance) for every row
  6. \n", "
\n", "Hint: For (3), use fancy column indexing: X[:, [0, 2]].\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "32", "metadata": {}, "outputs": [], "source": [ "X = np.array(\n", " [\n", " [12, 85, 3.1],\n", " [5, 60, 2.4],\n", " [18, 95, 3.8],\n", " [9, 70, 2.9],\n", " [22, 98, 3.9],\n", " ]\n", ")\n", "\n", "gpa_column = ... # TODO\n", "first_two_rows = ... # TODO\n", "hours_and_gpa = ... # TODO\n", "\n", "print(f\"gpa_column : {gpa_column}\")\n", "print(f\"first_two_rows :\\n{first_two_rows}\")\n", "print(f\"hours_and_gpa :\\n{hours_and_gpa}\")" ] }, { "cell_type": "markdown", "id": "33", "metadata": {}, "source": [ "## 5. Boolean Masking & Vectorised Conditionals\n", "\n", "Comparing an array to a value produces a **boolean array** of the same shape: a \"mask.\" Using that mask to index the original array keeps only the `True` positions. This replaces `if`/`for` filtering loops entirely." ] }, { "cell_type": "code", "execution_count": null, "id": "34", "metadata": {}, "outputs": [], "source": [ "scores = np.array([62, 78, 85, 91, 55, 73, 88, 95, 67, 80])\n", "\n", "passing_mask = scores >= 70\n", "print(f\"mask : {passing_mask}\")\n", "print(f\"passing : {scores[passing_mask]}\") # boolean indexing: keeps True positions\n", "print(f\"n passing: {passing_mask.sum()}\") # True counts as 1, False as 0" ] }, { "cell_type": "markdown", "id": "35", "metadata": {}, "source": [ "Combine conditions with `&` (and) / `|` (or), **not** Python's `and`/`or`, which only work on single booleans, not arrays. Each side needs its own parentheses because `&`/`|` bind tighter than comparison operators:" ] }, { "cell_type": "code", "execution_count": null, "id": "36", "metadata": {}, "outputs": [], "source": [ "attendance = np.array([85, 60, 95, 70, 98, 45, 88, 92, 55, 80])\n", "scores = np.array([62, 78, 85, 91, 55, 73, 88, 95, 67, 80])\n", "\n", "# Parentheses are required: & binds tighter than >= without them\n", "at_risk = (scores < 70) & (attendance < 70)\n", "print(f\"at_risk mask : {at_risk}\")\n", "print(f\"n at risk : {at_risk.sum()}\")" ] }, { "cell_type": "markdown", "id": "37", "metadata": {}, "source": [ "`np.where(condition, if_true, if_false)` builds a new array by choosing between two values element-wise: the vectorised equivalent of a ternary expression inside a loop:" ] }, { "cell_type": "code", "execution_count": null, "id": "38", "metadata": {}, "outputs": [], "source": [ "labels = np.where(scores >= 70, \"pass\", \"fail\")\n", "print(labels)\n", "\n", "# np.select handles more than two outcomes\n", "grade = np.select(\n", " [scores >= 90, scores >= 80, scores >= 70, scores >= 60],\n", " [\"A\", \"B\", \"C\", \"D\"],\n", " default=\"F\",\n", ")\n", "print(grade)" ] }, { "cell_type": "markdown", "id": "39", "metadata": {}, "source": [ "
\n", " Activity 3 - Flag Students Needing Intervention

\n", "\n", "Goal: Given scores and attendance arrays, build a boolean mask needs_help that flags students with score < 70 or attendance < 60, then print how many students were flagged and their scores.\n", "
scores     = [62, 78, 85, 91, 55, 73, 88, 95, 67, 80]\n",
    "attendance = [85, 60, 95, 70, 98, 45, 88, 92, 55, 80]\n",
    "# needs_help -> True at indices 0, 4, 5, 8  (score<70 OR attendance<60)
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "40", "metadata": {}, "outputs": [], "source": [ "scores = np.array([62, 78, 85, 91, 55, 73, 88, 95, 67, 80])\n", "attendance = np.array([85, 60, 95, 70, 98, 45, 88, 92, 55, 80])\n", "\n", "needs_help = ... # TODO: boolean mask, score < 70 OR attendance < 60\n", "\n", "print(f\"needs_help : {needs_help}\")\n", "# print(f\"n flagged : {needs_help.sum()}\")\n", "# print(f\"flagged scores : {scores[needs_help]}\")" ] }, { "cell_type": "markdown", "id": "41", "metadata": {}, "source": [ "## 6. Aggregations Along an Axis\n", "\n", "`mean()`, `sum()`, `std()`, `min()`, `max()` collapse an array to a single number by default. On a 2D matrix, the `axis` argument controls **which dimension gets collapsed** : this is the single most common source of \"right function, wrong number\" bugs in DS/ML code, so get the convention straight now:\n", "\n", "- **`axis=0`** collapses **rows** -> one result **per column** (per feature)\n", "- **`axis=1`** collapses **columns** -> one result **per row** (per sample)" ] }, { "cell_type": "code", "execution_count": null, "id": "42", "metadata": {}, "outputs": [], "source": [ "X = np.array(\n", " [\n", " [12.0, 85.0, 3.1],\n", " [5.0, 60.0, 2.4],\n", " [18.0, 95.0, 3.8],\n", " [9.0, 70.0, 2.9],\n", " [22.0, 98.0, 3.9],\n", " ]\n", ")\n", "\n", "print(f\"overall mean : {X.mean():.2f}\") # one number, all 15 values\n", "print(f\"per-feature mean : {X.mean(axis=0)}\") # shape (3,): one per column\n", "print(f\"per-student mean : {X.mean(axis=1)}\") # shape (5,): one per row\n", "print(f\"per-feature std : {X.std(axis=0)}\")\n", "print(f\"per-feature min/max : {X.min(axis=0)} / {X.max(axis=0)}\")" ] }, { "cell_type": "markdown", "id": "43", "metadata": {}, "source": [ "
\n", " Pro Tip: Say the axis name out loud

\n", "\"axis=0\" is easy to misremember. Read it as: \"collapse axis 0 (the row axis): what's left is one value per column.\" If you want one statistic per feature (the usual case before normalising a feature matrix), that is always axis=0.\n", "
" ] }, { "cell_type": "markdown", "id": "44", "metadata": {}, "source": [ "## 7. Broadcasting\n", "\n", "**Broadcasting** is the rule NumPy uses to apply an operation between two arrays of *different* shapes, by virtually \"stretching\" the smaller one, without actually copying any data. It is what lets you write `X - X.mean(axis=0)` instead of a loop over rows.\n", "\n", "
\n", " Key Concept: The Broadcasting Rule

\n", "Compare shapes from the right-hand side. Two dimensions are compatible when they are equal, or when one of them is 1 (it gets stretched to match). Missing leading dimensions are treated as 1.

(5, 3) and (3,) → treat (3,) as (1, 3) → stretch to (5, 3). ✅
(5, 3) and (5,) → treat (5,) as (1, 5)3 != 5 and neither is 1. ❌\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "45", "metadata": {}, "outputs": [], "source": [ "X = np.array(\n", " [\n", " [12.0, 85.0, 3.1],\n", " [5.0, 60.0, 2.4],\n", " [18.0, 95.0, 3.8],\n", " [9.0, 70.0, 2.9],\n", " [22.0, 98.0, 3.9],\n", " ]\n", ")\n", "\n", "feature_mean = X.mean(axis=0) # shape (3,) -- one mean per feature\n", "feature_std = X.std(axis=0) # shape (3,)\n", "\n", "print(f\"X.shape : {X.shape}\")\n", "print(f\"feature_mean.shape : {feature_mean.shape}\")\n", "\n", "# (5, 3) - (3,) broadcasts the mean across every row: no loop needed\n", "X_normalised = (X - feature_mean) / feature_std\n", "print(f\"normalised:\\n{X_normalised}\")\n", "print(f\"new per-feature mean (~0): {X_normalised.mean(axis=0).round(6)}\")" ] }, { "cell_type": "raw", "id": "46", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "> **Broadcasting: shapes align right, size-1 dimensions stretch**\n", "\n", "```{mermaid}\n", "flowchart LR\n", " A[\"(3, 3) array\n", "[[1, 2, 3],\n", " [4, 5, 6],\n", " [7, 8, 9]]\"] -->|\"+ scalar ()\"| R1[\"broadcasts to (3,3)\n", "[[11,12,13],\n", " [14,15,16],\n", " [17,18,19]]\"]\n", " B[\"(3, 3) array\"] -->|\"+ row (1, 3)\n", "[10, 20, 30]\"| R2[\"row expands to (3,3)\n", "[[11,22,33],\n", " [14,25,36],\n", " [17,28,39]]\"]\n", " C[\"(3, 1) col\"] -->|\"+ (1, 3) row\n", "shapes align on right\"| R3[\"(3,3) result\n", "outer-product-like\"]\n", "\n", " style R1 fill:#EBF5F0,stroke:#059669,color:#065F46\n", " style R2 fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E\n", " style R3 fill:#F5F3FF,stroke:#7C3AED,color:#3B0764\n", "```\n" ] }, { "cell_type": "markdown", "id": "47", "metadata": {}, "source": [ "The diagram below makes the rule literal: the top row is `feature_mean`, shape `(3,)`. NumPy stretches it down to align with every one of the 5 rows in `X`, without ever actually copying it 5 times in memory." ] }, { "cell_type": "code", "execution_count": null, "id": "48", "metadata": {}, "outputs": [], "source": [ "from ark.plot.diagrams import broadcasting_diagram\n", "\n", "broadcasting_diagram();" ] }, { "cell_type": "markdown", "id": "49", "metadata": {}, "source": [ "
\n", " Common Mistake: Broadcasting a (5,) array against a (5, 3) matrix fails for a subtle reason

\n", "If you instead compute per_student_mean = X.mean(axis=1) (shape (5,)) and try X - per_student_mean, NumPy raises ValueError: operands could not be broadcast together. It is comparing the trailing dimensions 3 vs 5, not what you meant. Fix it by giving the per-row result an explicit column shape with keepdims=True: X.mean(axis=1, keepdims=True) has shape (5, 1), which broadcasts correctly against (5, 3).\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "50", "metadata": {}, "outputs": [], "source": [ "row_mean = X.mean(axis=1, keepdims=True) # shape (5, 1), NOT (5,)\n", "print(f\"row_mean.shape : {row_mean.shape}\")\n", "\n", "# (5, 3) - (5, 1) broadcasts the per-row mean across every column\n", "centered_per_row = X - row_mean\n", "print(f\"centered_per_row:\\n{centered_per_row}\")" ] }, { "cell_type": "markdown", "id": "51", "metadata": {}, "source": [ "
\n", " Activity 4 - Min-Max Scale Every Feature

\n", "\n", "Goal: Write a one-line expression that scales every column of X to the [0, 1] range using the formula (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)). Confirm the result's per-column min is 0 and max is 1.\n", "
X_scaled.min(axis=0)  # -> array([0., 0., 0.])\n",
    "X_scaled.max(axis=0)  # -> array([1., 1., 1.])
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "52", "metadata": {}, "outputs": [], "source": [ "X = np.array(\n", " [\n", " [12.0, 85.0, 3.1],\n", " [5.0, 60.0, 2.4],\n", " [18.0, 95.0, 3.8],\n", " [9.0, 70.0, 2.9],\n", " [22.0, 98.0, 3.9],\n", " ]\n", ")\n", "\n", "X_scaled = ... # TODO: min-max scale every column to [0, 1]\n", "\n", "# print(f\"min per column: {X_scaled.min(axis=0)}\")\n", "# print(f\"max per column: {X_scaled.max(axis=0)}\")" ] }, { "cell_type": "markdown", "id": "53", "metadata": {}, "source": [ "## 8. Vectorisation vs. Python Loops\n", "\n", "Now that you can express normalisation as `(X - mean) / std` with no loop, it is worth seeing *why* that matters. Every NumPy operation you have used so far runs as a single compiled loop over contiguous memory; a Python `for` loop pays the cost of the interpreter on every single element." ] }, { "cell_type": "code", "execution_count": null, "id": "54", "metadata": {}, "outputs": [], "source": [ "import time\n", "\n", "big = np.random.default_rng(0).normal(size=1_000_000)\n", "\n", "\n", "def zscore_loop(values: np.ndarray) -> list[float]:\n", " mean = sum(values) / len(values)\n", " variance = sum((v - mean) ** 2 for v in values) / len(values)\n", " std = variance**0.5\n", " return [(v - mean) / std for v in values]\n", "\n", "\n", "start = time.perf_counter()\n", "_ = zscore_loop(big)\n", "loop_time = time.perf_counter() - start\n", "\n", "start = time.perf_counter()\n", "_ = (big - big.mean()) / big.std()\n", "vector_time = time.perf_counter() - start\n", "\n", "print(f\"Python loop time : {loop_time:.4f}s\")\n", "print(f\"Vectorised time : {vector_time:.4f}s\")\n", "print(f\"Speedup : {loop_time / vector_time:,.0f}x\")" ] }, { "cell_type": "markdown", "id": "55", "metadata": {}, "source": [ "
\n", " Pro Tip: If you are writing a for loop over an array, stop and look for a vectorised way

\n", "Almost every elementwise transformation, filter, or aggregation you would write as a Python loop already has a NumPy equivalent: arithmetic operators, np.where, boolean masks, axis aggregations. Reach for those first: a hand-written loop over a large array is one of the most common DS/ML performance bugs, and it is usually a 10-100x slowdown for no benefit.\n", "
" ] }, { "cell_type": "markdown", "id": "56", "metadata": {}, "source": [ "## 9. Linear Algebra Essentials\n", "\n", "A linear model's prediction is a **dot product**: multiply each feature by a learned weight, sum the results, add a bias. `@` (matrix multiplication) computes this for every row of a feature matrix at once, no loop over students required." ] }, { "cell_type": "code", "execution_count": null, "id": "57", "metadata": {}, "outputs": [], "source": [ "X = np.array(\n", " [\n", " [12.0, 85.0, 3.1],\n", " [5.0, 60.0, 2.4],\n", " [18.0, 95.0, 3.8],\n", " [9.0, 70.0, 2.9],\n", " [22.0, 98.0, 3.9],\n", " ]\n", ") # shape (5 students, 3 features)\n", "\n", "# Suppose a (already-fitted) linear model has these learned weights and bias\n", "weights = np.array([1.5, 0.3, 8.0]) # one weight per feature, shape (3,)\n", "bias = 10.0\n", "\n", "# X @ weights: (5, 3) @ (3,) -> (5,) -- one prediction per student\n", "predicted_scores = X @ weights + bias\n", "print(f\"predicted_scores: {predicted_scores.round(1)}\")" ] }, { "cell_type": "markdown", "id": "58", "metadata": {}, "source": [ "`@` on a matrix and a vector is shorthand for: for each row, multiply element-wise by `weights` and sum: exactly `(X * weights).sum(axis=1)`. Verify the two are equivalent, then measure prediction error against the true scores with `np.linalg.norm` (the Euclidean / RMS-style distance):" ] }, { "cell_type": "code", "execution_count": null, "id": "59", "metadata": {}, "outputs": [], "source": [ "# @ is equivalent to elementwise multiply + sum along axis=1\n", "manual = (X * weights).sum(axis=1) + bias\n", "print(f\"@ matches manual sum: {np.allclose(predicted_scores, manual)}\")\n", "\n", "actual_scores = np.array([88.0, 65.0, 95.0, 78.0, 99.0])\n", "errors = predicted_scores - actual_scores\n", "rmse = np.linalg.norm(errors) / np.sqrt(len(errors))\n", "print(f\"errors : {errors.round(1)}\")\n", "print(f\"RMSE : {rmse:.2f}\")" ] }, { "cell_type": "markdown", "id": "60", "metadata": {}, "source": [ "
\n", " Common Mistake: Comparing floats with ==

\n", "np.allclose(a, b) was used above instead of (a == b).all() on purpose: floating-point arithmetic accumulates tiny rounding errors, so two mathematically-equal results can differ in their last bit. Always compare floats with a tolerance: np.allclose, or abs(a - b) < 1e-9, never with exact ==.\n", "
" ] }, { "cell_type": "markdown", "id": "61", "metadata": {}, "source": [ "## 10. Saving & Loading Arrays\n", "\n", "A typical pipeline computes a feature matrix once and reuses it across many later steps (training, evaluation, serving). `.npy` stores a single array in NumPy's own binary format: far smaller and faster to read than CSV for numeric data, and it preserves `dtype` and `shape` exactly. `.npz` bundles several named arrays together." ] }, { "cell_type": "code", "execution_count": null, "id": "62", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "tmp_dir = Path(\"tmp_numpy_activity\")\n", "tmp_dir.mkdir(exist_ok=True)\n", "\n", "X = np.array([[12.0, 85.0, 3.1], [5.0, 60.0, 2.4], [18.0, 95.0, 3.8]])\n", "y = np.array([88.0, 65.0, 95.0])\n", "\n", "# Save a single array\n", "np.save(tmp_dir / \"X.npy\", X)\n", "X_loaded = np.load(tmp_dir / \"X.npy\")\n", "print(f\"round-trip equal: {np.array_equal(X, X_loaded)}\")\n", "\n", "# Save several named arrays together in one file\n", "np.savez(tmp_dir / \"dataset.npz\", features=X, target=y)\n", "bundle = np.load(tmp_dir / \"dataset.npz\")\n", "print(f\"keys : {list(bundle.keys())}\")\n", "print(f\"target : {bundle['target']}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "63", "metadata": {}, "outputs": [], "source": [ "import shutil\n", "\n", "shutil.rmtree(tmp_dir)\n", "print(f\"cleaned up: {tmp_dir.exists()}\")" ] }, { "cell_type": "markdown", "id": "64", "metadata": {}, "source": [ "## 11. Common Gotchas\n", "\n", "Like the Python gotchas in Part 3, none of these raise an exception. They silently produce a wrong (or surprising) result. Recognise them now so you do not lose hours to them later." ] }, { "cell_type": "code", "execution_count": null, "id": "65", "metadata": {}, "outputs": [], "source": [ "# GOTCHA 1: basic slicing returns a VIEW, not a copy\n", "scores = np.array([62, 78, 85, 91, 55])\n", "top_three = scores[:3]\n", "top_three[0] = 0 # mutating the \"slice\"...\n", "\n", "print(f\"top_three : {top_three}\")\n", "print(f\"scores : {scores}\") # ...changed the original too!\n", "\n", "# Fix: copy explicitly when you need an independent array\n", "safe_copy = scores[:3].copy()" ] }, { "cell_type": "code", "execution_count": null, "id": "66", "metadata": {}, "outputs": [], "source": [ "# GOTCHA 2: integer dtype truncates on division-like ops, and can overflow\n", "small = np.array([120, 10], dtype=np.int8)\n", "print(f\"int8 + 50 : {small + 50}\") # 120 + 50 = 170, overflows int8's max of 127!\n", "\n", "# Fix: use a wide-enough dtype, or let NumPy infer (default is int64/float64)\n", "safe = small.astype(np.int64) + 50\n", "print(f\"int64 + 50: {safe}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "67", "metadata": {}, "outputs": [], "source": [ "# GOTCHA 3: {} is a dict, not a set -- same trap as in plain Python (Part 3, Sec. 8)\n", "empty_dict = {}\n", "empty_set = set()\n", "print(f\"type({{}}) : {type(empty_dict)}\")\n", "print(f\"type(set()) : {type(empty_set)}\")\n", "\n", "# GOTCHA 4: comparing floats with == (see Sec. 9 for the fix: np.allclose)\n", "a = np.array([0.1 + 0.2])\n", "print(f\"0.1 + 0.2 == 0.3 : {a == 0.3}\") # False! 0.30000000000000004 != 0.3\n", "print(f\"np.allclose : {np.allclose(a, 0.3)}\") # True: tolerant comparison" ] }, { "cell_type": "markdown", "id": "68", "metadata": {}, "source": [ "## 12. Capstone Exercises\n", "\n", "Apply everything from this notebook together. Each exercise is self-contained." ] }, { "cell_type": "markdown", "id": "69", "metadata": {}, "source": [ "
\n", " Exercise 1 - Build, Normalise, and Predict

\n", "\n", "Goal: Using the students dataset below:

\n", "
    \n", "
  1. Build a (6, 3) feature matrix X with np.column_stack
  2. \n", "
  3. Z-score normalise X (Sec. 7)
  4. \n", "
  5. Predict exam_score with the given weights/bias using @ (Sec. 9), applied to the normalised features
  6. \n", "
  7. Compute the RMSE against actual_scores using np.linalg.norm
  8. \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "70", "metadata": {}, "outputs": [], "source": [ "study_hours = np.array([12, 5, 18, 9, 22, 14])\n", "attendance_pct = np.array([85, 60, 95, 70, 98, 80])\n", "prior_gpa = np.array([3.1, 2.4, 3.8, 2.9, 3.9, 3.3])\n", "actual_scores = np.array([88.0, 65.0, 95.0, 78.0, 99.0, 84.0])\n", "\n", "weights = np.array([0.8, 0.5, 6.0])\n", "bias = 55.0\n", "\n", "# TODO: 1) build X, 2) normalise it, 3) predict, 4) compute RMSE\n", "X = ...\n", "X_normalised = ...\n", "predicted = ...\n", "rmse = ...\n", "\n", "print(f\"predicted: {predicted}\")\n", "print(f\"RMSE : {rmse:.2f}\")" ] }, { "cell_type": "markdown", "id": "71", "metadata": {}, "source": [ "
\n", " Exercise 2 - Vectorised Anomaly Detector

\n", "\n", "Goal: Rewrite the deque-based anomaly detector from Part 3 (Sec. 9, Exercise 3) without any explicit loop. Flag any reading more than 2 standard deviations from the overall mean using a single boolean mask.\n", "
readings = [36.5, 36.7, 36.8, 36.6, 36.9, 39.5, 36.7, 36.8]\n",
    "# Expected: reading 39.5 (index 5) flagged as anomaly
\n", "Hint: z = (readings - readings.mean()) / readings.std(), then mask np.abs(z) > 2.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "72", "metadata": {}, "outputs": [], "source": [ "readings = np.array([36.5, 36.7, 36.8, 36.6, 36.9, 39.5, 36.7, 36.8])\n", "\n", "z_scores = ... # TODO\n", "anomaly_mask = ... # TODO\n", "\n", "print(f\"z_scores : {z_scores.round(2)}\")\n", "print(f\"anomaly_mask : {anomaly_mask}\")\n", "print(f\"anomalies : {readings[anomaly_mask]}\")" ] }, { "cell_type": "markdown", "id": "73", "metadata": {}, "source": [ "## What's New in NumPy 2.0\n", "\n", "NumPy 2.0 (released June 2024) is the first major version bump in almost two decades. Two changes matter most for day-to-day data science work:\n", "\n", "### Removed type aliases\n", "\n", "The old Python-builtin aliases (`np.int`, `np.float`, `np.bool`, `np.complex`, `np.object`, `np.str`) were deprecated for years and are now **fully removed**. They were just aliases to the Python built-ins anyway, so the fix is mechanical:\n", "\n", "| Old (removed) | Replacement |\n", "|---|---|\n", "| `np.bool` | `np.bool_` or Python `bool` |\n", "| `np.int` | `np.intp` or Python `int` |\n", "| `np.float` | `np.float64` or Python `float` |\n", "| `np.complex` | `np.complex128` or Python `complex` |\n", "| `np.object` | `np.object_` or Python `object` |\n", "| `np.str` | `np.str_` or Python `str` |\n", "\n", "### `StringDType` for proper string arrays\n", "\n", "NumPy 2.0 introduced `np.dtypes.StringDType()`, a real variable-length string dtype backed by UTF-8 memory. The old `np.str_` stored fixed-width UCS-4 strings (one array dtype for every character count), `StringDType` stores arbitrary-length strings efficiently.\n", "\n", "
\n", " Pro Tip: Use np.float64 not np.float in new code

\n", "If you see a module 'numpy' has no attribute 'float' error, the codebase was written against NumPy 1.x. A global search-and-replace of np.floatnp.float64 (and so on for the others in the table) is the complete fix.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "74", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "# Old aliases are REMOVED in NumPy 2.0 — this would raise AttributeError:\n", "# arr = np.array([1, 2, 3], dtype=np.float) # ❌\n", "\n", "# Use the explicit dtype names instead:\n", "arr = np.array([1, 2, 3], dtype=np.float64) # ✅\n", "print(f\"dtype: {arr.dtype}\")\n", "\n", "# StringDType: variable-length strings, efficient UTF-8 storage (NumPy 2.0+)\n", "names = np.array([\"Alice\", \"Bob\", \"Charlie\"], dtype=np.dtypes.StringDType())\n", "print(f\"names : {names}\")\n", "print(f\"dtype : {names.dtype}\")\n", "\n", "# The old fixed-width str_ is still available but StringDType is the modern choice\n", "old_style = np.array([\"Alice\", \"Bob\", \"Charlie\"]) # infers str_ (fixed width)\n", "print(f\"old dtype: {old_style.dtype}\") # one value per **column**; 1 collapses columns -> one value per **row** |\n", "| Broadcasting | Compare shapes right-to-left; dims match if equal or one is `1`; use `keepdims=True` to broadcast a per-row stat |\n", "| Vectorisation | A NumPy expression beats a Python loop by 10-100x, look for one before writing `for` |\n", "| `@` / `np.dot` | Matrix multiplication: `X @ weights` predicts every row in one call |\n", "| `np.allclose` | Always compare floats with a tolerance, never `==` |\n", "| `.npy` / `.npz` | Compact, dtype/shape-preserving array storage between pipeline stages |\n", "\n", "\n", "**Next:** `05-matplotlib.ipynb`, covering how to visualise arrays and DataFrames with matplotlib." ] } ], "metadata": { "kernelspec": { "display_name": "ark (3.12.12.final.0)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }