{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 4: NumPy for Data & ML\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/04-numpy.ipynb) [](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/04-numpy.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Python Foundations**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed Parts 1-3 (`01-python-core.ipynb`, `02-control-flow.ipynb`, `03-python-patterns.ipynb`). If you have not, start there.\n", "\n", "Every ML library you will use (pandas, scikit-learn, PyTorch, TensorFlow) stores its data as **NumPy arrays** (or something modelled directly on them) and builds its fast operations on top of NumPy's rules. This notebook teaches those rules from first principles, using the same **university analytics platform** running example from Parts 1-3: this time we build a small student feature matrix (`study_hours`, `attendance_pct`, `prior_gpa`) and use it to predict `exam_score`.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **ndarray vs. list** | Why ML libraries do not use plain Python lists |\n", "| **Creating arrays** | Build feature matrices and synthetic data |\n", "| **Shape & dtype** | Avoid silent shape-mismatch bugs |\n", "| **Indexing & slicing** | Select rows, columns, and subsets fast |\n", "| **Boolean masking** | Vectorised filtering and labelling |\n", "| **Aggregation & axis** | Per-feature vs. per-sample statistics |\n", "| **Broadcasting** | The rule behind every elementwise ML operation |\n", "| **Vectorisation** | Why loops are slow and arrays are fast |\n", "| **Linear algebra basics** | What is under the hood of a linear model |\n", "| **Saving & loading** | Persist arrays between pipeline stages |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Explain why NumPy arrays outperform Python lists for numeric data | Sec. 1 |\n", "| 2 | Create arrays with `array`, `arange`, `zeros`, `ones`, and a modern `Generator` | Sec. 2 |\n", "| 3 | Inspect and reshape arrays using `shape`, `dtype`, `reshape`, and stacking | Sec. 3 |\n", "| 4 | Select data with slicing, fancy indexing, and boolean masks | Sec. 4, 5 |\n", "| 5 | Compute per-row / per-column statistics with `axis` | Sec. 6 |\n", "| 6 | Apply NumPy's broadcasting rule to vectorise feature engineering | Sec. 7 |\n", "| 7 | Explain why vectorised code beats Python loops | Sec. 8 |\n", "| 8 | Use `@`/`np.dot` to express a linear model as matrix multiplication | Sec. 9 |\n", "| 9 | Save and reload arrays with `.npy` / `.npz` | Sec. 10 |\n", "| 10 | Recognise NumPy's most common silent bugs (views, dtype, float equality) | Sec. 11 |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 0. Meet NumPy\n", "\n", "You have written Python for three chapters. You can loop, slice, and unpack. Now imagine you need to normalise 2,400 exam scores: subtract the mean, divide by the standard deviation, do it across three columns. A Python `for` loop would work. It would also be slow, verbose, and nothing like the code your colleagues expect to read.\n", "\n", "NumPy ([numpy.org](https://numpy.org)) was created in 2005 by Travis Oliphant to give Python scientists the numeric performance of Fortran and C without leaving the language. The idea was simple: store numbers in a contiguous block of memory with a fixed type, and let a thin Python wrapper call heavily optimised C and Fortran routines on that block. Two decades later, nearly every numerical computing library in Python (pandas, scikit-learn, PyTorch, TensorFlow) uses NumPy arrays as its currency.\n", "\n", "### How it compares\n", "\n", "| Approach | Speed on large arrays | Readable math | When to use |\n", "| --- | --- | --- | --- |\n", "| Python `list` + loop | Slow (Python objects, GIL) | Verbose | Small collections, mixed types |\n", "| **NumPy `ndarray`** | **Fast (C/Fortran, contiguous)** | **Concise (`a * 2`)** | **Numeric data of any size** |\n", "| PyTorch `Tensor` | Fast (optionally GPU) | Similar to NumPy | Deep learning, autodiff |\n", "| JAX `Array` | Very fast (XLA, JIT, GPU/TPU) | NumPy-compatible | Research, differentiable programs |\n", "| CuPy `ndarray` | GPU only | NumPy-compatible | Large-scale GPU computing |\n", "\n", "For everything up to classical ML on a laptop, NumPy is the right level of abstraction. PyTorch and JAX add complexity (device management, gradient tracking) that you do not need yet.\n", "\n", "### Already in your environment\n", "\n", "NumPy is included in `pyproject.toml`. If you ever start a standalone project:\n", "\n", "```bash\n", "uv add numpy # or: pip install numpy\n", "```\n", "\n", "Official docs and API reference: [numpy.org/doc](https://numpy.org/doc/stable/)" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 1. Why NumPy? The `ndarray`\n", "\n", "A Python `list` can hold anything (mixed types, nested objects), which makes it flexible but slow for numeric work: every element is a separate Python object, and arithmetic on a list means a Python-level loop.\n", "\n", "A NumPy **`ndarray`** (\"n-dimensional array\") is different: it stores **one fixed dtype** in a single contiguous block of memory. That uniformity lets NumPy hand the math off to compiled C/Fortran loops instead of the Python interpreter, often 10-100x faster, and with far less memory per element." ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "study_hours = [12, 5, 18, 9, 22]\n", "\n", "# A Python list has no element-wise arithmetic: this is string repetition, not math!\n", "print(\"list * 2 :\", study_hours * 2)\n", "\n", "hours = np.array(study_hours)\n", "print(\"array * 2 :\", hours * 2) # element-wise multiplication" ] }, { "cell_type": "raw", "id": "7", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "> **Memory layout: Python list vs NumPy ndarray**\n", "\n", "```{mermaid}\n", "flowchart LR\n", " subgraph py [\"Python list: scattered pointers\"]\n", " P1[\"int obj\"] & P2[\"float obj\"] & P3[\"str obj\"]\n", " L[\"list\"] -->|ptr| P1\n", " L -->|ptr| P2\n", " L -->|ptr| P3\n", " end\n", " subgraph np [\"NumPy ndarray: contiguous memory\"]\n", " direction LR\n", " M1[\"64-bit float\"] --- M2[\"64-bit float\"] --- M3[\"64-bit float\"]\n", " A[\"array\"] --> M1\n", " end\n", "\n", " style np fill:#EBF5F0,stroke:#059669\n", " style py fill:#FEF2F2,stroke:#DC2626\n", "```\n" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "
dtype for every element) and has a fixed shape. Because elements sit next to each other in memory, NumPy can vectorise operations: apply one compiled loop to the whole array instead of looping in Python.default_rng over np.random.seednp.random.seed(42) mutates a single global random state shared by your whole program. Any other code (or library) calling np.random.* shifts that shared state and breaks your reproducibility. rng = np.random.default_rng(42) gives you an isolated generator: pass it around explicitly, and your results stay reproducible no matter what else runs.\n",
"rng = np.random.default_rng(7), generate a feature matrix X of shape (50, 3) for 50 students with columns:study_hours: rng.uniform(0, 25, size=50)attendance_pct: rng.uniform(50, 100, size=50)prior_gpa: rng.uniform(2.0, 4.0, size=50)(50, 3) array with np.column_stack.\n",
"X.shape # -> (50, 3)\n",
"X[0] # -> array([study_hours_0, attendance_pct_0, prior_gpa_0])\n",
"Hint: np.column_stack([a, b, c]) stacks 1D arrays as columns of a 2D array.\n",
"np.array([1, 2, 3.5]) produces a float64 array, not a mix of int and float. NumPy must pick one dtype for the whole array and silently widens every element to fit. This is usually harmless, but np.array([1, 2, 3], dtype=np.int32) / 2 truncating, or an unexpected int8 overflowing past 127, are the same root cause: always check .dtype when results look wrong.\n",
"reshape() returns a view, flatten() returns a copy.reshape() mutates the original array too. They share the same underlying memory. .flatten() always copies, so mutating it is safe. This distinction (view vs. copy) comes up constantly in NumPy; Sec. 11 covers it in more depth.\n",
"X[:, 1] (a slice) returns a view: mutating it mutates X. X[1:4, [0, 2]] (a list of indices, known as \"fancy indexing\") always returns a copy. If you need an independent array from a slice, call .copy() explicitly: col = X[:, 1].copy().\n",
"X feature matrix above (columns: study_hours, attendance_pct, prior_gpa), use slicing to print:prior_gpa column (all rows)study_hours and prior_gpa columns (skip attendance) for every rowX[:, [0, 2]].\n",
"scores and attendance arrays, build a boolean mask needs_help that flags students with score < 70 or attendance < 60, then print how many students were flagged and their scores.\n",
"scores = [62, 78, 85, 91, 55, 73, 88, 95, 67, 80]\n",
"attendance = [85, 60, 95, 70, 98, 45, 88, 92, 55, 80]\n",
"# needs_help -> True at indices 0, 4, 5, 8 (score<70 OR attendance<60)\n",
"axis=0\" is easy to misremember. Read it as: \"collapse axis 0 (the row axis): what's left is one value per column.\" If you want one statistic per feature (the usual case before normalising a feature matrix), that is always axis=0.\n",
"(5, 3) and (3,) → treat (3,) as (1, 3) → stretch to (5, 3). ✅(5, 3) and (5,) → treat (5,) as (1, 5) → 3 != 5 and neither is 1. ❌\n",
"per_student_mean = X.mean(axis=1) (shape (5,)) and try X - per_student_mean, NumPy raises ValueError: operands could not be broadcast together. It is comparing the trailing dimensions 3 vs 5, not what you meant. Fix it by giving the per-row result an explicit column shape with keepdims=True: X.mean(axis=1, keepdims=True) has shape (5, 1), which broadcasts correctly against (5, 3).\n",
"X to the [0, 1] range using the formula (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)). Confirm the result's per-column min is 0 and max is 1.\n",
"X_scaled.min(axis=0) # -> array([0., 0., 0.])\n",
"X_scaled.max(axis=0) # -> array([1., 1., 1.])\n",
"for loop over an array, stop and look for a vectorised waynp.where, boolean masks, axis aggregations. Reach for those first: a hand-written loop over a large array is one of the most common DS/ML performance bugs, and it is usually a 10-100x slowdown for no benefit.\n",
"==np.allclose(a, b) was used above instead of (a == b).all() on purpose: floating-point arithmetic accumulates tiny rounding errors, so two mathematically-equal results can differ in their last bit. Always compare floats with a tolerance: np.allclose, or abs(a - b) < 1e-9, never with exact ==.\n",
"students dataset below:(6, 3) feature matrix X with np.column_stackX (Sec. 7)exam_score with the given weights/bias using @ (Sec. 9), applied to the normalised featuresactual_scores using np.linalg.norm2 standard deviations from the overall mean using a single boolean mask.\n",
"readings = [36.5, 36.7, 36.8, 36.6, 36.9, 39.5, 36.7, 36.8]\n",
"# Expected: reading 39.5 (index 5) flagged as anomaly\n",
"Hint: z = (readings - readings.mean()) / readings.std(), then mask np.abs(z) > 2.\n",
"np.float64 not np.float in new codemodule 'numpy' has no attribute 'float' error, the codebase was written against NumPy 1.x. A global search-and-replace of np.float → np.float64 (and so on for the others in the table) is the complete fix.\n",
"