{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "---\n",
    "title: \"Part 4: NumPy for Data & ML\"\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/04-numpy.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/04-numpy.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "**DS-MLOps Python Foundations**\n",
    "\n",
    "**Python 3.12+ | Author: Anthony Faustine**\n",
    "\n",
    "## Before you begin\n",
    "\n",
    "This notebook assumes you have completed Parts 1-3 (`01-python-core.ipynb`, `02-control-flow.ipynb`, `03-python-patterns.ipynb`). If you have not, start there.\n",
    "\n",
    "Every ML library you will use (pandas, scikit-learn, PyTorch, TensorFlow) stores its data as **NumPy arrays** (or something modelled directly on them) and builds its fast operations on top of NumPy's rules. This notebook teaches those rules from first principles, using the same **university analytics platform** running example from Parts 1-3: this time we build a small student feature matrix (`study_hours`, `attendance_pct`, `prior_gpa`) and use it to predict `exam_score`.\n",
    "\n",
    "::: {.callout-note collapse=\"true\" icon=false}\n",
    "## Topics covered\n",
    "\n",
    "| Topic | Why it matters |\n",
    "|---|---|\n",
    "| **ndarray vs. list** | Why ML libraries do not use plain Python lists |\n",
    "| **Creating arrays** | Build feature matrices and synthetic data |\n",
    "| **Shape & dtype** | Avoid silent shape-mismatch bugs |\n",
    "| **Indexing & slicing** | Select rows, columns, and subsets fast |\n",
    "| **Boolean masking** | Vectorised filtering and labelling |\n",
    "| **Aggregation & axis** | Per-feature vs. per-sample statistics |\n",
    "| **Broadcasting** | The rule behind every elementwise ML operation |\n",
    "| **Vectorisation** | Why loops are slow and arrays are fast |\n",
    "| **Linear algebra basics** | What is under the hood of a linear model |\n",
    "| **Saving & loading** | Persist arrays between pipeline stages |\n",
    ":::\n",
    "\n",
    "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "::: {.callout-note collapse=\"true\" icon=false}\n",
    "## Learning Objectives\n",
    "\n",
    "| # | Skill | Covered in |\n",
    "|---|---|---|\n",
    "| 1 | Explain why NumPy arrays outperform Python lists for numeric data | Sec. 1 |\n",
    "| 2 | Create arrays with `array`, `arange`, `zeros`, `ones`, and a modern `Generator` | Sec. 2 |\n",
    "| 3 | Inspect and reshape arrays using `shape`, `dtype`, `reshape`, and stacking | Sec. 3 |\n",
    "| 4 | Select data with slicing, fancy indexing, and boolean masks | Sec. 4, 5 |\n",
    "| 5 | Compute per-row / per-column statistics with `axis` | Sec. 6 |\n",
    "| 6 | Apply NumPy's broadcasting rule to vectorise feature engineering | Sec. 7 |\n",
    "| 7 | Explain why vectorised code beats Python loops | Sec. 8 |\n",
    "| 8 | Use `@`/`np.dot` to express a linear model as matrix multiplication | Sec. 9 |\n",
    "| 9 | Save and reload arrays with `.npy` / `.npz` | Sec. 10 |\n",
    "| 10 | Recognise NumPy's most common silent bugs (views, dtype, float equality) | Sec. 11 |\n",
    ":::\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "## 0. Meet NumPy\n",
    "\n",
    "You have written Python for three chapters. You can loop, slice, and unpack. Now imagine you need to normalise 2,400 exam scores: subtract the mean, divide by the standard deviation, do it across three columns. A Python `for` loop would work. It would also be slow, verbose, and nothing like the code your colleagues expect to read.\n",
    "\n",
    "NumPy ([numpy.org](https://numpy.org)) was created in 2005 by Travis Oliphant to give Python scientists the numeric performance of Fortran and C without leaving the language. The idea was simple: store numbers in a contiguous block of memory with a fixed type, and let a thin Python wrapper call heavily optimised C and Fortran routines on that block. Two decades later, nearly every numerical computing library in Python (pandas, scikit-learn, PyTorch, TensorFlow) uses NumPy arrays as its currency.\n",
    "\n",
    "### How it compares\n",
    "\n",
    "| Approach | Speed on large arrays | Readable math | When to use |\n",
    "| --- | --- | --- | --- |\n",
    "| Python `list` + loop | Slow (Python objects, GIL) | Verbose | Small collections, mixed types |\n",
    "| **NumPy `ndarray`** | **Fast (C/Fortran, contiguous)** | **Concise (`a * 2`)** | **Numeric data of any size** |\n",
    "| PyTorch `Tensor` | Fast (optionally GPU) | Similar to NumPy | Deep learning, autodiff |\n",
    "| JAX `Array` | Very fast (XLA, JIT, GPU/TPU) | NumPy-compatible | Research, differentiable programs |\n",
    "| CuPy `ndarray` | GPU only | NumPy-compatible | Large-scale GPU computing |\n",
    "\n",
    "For everything up to classical ML on a laptop, NumPy is the right level of abstraction. PyTorch and JAX add complexity (device management, gradient tracking) that you do not need yet.\n",
    "\n",
    "### Already in your environment\n",
    "\n",
    "NumPy is included in `pyproject.toml`. If you ever start a standalone project:\n",
    "\n",
    "```bash\n",
    "uv add numpy          # or: pip install numpy\n",
    "```\n",
    "\n",
    "Official docs and API reference: [numpy.org/doc](https://numpy.org/doc/stable/)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "## 1. Why NumPy? The `ndarray`\n",
    "\n",
    "A Python `list` can hold anything (mixed types, nested objects), which makes it flexible but slow for numeric work: every element is a separate Python object, and arithmetic on a list means a Python-level loop.\n",
    "\n",
    "A NumPy **`ndarray`** (\"n-dimensional array\") is different: it stores **one fixed dtype** in a single contiguous block of memory. That uniformity lets NumPy hand the math off to compiled C/Fortran loops instead of the Python interpreter, often 10-100x faster, and with far less memory per element."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "study_hours = [12, 5, 18, 9, 22]\n",
    "\n",
    "# A Python list has no element-wise arithmetic: this is string repetition, not math!\n",
    "print(\"list  * 2 :\", study_hours * 2)\n",
    "\n",
    "hours = np.array(study_hours)\n",
    "print(\"array * 2 :\", hours * 2)  # element-wise multiplication"
   ]
  },
  {
   "cell_type": "raw",
   "id": "7",
   "metadata": {
    "raw_mimetype": "text/markdown"
   },
   "source": [
    "> **Memory layout: Python list vs NumPy ndarray**\n",
    "\n",
    "```{mermaid}\n",
    "flowchart LR\n",
    "    subgraph py [\"Python list: scattered pointers\"]\n",
    "        P1[\"int obj\"] & P2[\"float obj\"] & P3[\"str obj\"]\n",
    "        L[\"list\"] -->|ptr| P1\n",
    "        L -->|ptr| P2\n",
    "        L -->|ptr| P3\n",
    "    end\n",
    "    subgraph np [\"NumPy ndarray: contiguous memory\"]\n",
    "        direction LR\n",
    "        M1[\"64-bit float\"] --- M2[\"64-bit float\"] --- M3[\"64-bit float\"]\n",
    "        A[\"array\"] --> M1\n",
    "    end\n",
    "\n",
    "    style np fill:#EBF5F0,stroke:#059669\n",
    "    style py fill:#FEF2F2,stroke:#DC2626\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: ndarray = dtype + shape + contiguous memory</span><br><br>\n",
    "A NumPy array is homogeneous (one <code>dtype</code> for every element) and has a fixed <code>shape</code>. Because elements sit next to each other in memory, NumPy can vectorise operations: apply one compiled loop to the whole array instead of looping in Python.<br><br>Lists are general-purpose containers; arrays are <b>numeric data structures</b>. Use a list for a heterogeneous bag of objects, an array for a column of numbers.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "## 2. Creating Arrays\n",
    "\n",
    "The most direct way to create an array is `np.array()` from a Python list (or list of lists, for 2D data). For larger or synthetic data, NumPy provides dedicated creation functions so you never have to type out values by hand."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1D array: one feature for five students\n",
    "study_hours = np.array([12, 5, 18, 9, 22])\n",
    "\n",
    "# 2D array: rows = students, columns = features\n",
    "# columns: [study_hours, attendance_pct, prior_gpa]\n",
    "features = np.array(\n",
    "    [\n",
    "        [12, 85, 3.1],\n",
    "        [5, 60, 2.4],\n",
    "        [18, 95, 3.8],\n",
    "        [9, 70, 2.9],\n",
    "        [22, 98, 3.9],\n",
    "    ]\n",
    ")\n",
    "print(features)\n",
    "print(f\"shape: {features.shape}\")  # (5 students, 3 features)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "For larger arrays, typing out every value is impractical. These functions build arrays from a rule instead of a literal list:\n",
    "\n",
    "| Function | Produces |\n",
    "|---|---|\n",
    "| `np.arange(start, stop, step)` | Evenly spaced integers/floats, like `range()` |\n",
    "| `np.linspace(start, stop, n)` | `n` evenly spaced floats, **inclusive** of both ends |\n",
    "| `np.zeros(shape)` / `np.ones(shape)` | Array filled with `0.0` / `1.0` |\n",
    "| `np.full(shape, value)` | Array filled with a constant |\n",
    "| `np.eye(n)` | `n x n` identity matrix |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"arange(0, 10)       :\", np.arange(0, 10))\n",
    "print(\"arange(0, 10, 2)    :\", np.arange(0, 10, 2))\n",
    "print(\"linspace(0, 1, 5)   :\", np.linspace(0, 1, 5))\n",
    "print(\"zeros((2, 3))       :\\n\", np.zeros((2, 3)))\n",
    "print(\"ones(3)             :\", np.ones(3))\n",
    "print(\"full((2, 2), 7)     :\\n\", np.full((2, 2), 7))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {},
   "source": [
    "### Random Data with a `Generator`\n",
    "\n",
    "Synthetic data and simulations need reproducible randomness. Modern NumPy (1.17+) recommends `np.random.default_rng(seed)` over the legacy `np.random.seed(...)`. A `Generator` object is self-contained, so two generators never interfere with each other's state (unlike the old global `np.random.seed`, which silently affects every call anywhere in the program)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {},
   "outputs": [],
   "source": [
    "rng = np.random.default_rng(seed=42)  # one independent, reproducible stream\n",
    "\n",
    "print(\"uniform [0, 1)      :\", rng.random(3))\n",
    "print(\"normal(mean=0,std=1):\", rng.normal(0, 1, size=3))\n",
    "print(\"integers [60, 100)  :\", rng.integers(60, 100, size=5))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Prefer <code>default_rng</code> over <code>np.random.seed</code></span><br><br>\n",
    "<code>np.random.seed(42)</code> mutates a single <b>global</b> random state shared by your whole program. Any other code (or library) calling <code>np.random.*</code> shifts that shared state and breaks your reproducibility. <code>rng = np.random.default_rng(42)</code> gives you an isolated generator: pass it around explicitly, and your results stay reproducible no matter what else runs.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 1 - Build a Synthetic Student Dataset</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Using <code>rng = np.random.default_rng(7)</code>, generate a feature matrix <code>X</code> of shape <code>(50, 3)</code> for 50 students with columns:<br><br>\n",
    "<ul>\n",
    "<li><code>study_hours</code>: <code>rng.uniform(0, 25, size=50)</code></li>\n",
    "<li><code>attendance_pct</code>: <code>rng.uniform(50, 100, size=50)</code></li>\n",
    "<li><code>prior_gpa</code>: <code>rng.uniform(2.0, 4.0, size=50)</code></li>\n",
    "</ul>\n",
    "Combine the three 1D arrays into one <code>(50, 3)</code> array with <code>np.column_stack</code>.\n",
    "<pre style='background:#FCE8DA;padding:10px;border-radius:4px;font-size:0.9em'>X.shape  # -> (50, 3)\n",
    "X[0]     # -> array([study_hours_0, attendance_pct_0, prior_gpa_0])</pre>\n",
    "<b>Hint:</b> <code>np.column_stack([a, b, c])</code> stacks 1D arrays as columns of a 2D array.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {},
   "outputs": [],
   "source": [
    "rng = np.random.default_rng(7)\n",
    "\n",
    "study_hours = rng.uniform(0, 25, size=50)\n",
    "attendance_pct = rng.uniform(50, 100, size=50)\n",
    "prior_gpa = rng.uniform(2.0, 4.0, size=50)\n",
    "\n",
    "X = ...  # TODO: combine the three arrays into one (50, 3) feature matrix\n",
    "\n",
    "# print(f\"X.shape : {X.shape}\")\n",
    "# print(f\"X[0]    : {X[0]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18",
   "metadata": {},
   "source": [
    "## 3. Shape, Size, and dtype\n",
    "\n",
    "Every array carries metadata you should check before trusting a computation: `shape` (size along each dimension), `ndim` (number of dimensions), `size` (total element count), and `dtype` (the single data type of every element)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.array(\n",
    "    [\n",
    "        [12.0, 85.0, 3.1],\n",
    "        [5.0, 60.0, 2.4],\n",
    "        [18.0, 95.0, 3.8],\n",
    "        [9.0, 70.0, 2.9],\n",
    "    ]\n",
    ")\n",
    "\n",
    "print(f\"shape : {X.shape}\")  # (rows, columns)\n",
    "print(f\"ndim  : {X.ndim}\")\n",
    "print(f\"size  : {X.size}\")  # rows * columns\n",
    "print(f\"dtype : {X.dtype}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20",
   "metadata": {},
   "source": [
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: Mixed int/float input silently upcasts</span><br><br>\n",
    "<code>np.array([1, 2, 3.5])</code> produces a <code>float64</code> array, not a mix of <code>int</code> and <code>float</code>. NumPy must pick <b>one</b> dtype for the whole array and silently widens every element to fit. This is usually harmless, but <code>np.array([1, 2, 3], dtype=np.int32) / 2</code> truncating, or an unexpected <code>int8</code> overflowing past 127, are the same root cause: always check <code>.dtype</code> when results look wrong.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21",
   "metadata": {},
   "source": [
    "### Reshaping\n",
    "\n",
    "`reshape()` returns the **same data** viewed with a different shape. It does not copy or reorder values, so the total element count (`size`) must stay the same. `-1` tells NumPy \"infer this dimension from the others\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22",
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.arange(12)\n",
    "print(f\"x          : {x}  shape={x.shape}\")\n",
    "\n",
    "x_grid = x.reshape(3, 4)\n",
    "print(f\"reshape(3,4):\\n{x_grid}\")\n",
    "\n",
    "# -1 means \"figure this dimension out for me\"\n",
    "x_col = x.reshape(-1, 1)  # turn a 1D array into a single column\n",
    "print(f\"reshape(-1,1) shape: {x_col.shape}\")\n",
    "\n",
    "x_flat = x_grid.flatten()  # back to 1D - always returns a COPY\n",
    "print(f\"flatten()   : {x_flat}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23",
   "metadata": {},
   "source": [
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: <code>reshape()</code> returns a view, <code>flatten()</code> returns a copy</span><br><br>\n",
    "Mutating the result of <code>.reshape()</code> mutates the <b>original</b> array too. They share the same underlying memory. <code>.flatten()</code> always copies, so mutating it is safe. This distinction (view vs. copy) comes up constantly in NumPy; Sec. 11 covers it in more depth.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24",
   "metadata": {},
   "outputs": [],
   "source": [
    "original = np.arange(6)\n",
    "view = original.reshape(2, 3)\n",
    "view[0, 0] = 99  # mutating the reshaped VIEW...\n",
    "\n",
    "print(f\"view     :\\n{view}\")\n",
    "print(f\"original : {original}\")  # ...also changed the original!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25",
   "metadata": {},
   "source": [
    "### Combining Arrays: `column_stack`, `hstack`, `vstack`\n",
    "\n",
    "Feature engineering often means assembling separate 1D feature vectors into one 2D matrix, or stacking two matrices together. Use the right function for the shape change you want:\n",
    "\n",
    "| Function | Effect |\n",
    "|---|---|\n",
    "| `np.column_stack([a, b, ...])` | 1D arrays -> columns of a 2D array |\n",
    "| `np.hstack([a, b])` | Join side-by-side (same number of rows) |\n",
    "| `np.vstack([a, b])` | Stack on top of each other (same number of columns) |\n",
    "| `np.concatenate([a, b], axis=...)` | General join along a chosen axis |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26",
   "metadata": {},
   "outputs": [],
   "source": [
    "gpa = np.array([3.1, 2.4, 3.8, 2.9])\n",
    "attendance = np.array([85, 60, 95, 70])\n",
    "\n",
    "# Two 1D feature vectors -> one (4, 2) matrix\n",
    "combined = np.column_stack([gpa, attendance])\n",
    "print(f\"column_stack:\\n{combined}\")\n",
    "\n",
    "# Two (4, 2) batches of students -> one (8, 2) matrix\n",
    "more_students = np.array([[3.5, 90], [2.0, 55]])\n",
    "all_students = np.vstack([combined, more_students])\n",
    "print(f\"vstack shape: {all_students.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27",
   "metadata": {},
   "source": [
    "## 4. Indexing and Slicing\n",
    "\n",
    "NumPy indexing extends Python's list slicing to multiple dimensions. For a 2D array, the convention is `array[rows, columns]`, and negative indices still count from the end."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28",
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = np.array([62, 78, 85, 91, 55, 73, 88, 95, 67, 80])\n",
    "\n",
    "print(f\"first three   : {scores[:3]}\")\n",
    "print(f\"last three    : {scores[-3:]}\")\n",
    "print(f\"between 3 & 7 : {scores[3:7]}\")\n",
    "print(f\"every other   : {scores[::2]}\")\n",
    "print(f\"reversed      : {scores[::-1]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.array(\n",
    "    [\n",
    "        [12, 85, 3.1],\n",
    "        [5, 60, 2.4],\n",
    "        [18, 95, 3.8],\n",
    "        [9, 70, 2.9],\n",
    "        [22, 98, 3.9],\n",
    "    ]\n",
    ")\n",
    "\n",
    "print(f\"row 0              : {X[0]}\")\n",
    "print(f\"column 1 (all rows): {X[:, 1]}\")  # every row, attendance column\n",
    "print(f\"rows 1-3, col 0 & 2: \\n{X[1:4, [0, 2]]}\")  # fancy column indexing\n",
    "print(f\"single cell [2, 1] : {X[2, 1]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30",
   "metadata": {},
   "source": [
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: Basic slices are views; fancy/boolean indexing copies</span><br><br>\n",
    "<code>X[:, 1]</code> (a slice) returns a <b>view</b>: mutating it mutates <code>X</code>. <code>X[1:4, [0, 2]]</code> (a list of indices, known as \"fancy indexing\") always returns a <b>copy</b>. If you need an independent array from a slice, call <code>.copy()</code> explicitly: <code>col = X[:, 1].copy()</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 2 - Select Top Performers</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Given the <code>X</code> feature matrix above (columns: study_hours, attendance_pct, prior_gpa), use slicing to print:<br><br>\n",
    "<ol>\n",
    "<li>The <code>prior_gpa</code> column (all rows)</li>\n",
    "<li>The first two rows, all columns</li>\n",
    "<li>The <code>study_hours</code> and <code>prior_gpa</code> columns (skip attendance) for every row</li>\n",
    "</ol>\n",
    "<b>Hint:</b> For (3), use fancy column indexing: <code>X[:, [0, 2]]</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.array(\n",
    "    [\n",
    "        [12, 85, 3.1],\n",
    "        [5, 60, 2.4],\n",
    "        [18, 95, 3.8],\n",
    "        [9, 70, 2.9],\n",
    "        [22, 98, 3.9],\n",
    "    ]\n",
    ")\n",
    "\n",
    "gpa_column = ...  # TODO\n",
    "first_two_rows = ...  # TODO\n",
    "hours_and_gpa = ...  # TODO\n",
    "\n",
    "print(f\"gpa_column     : {gpa_column}\")\n",
    "print(f\"first_two_rows :\\n{first_two_rows}\")\n",
    "print(f\"hours_and_gpa  :\\n{hours_and_gpa}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33",
   "metadata": {},
   "source": [
    "## 5. Boolean Masking & Vectorised Conditionals\n",
    "\n",
    "Comparing an array to a value produces a **boolean array** of the same shape: a \"mask.\" Using that mask to index the original array keeps only the `True` positions. This replaces `if`/`for` filtering loops entirely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34",
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = np.array([62, 78, 85, 91, 55, 73, 88, 95, 67, 80])\n",
    "\n",
    "passing_mask = scores >= 70\n",
    "print(f\"mask     : {passing_mask}\")\n",
    "print(f\"passing  : {scores[passing_mask]}\")  # boolean indexing: keeps True positions\n",
    "print(f\"n passing: {passing_mask.sum()}\")  # True counts as 1, False as 0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35",
   "metadata": {},
   "source": [
    "Combine conditions with `&` (and) / `|` (or), **not** Python's `and`/`or`, which only work on single booleans, not arrays. Each side needs its own parentheses because `&`/`|` bind tighter than comparison operators:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36",
   "metadata": {},
   "outputs": [],
   "source": [
    "attendance = np.array([85, 60, 95, 70, 98, 45, 88, 92, 55, 80])\n",
    "scores = np.array([62, 78, 85, 91, 55, 73, 88, 95, 67, 80])\n",
    "\n",
    "# Parentheses are required: & binds tighter than >= without them\n",
    "at_risk = (scores < 70) & (attendance < 70)\n",
    "print(f\"at_risk mask : {at_risk}\")\n",
    "print(f\"n at risk    : {at_risk.sum()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37",
   "metadata": {},
   "source": [
    "`np.where(condition, if_true, if_false)` builds a new array by choosing between two values element-wise: the vectorised equivalent of a ternary expression inside a loop:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38",
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = np.where(scores >= 70, \"pass\", \"fail\")\n",
    "print(labels)\n",
    "\n",
    "# np.select handles more than two outcomes\n",
    "grade = np.select(\n",
    "    [scores >= 90, scores >= 80, scores >= 70, scores >= 60],\n",
    "    [\"A\", \"B\", \"C\", \"D\"],\n",
    "    default=\"F\",\n",
    ")\n",
    "print(grade)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 3 - Flag Students Needing Intervention</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Given <code>scores</code> and <code>attendance</code> arrays, build a boolean mask <code>needs_help</code> that flags students with <code>score &lt; 70</code> <b>or</b> <code>attendance &lt; 60</code>, then print how many students were flagged and their scores.\n",
    "<pre style='background:#FCE8DA;padding:10px;border-radius:4px;font-size:0.9em'>scores     = [62, 78, 85, 91, 55, 73, 88, 95, 67, 80]\n",
    "attendance = [85, 60, 95, 70, 98, 45, 88, 92, 55, 80]\n",
    "# needs_help -> True at indices 0, 4, 5, 8  (score<70 OR attendance<60)</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40",
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = np.array([62, 78, 85, 91, 55, 73, 88, 95, 67, 80])\n",
    "attendance = np.array([85, 60, 95, 70, 98, 45, 88, 92, 55, 80])\n",
    "\n",
    "needs_help = ...  # TODO: boolean mask, score < 70 OR attendance < 60\n",
    "\n",
    "print(f\"needs_help        : {needs_help}\")\n",
    "# print(f\"n flagged         : {needs_help.sum()}\")\n",
    "# print(f\"flagged scores    : {scores[needs_help]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41",
   "metadata": {},
   "source": [
    "## 6. Aggregations Along an Axis\n",
    "\n",
    "`mean()`, `sum()`, `std()`, `min()`, `max()` collapse an array to a single number by default. On a 2D matrix, the `axis` argument controls **which dimension gets collapsed** : this is the single most common source of \"right function, wrong number\" bugs in DS/ML code, so get the convention straight now:\n",
    "\n",
    "- **`axis=0`** collapses **rows** -> one result **per column** (per feature)\n",
    "- **`axis=1`** collapses **columns** -> one result **per row** (per sample)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.array(\n",
    "    [\n",
    "        [12.0, 85.0, 3.1],\n",
    "        [5.0, 60.0, 2.4],\n",
    "        [18.0, 95.0, 3.8],\n",
    "        [9.0, 70.0, 2.9],\n",
    "        [22.0, 98.0, 3.9],\n",
    "    ]\n",
    ")\n",
    "\n",
    "print(f\"overall mean        : {X.mean():.2f}\")  # one number, all 15 values\n",
    "print(f\"per-feature mean    : {X.mean(axis=0)}\")  # shape (3,): one per column\n",
    "print(f\"per-student mean    : {X.mean(axis=1)}\")  # shape (5,): one per row\n",
    "print(f\"per-feature std     : {X.std(axis=0)}\")\n",
    "print(f\"per-feature min/max : {X.min(axis=0)} / {X.max(axis=0)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Say the axis name out loud</span><br><br>\n",
    "\"<code>axis=0</code>\" is easy to misremember. Read it as: \"collapse <b>axis 0</b> (the row axis): what's left is one value per column.\" If you want one statistic per <i>feature</i> (the usual case before normalising a feature matrix), that is always <code>axis=0</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44",
   "metadata": {},
   "source": [
    "## 7. Broadcasting\n",
    "\n",
    "**Broadcasting** is the rule NumPy uses to apply an operation between two arrays of *different* shapes, by virtually \"stretching\" the smaller one, without actually copying any data. It is what lets you write `X - X.mean(axis=0)` instead of a loop over rows.\n",
    "\n",
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: The Broadcasting Rule</span><br><br>\n",
    "Compare shapes from the <b>right-hand side</b>. Two dimensions are compatible when they are <b>equal</b>, or when <b>one of them is 1</b> (it gets stretched to match). Missing leading dimensions are treated as 1.<br><br> <code>(5, 3)</code> and <code>(3,)</code> &rarr; treat <code>(3,)</code> as <code>(1, 3)</code> &rarr; stretch to <code>(5, 3)</code>. ✅<br> <code>(5, 3)</code> and <code>(5,)</code> &rarr; treat <code>(5,)</code> as <code>(1, 5)</code> &rarr; <code>3 != 5</code> and neither is 1. ❌\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.array(\n",
    "    [\n",
    "        [12.0, 85.0, 3.1],\n",
    "        [5.0, 60.0, 2.4],\n",
    "        [18.0, 95.0, 3.8],\n",
    "        [9.0, 70.0, 2.9],\n",
    "        [22.0, 98.0, 3.9],\n",
    "    ]\n",
    ")\n",
    "\n",
    "feature_mean = X.mean(axis=0)  # shape (3,)  -- one mean per feature\n",
    "feature_std = X.std(axis=0)  # shape (3,)\n",
    "\n",
    "print(f\"X.shape            : {X.shape}\")\n",
    "print(f\"feature_mean.shape : {feature_mean.shape}\")\n",
    "\n",
    "# (5, 3) - (3,) broadcasts the mean across every row: no loop needed\n",
    "X_normalised = (X - feature_mean) / feature_std\n",
    "print(f\"normalised:\\n{X_normalised}\")\n",
    "print(f\"new per-feature mean (~0): {X_normalised.mean(axis=0).round(6)}\")"
   ]
  },
  {
   "cell_type": "raw",
   "id": "46",
   "metadata": {
    "raw_mimetype": "text/markdown"
   },
   "source": [
    "> **Broadcasting: shapes align right, size-1 dimensions stretch**\n",
    "\n",
    "```{mermaid}\n",
    "flowchart LR\n",
    "    A[\"(3, 3) array\n",
    "[[1, 2, 3],\n",
    " [4, 5, 6],\n",
    " [7, 8, 9]]\"] -->|\"+ scalar ()\"| R1[\"broadcasts to (3,3)\n",
    "[[11,12,13],\n",
    " [14,15,16],\n",
    " [17,18,19]]\"]\n",
    "    B[\"(3, 3) array\"] -->|\"+ row (1, 3)\n",
    "[10, 20, 30]\"| R2[\"row expands to (3,3)\n",
    "[[11,22,33],\n",
    " [14,25,36],\n",
    " [17,28,39]]\"]\n",
    "    C[\"(3, 1) col\"] -->|\"+ (1, 3) row\n",
    "shapes align on right\"| R3[\"(3,3) result\n",
    "outer-product-like\"]\n",
    "\n",
    "    style R1 fill:#EBF5F0,stroke:#059669,color:#065F46\n",
    "    style R2 fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E\n",
    "    style R3 fill:#F5F3FF,stroke:#7C3AED,color:#3B0764\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47",
   "metadata": {},
   "source": [
    "The diagram below makes the rule literal: the top row is `feature_mean`, shape `(3,)`. NumPy stretches it down to align with every one of the 5 rows in `X`, without ever actually copying it 5 times in memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48",
   "metadata": {},
   "outputs": [],
   "source": [
    "from ark.plot.diagrams import broadcasting_diagram\n",
    "\n",
    "broadcasting_diagram();"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49",
   "metadata": {},
   "source": [
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: Broadcasting a (5,) array against a (5, 3) matrix fails for a subtle reason</span><br><br>\n",
    "If you instead compute <code>per_student_mean = X.mean(axis=1)</code> (shape <code>(5,)</code>) and try <code>X - per_student_mean</code>, NumPy raises <code>ValueError: operands could not be broadcast together</code>. It is comparing the trailing dimensions <code>3</code> vs <code>5</code>, not what you meant. Fix it by giving the per-row result an explicit column shape with <code>keepdims=True</code>: <code>X.mean(axis=1, keepdims=True)</code> has shape <code>(5, 1)</code>, which broadcasts correctly against <code>(5, 3)</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50",
   "metadata": {},
   "outputs": [],
   "source": [
    "row_mean = X.mean(axis=1, keepdims=True)  # shape (5, 1), NOT (5,)\n",
    "print(f\"row_mean.shape : {row_mean.shape}\")\n",
    "\n",
    "# (5, 3) - (5, 1) broadcasts the per-row mean across every column\n",
    "centered_per_row = X - row_mean\n",
    "print(f\"centered_per_row:\\n{centered_per_row}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 4 - Min-Max Scale Every Feature</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Write a one-line expression that scales every column of <code>X</code> to the <code>[0, 1]</code> range using the formula <code>(X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))</code>. Confirm the result's per-column min is 0 and max is 1.\n",
    "<pre style='background:#FCE8DA;padding:10px;border-radius:4px;font-size:0.9em'>X_scaled.min(axis=0)  # -> array([0., 0., 0.])\n",
    "X_scaled.max(axis=0)  # -> array([1., 1., 1.])</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.array(\n",
    "    [\n",
    "        [12.0, 85.0, 3.1],\n",
    "        [5.0, 60.0, 2.4],\n",
    "        [18.0, 95.0, 3.8],\n",
    "        [9.0, 70.0, 2.9],\n",
    "        [22.0, 98.0, 3.9],\n",
    "    ]\n",
    ")\n",
    "\n",
    "X_scaled = ...  # TODO: min-max scale every column to [0, 1]\n",
    "\n",
    "# print(f\"min per column: {X_scaled.min(axis=0)}\")\n",
    "# print(f\"max per column: {X_scaled.max(axis=0)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53",
   "metadata": {},
   "source": [
    "## 8. Vectorisation vs. Python Loops\n",
    "\n",
    "Now that you can express normalisation as `(X - mean) / std` with no loop, it is worth seeing *why* that matters. Every NumPy operation you have used so far runs as a single compiled loop over contiguous memory; a Python `for` loop pays the cost of the interpreter on every single element."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "big = np.random.default_rng(0).normal(size=1_000_000)\n",
    "\n",
    "\n",
    "def zscore_loop(values: np.ndarray) -> list[float]:\n",
    "    mean = sum(values) / len(values)\n",
    "    variance = sum((v - mean) ** 2 for v in values) / len(values)\n",
    "    std = variance**0.5\n",
    "    return [(v - mean) / std for v in values]\n",
    "\n",
    "\n",
    "start = time.perf_counter()\n",
    "_ = zscore_loop(big)\n",
    "loop_time = time.perf_counter() - start\n",
    "\n",
    "start = time.perf_counter()\n",
    "_ = (big - big.mean()) / big.std()\n",
    "vector_time = time.perf_counter() - start\n",
    "\n",
    "print(f\"Python loop time : {loop_time:.4f}s\")\n",
    "print(f\"Vectorised time  : {vector_time:.4f}s\")\n",
    "print(f\"Speedup          : {loop_time / vector_time:,.0f}x\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: If you are writing a <code>for</code> loop over an array, stop and look for a vectorised way</span><br><br>\n",
    "Almost every elementwise transformation, filter, or aggregation you would write as a Python loop already has a NumPy equivalent: arithmetic operators, <code>np.where</code>, boolean masks, <code>axis</code> aggregations. Reach for those first: a hand-written loop over a large array is one of the most common DS/ML performance bugs, and it is usually a 10-100x slowdown for no benefit.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56",
   "metadata": {},
   "source": [
    "## 9. Linear Algebra Essentials\n",
    "\n",
    "A linear model's prediction is a **dot product**: multiply each feature by a learned weight, sum the results, add a bias. `@` (matrix multiplication) computes this for every row of a feature matrix at once, no loop over students required."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "57",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.array(\n",
    "    [\n",
    "        [12.0, 85.0, 3.1],\n",
    "        [5.0, 60.0, 2.4],\n",
    "        [18.0, 95.0, 3.8],\n",
    "        [9.0, 70.0, 2.9],\n",
    "        [22.0, 98.0, 3.9],\n",
    "    ]\n",
    ")  # shape (5 students, 3 features)\n",
    "\n",
    "# Suppose a (already-fitted) linear model has these learned weights and bias\n",
    "weights = np.array([1.5, 0.3, 8.0])  # one weight per feature, shape (3,)\n",
    "bias = 10.0\n",
    "\n",
    "# X @ weights: (5, 3) @ (3,) -> (5,) -- one prediction per student\n",
    "predicted_scores = X @ weights + bias\n",
    "print(f\"predicted_scores: {predicted_scores.round(1)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58",
   "metadata": {},
   "source": [
    "`@` on a matrix and a vector is shorthand for: for each row, multiply element-wise by `weights` and sum: exactly `(X * weights).sum(axis=1)`. Verify the two are equivalent, then measure prediction error against the true scores with `np.linalg.norm` (the Euclidean / RMS-style distance):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59",
   "metadata": {},
   "outputs": [],
   "source": [
    "# @ is equivalent to elementwise multiply + sum along axis=1\n",
    "manual = (X * weights).sum(axis=1) + bias\n",
    "print(f\"@ matches manual sum: {np.allclose(predicted_scores, manual)}\")\n",
    "\n",
    "actual_scores = np.array([88.0, 65.0, 95.0, 78.0, 99.0])\n",
    "errors = predicted_scores - actual_scores\n",
    "rmse = np.linalg.norm(errors) / np.sqrt(len(errors))\n",
    "print(f\"errors : {errors.round(1)}\")\n",
    "print(f\"RMSE   : {rmse:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60",
   "metadata": {},
   "source": [
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: Comparing floats with <code>==</code></span><br><br>\n",
    "<code>np.allclose(a, b)</code> was used above instead of <code>(a == b).all()</code> on purpose: floating-point arithmetic accumulates tiny rounding errors, so two mathematically-equal results can differ in their last bit. Always compare floats with a tolerance: <code>np.allclose</code>, or <code>abs(a - b) &lt; 1e-9</code>, never with exact <code>==</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61",
   "metadata": {},
   "source": [
    "## 10. Saving & Loading Arrays\n",
    "\n",
    "A typical pipeline computes a feature matrix once and reuses it across many later steps (training, evaluation, serving). `.npy` stores a single array in NumPy's own binary format: far smaller and faster to read than CSV for numeric data, and it preserves `dtype` and `shape` exactly. `.npz` bundles several named arrays together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "tmp_dir = Path(\"tmp_numpy_activity\")\n",
    "tmp_dir.mkdir(exist_ok=True)\n",
    "\n",
    "X = np.array([[12.0, 85.0, 3.1], [5.0, 60.0, 2.4], [18.0, 95.0, 3.8]])\n",
    "y = np.array([88.0, 65.0, 95.0])\n",
    "\n",
    "# Save a single array\n",
    "np.save(tmp_dir / \"X.npy\", X)\n",
    "X_loaded = np.load(tmp_dir / \"X.npy\")\n",
    "print(f\"round-trip equal: {np.array_equal(X, X_loaded)}\")\n",
    "\n",
    "# Save several named arrays together in one file\n",
    "np.savez(tmp_dir / \"dataset.npz\", features=X, target=y)\n",
    "bundle = np.load(tmp_dir / \"dataset.npz\")\n",
    "print(f\"keys   : {list(bundle.keys())}\")\n",
    "print(f\"target : {bundle['target']}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63",
   "metadata": {},
   "outputs": [],
   "source": [
    "import shutil\n",
    "\n",
    "shutil.rmtree(tmp_dir)\n",
    "print(f\"cleaned up: {tmp_dir.exists()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64",
   "metadata": {},
   "source": [
    "## 11. Common Gotchas\n",
    "\n",
    "Like the Python gotchas in Part 3, none of these raise an exception. They silently produce a wrong (or surprising) result. Recognise them now so you do not lose hours to them later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65",
   "metadata": {},
   "outputs": [],
   "source": [
    "# GOTCHA 1: basic slicing returns a VIEW, not a copy\n",
    "scores = np.array([62, 78, 85, 91, 55])\n",
    "top_three = scores[:3]\n",
    "top_three[0] = 0  # mutating the \"slice\"...\n",
    "\n",
    "print(f\"top_three : {top_three}\")\n",
    "print(f\"scores    : {scores}\")  # ...changed the original too!\n",
    "\n",
    "# Fix: copy explicitly when you need an independent array\n",
    "safe_copy = scores[:3].copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66",
   "metadata": {},
   "outputs": [],
   "source": [
    "# GOTCHA 2: integer dtype truncates on division-like ops, and can overflow\n",
    "small = np.array([120, 10], dtype=np.int8)\n",
    "print(f\"int8 + 50 : {small + 50}\")  # 120 + 50 = 170, overflows int8's max of 127!\n",
    "\n",
    "# Fix: use a wide-enough dtype, or let NumPy infer (default is int64/float64)\n",
    "safe = small.astype(np.int64) + 50\n",
    "print(f\"int64 + 50: {safe}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67",
   "metadata": {},
   "outputs": [],
   "source": [
    "# GOTCHA 3: {} is a dict, not a set -- same trap as in plain Python (Part 3, Sec. 8)\n",
    "empty_dict = {}\n",
    "empty_set = set()\n",
    "print(f\"type({{}})   : {type(empty_dict)}\")\n",
    "print(f\"type(set()) : {type(empty_set)}\")\n",
    "\n",
    "# GOTCHA 4: comparing floats with == (see Sec. 9 for the fix: np.allclose)\n",
    "a = np.array([0.1 + 0.2])\n",
    "print(f\"0.1 + 0.2 == 0.3 : {a == 0.3}\")  # False! 0.30000000000000004 != 0.3\n",
    "print(f\"np.allclose      : {np.allclose(a, 0.3)}\")  # True: tolerant comparison"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68",
   "metadata": {},
   "source": [
    "## 12. Capstone Exercises\n",
    "\n",
    "Apply everything from this notebook together. Each exercise is self-contained."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Exercise 1 - Build, Normalise, and Predict</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Using the <code>students</code> dataset below:<br><br>\n",
    "<ol>\n",
    "<li>Build a <code>(6, 3)</code> feature matrix <code>X</code> with <code>np.column_stack</code></li>\n",
    "<li>Z-score normalise <code>X</code> (Sec. 7)</li>\n",
    "<li>Predict <code>exam_score</code> with the given <code>weights</code>/<code>bias</code> using <code>@</code> (Sec. 9), applied to the <b>normalised</b> features</li>\n",
    "<li>Compute the RMSE against <code>actual_scores</code> using <code>np.linalg.norm</code></li>\n",
    "</ol>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70",
   "metadata": {},
   "outputs": [],
   "source": [
    "study_hours = np.array([12, 5, 18, 9, 22, 14])\n",
    "attendance_pct = np.array([85, 60, 95, 70, 98, 80])\n",
    "prior_gpa = np.array([3.1, 2.4, 3.8, 2.9, 3.9, 3.3])\n",
    "actual_scores = np.array([88.0, 65.0, 95.0, 78.0, 99.0, 84.0])\n",
    "\n",
    "weights = np.array([0.8, 0.5, 6.0])\n",
    "bias = 55.0\n",
    "\n",
    "# TODO: 1) build X, 2) normalise it, 3) predict, 4) compute RMSE\n",
    "X = ...\n",
    "X_normalised = ...\n",
    "predicted = ...\n",
    "rmse = ...\n",
    "\n",
    "print(f\"predicted: {predicted}\")\n",
    "print(f\"RMSE     : {rmse:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Exercise 2 - Vectorised Anomaly Detector</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Rewrite the deque-based anomaly detector from Part 3 (Sec. 9, Exercise 3) without any explicit loop. Flag any reading more than <code>2</code> standard deviations from the <b>overall</b> mean using a single boolean mask.\n",
    "<pre style='background:#FCE8DA;padding:10px;border-radius:4px;font-size:0.9em'>readings = [36.5, 36.7, 36.8, 36.6, 36.9, 39.5, 36.7, 36.8]\n",
    "# Expected: reading 39.5 (index 5) flagged as anomaly</pre>\n",
    "<b>Hint:</b> <code>z = (readings - readings.mean()) / readings.std()</code>, then mask <code>np.abs(z) > 2</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72",
   "metadata": {},
   "outputs": [],
   "source": [
    "readings = np.array([36.5, 36.7, 36.8, 36.6, 36.9, 39.5, 36.7, 36.8])\n",
    "\n",
    "z_scores = ...  # TODO\n",
    "anomaly_mask = ...  # TODO\n",
    "\n",
    "print(f\"z_scores      : {z_scores.round(2)}\")\n",
    "print(f\"anomaly_mask  : {anomaly_mask}\")\n",
    "print(f\"anomalies     : {readings[anomaly_mask]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73",
   "metadata": {},
   "source": [
    "## What's New in NumPy 2.0\n",
    "\n",
    "NumPy 2.0 (released June 2024) is the first major version bump in almost two decades. Two changes matter most for day-to-day data science work:\n",
    "\n",
    "### Removed type aliases\n",
    "\n",
    "The old Python-builtin aliases (`np.int`, `np.float`, `np.bool`, `np.complex`, `np.object`, `np.str`) were deprecated for years and are now **fully removed**. They were just aliases to the Python built-ins anyway, so the fix is mechanical:\n",
    "\n",
    "| Old (removed) | Replacement |\n",
    "|---|---|\n",
    "| `np.bool` | `np.bool_` or Python `bool` |\n",
    "| `np.int` | `np.intp` or Python `int` |\n",
    "| `np.float` | `np.float64` or Python `float` |\n",
    "| `np.complex` | `np.complex128` or Python `complex` |\n",
    "| `np.object` | `np.object_` or Python `object` |\n",
    "| `np.str` | `np.str_` or Python `str` |\n",
    "\n",
    "### `StringDType` for proper string arrays\n",
    "\n",
    "NumPy 2.0 introduced `np.dtypes.StringDType()`, a real variable-length string dtype backed by UTF-8 memory. The old `np.str_` stored fixed-width UCS-4 strings (one array dtype for every character count), `StringDType` stores arbitrary-length strings efficiently.\n",
    "\n",
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Use <code>np.float64</code> not <code>np.float</code> in new code</span><br><br>\n",
    "If you see a <code>module 'numpy' has no attribute 'float'</code> error, the codebase was written against NumPy 1.x. A global search-and-replace of <code>np.float</code> → <code>np.float64</code> (and so on for the others in the table) is the complete fix.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Old aliases are REMOVED in NumPy 2.0 — this would raise AttributeError:\n",
    "# arr = np.array([1, 2, 3], dtype=np.float)   # ❌\n",
    "\n",
    "# Use the explicit dtype names instead:\n",
    "arr = np.array([1, 2, 3], dtype=np.float64)  # ✅\n",
    "print(f\"dtype: {arr.dtype}\")\n",
    "\n",
    "# StringDType: variable-length strings, efficient UTF-8 storage (NumPy 2.0+)\n",
    "names = np.array([\"Alice\", \"Bob\", \"Charlie\"], dtype=np.dtypes.StringDType())\n",
    "print(f\"names : {names}\")\n",
    "print(f\"dtype : {names.dtype}\")\n",
    "\n",
    "# The old fixed-width str_ is still available but StringDType is the modern choice\n",
    "old_style = np.array([\"Alice\", \"Bob\", \"Charlie\"])  # infers str_ (fixed width)\n",
    "print(f\"old dtype: {old_style.dtype}\")  # <U7 means Unicode, 7 chars wide"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75",
   "metadata": {},
   "source": [
    "## Further Reading\n",
    "\n",
    "| Resource | Why it matters |\n",
    "|---|---|\n",
    "| Harris, C.R. et al. (2020). [Array programming with NumPy](https://doi.org/10.1038/s41586-020-2649-2). *Nature* 585, 357–362. | The primary citation for NumPy; the paper explains the design decisions behind broadcasting and ufuncs |\n",
    "| VanderPlas, J. (2016). *Python Data Science Handbook*, Ch. 2. O'Reilly. | Free online — the most readable treatment of fancy indexing, broadcasting, and structured arrays |\n",
    "| [NumPy documentation — Broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html) | Official broadcasting rules with diagrams; bookmark for the next time the shapes don't align |\n",
    "| [NumPy documentation — Indexing](https://numpy.org/doc/stable/user/basics.indexing.html) | Covers basic, advanced, and boolean indexing in one place |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Concept | Key rule |\n",
    "|---|---|\n",
    "| `ndarray` | One dtype, contiguous memory, fast because it skips the Python interpreter |\n",
    "| Creation | `np.array`, `arange`, `linspace`, `zeros`/`ones`; `np.random.default_rng(seed)` for reproducible random data |\n",
    "| `shape`/`dtype` | Always check before trusting a result; mixed-type input silently upcasts |\n",
    "| `reshape` | Returns a **view**, same data, same `size`, different shape |\n",
    "| `flatten` | Always returns a **copy** |\n",
    "| Slicing | Basic slices (`X[:, 0]`) are views; fancy/boolean indexing always copies |\n",
    "| Boolean masks | `&` / `\\|` (not `and`/`or`) on arrays, each side parenthesised |\n",
    "| `np.where` / `np.select` | Vectorised if/else and multi-branch labelling |\n",
    "| `axis=0` vs `axis=1` | 0 collapses rows -> one value per **column**; 1 collapses columns -> one value per **row** |\n",
    "| Broadcasting | Compare shapes right-to-left; dims match if equal or one is `1`; use `keepdims=True` to broadcast a per-row stat |\n",
    "| Vectorisation | A NumPy expression beats a Python loop by 10-100x, look for one before writing `for` |\n",
    "| `@` / `np.dot` | Matrix multiplication: `X @ weights` predicts every row in one call |\n",
    "| `np.allclose` | Always compare floats with a tolerance, never `==` |\n",
    "| `.npy` / `.npz` | Compact, dtype/shape-preserving array storage between pipeline stages |\n",
    "\n",
    "\n",
    "**Next:** `05-matplotlib.ipynb`, covering how to visualise arrays and DataFrames with matplotlib."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ark (3.12.12.final.0)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}