{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "---\n",
    "title: \"Part 1: Language Core (Data & Types)\"\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/01-python-core.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/01-python-core.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "**DS-MLOps Python Foundations**\n",
    "\n",
    "**Python 3.12+ | Author: Anthony Faustine**\n",
    "\n",
    "Part 1 covers Python's core data vocabulary: variables, types, strings, the four collection types, the standard library's extra collections, and operators. All examples come from a single realistic scenario: a **university analytics platform** that tracks student performance, course enrollment, and model experiment logs.\n",
    "\n",
    "Part 2 (`02-control-flow.ipynb`) continues directly from this notebook with control flow and comprehensions. Read it right after this one to complete the language foundation.\n",
    "\n",
    "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "## Before You Begin\n",
    "\n",
    "### What is Python?\n",
    "\n",
    "**Python** is a general-purpose programming language created in 1991. A programming language is a set of rules for writing instructions a computer can execute. Unlike a spreadsheet, code lets you automate tasks, process millions of data points, and build models that learn from data.\n",
    "\n",
    "### Why Python for data science and AI?\n",
    "\n",
    "Python was not built for data science. It became the de facto standard because of three compounding advantages:\n",
    "\n",
    "**1. Readable syntax.** Python code reads closer to plain English than any other mainstream language. A data scientist can focus on the algorithm, not the language syntax. `for score in scores: total += score` needs no translation.\n",
    "\n",
    "**2. A world-class numerical ecosystem.** The entire scientific Python stack is Python-first:\n",
    "\n",
    "| Library | What it does |\n",
    "|---|---|\n",
    "| **NumPy** | Fast multi-dimensional arrays; the foundation everything else builds on |\n",
    "| **pandas** | Tabular data: load, clean, reshape, merge DataFrames |\n",
    "| **matplotlib / seaborn** | Visualisation: line charts, heatmaps, histograms |\n",
    "| **scikit-learn** | Classical ML: linear models, trees, SVMs, pipelines |\n",
    "| **PyTorch / JAX** | Deep learning: neural networks trained on GPU |\n",
    "| **HuggingFace Transformers** | Large language models and vision models |\n",
    "\n",
    "Every major AI breakthrough in the last decade (ResNet, BERT, GPT, Llama) was released as Python code. Reproducing or building on that research requires Python.\n",
    "\n",
    "**3. Interactive computing with Jupyter.** Jupyter notebooks let you run one cell at a time, see results immediately, and iterate without a compile step. This matches how data exploration actually works: inspect the data, transform it, visualise, repeat.\n",
    "\n",
    "### What is a Jupyter notebook?\n",
    "\n",
    "This file is a **Jupyter notebook**: a document that mixes formatted text (like this paragraph) with executable code. It consists of **cells**:\n",
    "\n",
    "- **Markdown cells** (like this one): formatted text, explanations, tables, equations.\n",
    "- **Code cells** (the grey boxes below): Python code. Press **Shift + Enter** to run\n",
    "  a cell; its output appears directly below.\n",
    "\n",
    "**Always run cells from top to bottom.** Later cells often use variables created in earlier ones. If something breaks, use **Kernel → Restart & Run All** to start fresh.\n",
    "\n",
    "### What we will build together\n",
    "\n",
    "Every example uses the same scenario: a **university analytics platform** tracking student scores, course enrollments, and ML experiment logs. The same data structures recur across every section so you can focus on the Python concept, not a new domain each time.\n",
    "\n",
    "By the end of Part 2 you will have the full language foundation needed to work with real datasets using NumPy and pandas.\n",
    "\n",
    "### Python vs other languages\n",
    "\n",
    "The best way to appreciate Python's readability is side-by-side comparison. Here is \"print Hello\" in three languages:\n",
    "\n",
    "**Java**: 5 lines of ceremony for one instruction:\n",
    "```java\n",
    "public class Hello {\n",
    "    public static void main(String[] args) {\n",
    "        System.out.println(\"Hello\");\n",
    "    }\n",
    "}\n",
    "```\n",
    "\n",
    "**C++**: requires a header and an entry-point function:\n",
    "```cpp\n",
    "#include <iostream>\n",
    "int main() { std::cout << \"Hello\\n\"; }\n",
    "```\n",
    "\n",
    "**Python**: one line that reads like English:\n",
    "```python\n",
    "print(\"Hello\")\n",
    "```\n",
    "\n",
    "This gap widens as programs grow. A 300-line data pipeline in Python stays readable; the equivalent in Java or C++ becomes much harder to navigate. That is why Python dominates explorative, iterative work like data science and ML."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Running notebooks in VS Code</span><br><br>\n",
    "You can run all notebooks in this book inside VS Code with the <a href=\"../../tutorials/02-dev-tools/00-vscode-setup.qmd\">IDE Setup tutorial (Part 12)</a> guiding you through the setup. VS Code gives you IntelliSense inside cells, a Variable Inspector, and integrated git, all without leaving the editor. If you prefer JupyterLab, skip ahead — every notebook works identically there.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "::: {.callout-note collapse=\"true\" icon=false}\n",
    "## Learning Objectives\n",
    "\n",
    "By the end of Part 1 you will be able to:\n",
    "\n",
    "| # | Skill | Covered in |\n",
    "|---|---|---|\n",
    "| 1 | Annotate variables with type hints (`list[float]`, `X | None`) | Sec. 1 |\n",
    "| 2 | Apply PEP 8 naming conventions (`snake_case`, `PascalCase`, `UPPER_SNAKE`) | Sec. 1.4 |\n",
    "| 3 | Clean, parse, and format strings | Sec. 2 |\n",
    "| 4 | Choose the right collection for any task | Sec. 3-7 |\n",
    "| 5 | Use `dict |` merge, `TypedDict`, and `NamedTuple` | Sec. 5, 4 |\n",
    "| 6 | Apply the walrus operator `:=` where it clarifies code | Sec. 8 |\n",
    "\n",
    "> **Note on forward references:** Sections 2-7 occasionally use `for` loops and `class`\n",
    "> definitions before they are formally introduced. `for` loops are covered in Part 2\n",
    "> (`02-control-flow.ipynb`); classes are covered in Part 3 (`03-python-patterns.ipynb`).\n",
    "> Whenever you see `for item in collection:` early, read it as \"repeat this block once\n",
    "> per item.\" Full explanations follow in their dedicated sections.\n",
    ":::\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "## 1. Variables, Types & Type Hints\n",
    "\n",
    "### What is a variable?\n",
    "\n",
    "A **variable** is a named container that stores a value in your program's memory. Think of it as a labelled box:\n",
    "\n",
    "```\n",
    "name     ──►  \"Alice Kamau\"\n",
    "gpa      ──►  3.85\n",
    "enrolled ──►  True\n",
    "```\n",
    "\n",
    "You create a variable with the **assignment operator** `=`:\n",
    "\n",
    "```python\n",
    "name = \"Alice Kamau\"   # create a box called 'name', put the value in it\n",
    "```\n",
    "\n",
    "> ⚠️ The `=` sign in Python means **assign** (store this value). It is NOT the\n",
    "> mathematical equals sign. To check equality, use `==` (two equals signs).\n",
    "\n",
    "### Python's four core types\n",
    "\n",
    "Every value has a **type**: a label describing what kind of data it is:\n",
    "\n",
    "| Type | What it stores | Examples | Real-world use |\n",
    "|---|---|---|---|\n",
    "| `int` | Whole numbers | `42`, `2024001`, `-7` | Student IDs, epoch counts, ranks |\n",
    "| `float` | Decimal numbers | `3.85`, `0.001`, `92.3` | GPA, learning rate, accuracy |\n",
    "| `str` | Text (any characters) | `'Alice'`, `\"CS301\"` | Names, labels, file paths |\n",
    "| `bool` | True or False only | `True`, `False` | \"Is enrolled?\", \"Did it converge?\" |\n",
    "\n",
    "Python figures out the type of every value automatically. You never need to declare it.\n",
    "\n",
    "### Why add type hints?\n",
    "\n",
    "Without hints, Python happily lets you store the wrong type in a variable:\n",
    "\n",
    "```python\n",
    "gpa = 3.85       # float ✓\n",
    "gpa = \"unknown\"  # str  - legal but wrong! breaks any later calculation\n",
    "```\n",
    "\n",
    "**Type hints** are optional annotations that make your intent explicit so that tools can catch mistakes like the one above:\n",
    "\n",
    "```python\n",
    "gpa: float = 3.85   # hint says this must be a float\n",
    "```\n",
    "\n",
    "The syntax is `name: type = value`. Hints are **not enforced at runtime**: Python will not crash if you violate them, but the type checker `ty` will report an error the moment you try to assign the wrong type.\n",
    "\n",
    "**Python 3.9+:** `list[int]`, `dict[str, float]` (no imports needed) **Python 3.10+:** `float | None` means \"a float, or nothing\" (replaces `Optional[float]`)\n",
    "\n",
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Type Hints</span><br><br>\n",
    "A type hint annotates what type a variable <em>should</em> hold: <code>name: str = 'Alice'</code>. Hints are read by the type checker (<code>ty</code>) and your editor, not enforced at runtime. Annotate every variable, function parameter, and return value you write.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "Start with the simplest possible case: create a few variables and print them. No type hints yet, just the core idea of \"give a name to a value\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your first Python variables: no type hints yet\n",
    "# The = sign puts the value on the right into the name on the left\n",
    "name = \"Alice Kamau\"  # text value (str)\n",
    "score = 87.5  # decimal number (float)\n",
    "rank = 1  # whole number (int)\n",
    "enrolled = True  # True or False (bool)\n",
    "\n",
    "# print() displays a value in the output area below this cell\n",
    "print(name)\n",
    "print(score)\n",
    "print(rank)\n",
    "print(enrolled)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "Python knows the type of every value. `type()` reveals it, and `isinstance()` tests whether a value belongs to a given type. Run this cell to confirm:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "# type() tells you what Python has inferred\n",
    "print(type(name))  # <class 'str'>\n",
    "print(type(score))  # <class 'float'>\n",
    "print(type(rank))  # <class 'int'>\n",
    "print(type(enrolled))  # <class 'bool'>\n",
    "\n",
    "# Without hints, Python lets you overwrite with the wrong type: silently\n",
    "rank = \"first\"  # rank was an int, now it's a str: Python allows it\n",
    "print(f\"rank is now a {type(rank).__name__}\")  # str!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "That last reassignment (`rank = 'first'`) would silently break any code that later tries to do arithmetic with `rank`. Type hints prevent this by making your intent explicit. Now see the same variables with proper annotations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- Student enrollment record ---\n",
    "student_id: int = 2024001\n",
    "full_name: str = \"Maria Garcia\"\n",
    "gpa: float = 3.85\n",
    "is_enrolled: bool = True\n",
    "scholarship_amount: float | None = None  # union type: float or None (Python 3.10+)\n",
    "\n",
    "print(f\"Student : {full_name} (ID: {student_id})\")\n",
    "print(f\"GPA     : {gpa}  Enrolled: {is_enrolled}\")\n",
    "print(f\"Scholar.: {scholarship_amount}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {},
   "source": [
    "Run this to see Python's runtime type information. `isinstance()` is preferred over `type()` because it handles class hierarchies. `bool` is a subclass of `int`, so `isinstance(True, int)` returns `True`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {},
   "outputs": [],
   "source": [
    "# isinstance() is preferred over type() for checks: handles subclasses\n",
    "print(f\"type(gpa)                      -> {type(gpa)}\")\n",
    "print(f\"isinstance(gpa, float)         -> {isinstance(gpa, float)}\")\n",
    "print(f\"isinstance(gpa, int | float)  -> {isinstance(gpa, int | float)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "Python also has a built-in **complex number** type, used in signal processing and Fourier analysis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "# complex numbers: real + imaginary parts\n",
    "frequency: complex = 3 + 2j  # j is the imaginary unit\n",
    "print(f\"complex   : {frequency}\")\n",
    "print(f\"real part : {frequency.real}\")\n",
    "print(f\"imag part : {frequency.imag}\")\n",
    "print(f\"magnitude : {abs(frequency):.3f}\")  # |z| = sqrt(real² + imag²)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: f-string debugging with <code>=</code></span><br><br>\n",
    "Python 3.8+ added <code>f'{var=}'</code> which prints the variable name <b>and</b> its value in one shot. This is faster than writing <code>print(f'var = {var}')</code> and far more useful during exploration.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18",
   "metadata": {},
   "outputs": [],
   "source": [
    "# f'{var=}': name + value, invaluable for debugging\n",
    "loss: float = 0.4231\n",
    "epoch: int = 12\n",
    "learning_rate: float = 0.001\n",
    "\n",
    "print(f\"{loss=}\")  # loss=0.4231\n",
    "print(f\"{epoch=}\")  # epoch=12\n",
    "print(f\"{learning_rate=}\")  # learning_rate=0.001\n",
    "print(f\"{loss:.4f}\")  # 0.4231  (formatted, no name)\n",
    "print(f\"{loss * 0.9 = }\")  # loss * 0.9 = 0.38078999999999997 (expressions too)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 1 - Annotate a Dataset Row</span><br><br>\n",
    "<b>Goal:</b> Replace each <code>...</code> with the correct type from the table above (<code>int</code>, <code>float</code>, <code>str</code>, <code>bool</code>, or <code>float | None</code>).<br><br> <b>How to decide:</b> look at the value on the right of <code>=</code> and ask: \"Is it a whole number? A decimal? Text? True/False? Could it be missing?\"<br><br> <b>Expected:</b> after filling in the hints, your editor should show no type errors.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: replace each ... with the correct type annotation\n",
    "course_code: ... = \"CS301\"\n",
    "credits: ... = 3\n",
    "pass_rate: ... = 0.87\n",
    "instructor: ... = \"Dr. Nkosi\"\n",
    "lab_room: ... = None  # lab not yet assigned\n",
    "is_core_course: ... = True\n",
    "\n",
    "# When you are done, print each variable with f'{var=}'\n",
    "print(f\"{course_code=}\")\n",
    "print(f\"{credits=}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21",
   "metadata": {},
   "source": [
    "### 1.4 Naming Conventions ([PEP 8](https://peps.python.org/pep-0008/))\n",
    "\n",
    "[PEP 8](https://peps.python.org/pep-0008/) (Python Enhancement Proposal 8) is the **official Python style guide**, written by Python's creator Guido van Rossum. Every serious Python project follows it; the linter `ruff` enforces it automatically (`ruff check .`).\n",
    "\n",
    "Python defines four naming styles. Each signals a specific **role** in the language:\n",
    "\n",
    "\n",
    "#### `snake_case`\n",
    "All lowercase, words joined by underscores. The default style for **everything that is not a class or a constant**: variables, functions, method names, and module file names.\n",
    "\n",
    "```python\n",
    "student_gpa    = 3.85     # variable\n",
    "learning_rate  = 0.001    # variable\n",
    "def load_data(): ...      # function name\n",
    "# module file: data_loader.py\n",
    "```\n",
    "\n",
    "\n",
    "#### `PascalCase` (also called UpperCamelCase)\n",
    "Every word starts with a capital letter; no underscores. Reserved exclusively for **class names**, `NamedTuple`s, and `TypedDict`s -- anything that defines a new **type**.\n",
    "\n",
    "```python\n",
    "class StudentRecord: ...      # class\n",
    "class ModelConfig: ...        # class\n",
    "class ExperimentRow(TypedDict): ...   # TypedDict\n",
    "```\n",
    "\n",
    "\n",
    "#### `UPPER_SNAKE_CASE`\n",
    "All uppercase, words separated by underscores. Use only for **module-level constants**: values set once, never reassigned. The style signals to every reader: \"do not change this.\"\n",
    "\n",
    "```python\n",
    "MAX_EPOCHS       = 100\n",
    "BASE_LEARNING_RATE = 0.001\n",
    "DATASET_PATH     = 'data/students.csv'\n",
    "```\n",
    "\n",
    "\n",
    "#### `_leading_underscore`\n",
    "A single underscore prefix signals that a name is **private / internal** -- an implementation detail not meant to be called from outside the module or class. Python does not enforce this; it is a convention your team respects.\n",
    "\n",
    "```python\n",
    "def _validate_scores(scores): ...  # internal helper\n",
    "_cache: dict[str, float] = {}      # internal state\n",
    "```\n",
    "\n",
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: PEP 8 naming</span><br><br>\n",
    "<code>snake_case</code> for variables &amp; functions &nbsp;|&nbsp; <code>PascalCase</code> for classes &amp; types &nbsp;|&nbsp; <code>UPPER_SNAKE</code> for constants &nbsp;|&nbsp; <code>_leading</code> for internals.<br> The computer ignores these conventions. Your teammates will not. Run <code>ruff check .</code> to catch violations automatically.\n",
    "</div>\n",
    "\n",
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: Mixing styles</span><br><br>\n",
    "<code>StudentGPA = 3.85</code> looks like a class (PascalCase), not a variable.<br> <code>LOAD_DATA = lambda: ...</code> looks like a constant, not a function.<br> Misleading names cause bugs that are hard to find. Be consistent.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22",
   "metadata": {},
   "outputs": [],
   "source": [
    "# snake_case: variables and functions\n",
    "max_epochs: int = 100\n",
    "learning_rate: float = 0.001\n",
    "model_accuracy: float = 0.945\n",
    "is_converged: bool = False  # bool names read like a yes/no question\n",
    "student_gpa_scores: list[float] = [3.95, 3.45, 3.88]\n",
    "\n",
    "# UPPER_SNAKE_CASE: module-level constants\n",
    "MAX_BATCH_SIZE: int = 32\n",
    "DATASET_PATH: str = \"data/students.csv\"\n",
    "\n",
    "# Avoid: cryptic abbreviations\n",
    "# lr   = 0.001    # unclear: is this learning rate? loss ratio?\n",
    "# ma   = 0.945    # unclear\n",
    "# b    = 32       # unclear\n",
    "\n",
    "# ruff catches naming violations:\n",
    "#   ruff check tutorials/  -->  E741 Ambiguous variable name: 'l'\n",
    "\n",
    "print(f\"Accuracy: {model_accuracy:.1%}\")  # .1% formats as a percentage\n",
    "print(f\"Converged: {is_converged}\")\n",
    "print(f\"Dataset: {DATASET_PATH}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23",
   "metadata": {},
   "source": [
    "## 2. Strings & String Methods\n",
    "\n",
    "A **string** is any piece of text: a student name, a course code, a log message, a file path. Create one by wrapping text in matching quotes:\n",
    "\n",
    "```python\n",
    "name   = 'Alice Kamau'       # single quotes\n",
    "course = \"Machine Learning\"  # double quotes - both work identically\n",
    "```\n",
    "\n",
    "Strings are used constantly in data science: reading CSV column headers, cleaning field values, building file paths, and formatting model output. Python provides dozens of built-in methods, no imports needed.\n",
    "\n",
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Strings are Immutable Sequences</span><br><br>\n",
    "A <code>str</code> is an ordered, <b>immutable</b> sequence of Unicode characters. Every string method returns a <b>new</b> string. The original is never changed. In data science you use strings to parse CSV rows, clean field values, build file paths, and format model output. Mastering the handful of methods below covers 95% of string work you will encounter.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24",
   "metadata": {},
   "outputs": [],
   "source": [
    "# f-strings: the standard for formatting output in Python 3.6+\n",
    "name: str = \"Alice Kamau\"\n",
    "score: float = 87.5\n",
    "rank: int = 3\n",
    "\n",
    "print(f\"Student : {name}\")\n",
    "print(f\"Score   : {score:.1f}%\")  # one decimal place\n",
    "print(f\"Score   : {score:.0f}%\")  # rounded to integer\n",
    "print(f\"Rank    : #{rank:02d}\")  # zero-padded two digits\n",
    "print(f\"Pass?   : {'Yes' if score >= 70 else 'No'}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25",
   "metadata": {},
   "source": [
    "Alignment specifiers (`{name:<8}`, `{score:5.1f}`) format values into fixed-width columns. This cell uses a `for` loop for display; for loops are covered properly in Part 2. Read `for name, s in [...]:` as \"for each (name, score) pair in the list, do this\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Alignment: useful for building readable reports\n",
    "for student, s in [(\"Alice\", 92.1), (\"Bob\", 74.8), (\"Carol\", 88.5)]:\n",
    "    bar = \"#\" * int(s // 10)\n",
    "    print(f\"{student:<8} {s:5.1f}  {bar}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Recognising Older Formatting Styles</span><br><br>\n",
    "You will encounter two older styles in legacy code and tutorials. Know them so you can read them, but write f-strings.<br><br> <code>print(\"Accuracy: %d%%\" % 92)</code> &nbsp;&nbsp; ← <b>%-formatting</b> (Python 2 era, still valid)<br> <code>print(\"Accuracy: {}\".format(92))</code> &nbsp;&nbsp; ← <b>.format()</b> (Python 3.0+, more flexible than %)<br> <code>print(f\"Accuracy: {acc}\")</code> &nbsp;&nbsp; ← <b>f-strings</b> (Python 3.6+, fastest and most readable, use this)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28",
   "metadata": {},
   "source": [
    "### Cleaning & Parsing\n",
    "Real-world data always arrives dirty: extra spaces, inconsistent delimiters, mixed case. `strip()` + `split()` is the most common two-step clean-up in any data pipeline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cleaning and parsing: the most common string operations in data work\n",
    "raw_row: str = \"  Alice Kamau , 2024001 , 3.95 , Computer Science  \"\n",
    "\n",
    "# strip() removes leading and trailing whitespace\n",
    "cleaned: str = raw_row.strip()\n",
    "\n",
    "# split() on a delimiter returns a list; strip each part too\n",
    "parts: list[str] = [p.strip() for p in cleaned.split(\",\")]\n",
    "name, sid, gpa_str, major = parts\n",
    "\n",
    "print(f\"Name  : {name!r}\")\n",
    "print(f\"ID    : {sid}\")\n",
    "print(f\"GPA   : {float(gpa_str):.2f}\")\n",
    "print(f\"Major : {major}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30",
   "metadata": {},
   "source": [
    "`join()` is the inverse of `split()`. It reassembles a list of strings into one string with a chosen separator. `replace()` and case methods normalise individual field values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31",
   "metadata": {},
   "outputs": [],
   "source": [
    "# join() is the inverse of split(): reassemble with a new delimiter\n",
    "tsv_row: str = \"\\t\".join(parts)\n",
    "print(f\"TSV   : {tsv_row!r}\")\n",
    "\n",
    "# replace(): swap delimiters or fix typos\n",
    "print(cleaned.replace(\",\", \" |\"))\n",
    "\n",
    "# Case methods\n",
    "tag: str = \"  machine_learning  \"\n",
    "print(tag.strip().replace(\"_\", \" \").title())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32",
   "metadata": {},
   "source": [
    "### Searching & Slicing\n",
    "Test membership, find positions, and count occurrences, all without writing a loop:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Searching strings: common in log parsing and feature extraction\n",
    "log: str = \"[ERROR] epoch 42: validation loss exceeded threshold (loss=1.234)\"\n",
    "\n",
    "print(f\"starts with [ERROR]  : {log.startswith('[ERROR]')}\")\n",
    "print(f\"ends with threshold  : {log.endswith('threshold')}\")\n",
    "print(f'contains \"loss\"      : {\"loss\" in log}')\n",
    "print(f'find \"epoch\"         : index {log.find(\"epoch\")}')\n",
    "print(f'count of \"e\"         : {log.count(\"e\")}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34",
   "metadata": {},
   "source": [
    "String slicing (`s[start:stop]`) extracts a substring by position, using the same syntax as list slicing. `rpartition(sep)` splits at the **last** occurrence of `sep`, returning `(before, sep, after)`, the cleanest way to separate a filename from its extension:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35",
   "metadata": {},
   "outputs": [],
   "source": [
    "log: str = \"[ERROR] epoch 42: validation loss exceeded threshold (loss=1.234)\"\n",
    "\n",
    "# Extract structured data from a log line\n",
    "epoch_part: str = log.split(\"epoch \")[1].split(\":\")[0]\n",
    "print(f\"Epoch number : {epoch_part}\")\n",
    "\n",
    "# Slicing: same rules as lists\n",
    "prefix: str = log[:7]  # '[ERROR]'\n",
    "body: str = log[9:]\n",
    "print(f\"Prefix : {prefix!r}\")\n",
    "print(f\"Body   : {body!r}\")\n",
    "\n",
    "# rpartition(): split at the LAST occurrence of a separator\n",
    "filename: str = \"model_experiment_run_42.parquet\"\n",
    "stem, _, ext = filename.rpartition(\".\")\n",
    "print(f\"stem={stem!r}  ext={ext!r}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 2 - Parse a Messy Log Line</span><br><br>\n",
    "<b>Goal:</b> Extract the model name, epoch, and loss value from the raw log string below into typed variables.<br><br>\n",
    "<pre style='background:#FCE8DA;padding:10px;border-radius:4px;font-size:0.9em'>raw = '  [INFO]  model=RandomForest | epoch=  5 | train_loss=0.3421 | val_loss=0.3812  '\n",
    "\n",
    "# Expected\n",
    "model_name  = 'RandomForest'\n",
    "epoch       = 5\n",
    "train_loss  = 0.3421\n",
    "val_loss    = 0.3812</pre>\n",
    "<b>Hint:</b> Use <code>strip()</code>, <code>split('|')</code>, and <code>split('=')</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37",
   "metadata": {},
   "outputs": [],
   "source": [
    "raw: str = \"  [INFO]  model=RandomForest | epoch=  5 | train_loss=0.3421 | val_loss=0.3812  \"\n",
    "\n",
    "# TODO: parse raw into the variables below\n",
    "model_name: str = ...\n",
    "epoch: int = ...\n",
    "train_loss: float = ...\n",
    "val_loss: float = ...\n",
    "\n",
    "# Verify\n",
    "print(f\"{model_name=}  {epoch=}  {train_loss=}  {val_loss=}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38",
   "metadata": {},
   "source": [
    "## 3. Collections: List\n",
    "\n",
    "A **list** is Python's most versatile built-in container: an **ordered, mutable** sequence of items of any type.\n",
    "\n",
    "```python\n",
    "scores  : list[float] = [78.0, 85.5, 92.0]   # floats\n",
    "names   : list[str]   = ['Alice', 'Bob']       # strings\n",
    "mixed   :  list       = [42, 'label', True]    # any types (avoid in practice)\n",
    "```\n",
    "\n",
    "**When to use a list:**\n",
    "- Order matters: items have a defined first and last position\n",
    "- You need to add, remove, or change elements after creation\n",
    "- You are collecting results in a loop: training losses, processed records, file paths\n",
    "\n",
    "**Key operations at a glance:**\n",
    "\n",
    "| Operation | Syntax | Notes |\n",
    "|---|---|---|\n",
    "| Index | `a[i]` | 0-based; negative counts from end |\n",
    "| Slice | `a[start:stop:step]` | Returns new list; stop is exclusive |\n",
    "| Append | `a.append(x)` | Add one item to the end |\n",
    "| Extend | `a.extend(iterable)` | Add all items from another sequence |\n",
    "| Insert | `a.insert(i, x)` | Insert before index `i` |\n",
    "| Remove | `a.remove(x)` | Remove first occurrence of value `x` |\n",
    "| Pop | `a.pop(i)` | Remove & return item at index `i` (default: last) |\n",
    "| Delete | `del a[i]` / `del a[i:j]` | Remove item or slice, returns nothing |\n",
    "| Clear | `a.clear()` | Remove all items (same as `del a[:]`) |\n",
    "| Membership | `x in a` | Returns `True` / `False` |\n",
    "| Length | `len(a)` | Number of items |\n",
    "| Sort | `a.sort()` / `sorted(a)` | In-place vs new list |\n",
    "| Count | `a.count(x)` | Occurrences of `x` |\n",
    "| Index | `a.index(x)` | Position of first `x` |\n",
    "| Copy | `a.copy()` | Shallow independent copy |\n",
    "\n",
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Ordered &amp; Mutable</span><br><br>\n",
    "A list maintains <b>insertion order</b> and supports <b>in-place modification</b>. Annotate as <code>list[int]</code> (Python 3.9+, no import needed).<br> Full reference: <a href=\"https://docs.python.org/3/tutorial/datastructures.html\" style=\"color:#0369A1\">docs.python.org: 5.1 More on Lists</a>\n",
    "</div>\n",
    "\n",
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: Assignment Is Not a Copy</span><br><br>\n",
    "<code>b = a</code> makes <code>b</code> point to the <b>same</b> list. Mutating <code>b</code> also changes <code>a</code>.<br> Use <code>b = a.copy()</code> or <code>b = a[:]</code> for an independent copy.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Quiz scores for a cohort of students\n",
    "quiz_scores: list[float] = [78.0, 85.5, 92.0, 88.5, 95.0, 67.0, 81.0]\n",
    "\n",
    "# Indexing: 0-based; negative index counts from the end\n",
    "print(f\"First  : {quiz_scores[0]}\")\n",
    "print(f\"Last   : {quiz_scores[-1]}\")\n",
    "print(f\"[1:4]  : {quiz_scores[1:4]}\")\n",
    "print(f\"[::2]  : {quiz_scores[::2]}\")  # every other element\n",
    "\n",
    "# Aggregates\n",
    "n: int = len(quiz_scores)\n",
    "mean: float = sum(quiz_scores) / n\n",
    "print(f\"n={n}  min={min(quiz_scores)}  max={max(quiz_scores)}  mean={mean:.1f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40",
   "metadata": {},
   "source": [
    "### Slicing\n",
    "\n",
    "A **slice** extracts a sub-list without modifying the original. The syntax is `a[start : stop : step]`:\n",
    "\n",
    "| Part | Default | Meaning |\n",
    "|---|---|---|\n",
    "| `start` | `0` | Index to begin from (inclusive) |\n",
    "| `stop` | `len(a)` | Index to stop at (**exclusive**: this element is NOT included) |\n",
    "| `step` | `1` | How many positions to advance each time |\n",
    "\n",
    "```python\n",
    "a = [10, 20, 30, 40, 50]\n",
    "a[1:4]    # [20, 30, 40]   - stop=4 is excluded\n",
    "a[:3]     # [10, 20, 30]   - start defaults to 0\n",
    "a[2:]     # [30, 40, 50]   - stop defaults to end\n",
    "a[::2]    # [10, 30, 50]   - every second element\n",
    "a[::-1]   # [50, 40, 30, 20, 10] - reversed\n",
    "a[:]      # [10, 20, 30, 40, 50] - full copy (shallow)\n",
    "```\n",
    "\n",
    "Slicing never raises an `IndexError`. Out-of-range start/stop are clamped silently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41",
   "metadata": {},
   "outputs": [],
   "source": [
    "quiz_scores: list[float] = [78.0, 85.5, 92.0, 88.5, 95.0, 67.0, 81.0]\n",
    "\n",
    "# Basic slices\n",
    "print(f\"First 3     : {quiz_scores[:3]}\")  # [78.0, 85.5, 92.0]\n",
    "print(f\"Last 3      : {quiz_scores[-3:]}\")  # [95.0, 67.0, 81.0]\n",
    "print(f\"Middle      : {quiz_scores[2:5]}\")  # [92.0, 88.5, 95.0]\n",
    "\n",
    "# Step\n",
    "print(f\"Every 2nd   : {quiz_scores[::2]}\")  # [78.0, 92.0, 95.0, 81.0]\n",
    "print(f\"Reversed    : {quiz_scores[::-1]}\")\n",
    "\n",
    "# Shallow copy via slice\n",
    "copy_via_slice: list[float] = quiz_scores[:]\n",
    "copy_via_slice[0] = 0.0\n",
    "print(f\"Original[0] : {quiz_scores[0]}\")  # unchanged: 78.0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42",
   "metadata": {},
   "source": [
    "`=` copies the **reference**, not the data. Both names then point to the same list in memory. Confirm the difference between a reference and an independent copy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copy vs reference: a critical distinction\n",
    "quiz_scores: list[float] = [78.0, 85.5, 92.0, 88.5, 95.0, 67.0, 81.0]\n",
    "\n",
    "backup: list[float] = quiz_scores.copy()  # independent copy\n",
    "ref: list[float] = quiz_scores  # same object!\n",
    "quiz_scores[0] = 99.0\n",
    "\n",
    "print(\"After quiz_scores[0] = 99.0:\")\n",
    "print(f\"  quiz_scores[0] : {quiz_scores[0]}\")\n",
    "print(f\"  ref[0]         : {ref[0]}\")  # also changed: same object\n",
    "print(f\"  backup[0]      : {backup[0]}\")  # unchanged: independent copy"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44",
   "metadata": {},
   "source": [
    "### Modifying Lists\n",
    "\n",
    "**Mutability** means a value can be changed after it is created. A list is **mutable**: you can add, remove, or replace any element at any time, without creating a new list. This is unlike strings and tuples, which are **immutable**: once created, their contents cannot change.\n",
    "\n",
    "| Type | Mutable? | What it means |\n",
    "|---|---|---|\n",
    "| `list` | Yes | Change any element, add or remove items freely: `scores[0] = 99` |\n",
    "| `str` | No | Methods like `.upper()` return a *new* string; the original is untouched |\n",
    "| `tuple` | No | Elements are fixed at creation and cannot be reassigned |\n",
    "\n",
    "Because lists are mutable, the methods below **modify the original list in place** and return `None`, not a new list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45",
   "metadata": {},
   "outputs": [],
   "source": [
    "scores: list[float] = [85.0, 92.0, 78.0, 65.0, 88.0]\n",
    "\n",
    "#: Adding items --\n",
    "scores.append(95.0)  # add one item to the end       [85, 92, 78, 65, 88, 95]\n",
    "scores.insert(1, 90.0)  # insert 90.0 before index 1\n",
    "scores.extend([81.5, 76.0])  # add all items from another list\n",
    "\n",
    "#: Removing items --\n",
    "scores.remove(65.0)  # remove first occurrence of 65.0 (raises ValueError if absent)\n",
    "last = scores.pop()  # remove and return last item\n",
    "second = scores.pop(1)  # remove and return item at index 1\n",
    "del scores[0]  # remove item at index 0 (no return value)\n",
    "# del scores[1:3]             # delete a slice: removes multiple items at once\n",
    "\n",
    "#: Membership test --\n",
    "print(f\"95.0 in scores   : {95.0 in scores}\")  # True / False\n",
    "print(f\"999.0 in scores  : {999.0 in scores}\")\n",
    "\n",
    "print(f\"scores : {scores}\")\n",
    "print(f\"popped : last={last}, second={second}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46",
   "metadata": {},
   "source": [
    "### List as a Stack & clear()\n",
    "\n",
    "A **stack** is a Last-In, First-Out (LIFO) structure: the last item appended is the first one popped. Lists implement this naturally with `append()` + `pop()`.\n",
    "\n",
    "`clear()` removes all items from the list in place (equivalent to `del a[:]`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "47",
   "metadata": {},
   "outputs": [],
   "source": [
    "# List as a LIFO stack: useful for depth-first search, undo history, backtracking\n",
    "call_stack: list[str] = []\n",
    "\n",
    "# Push\n",
    "call_stack.append(\"load_data\")\n",
    "call_stack.append(\"clean_data\")\n",
    "call_stack.append(\"train_model\")\n",
    "print(f\"Stack (top is last) : {call_stack}\")\n",
    "\n",
    "# Pop (LIFO order)\n",
    "while call_stack:\n",
    "    task = call_stack.pop()\n",
    "    print(f\"  Processing: {task}\")\n",
    "\n",
    "print(f\"Stack after popping : {call_stack}\")\n",
    "\n",
    "# clear(): empty a list in place (the name 'call_stack' still exists)\n",
    "call_stack.extend([\"task_a\", \"task_b\", \"task_c\"])\n",
    "call_stack.clear()\n",
    "print(f\"After clear()       : {call_stack}\")  # []"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48",
   "metadata": {},
   "source": [
    "`sorted()` returns a **new** sorted list; `.sort()` modifies the list **in place** and returns `None`. Assigning the result of `.sort()` is a common silent bug:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49",
   "metadata": {},
   "outputs": [],
   "source": [
    "# sorted() returns a new list; .sort() modifies in place\n",
    "ascending: list[float] = sorted(scores)\n",
    "descending: list[float] = sorted(scores, reverse=True)\n",
    "print(f\"asc    : {ascending}\")\n",
    "print(f\"desc   : {descending}\")\n",
    "\n",
    "# Search\n",
    "print(f\"count of 85.0 : {scores.count(85.0)}\")\n",
    "print(f\"index of 92.0 : {scores.index(92.0)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 3 - Summarise a Score List</span><br><br>\n",
    "<b>Goal:</b> Given the raw scores below, produce a cleaned, sorted list and a summary string.<br><br>\n",
    "<pre style='background:#FCE8DA;padding:10px;border-radius:4px;font-size:0.9em'>raw = [91.0, None, 74.5, 88.0, None, 63.0, 95.5, 80.0]\n",
    "\n",
    "# Expected output\n",
    "clean = [63.0, 74.5, 80.0, 88.0, 91.0, 95.5]   # sorted, None removed\n",
    "summary = 'n=6  min=63.0  max=95.5  mean=82.0'</pre>\n",
    "<b>Hint:</b> Filter with a list comprehension, then use <code>sorted()</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51",
   "metadata": {},
   "outputs": [],
   "source": [
    "raw: list[float | None] = [91.0, None, 74.5, 88.0, None, 63.0, 95.5, 80.0]\n",
    "\n",
    "# TODO: build clean (filtered + sorted) and print summary\n",
    "clean: list[float] = ...\n",
    "print(f\"clean   : {clean}\")\n",
    "print(f\"n={len(clean)}  min={min(clean)}  max={max(clean)}  mean={sum(clean) / len(clean):.1f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52",
   "metadata": {},
   "source": [
    "## 4. Collections: Tuple & NamedTuple\n",
    "\n",
    "A **tuple** is an **ordered, immutable** sequence, similar to a list, but its contents are fixed at creation. You cannot add, remove, or change any element.\n",
    "\n",
    "**Immutable** means locked. Once you write `coords = (1.29, 36.82)`, those two numbers cannot be replaced. This is intentional: immutability makes tuples safe to use as dictionary keys, pass between functions, and share across threads without risk of accidental modification.\n",
    "\n",
    "**When to use a tuple:**\n",
    "- The number of elements is fixed by design (a coordinate pair is always 2 values)\n",
    "- Returning multiple values from a function (Python packs them into a tuple)\n",
    "- You need a hashable key for a `dict` or `set` (lists cannot be dict keys)\n",
    "- Signalling to a reader that this data must not change\n",
    "\n",
    "**Key operations at a glance:**\n",
    "\n",
    "| Operation | Syntax | Notes |\n",
    "|---|---|---|\n",
    "| Index | `t[i]` | Same as list; negative index counts from end |\n",
    "| Slice | `t[start:stop:step]` | Returns a new tuple |\n",
    "| Unpack | `a, b, c = t` | Assign each element to a name |\n",
    "| Extended unpack | `first, *rest = t` | `*rest` collects remaining into a list |\n",
    "| Swap | `a, b = b, a` | Pythonic; no temporary variable needed |\n",
    "| Length | `len(t)` | Number of elements |\n",
    "| Membership | `x in t` | `True` / `False` |\n",
    "| Count | `t.count(x)` | Number of occurrences of `x` |\n",
    "| Find | `t.index(x)` | Index of first occurrence of `x` |\n",
    "| Concatenate | `t1 + t2` | Returns a new, longer tuple |\n",
    "\n",
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Ordered &amp; Immutable</span><br><br>\n",
    "Use a <code>tuple</code> for data that <b>must not change</b>: coordinate pairs, database rows, function return values. Annotate the type of each position: <code>tuple[str, int, float]</code>.<br><br> <code>typing.NamedTuple</code> adds field names and type hints, giving you a lightweight, typed, self-documenting record with zero runtime overhead over a plain tuple.\n",
    "</div>\n",
    "\n",
    "> **Class syntax note:** `NamedTuple` uses the `class` keyword.\n",
    "> Full class mechanics are covered in Part 3. For now, read\n",
    "> `class Foo(NamedTuple):` as \"define a named-tuple type called Foo with these fields.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Tuple: annotate with the exact types of each position\n",
    "record: tuple[str, int, float] = (\"Alice Kamau\", 2024001, 3.95)\n",
    "\n",
    "# Unpack all elements at once\n",
    "name, student_id, gpa = record\n",
    "print(f\"{name=}  {student_id=}  {gpa=}\")\n",
    "\n",
    "# Extended unpacking with *\n",
    "first, *middle, last = (82.0, 91.5, 74.0, 88.0, 95.5)\n",
    "print(f\"{first=}  {middle=}  {last=}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54",
   "metadata": {},
   "source": [
    "Python's swap idiom packs two values into a tuple and immediately unpacks them in the opposite order, no temporary variable needed. Tuples also enforce immutability at runtime:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "55",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pythonic variable swap: no temp variable needed\n",
    "x, y = \"train\", \"val\"\n",
    "x, y = y, x\n",
    "print(f\"After swap: {x=}  {y=}\")\n",
    "\n",
    "# Immutability: tuples cannot be changed after creation\n",
    "record: tuple[str, int, float] = (\"Alice Kamau\", 2024001, 3.95)\n",
    "try:\n",
    "    record[0] = \"Bob\"  # type: ignore[index]\n",
    "except TypeError as exc:\n",
    "    print(f\"Immutable: {exc}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56",
   "metadata": {},
   "source": [
    "### NamedTuple: Named, Typed Fields\n",
    "`NamedTuple` gives a plain tuple field names and type annotations. It uses `class` syntax (see the note in the section header). For now, read this as \"create a named tuple type with these typed fields\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "57",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import NamedTuple\n",
    "\n",
    "\n",
    "class StudentRecord(NamedTuple):\n",
    "    \"\"\"Typed, immutable student record.\"\"\"\n",
    "\n",
    "    name: str\n",
    "    student_id: int\n",
    "    gpa: float\n",
    "    major: str = \"Undeclared\"  # field with default value"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58",
   "metadata": {},
   "source": [
    "Create instances by calling the class like a function. `__repr__` is generated automatically: field names appear in the output:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59",
   "metadata": {},
   "outputs": [],
   "source": [
    "alice = StudentRecord(\"Alice Kamau\", 2024001, 3.95, \"Computer Science\")\n",
    "bob = StudentRecord(\"Bob Mwangi\", 2024002, 3.45)  # uses default major\n",
    "\n",
    "print(alice)\n",
    "print(bob)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60",
   "metadata": {},
   "source": [
    "Access fields by name for readability or by index for tuple-compatible tools. `_replace()` returns a **new** record with selected fields updated. The original is immutable and unchanged:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Access by name (readable) or by index (tuple-compatible)\n",
    "print(f\"{alice.name}, GPA: {alice.gpa}\")\n",
    "print(f\"By index alice[2]: {alice[2]}\")\n",
    "\n",
    "# _replace() creates a new record with selected fields changed\n",
    "alice_updated = alice._replace(gpa=3.97)\n",
    "print(f\"Updated: {alice_updated}\")\n",
    "\n",
    "# NamedTuples unpack just like plain tuples\n",
    "name, sid, gpa, major = alice\n",
    "print(f\"Unpacked: {name}, {major}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62",
   "metadata": {},
   "source": [
    "## 5. Collections: Dict\n",
    "\n",
    "A **dictionary** (`dict`) maps unique **keys** to **values**. Think of it as a lookup table: given a key, you get back its associated value in O(1) time: instantly, regardless of how many entries the dict contains.\n",
    "\n",
    "Unlike a list (where you access items by numeric position), a dict lets you access data by a meaningful label:\n",
    "\n",
    "```python\n",
    "student = {'name': 'Alice', 'gpa': 3.95, 'enrolled': True}\n",
    "student['gpa']       # 3.95  - by label, not by position\n",
    "student.get('age')   # None  - safe access, no KeyError\n",
    "```\n",
    "\n",
    "**When to use a dict:**\n",
    "- Access by name: student record, model config, API response payload\n",
    "- Counting occurrences: `{'cat': 3, 'dog': 1, 'bird': 2}`\n",
    "- Grouping: `{course_id: [student, student, ...]}`\n",
    "\n",
    "Python 3.7+ dicts preserve **insertion order**: you get keys back in the order you added them.\n",
    "\n",
    "**Key operations at a glance:**\n",
    "\n",
    "| Operation | Syntax | Notes |\n",
    "|---|---|---|\n",
    "| Access | `d[key]` | Raises `KeyError` if key is missing |\n",
    "| Safe access | `d.get(key, default)` | Returns `default` (or `None`) if key missing |\n",
    "| Add / update | `d[key] = value` | Creates key if absent; overwrites if present |\n",
    "| Bulk update | `d.update(other)` | Merge another dict or iterable of pairs |\n",
    "| Remove | `d.pop(key)` | Remove and return value; `KeyError` if absent |\n",
    "| Remove (safe) | `d.pop(key, default)` | Returns `default` instead of raising |\n",
    "| Delete | `del d[key]` | Remove key in place; no return value |\n",
    "| Clear | `d.clear()` | Remove all pairs; dict remains (now empty) |\n",
    "| Membership | `key in d` | Checks keys only, O(1) |\n",
    "| Keys | `d.keys()` | Live view of all keys |\n",
    "| Values | `d.values()` | Live view of all values |\n",
    "| Pairs | `d.items()` | Live view of `(key, value)` tuples, used in `for` loops |\n",
    "| Length | `len(d)` | Number of key-value pairs |\n",
    "| Merge (3.9+) | `a \\| b` | New merged dict; right side wins on conflicts |\n",
    "| Merge in-place | `a \\|= b` | Update `a` with `b` in place |\n",
    "| Copy | `d.copy()` | Shallow independent copy |\n",
    "\n",
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Key-Value Map</span><br><br>\n",
    "A <code>dict</code> maps unique, hashable keys to values. Insertion order is preserved (Python 3.7+). Use <code>dict[str, float]</code> to annotate key and value types.<br><br> <code>TypedDict</code> (Python 3.8+) defines a typed schema for a dict, essential for model configs and API payloads where every key and its type must be known.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Course record as a dict\n",
    "course: dict[str, object] = {\n",
    "    \"code\": \"CS301\",\n",
    "    \"title\": \"Machine Learning\",\n",
    "    \"credits\": 3,\n",
    "    \"enrollment\": 42,\n",
    "    \"pass_rate\": 0.87,\n",
    "}\n",
    "\n",
    "# Access: [] raises KeyError on missing key; .get() returns a default\n",
    "print(course[\"title\"])\n",
    "print(course.get(\"lab_room\", \"TBA\"))\n",
    "\n",
    "# Membership checks keys\n",
    "print(f'\"pass_rate\" in course  : {\"pass_rate\" in course}')\n",
    "print(f'\"semester\" in course   : {\"semester\" in course}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64",
   "metadata": {},
   "source": [
    "### Modifying a Dict\n",
    "\n",
    "Dicts are **mutable**: you can add, change, and remove keys after creation. `.pop()` removes a key and returns its value. `.items()` gives `(key, value)` pairs for iteration *(for loops are covered in Part 2)*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add / update / remove\n",
    "course[\"lab_room\"] = \"Lab 3A\"\n",
    "course.update({\"enrollment\": 45, \"semester\": \"Fall 2024\"})\n",
    "semester = course.pop(\"semester\")  # remove and return\n",
    "\n",
    "# Iterate over all key-value pairs\n",
    "for key, value in course.items():\n",
    "    print(f\"  {key:<12} : {value}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66",
   "metadata": {},
   "source": [
    "### Dict Merge (Python 3.9+)\n",
    "`a | b` creates a **new** merged dict; the right-hand side wins on key conflicts. `a |= b` merges `b` into `a` in place. This replaces the older `{**a, **b}` pattern:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Python 3.9+ dict merge operator | and |=\n",
    "# Replaces the older {**a, **b} pattern: cleaner and faster\n",
    "\n",
    "default_config: dict[str, object] = {\n",
    "    \"learning_rate\": 0.001,\n",
    "    \"epochs\": 10,\n",
    "    \"batch_size\": 32,\n",
    "    \"optimizer\": \"adam\",\n",
    "}\n",
    "\n",
    "run_overrides: dict[str, object] = {\n",
    "    \"epochs\": 50,  # override\n",
    "    \"batch_size\": 64,  # override\n",
    "    \"dropout\": 0.2,  # new key\n",
    "}\n",
    "\n",
    "# | creates a NEW merged dict; right side wins on key conflicts\n",
    "run_config = default_config | run_overrides\n",
    "print(\"Merged run config:\")\n",
    "for k, v in run_config.items():\n",
    "    print(f\"  {k:<16}: {v}\")\n",
    "\n",
    "# |= updates the dict in place\n",
    "default_config |= {\"weight_decay\": 1e-4}\n",
    "print(f\"\\ndefault_config after |=: {default_config}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68",
   "metadata": {},
   "source": [
    "### TypedDict: Typed Schema for a Dict\n",
    "`TypedDict` defines which keys a dict must have and the type of each value. It uses `class` syntax (see section header note). At runtime it is a plain `dict` with zero overhead. The schema is enforced only by the type checker:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import TypedDict\n",
    "\n",
    "\n",
    "class ModelConfig(TypedDict):\n",
    "    learning_rate: float\n",
    "    epochs: int\n",
    "    batch_size: int\n",
    "    optimizer: str\n",
    "\n",
    "\n",
    "class ExperimentResult(TypedDict):\n",
    "    run_id: str\n",
    "    accuracy: float\n",
    "    val_loss: float"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70",
   "metadata": {},
   "source": [
    "Annotate a variable with your `TypedDict` class. The type checker flags wrong key names or value types. `type(config)` at runtime confirms it is simply a `dict`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TypedDict is a plain dict at runtime: no overhead\n",
    "# ty checks that keys and value types match the schema\n",
    "config: ModelConfig = {\n",
    "    \"learning_rate\": 0.001,\n",
    "    \"epochs\": 50,\n",
    "    \"batch_size\": 32,\n",
    "    \"optimizer\": \"adam\",\n",
    "}\n",
    "\n",
    "result: ExperimentResult = {\n",
    "    \"run_id\": \"exp-2024-001\",\n",
    "    \"accuracy\": 0.923,\n",
    "    \"val_loss\": 0.218,\n",
    "}\n",
    "\n",
    "print(f\"Config : {config}\")\n",
    "print(f\"Result : {result}\")\n",
    "print(f\"Accuracy: {result['accuracy']:.1%}\")\n",
    "print(f\"type(config): {type(config)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 4 - Merge Experiment Configs</span><br><br>\n",
    "<b>Goal:</b> Use the <code>|</code> operator to produce a final run config where <code>overrides</code> wins on conflicts, then add a <code>run_id</code> key.\n",
    "<pre style='background:#FCE8DA;padding:10px;border-radius:4px;font-size:0.9em'>base = {'lr': 0.01, 'epochs': 5, 'optimizer': 'sgd'}\n",
    "overrides = {'lr': 0.001, 'epochs': 20}\n",
    "\n",
    "# Expected\n",
    "final = {'lr': 0.001, 'epochs': 20, 'optimizer': 'sgd', 'run_id': 'run-001'}</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73",
   "metadata": {},
   "outputs": [],
   "source": [
    "base: dict[str, object] = {\"lr\": 0.01, \"epochs\": 5, \"optimizer\": \"sgd\"}\n",
    "overrides: dict[str, object] = {\"lr\": 0.001, \"epochs\": 20}\n",
    "\n",
    "# TODO: merge and add run_id\n",
    "final: dict[str, object] = ...\n",
    "print(f\"final: {final}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74",
   "metadata": {},
   "source": [
    "## 6. Collections: Set\n",
    "\n",
    "A **set** is an **unordered** collection of **unique** values. Duplicates are discarded automatically. You never need to deduplicate manually.\n",
    "\n",
    "Two properties make sets special:\n",
    "\n",
    "1. **Uniqueness**: every value appears at most once, always\n",
    "2. **O(1) membership testing**: `x in my_set` takes the same time whether the set\n",
    "   has 10 or 10,000,000 items. The equivalent `x in my_list` slows down linearly.\n",
    "\n",
    "**When to use a set:**\n",
    "- Removing duplicates from a list: `unique = set(my_list)`\n",
    "- Fast membership check: `if label in valid_labels:`\n",
    "- Data pipeline integrity: find overlap or difference between train/val/test IDs\n",
    "\n",
    "**Key operations at a glance:**\n",
    "\n",
    "| Operation | Syntax / Method | Notes |\n",
    "|---|---|---|\n",
    "| Create | `{1, 2, 3}` or `set(iterable)` | `{}` creates a **dict**, use `set()` for empty |\n",
    "| Add | `s.add(x)` | No effect if `x` already present |\n",
    "| Remove | `s.remove(x)` | Raises `KeyError` if `x` absent |\n",
    "| Remove (safe) | `s.discard(x)` | No error if `x` absent |\n",
    "| Pop | `s.pop()` | Remove and return an arbitrary element |\n",
    "| Clear | `s.clear()` | Remove all elements |\n",
    "| Membership | `x in s` | O(1), instant regardless of set size |\n",
    "| Length | `len(s)` | Number of elements |\n",
    "| Union | `s \\| t` or `s.union(t)` | All elements from both sets |\n",
    "| Intersection | `s & t` or `s.intersection(t)` | Elements present in both |\n",
    "| Difference | `s - t` or `s.difference(t)` | In `s` but not in `t` |\n",
    "| Symmetric diff | `s ^ t` or `s.symmetric_difference(t)` | In one but not both |\n",
    "| Subset | `s <= t` or `s.issubset(t)` | Every element of `s` is in `t` |\n",
    "| Superset | `s >= t` or `s.issuperset(t)` | Every element of `t` is in `s` |\n",
    "| Disjoint | `s.isdisjoint(t)` | No elements in common |\n",
    "| Immutable copy | `frozenset(s)` | Immutable set, can be used as a dict key |\n",
    "\n",
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Unique Values &amp; O(1) Lookup</span><br><br>\n",
    "A <code>set</code> never stores duplicates and tests membership in constant time. Annotate as <code>set[str]</code>. For an immutable, hashable set that can be used as a dict key, use <code>frozenset</code>.\n",
    "</div>\n",
    "\n",
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: {} Is a Dict, Not a Set</span><br><br>\n",
    "<code>empty = {}</code> creates an empty <b>dict</b>.<br> <code>empty = set()</code> creates an empty <b>set</b>.<br> This trips up nearly every Python learner once. Now you know.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sets remove duplicates on creation\n",
    "raw_labels: list[str] = [\"cat\", \"dog\", \"cat\", \"bird\", \"dog\", \"cat\"]\n",
    "unique_labels: set[str] = set(raw_labels)\n",
    "print(f\"raw    : {raw_labels}\")\n",
    "print(f\"unique : {sorted(unique_labels)}\")\n",
    "\n",
    "# O(1) membership test: much faster than list for large collections\n",
    "valid_formats: set[str] = {\"parquet\", \"csv\", \"json\", \"feather\"}\n",
    "print(f\"parquet valid : {'parquet' in valid_formats}\")\n",
    "print(f\"xlsx valid    : {'xlsx' in valid_formats}\")\n",
    "\n",
    "# Mutation\n",
    "valid_formats.add(\"orc\")\n",
    "valid_formats.discard(\"feather\")  # safe: no error if element is absent\n",
    "print(f\"formats : {sorted(valid_formats)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76",
   "metadata": {},
   "source": [
    "Confirm the `{}` gotcha by running this cell. The type output makes it unmistakable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77",
   "metadata": {},
   "outputs": [],
   "source": [
    "# GOTCHA: {} creates a dict, not a set: always use set() for an empty set\n",
    "empty_dict = {}\n",
    "empty_set = set()\n",
    "print(f\"type({{}})   : {type(empty_dict)}\")\n",
    "print(f\"type(set()) : {type(empty_set)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78",
   "metadata": {},
   "source": [
    "### Set Algebra\n",
    "Sets support mathematical operations directly with operators. These are invaluable for data-pipeline integrity checks such as detecting train/validation leakage:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "79",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set algebra: very common in data pipeline checks\n",
    "train_ids: set[int] = {101, 102, 103, 104, 105, 106, 107, 108}\n",
    "val_ids: set[int] = {107, 108, 109, 110}\n",
    "\n",
    "print(f\"Union        : {sorted(train_ids | val_ids)}\")\n",
    "print(f\"Intersection : {sorted(train_ids & val_ids)}\")\n",
    "print(f\"Difference   : {sorted(train_ids - val_ids)}\")\n",
    "print(f\"Sym. diff    : {sorted(train_ids ^ val_ids)}\")\n",
    "\n",
    "# Practical: check for data leakage between splits\n",
    "leakage: set[int] = train_ids & val_ids\n",
    "if leakage:\n",
    "    print(f\"\\nWARNING: {len(leakage)} IDs in both train and val : data leakage! {leakage}\")\n",
    "else:\n",
    "    print(\"\\nNo data leakage between splits.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80",
   "metadata": {},
   "source": [
    "## 7. Standard Library Collections\n",
    "\n",
    "Python ships with a large collection of ready-to-use modules called the **standard library**: available without any `pip install`. The `collections` module contains specialised containers that solve common data patterns more cleanly than plain `list` and `dict`.\n",
    "\n",
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Specialised Containers from <code>collections</code></span><br><br>\n",
    "Three tools from the standard library cover the most common data-science patterns beyond the built-in types:\n",
    "<ul>\n",
    "<li><b>Counter</b>: count occurrences; perfect for label frequencies and class imbalance checks</li>\n",
    "<li><b>defaultdict</b>: group items without writing <code>if key not in d: d[key] = []</code></li>\n",
    "<li><b>deque</b>: O(1) append and pop from both ends; ideal for sliding windows in time series</li>\n",
    "</ul>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "81",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "\n",
    "# Class imbalance check using Counter\n",
    "predicted_labels: list[str] = [\n",
    "    \"pass\",\n",
    "    \"pass\",\n",
    "    \"fail\",\n",
    "    \"pass\",\n",
    "    \"pass\",\n",
    "    \"fail\",\n",
    "    \"pass\",\n",
    "    \"pass\",\n",
    "    \"pass\",\n",
    "    \"fail\",\n",
    "    \"pass\",\n",
    "    \"pass\",\n",
    "]\n",
    "\n",
    "counts: Counter[str] = Counter(predicted_labels)\n",
    "print(f\"All counts : {counts}\")\n",
    "print(f'\"pass\"     : {counts[\"pass\"]}')\n",
    "print(f\"Unknown    : {counts['unknown']}\")  # returns 0, not KeyError\n",
    "print(f\"Top 2      : {counts.most_common(2)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82",
   "metadata": {},
   "source": [
    "Build a class-distribution report and combine counters from multiple batches using Counter arithmetic: `+` merges counts, `-` subtracts (removing zeros):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "83",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Class distribution report\n",
    "total: int = sum(counts.values())\n",
    "for label, n in counts.most_common():\n",
    "    print(f\"  {label:<8}: {n:2d}/{total} ({n / total:.1%})\")\n",
    "\n",
    "# Counter arithmetic: combine counts from multiple batches\n",
    "batch_a: Counter[str] = Counter([\"pass\", \"pass\", \"fail\"])\n",
    "batch_b: Counter[str] = Counter([\"fail\", \"fail\", \"pass\"])\n",
    "combined = batch_a + batch_b\n",
    "print(f\"\\nCombined batches: {combined}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 5 - Label Frequency Report</span><br><br>\n",
    "<b>Goal:</b> Use <code>Counter</code> to produce a class distribution report from the labels list below.\n",
    "<pre style='background:#FCE8DA;padding:10px;border-radius:4px;font-size:0.9em'>labels = ['A','B','A','C','B','A','A','B','C','A','B','A']\n",
    "\n",
    "# Expected output\n",
    "A : 6/12 (50.0%)  [##############################]\n",
    "B : 4/12 (33.3%)  [####################          ]\n",
    "C : 2/12 (16.7%)  [##########                    ]</pre>\n",
    "<b>Hint:</b> Build the bar with <code>'#' * int(pct * 30)</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "\n",
    "labels: list[str] = [\"A\", \"B\", \"A\", \"C\", \"B\", \"A\", \"A\", \"B\", \"C\", \"A\", \"B\", \"A\"]\n",
    "\n",
    "# TODO: print a class distribution report\n",
    "counts: Counter[str] = Counter(labels)\n",
    "total: int = sum(counts.values())\n",
    "\n",
    "for label, n in counts.most_common():\n",
    "    pct = n / total\n",
    "    bar = ...  # TODO: build the bar string\n",
    "    print(f\"{label} : {n}/{total} ({pct:.1%})  [{bar:<30}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86",
   "metadata": {},
   "source": [
    "### defaultdict: Zero-Setup Grouping\n",
    "`defaultdict(factory)` calls `factory()` to create a new default value whenever a missing key is accessed, eliminating the `if key not in d: d[key] = []` boilerplate. `defaultdict(list)` is the standard pattern for grouping *(uses a for loop, covered in Part 2)*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "87",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import defaultdict\n",
    "\n",
    "students: list[dict[str, object]] = [\n",
    "    {\"name\": \"Alice\", \"major\": \"CS\", \"gpa\": 3.95},\n",
    "    {\"name\": \"Bob\", \"major\": \"Math\", \"gpa\": 3.45},\n",
    "    {\"name\": \"Carol\", \"major\": \"CS\", \"gpa\": 3.88},\n",
    "    {\"name\": \"Dan\", \"major\": \"Math\", \"gpa\": 3.72},\n",
    "    {\"name\": \"Eve\", \"major\": \"CS\", \"gpa\": 3.60},\n",
    "]\n",
    "\n",
    "# Group students by major: no 'if key not in d: d[key] = []' needed\n",
    "by_major: defaultdict[str, list[str]] = defaultdict(list)\n",
    "for s in students:\n",
    "    by_major[str(s[\"major\"])].append(str(s[\"name\"]))\n",
    "\n",
    "print(\"Students by major:\")\n",
    "for major, names in sorted(by_major.items()):\n",
    "    print(f\"  {major}: {names}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88",
   "metadata": {},
   "source": [
    "The same pattern works for numeric accumulation. `defaultdict(float)` starts every new key at `0.0`, making sum-per-group pipelines one-liners:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "89",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Accumulate GPA sums per major\n",
    "gpa_total: defaultdict[str, float] = defaultdict(float)\n",
    "gpa_count: defaultdict[str, int] = defaultdict(int)\n",
    "\n",
    "for s in students:\n",
    "    key = str(s[\"major\"])\n",
    "    gpa_total[key] += float(s[\"gpa\"])  # type: ignore[arg-type]\n",
    "    gpa_count[key] += 1\n",
    "\n",
    "print(\"Average GPA by major:\")\n",
    "for major in sorted(gpa_total):\n",
    "    print(f\"  {major}: {gpa_total[major] / gpa_count[major]:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90",
   "metadata": {},
   "source": [
    "### deque: Fixed-Size Rolling Buffer\n",
    "`deque(maxlen=N)` discards the oldest element automatically when the buffer is full. This is the standard tool for rolling statistics over time-series streams:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import deque\n",
    "\n",
    "# Rolling mean with maxlen: oldest element auto-discards when full\n",
    "temperature_readings: list[float] = [\n",
    "    36.5,\n",
    "    36.7,\n",
    "    37.1,\n",
    "    37.8,\n",
    "    38.2,\n",
    "    38.0,\n",
    "    37.5,\n",
    "    37.2,\n",
    "    36.9,\n",
    "    36.6,\n",
    "]\n",
    "WINDOW: int = 3\n",
    "\n",
    "window: deque[float] = deque(maxlen=WINDOW)\n",
    "rolling_means: list[float] = []\n",
    "\n",
    "for reading in temperature_readings:\n",
    "    window.append(reading)\n",
    "    if len(window) == WINDOW:\n",
    "        rolling_means.append(round(sum(window) / WINDOW, 2))\n",
    "\n",
    "print(f\"Readings      : {temperature_readings}\")\n",
    "print(f\"Rolling mean-3: {rolling_means}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92",
   "metadata": {},
   "source": [
    "A `deque` doubles as a FIFO queue. `appendleft()` and `popleft()` are O(1) -- far faster than `list.insert(0, ...)` which is O(n):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import deque\n",
    "\n",
    "# deque as a task queue: O(1) popleft vs list's O(n)\n",
    "pipeline: deque[str] = deque([\"load_data\", \"clean\", \"feature_eng\", \"train\", \"evaluate\"])\n",
    "pipeline.appendleft(\"validate_schema\")  # high-priority step prepended\n",
    "\n",
    "print(\"Pipeline execution order:\")\n",
    "while pipeline:\n",
    "    step = pipeline.popleft()\n",
    "    print(f\"  -> {step}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94",
   "metadata": {},
   "source": [
    "### statistics: Descriptive Stats Without NumPy\n",
    "\n",
    "The `statistics` module computes common descriptive statistics on plain Python lists, no NumPy required. Use it for quick sanity checks and lightweight scripts. For large arrays, NumPy is faster (covered in Part 4)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95",
   "metadata": {},
   "outputs": [],
   "source": [
    "import statistics\n",
    "\n",
    "exam_scores: list[float] = [72.0, 85.0, 91.0, 68.0, 88.0, 77.0, 94.0, 63.0]\n",
    "\n",
    "mean = statistics.mean(exam_scores)\n",
    "median = statistics.median(exam_scores)\n",
    "stdev = statistics.stdev(exam_scores)  # sample std deviation\n",
    "pstdev = statistics.pstdev(exam_scores)  # population std deviation\n",
    "var = statistics.variance(exam_scores)\n",
    "\n",
    "print(f\"n      = {len(exam_scores)}\")\n",
    "print(f\"mean   = {mean:.2f}\")\n",
    "print(f\"median = {median:.2f}\")\n",
    "print(f\"stdev  = {stdev:.2f}   (sample)\")\n",
    "print(f\"var    = {var:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96",
   "metadata": {},
   "source": [
    "Use `statistics.NormalDist` to compute z-scores and check where any value falls in the distribution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97",
   "metadata": {},
   "outputs": [],
   "source": [
    "from statistics import NormalDist\n",
    "\n",
    "dist = NormalDist(mu=mean, sigma=stdev)\n",
    "\n",
    "for score in [63.0, 77.0, 94.0]:\n",
    "    z = (score - mean) / stdev\n",
    "    pct = dist.cdf(score) * 100  # percentile rank\n",
    "    print(f\"  score={score:5.1f}  z={z:+.2f}  percentile={pct:.1f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98",
   "metadata": {},
   "source": [
    "## 8. Operators\n",
    "\n",
    "An **operator** is a symbol that performs a computation on one or two values. You already know arithmetic operators from mathematics. Python adds several more:\n",
    "\n",
    "| Category | Operators | Example |\n",
    "|---|---|---|\n",
    "| Arithmetic | `+` `-` `*` `/` `//` `%` `**` | `7 // 2` → `3` |\n",
    "| Comparison | `==` `!=` `<` `>` `<=` `>=` | `score >= 70` → `True` |\n",
    "| Logical | `and` `or` `not` | `a and b` |\n",
    "| Assignment expression | `:=` (walrus) | `if (n := len(data)) > 10:` |\n",
    "\n",
    "Three operator families matter most in data science work: **arithmetic**, **comparison + logical**, and the **walrus** `:=` (Python 3.8+)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Weighted grade calculation\n",
    "midterm: float = 82.0\n",
    "final: float = 91.0\n",
    "project: float = 88.0\n",
    "\n",
    "weighted_grade = midterm * 0.30 + final * 0.50 + project * 0.20\n",
    "print(f\"Weighted grade: {weighted_grade:.1f}\")\n",
    "\n",
    "# Augmented assignment modifies in place\n",
    "loss: float = 1.0\n",
    "for _ in range(6):\n",
    "    loss *= 0.75\n",
    "print(f\"Loss after 6 decay steps: {loss:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "100",
   "metadata": {},
   "source": [
    "`/` and `//` are **different operators**. This is one of the most common Python gotchas: `//` floors toward negative infinity, not toward zero:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "101",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Division: / is always true division; // is floor (rounds toward -inf)\n",
    "print(f\"7 / 2  = {7 / 2}\")  # 3.5 : always float\n",
    "print(f\"7 // 2 = {7 // 2}\")  # 3   : floor, not truncate\n",
    "print(f\"7 % 2  = {7 % 2}\")  # 1   : remainder\n",
    "print(f\"2**10  = {2**10}\")  # 1024: exponentiation\n",
    "print(f\"-7//2  = {-7 // 2}\")  # -4  : floors TOWARD negative infinity"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "102",
   "metadata": {},
   "source": [
    "### Comparison & Logical Operators\n",
    "Comparison operators return `bool`. Logical operators combine conditions and use **short-circuit** evaluation: the right side is not evaluated if the left side already determines the result:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "103",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Comparison and logical operators\n",
    "score: float = 84.5\n",
    "attendance: int = 90\n",
    "\n",
    "passes = score >= 70\n",
    "qualifies = score >= 80 and attendance >= 85  # both must be true\n",
    "at_risk = score < 60 or attendance < 70  # either triggers\n",
    "not_pass = not passes\n",
    "\n",
    "print(f\"{passes=}  {qualifies=}  {at_risk=}  {not_pass=}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "104",
   "metadata": {},
   "source": [
    "Short-circuit evaluation prevents errors like dividing by an empty list. Use `is`/`is not` to check **object identity** (same object in memory) and `==`/`!=` to check **value equality**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "105",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Short-circuit evaluation: right side is NOT evaluated if left decides outcome\n",
    "scores: list[float] | None = [82.0, 91.5, 74.0]\n",
    "mean = scores and sum(scores) / len(scores)  # safe: skips divide if scores is None\n",
    "print(f\"mean (safe): {mean}\")\n",
    "\n",
    "# Identity (is) vs equality (==)\n",
    "a: list[int] = [1, 2, 3]\n",
    "b: list[int] = [1, 2, 3]\n",
    "c: list[int] = a\n",
    "print(f\"a == b : {a == b}\")  # True : same values\n",
    "print(f\"a is b : {a is b}\")  # False: different objects\n",
    "print(f\"a is c : {a is c}\")  # True : same object"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "106",
   "metadata": {},
   "source": [
    "### Walrus Operator `:=` (Python 3.8+)\n",
    "`:=` is an **assignment expression**: unlike `=` (a statement), it both assigns a value and evaluates to that value. This lets you assign inside a condition, avoiding computing the same value twice:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "107",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Walrus operator := (Python 3.8+)\n",
    "# Assigns AND returns a value in the same expression.\n",
    "exam_scores: list[float] = [82.0, 91.5, 74.0, 88.0, 95.5, 64.0, 79.0]\n",
    "\n",
    "# Without walrus: must store mean manually before using in condition\n",
    "m = sum(exam_scores) / len(exam_scores)\n",
    "if m < 80:\n",
    "    print(f\"[without walrus] Cohort average is low: {m:.1f}\")\n",
    "\n",
    "# With walrus: compute once, test, and use: all in one expression\n",
    "if (m := sum(exam_scores) / len(exam_scores)) < 80:\n",
    "    print(f\"[with walrus]    Cohort average is low: {m:.1f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "108",
   "metadata": {},
   "source": [
    "`:=` is especially useful inside `while` loops that consume a stream and inside comprehensions that need to reuse a computed intermediate value (both covered in Part 2):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "109",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Walrus in a while loop: consume a stream until None sentinel\n",
    "data_stream = iter([10.0, 20.0, 30.0, None, 40.0])\n",
    "while (value := next(data_stream, None)) is not None:\n",
    "    print(f\"  Read: {value}\")\n",
    "\n",
    "# Walrus in a comprehension: strip once, reuse stripped value\n",
    "raw_names: list[str] = [\"  Alice  \", \"\", \"  Bob  \", \" \", \"  Carol  \"]\n",
    "clean_names: list[str] = [stripped for name in raw_names if (stripped := name.strip())]\n",
    "print(f\"Clean names: {clean_names}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "110",
   "metadata": {},
   "source": [
    "## Further Reading\n",
    "\n",
    "| Resource | Why it matters |\n",
    "|---|---|\n",
    "| [Python Data Model](https://docs.python.org/3/reference/datamodel.html) | The official spec for `__dunder__` methods and how Python objects work under the hood |\n",
    "| VanderPlas, J. (2016). *Python Data Science Handbook*. O'Reilly. | Free at [jakevdp.github.io/PythonDataScienceHandbook](https://jakevdp.github.io/PythonDataScienceHandbook) — the NumPy and pandas chapters build directly on this one |\n",
    "| Ramalho, L. (2022). *Fluent Python*, 2nd ed. O'Reilly. | Chapter 2 (sequences) and Chapter 3 (dicts and sets) go deeper than any tutorial; the book treats Python as a first-class design language |\n",
    "| PEP 572 — Assignment Expressions | Background and rationale for the walrus operator (`:=`) introduced in Python 3.8 |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "111",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Concept | Key rule |\n",
    "|---|---|\n",
    "| Type hints | `x: int`, `list[float]`, `dict[str, int]`, `X \\| None`, checked by `ty` but not enforced at runtime |\n",
    "| f-strings | `f'{var=}'` for debugging; `f'{val:.2f}'` for formatting |\n",
    "| Strings | `.strip()`, `.split()`, `.join()`, `.replace()` cover most data cleaning |\n",
    "| `list` | Ordered, mutable; use `.copy()` not `=` when you need independence |\n",
    "| `tuple` / `NamedTuple` | Immutable records; unpack with `a, b = t` or `a, *rest, b = t` |\n",
    "| `dict` / `TypedDict` | Key-value; merge with `\\|`; typed schema with `TypedDict` |\n",
    "| `set` | Unique values, O(1) membership; `\\|` union, `&` intersection, `-` difference |\n",
    "| `Counter` | Frequency counts; `.most_common(n)` |\n",
    "| `defaultdict` | Group items without `KeyError`; `defaultdict(list)` |\n",
    "| `deque` | Sliding windows; `maxlen=` auto-drops oldest |\n",
    "| Walrus `:=` | Assign inside a condition to avoid re-computing |\n",
    "\n",
    "**Next:** `02-control-flow.ipynb`, covering `if`/`elif`/`else`, `match`/`case`, `for`, `while`, and comprehensions."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}