{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 3: Patterns for Data Science & ML\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/03-python-patterns.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/03-python-patterns.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Python Foundations**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed Part 1 (`01-python-core.ipynb`) and Part 2 (`02-control-flow.ipynb`). If you have not, start there. The concepts here build directly on both.\n", "\n", "Part 2 introduces **professional coding patterns**: the habits and structures that separate a working script from maintainable, production-grade code. These patterns are used every day in real data science and ML engineering work.\n", "\n", "| Pattern | Why it matters |\n", "|---|---|\n", "| **Functions** | Reuse logic without copying code; make code testable |\n", "| **Lambda** | Write concise callbacks for `sorted()`, `map()`, pandas `.apply()` |\n", "| **\\*args / \\*\\*kwargs** | Handle flexible inputs like scikit-learn and PyTorch do |\n", "| **Dataclasses** | Typed, structured containers for configs and pipeline state |\n", "| **Modules** | Organise code into files; use the standard library |\n", "| **Exceptions** | Handle errors gracefully instead of crashing |\n", "| **pathlib** | Read and write files safely, cross-platform |\n", "\n", "The running example is the same **university analytics platform** from Part 1.\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Write type-annotated functions with Google-style docstrings | Sec. 1 |\n", "| 2 | Use lambda, `*args`, and `**kwargs` | Sec. 2, 3 |\n", "| 3 | Define structured data with `@dataclass` | Sec. 4 |\n", "| 4 | Import and use the standard library | Sec. 5 |\n", "| 5 | Handle exceptions with `try/except/else/finally` | Sec. 6 |\n", "| 6 | Read and write files with `pathlib.Path` | Sec. 7 |\n", "| 7 | Recognise and avoid the most common Python gotchas | Sec. 8 |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 1. Functions\n", "\n", "A **function** is a named, reusable block of code. You define it once and call it as many times as you need, with different inputs each time.\n", "\n", "```python\n", "# Define once:\n", "def greet(name):\n", " print(f'Hello, {name}!')\n", "\n", "# Call many times:\n", "greet('Alice') # Hello, Alice!\n", "greet('Bob') # Hello, Bob!\n", "```\n", "\n", "Without functions, any repeated logic must be copy-pasted, and copy-pasted code means bugs fixed in one place but not the other. Functions are the foundation of all reusable, testable code.\n", "\n", "
\n", " Key Concept: Type Hints + Google Docstrings

\n", "Every function you write for production should have:
\n", "
    \n", "
  1. Type annotations on all parameters and the return value: mypy and your IDE use them to catch bugs. This is the same name: type syntax from Part 1, Sec. 1, just applied to function signatures.
  2. \n", "
  3. A docstring that explains what the function does, its arguments, and what it returns.\n", "This project uses Google style.
  4. \n", "
\n", "The default parameter rule: defaults are evaluated once at definition time. Never use a mutable object (list, dict) as a default; see the Common Mistake callout in Sec. 8.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "5", "metadata": {}, "outputs": [], "source": [ "def weighted_grade(\n", " midterm: float,\n", " final: float,\n", " project: float,\n", " weights: tuple[float, float, float] = (0.30, 0.50, 0.20),\n", ") -> float:\n", " \"\"\"Compute a weighted final grade from three components.\n", "\n", " Args:\n", " midterm: Midterm exam score (0-100).\n", " final: Final exam score (0-100).\n", " project: Project score (0-100).\n", " weights: (midterm_w, final_w, project_w) : must sum to 1.0.\n", "\n", " Returns:\n", " Weighted average on a 0-100 scale.\n", "\n", " Raises:\n", " ValueError: If weights do not sum to 1.0 within tolerance.\n", " \"\"\"\n", " w_mid, w_fin, w_proj = weights\n", " if abs(w_mid + w_fin + w_proj - 1.0) > 1e-9:\n", " raise ValueError(f\"Weights must sum to 1.0; got {w_mid + w_fin + w_proj}\")\n", " return midterm * w_mid + final * w_fin + project * w_proj" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "Call the function with default weights, then with custom weights. Keyword arguments make the call self-documenting:" ] }, { "cell_type": "code", "execution_count": null, "id": "7", "metadata": {}, "outputs": [], "source": [ "# Default weights (0.30 mid / 0.50 final / 0.20 project)\n", "grade = weighted_grade(midterm=82.0, final=91.0, project=88.0)\n", "print(f\"Default weights : {grade:.1f}\")\n", "\n", "# Override weights - the tuple must still sum to 1.0\n", "custom = weighted_grade(82.0, 91.0, 88.0, weights=(0.20, 0.60, 0.20))\n", "print(f\"Custom weights : {custom:.1f}\")" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "Functions can return multiple values packed as a tuple. Callers unpack them directly into named variables with `a, b, c = func()`:" ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": [ "def score_summary(scores: list[float]) -> tuple[float, float, float, float]:\n", " \"\"\"Return descriptive statistics for a score list.\n", "\n", " Args:\n", " scores: Non-empty list of numeric scores.\n", "\n", " Returns:\n", " Tuple of (mean, minimum, maximum, std_dev).\n", "\n", " Raises:\n", " ValueError: If scores is empty.\n", " \"\"\"\n", " if not scores:\n", " raise ValueError(\"scores must not be empty\")\n", " n = len(scores)\n", " mean = sum(scores) / n\n", " variance = sum((s - mean) ** 2 for s in scores) / n\n", " return mean, min(scores), max(scores), variance**0.5" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "Python packs the four return values into a tuple. Unpack them in one line with tuple unpacking:" ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "exam_scores: list[float] = [78.0, 85.5, 92.0, 88.5, 95.0, 67.0, 81.0]\n", "mean, lo, hi, std = score_summary(exam_scores)\n", "print(f\"mean={mean:.1f} min={lo} max={hi} std={std:.1f}\")" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "Another essential DS function: z-score normalisation. It scales any value to \"how many standard deviations from the mean\", a prerequisite for most ML models:" ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "import statistics\n", "\n", "\n", "def normalize(value: float, mean: float, std: float) -> float:\n", " \"\"\"Compute the z-score of a single value.\n", "\n", " Args:\n", " value: The raw data point.\n", " mean: Population or sample mean.\n", " std: Standard deviation (must be non-zero).\n", "\n", " Returns:\n", " Z-score: positive = above average, negative = below average.\n", "\n", " Raises:\n", " ValueError: If std is zero (all values identical).\n", " \"\"\"\n", " if std == 0.0:\n", " raise ValueError(\"std must be non-zero\")\n", " return (value - mean) / std" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "Apply it to an exam score list. The z-scores make it immediately clear who is above/below the class mean:" ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "exam_scores: list[float] = [72.0, 85.0, 91.0, 68.0, 88.0, 77.0, 94.0, 63.0]\n", "mu = statistics.mean(exam_scores)\n", "sig = statistics.stdev(exam_scores)\n", "\n", "print(f\"mean={mu:.1f} std={sig:.1f}\\n\")\n", "for score in exam_scores:\n", " z = normalize(score, mu, sig)\n", " label = \"above avg\" if z > 0 else \"below avg\"\n", " print(f\" {score:5.1f} z={z:+.2f} {label}\")" ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "
\n", " Activity 1 - Write a Grade Classifier Function

\n", "Goal: Write a fully annotated function classify_cohort that takes a list of scores and returns a dict mapping each grade letter to its count.\n", "
classify_cohort([95, 83, 71, 62, 45, 88, 76])\n",
    "# -> {'A': 1, 'B': 2, 'C': 2, 'D': 1, 'F': 1}
\n", "Hint: Use a helper function grade_letter(score) -> str and Counter.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "\n", "\n", "def grade_letter(score: float) -> str:\n", " \"\"\"Return the letter grade for a numeric score.\"\"\"\n", " ... # TODO\n", "\n", "\n", "def classify_cohort(scores: list[float]) -> dict[str, int]:\n", " \"\"\"Return grade-letter frequency counts for a cohort.\n", "\n", " Args:\n", " scores: List of numeric scores.\n", "\n", " Returns:\n", " Dict mapping each letter grade to its count.\n", " \"\"\"\n", " ... # TODO\n", "\n", "\n", "print(classify_cohort([95.0, 83.0, 71.0, 62.0, 45.0, 88.0, 76.0]))" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "
\n", " Activity 2 - Login Validator

\n", "Goal: Write accept_login(users, username, password) that takes a dict[str, str] of username→password pairs and returns True if the username exists and the password matches, False otherwise.

\n", "
users = {\"alice\": \"ds2024\", \"bob\": \"ml#secure\"}\n",
    "\n",
    "accept_login(users, \"alice\", \"ds2024\")   # True\n",
    "accept_login(users, \"alice\", \"wrong\")    # False  (bad password)\n",
    "accept_login(users, \"carol\", \"any\")      # False  (user not found)
\n", "Hint: Use dict.get() to avoid a KeyError on missing usernames.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "def accept_login(users: dict[str, str], username: str, password: str) -> bool:\n", " \"\"\"Return True if username exists and password matches.\"\"\"\n", " # TODO: implement\n", "\n", "\n", "users: dict[str, str] = {\"alice\": \"ds2024\", \"bob\": \"ml#secure\"}\n", "print(accept_login(users, \"alice\", \"ds2024\")) # True\n", "print(accept_login(users, \"alice\", \"wrong\")) # False\n", "print(accept_login(users, \"carol\", \"any\")) # False" ] }, { "cell_type": "markdown", "id": "20", "metadata": {}, "source": [ "## 2. Lambda Functions\n", "\n", "A **lambda** is an anonymous (nameless) function defined in a single expression. It takes inputs on the left of `:` and produces a result on the right:\n", "\n", "```python\n", "double = lambda x: x * 2\n", "double(5) # -> 10\n", "```\n", "\n", "Lambdas are most useful as short callbacks: a function you pass to another function rather than call yourself. You will use them constantly with `sorted()`, `map()`, `filter()`, and pandas `.apply()`.\n", "\n", "
\n", " Key Concept: Anonymous Single-Expression Function

\n", "A lambda is a function with no name, a single expression, and an implicit return. Use it as a short callback for sorted(), map(), filter(), and especially pandas .apply().
If the body needs more than one expression, write a named function instead; lambdas must stay simple.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "21", "metadata": {}, "outputs": [], "source": [ "# Lambda syntax: lambda params: expression\n", "square = lambda x: x**2 # noqa: E731\n", "clamp = lambda x, lo, hi: max(lo, min(x, hi)) # noqa: E731\n", "\n", "print(f\"square(5) = {square(5)}\")\n", "print(f\"clamp(150, 0, 100) = {clamp(150, 0, 100)}\")" ] }, { "cell_type": "markdown", "id": "22", "metadata": {}, "source": [ "The most common real-world use for lambdas is as a **sort key**: a function that extracts the comparison value from each element:" ] }, { "cell_type": "code", "execution_count": null, "id": "23", "metadata": {}, "outputs": [], "source": [ "# Most common use: sort key: avoids writing a throwaway named function\n", "students: list[dict[str, object]] = [\n", " {\"name\": \"Alice\", \"gpa\": 3.95, \"major\": \"CS\"},\n", " {\"name\": \"Bob\", \"gpa\": 3.45, \"major\": \"Math\"},\n", " {\"name\": \"Carol\", \"gpa\": 3.88, \"major\": \"CS\"},\n", " {\"name\": \"Dan\", \"gpa\": 3.72, \"major\": \"Math\"},\n", "]\n", "\n", "by_gpa = sorted(students, key=lambda s: s[\"gpa\"], reverse=True)\n", "by_major = sorted(students, key=lambda s: (s[\"major\"], s[\"gpa\"]))\n", "\n", "print(\"By GPA (desc):\")\n", "for s in by_gpa:\n", " print(f\" {s['name']:<8} GPA={s['gpa']}\")\n", "\n", "print(\"\\nBy major then GPA:\")\n", "for s in by_major:\n", " print(f\" {s['major']:<6} {s['name']:<8} GPA={s['gpa']}\")" ] }, { "cell_type": "markdown", "id": "24", "metadata": {}, "source": [ "### map() and filter()\n", "`map(func, iterable)` applies `func` to every element. Both return **lazy iterators** -- wrap with `list()` to materialise the result:" ] }, { "cell_type": "code", "execution_count": null, "id": "25", "metadata": {}, "outputs": [], "source": [ "# map() applies a function to every element: returns a lazy iterator\n", "raw_scores: list[str] = [\"78.5\", \"85.0\", \"92.3\", \"61.0\", \"88.7\"]\n", "\n", "# Convert strings to floats\n", "scores: list[float] = list(map(float, raw_scores))\n", "print(f\"Converted : {scores}\")\n", "\n", "# Normalise each score to 0-1\n", "lo, hi = min(scores), max(scores)\n", "normed: list[float] = list( # noqa: C417\n", " map(lambda s: round((s - lo) / (hi - lo), 3), scores)\n", ")\n", "print(f\"Normalised: {normed}\")" ] }, { "cell_type": "markdown", "id": "26", "metadata": {}, "source": [ "`filter(func, iterable)` keeps only elements where `func` returns `True`. Equivalent to `[x for x in items if func(x)]` but more expressive with a named predicate:" ] }, { "cell_type": "code", "execution_count": null, "id": "27", "metadata": {}, "outputs": [], "source": [ "# filter() keeps elements where the function returns True\n", "passing: list[float] = list(filter(lambda s: s >= 70, scores))\n", "print(f\"Passing: {passing}\")\n", "\n", "# Preview: lambda is the backbone of pandas .apply()\n", "# df['grade'] = df['score'].apply(lambda s: 'pass' if s >= 70 else 'fail')\n", "# You will see this pattern in the Data Analysis module." ] }, { "cell_type": "markdown", "id": "28", "metadata": {}, "source": [ "## 3. *args and **kwargs\n", "\n", "### What are *args and **kwargs?\n", "\n", "Sometimes you want a function to accept **any number of arguments** without listing them all:\n", "\n", "```python\n", "def add(*numbers): # *numbers collects all positional args into a tuple\n", " return sum(numbers)\n", "\n", "add(1, 2) # -> 3\n", "add(1, 2, 3, 4) # -> 10 (same function, different number of args)\n", "```\n", "\n", "This pattern is used by virtually every ML library: `nn.Sequential(*layers)`, `model.fit(X, y, **config)`, `pd.concat([df1, df2], **options)`.\n", "\n", "
\n", " Key Concept: Variable Positional and Keyword Arguments

\n", "\n", "You will encounter both constantly in scikit-learn, PyTorch, and FastAPI APIs: model.fit(X, y, **config), nn.Sequential(*layers).\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "29", "metadata": {}, "outputs": [], "source": [ "# *args collects all positional arguments into a tuple\n", "def ensemble_predict(*predictions: float) -> float:\n", " \"\"\"Return the mean of any number of model predictions.\n", "\n", " Args:\n", " *predictions: Floats from individual model predictions.\n", "\n", " Returns:\n", " Mean of all predictions.\n", "\n", " Raises:\n", " ValueError: If no predictions are provided.\n", " \"\"\"\n", " if not predictions:\n", " raise ValueError(\"At least one prediction required\")\n", " return sum(predictions) / len(predictions)" ] }, { "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "Call with one value, three values, or unpack a list with `*`. The function signature stays the same in all three cases:" ] }, { "cell_type": "code", "execution_count": null, "id": "31", "metadata": {}, "outputs": [], "source": [ "print(ensemble_predict(0.82)) # single model\n", "print(ensemble_predict(0.82, 0.91, 0.87)) # three models\n", "\n", "# * unpacks a list into positional arguments\n", "model_preds: list[float] = [0.82, 0.91, 0.87, 0.79]\n", "print(ensemble_predict(*model_preds)) # four models" ] }, { "cell_type": "markdown", "id": "32", "metadata": {}, "source": [ "`**kwargs` collects any number of keyword arguments into a dict. Unpacking a dict with `**` passes its contents as keyword arguments to another function:" ] }, { "cell_type": "code", "execution_count": null, "id": "33", "metadata": {}, "outputs": [], "source": [ "# **kwargs collects all keyword arguments into a dict\n", "def build_config(model: str, **hyperparams: object) -> dict[str, object]:\n", " \"\"\"Assemble a model config dict from keyword arguments.\n", "\n", " Args:\n", " model: Model name identifier.\n", " **hyperparams: Any additional hyperparameter key/value pairs.\n", "\n", " Returns:\n", " Config dict with model name plus all hyperparameters.\n", " \"\"\"\n", " return {\"model\": model} | dict(hyperparams)" ] }, { "cell_type": "markdown", "id": "34", "metadata": {}, "source": [ "Pass any combination of keyword arguments. `**` also unpacks a dict into keyword arguments at the call site:" ] }, { "cell_type": "code", "execution_count": null, "id": "35", "metadata": {}, "outputs": [], "source": [ "cfg1 = build_config(\"xgboost\", n_estimators=200, max_depth=6, learning_rate=0.05)\n", "cfg2 = build_config(\"linear\", C=1.0, penalty=\"l2\")\n", "print(\"Config 1:\", cfg1)\n", "print(\"Config 2:\", cfg2)\n", "\n", "# ** unpacks a dict into keyword arguments\n", "base_params: dict[str, object] = {\"n_estimators\": 100, \"max_depth\": 4}\n", "cfg3 = build_config(\"xgboost\", **base_params, learning_rate=0.01)\n", "print(\"Config 3:\", cfg3)" ] }, { "cell_type": "markdown", "id": "36", "metadata": {}, "source": [ "You can combine all four argument forms in one signature: fixed positional, `*args`, keyword-only (after `*`), and `**kwargs`. This is the pattern used by scikit-learn, PyTorch, and FastAPI:" ] }, { "cell_type": "code", "execution_count": null, "id": "37", "metadata": {}, "outputs": [], "source": [ "# All four kinds of argument in one signature:\n", "# positional → run_id\n", "# *args → tags (zero or more strings)\n", "# keyword-only → verbose (must be named at the call site)\n", "# **kwargs → metrics (any float metrics)\n", "def log_run(\n", " run_id: str,\n", " *tags: str,\n", " verbose: bool = False,\n", " **metrics: float,\n", ") -> None:\n", " \"\"\"Log a training run with optional tags and metrics.\"\"\"\n", " tag_str = \", \".join(tags) if tags else \"none\"\n", " metric_str = \" \".join(f\"{k}={v:.3f}\" for k, v in metrics.items())\n", " print(f\"[{run_id}] tags=[{tag_str}] {metric_str}\")\n", " if verbose:\n", " print(f\" (full metrics: {metrics})\")" ] }, { "cell_type": "markdown", "id": "38", "metadata": {}, "source": [ "Call it with positional tags and keyword metric pairs. `verbose` is keyword-only. It cannot be passed positionally:" ] }, { "cell_type": "code", "execution_count": null, "id": "39", "metadata": {}, "outputs": [], "source": [ "log_run(\"run-001\", \"baseline\", \"v1\", accuracy=0.923, loss=0.218)\n", "log_run(\"run-002\", verbose=True, accuracy=0.934, precision=0.918)" ] }, { "cell_type": "markdown", "id": "40", "metadata": {}, "source": [ "
\n", " Activity 2 - Flexible Metric Logger

\n", "Goal: Write log_metrics(epoch, **metrics) that prints a formatted line and returns a dict.\n", "
log_metrics(5, loss=0.312, accuracy=0.901, val_loss=0.334)\n",
    "# prints:\n",
    "# Epoch 05 | loss=0.3120  accuracy=0.9010  val_loss=0.3340\n",
    "# returns:\n",
    "# {'epoch': 5, 'loss': 0.312, 'accuracy': 0.901, 'val_loss': 0.334}
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "41", "metadata": {}, "outputs": [], "source": [ "def log_metrics(epoch: int, **metrics: float) -> dict[str, object]:\n", " \"\"\"Print and return a training metrics snapshot.\n", "\n", " Args:\n", " epoch: Current training epoch.\n", " **metrics: Metric name -> value pairs.\n", "\n", " Returns:\n", " Dict containing epoch and all metrics.\n", " \"\"\"\n", " ... # TODO\n", "\n", "\n", "result = log_metrics(5, loss=0.312, accuracy=0.901, val_loss=0.334)\n", "print(result)" ] }, { "cell_type": "markdown", "id": "42", "metadata": {}, "source": [ "## 4. Dataclasses\n", "\n", "A **dataclass** is a class (a blueprint for creating objects) where Python automatically generates the `__init__`, `__repr__`, and `__eq__` methods from your field annotations. It is the modern replacement for the plain `dict` records from Part 1, Sec. 5, once the structure of your data is fixed and known ahead of time.\n", "\n", "```python\n", "# Instead of this:\n", "config = {'model': 'xgboost', 'lr': 0.001, 'epochs': 50}\n", "\n", "# Use this - self-documenting, typed, and auto-validated:\n", "@dataclass\n", "class Config:\n", " model: str\n", " lr: float\n", " epochs: int\n", "```\n", "\n", "> If you have not used classes before, think of a class as a custom type you define.\n", "> Dataclasses are the gentlest introduction. They need no understanding of\n", "> inheritance or `self` beyond what is shown here.\n", "\n", "
\n", " Key Concept: Typed Structured Data

\n", "A @dataclass (Python 3.7+) generates __init__, __repr__, and __eq__ from field annotations automatically. It is the modern replacement for plain dicts when the shape of your data is known and fixed.

When to use what:\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "43", "metadata": {}, "outputs": [], "source": [ "from dataclasses import dataclass, field\n", "\n", "\n", "@dataclass\n", "class TrainingConfig:\n", " \"\"\"Configuration for a single training run.\"\"\"\n", "\n", " model_name: str\n", " learning_rate: float\n", " epochs: int\n", " batch_size: int = 32 # field with a default value\n", " optimizer: str = \"adam\"\n", " tags: list[str] = field(default_factory=list) # mutable default: use field()!\n", "\n", " def is_fast_run(self) -> bool:\n", " \"\"\"Return True if this is a quick smoke-test run (≤ 5 epochs).\"\"\"\n", " return self.epochs <= 5" ] }, { "cell_type": "markdown", "id": "44", "metadata": {}, "source": [ "`@dataclass` generates `__init__`, `__repr__`, and `__eq__` automatically. Create an instance exactly like calling a function:" ] }, { "cell_type": "code", "execution_count": null, "id": "45", "metadata": {}, "outputs": [], "source": [ "cfg = TrainingConfig(\n", " model_name=\"xgboost-v2\",\n", " learning_rate=0.001,\n", " epochs=50,\n", " tags=[\"baseline\", \"production\"],\n", ")\n", "\n", "print(cfg) # __repr__ generated: no boilerplate needed\n", "print(f\"Fast run : {cfg.is_fast_run()}\")\n", "print(f\"Tags : {cfg.tags}\")" ] }, { "cell_type": "markdown", "id": "46", "metadata": {}, "source": [ "Dataclass fields are mutable by default. `__eq__` compares **field values**, not object identity: two separately created instances with the same fields are equal:" ] }, { "cell_type": "code", "execution_count": null, "id": "47", "metadata": {}, "outputs": [], "source": [ "# Mutation: dataclass fields are mutable by default\n", "cfg.epochs = 100\n", "cfg.tags.append(\"extended\")\n", "print(f\"Updated epochs: {cfg.epochs} tags: {cfg.tags}\")\n", "\n", "# Equality: __eq__ is generated automatically from field values\n", "cfg2 = TrainingConfig(\"xgboost-v2\", 0.001, 100, tags=[\"baseline\", \"production\", \"extended\"])\n", "print(f\"cfg == cfg2: {cfg == cfg2}\")" ] }, { "cell_type": "markdown", "id": "48", "metadata": {}, "source": [ "### frozen=True: Immutable, Hashable Dataclass\n", "`frozen=True` prevents field mutation after creation and makes the object **hashable** -- it can then be used as a dict key or placed in a set:" ] }, { "cell_type": "code", "execution_count": null, "id": "49", "metadata": {}, "outputs": [], "source": [ "from dataclasses import dataclass\n", "\n", "\n", "@dataclass(frozen=True)\n", "class DatasetSplit:\n", " \"\"\"Immutable description of a train/val/test split.\"\"\"\n", "\n", " train_size: float\n", " val_size: float\n", " test_size: float\n", " random_seed: int = 42\n", "\n", " def __post_init__(self) -> None:\n", " total = self.train_size + self.val_size + self.test_size\n", " if abs(total - 1.0) > 1e-9:\n", " raise ValueError(f\"Splits must sum to 1.0; got {total}\")" ] }, { "cell_type": "markdown", "id": "50", "metadata": {}, "source": [ "`__post_init__` runs automatically after `__init__`. Any invalid split ratio is caught immediately on construction:" ] }, { "cell_type": "code", "execution_count": null, "id": "51", "metadata": {}, "outputs": [], "source": [ "split = DatasetSplit(train_size=0.70, val_size=0.15, test_size=0.15)\n", "print(split)\n", "\n", "# Invalid split: __post_init__ raises ValueError\n", "try:\n", " bad = DatasetSplit(train_size=0.80, val_size=0.15, test_size=0.15)\n", "except ValueError as exc:\n", " print(f\"Caught: {exc}\")" ] }, { "cell_type": "markdown", "id": "52", "metadata": {}, "source": [ "A frozen dataclass can be used as a dict key for result caching. Attempting to set a field after construction raises `FrozenInstanceError`:" ] }, { "cell_type": "code", "execution_count": null, "id": "53", "metadata": {}, "outputs": [], "source": [ "# frozen=True means the object is hashable: usable as a dict key or set element\n", "cache: dict[DatasetSplit, float] = {split: 0.923}\n", "print(f\"Cached accuracy: {cache[split]}\")\n", "\n", "# Attempting mutation raises FrozenInstanceError\n", "try:\n", " split.train_size = 0.80 # type: ignore[misc]\n", "except Exception as exc:\n", " print(f\"Immutable: {exc}\")" ] }, { "cell_type": "markdown", "id": "54", "metadata": {}, "source": [ "## 5. Modules & the Standard Library\n", "\n", "A **module** is a Python file. When you write `import math`, Python loads the file `math.py` from the standard library and makes its contents available under the name `math`.\n", "\n", "```python\n", "import math\n", "math.sqrt(9) # -> 3.0\n", "```\n", "\n", "You can import your own code files the same way. Splitting code into modules is how real projects stay organised as they grow.\n", "\n", "
\n", " Key Concept: Import Patterns

\n", "\n", "\n", "\n", "\n", "
import mathimport the whole module; access with math.sqrt()
from math import sqrt, piimport specific names; use directly
import numpy as npalias (conventional for large packages)

\n", "Prefer import module over from module import *. The star import pollutes the namespace and hides where names come from.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "55", "metadata": {}, "outputs": [], "source": [ "import math\n", "\n", "# math: precise numeric operations\n", "print(f\"pi = {math.pi:.6f}\")\n", "print(f\"sqrt(2) = {math.sqrt(2):.6f}\")\n", "print(f\"log base 2(8) = {math.log2(8)}\")\n", "print(f\"ceil(4.2) = {math.ceil(4.2)}\")\n", "print(f\"floor(4.9) = {math.floor(4.9)}\")" ] }, { "cell_type": "markdown", "id": "56", "metadata": {}, "source": [ "### random: Sampling, Shuffling, and Simulation\n", "\n", "`random` provides pseudo-random number generation for sampling, shuffling, and Monte Carlo simulation. Always call `random.seed(n)` at the start of any script that uses randomness. It makes results reproducible:\n", "\n", "| Function | What it does |\n", "|---|---|\n", "| `random.seed(n)` | Fix the random state for reproducibility |\n", "| `random.uniform(a, b)` | Float in `[a, b]` |\n", "| `random.randint(a, b)` | Integer in `[a, b]` (both inclusive) |\n", "| `random.choice(seq)` | One element from a sequence |\n", "| `random.sample(seq, k)` | `k` unique elements (no replacement) |\n", "| `random.shuffle(lst)` | Shuffle a list **in place** |" ] }, { "cell_type": "code", "execution_count": null, "id": "57", "metadata": {}, "outputs": [], "source": [ "import random\n", "\n", "# Always set a seed before any random operation for reproducibility\n", "random.seed(42)\n", "\n", "scores: list[float] = [72.0, 85.0, 91.0, 68.0, 88.0, 77.0, 94.0, 63.0]\n", "\n", "# shuffle: randomise in place (S311: not for crypto -- pedagogical demo)\n", "shuffled = scores.copy()\n", "random.shuffle(shuffled) # noqa: S311\n", "print(f\"Shuffled : {shuffled}\")\n", "\n", "# choice: pick one element at random\n", "winner = random.choice(scores) # noqa: S311\n", "print(f\"Random winner: {winner}\")\n", "\n", "# sample: pick k unique elements (without replacement)\n", "batch = random.sample(scores, k=3) # noqa: S311\n", "print(f\"Random batch : {batch}\")\n", "\n", "# uniform: float in [a, b]\n", "points: list[float] = [random.uniform(0.0, 1.0) for _ in range(5)] # noqa: S311\n", "print(f\"Uniform [0,1]: {[round(p, 3) for p in points]}\")\n", "\n", "# randint: integer in [a, b] (inclusive both ends)\n", "dice_rolls: list[int] = [random.randint(1, 6) for _ in range(10)] # noqa: S311\n", "print(f\"Dice rolls : {dice_rolls}\")" ] }, { "cell_type": "markdown", "id": "58", "metadata": {}, "source": [ "### json: Serialise Python Objects\n", "`json` converts Python dicts, lists, strings, numbers, and booleans to a JSON string and back. It is the standard format for saving model configs and experiment results:" ] }, { "cell_type": "code", "execution_count": null, "id": "59", "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "# json: serialise / deserialise Python objects to JSON strings\n", "run_result: dict[str, object] = {\n", " \"run_id\": \"exp-2024-001\",\n", " \"model\": \"xgboost\",\n", " \"accuracy\": 0.923,\n", " \"loss\": 0.218,\n", " \"tags\": [\"baseline\", \"production\"],\n", "}\n", "\n", "json_str: str = json.dumps(run_result, indent=2)\n", "print(\"JSON string:\")\n", "print(json_str)\n", "\n", "loaded: dict[str, object] = json.loads(json_str)\n", "print(f\"Round-trip accuracy: {loaded['accuracy']}\")" ] }, { "cell_type": "markdown", "id": "60", "metadata": {}, "source": [ "`datetime` represents a point in time. Always attach `timezone.utc` to avoid ambiguous \"naive\" datetime objects that can silently shift across time zones:" ] }, { "cell_type": "code", "execution_count": null, "id": "61", "metadata": {}, "outputs": [], "source": [ "from datetime import UTC, datetime, timezone\n", "\n", "# datetime: timestamp experiments and logs\n", "now = datetime.now(tz=UTC)\n", "print(f\"Timestamp : {now.isoformat()}\")\n", "print(f\"Date part : {now.strftime('%Y-%m-%d')}\")" ] }, { "cell_type": "markdown", "id": "62", "metadata": {}, "source": [ "## 6. Exception Handling\n", "\n", "An **exception** is an error that occurs while your program is running. By default, Python stops immediately and prints a traceback. Exception handling lets you catch the error, respond to it gracefully, and keep the program running.\n", "\n", "```python\n", "# Without handling - program crashes:\n", "int(\"abc\") # ValueError: invalid literal for int() with base 10: 'abc'\n", "\n", "# With handling - program continues:\n", "try:\n", " int(\"abc\")\n", "except ValueError:\n", " print(\"That was not a number, skipping\")\n", "```\n", "\n", "In data pipelines and ML training loops, unhandled exceptions can discard hours of computation. Always handle errors at system boundaries (user input, file I/O, APIs).\n", "\n", "
\n", " Key Concept: try / except / else / finally

\n", "\n", "Catch the most specific exception you can. Bare except: or except Exception: hides bugs and silences keyboard interrupts.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "63", "metadata": {}, "outputs": [], "source": [ "def parse_score(raw: str) -> float:\n", " \"\"\"Parse a score string and validate it is in [0, 100].\n", "\n", " Args:\n", " raw: String representation of a numeric score.\n", "\n", " Returns:\n", " Validated float score.\n", "\n", " Raises:\n", " ValueError: If raw is not numeric or out of range.\n", " \"\"\"\n", " try:\n", " value = float(raw)\n", " except ValueError:\n", " raise ValueError(f\"{raw!r} is not a valid number\") from None\n", "\n", " if not 0 <= value <= 100:\n", " raise ValueError(f\"Score {value} is out of range [0, 100]\")\n", "\n", " return value" ] }, { "cell_type": "markdown", "id": "64", "metadata": {}, "source": [ "Test `parse_score()` against a range of inputs: valid numbers, out-of-range values, and non-numeric strings. The `else` clause runs only on the **success** path:" ] }, { "cell_type": "code", "execution_count": null, "id": "65", "metadata": {}, "outputs": [], "source": [ "# Test parse_score against valid and invalid inputs\n", "test_inputs: list[str] = [\"87.5\", \"105\", \"abc\", \"-3\", \"72\"]\n", "\n", "for raw in test_inputs:\n", " try:\n", " score = parse_score(raw)\n", " except ValueError as exc:\n", " print(f\" {raw!r:<8} -> ERROR: {exc}\")\n", " else:\n", " print(f\" {raw!r:<8} -> OK: {score}\")" ] }, { "cell_type": "markdown", "id": "66", "metadata": {}, "source": [ "`finally` runs regardless of whether an exception occurred or was handled. Use it for cleanup code (closing files, releasing connections) that must execute either way. This example illustrates the pattern using explicit `open`/`close`; in practice, always use `with open(...) as fh` instead (shown in Sec. 7):" ] }, { "cell_type": "code", "execution_count": null, "id": "67", "metadata": {}, "outputs": [], "source": [ "# else runs ONLY when try succeeds; finally ALWAYS runs (cleanup guarantee)\n", "def load_scores(filepath: str) -> list[float]:\n", " \"\"\"Load numeric scores from a text file, one per line.\"\"\"\n", " fh = None\n", " try:\n", " fh = open(filepath, encoding=\"utf-8\") # noqa: SIM115, PTH123\n", " lines = fh.readlines()\n", " except FileNotFoundError:\n", " print(f\"File not found: {filepath!r}\")\n", " return []\n", " except PermissionError as exc:\n", " print(f\"Permission denied: {exc}\")\n", " return []\n", " else:\n", " print(f\"Loaded {len(lines)} lines successfully\")\n", " return [float(line.strip()) for line in lines if line.strip()]\n", " finally:\n", " if fh is not None:\n", " fh.close()\n", " print(\"File handle closed\")" ] }, { "cell_type": "markdown", "id": "68", "metadata": {}, "source": [ "`finally` guarantees the file handle is closed even when the file does not exist: no resource leak is possible:" ] }, { "cell_type": "code", "execution_count": null, "id": "69", "metadata": {}, "outputs": [], "source": [ "# NOTE: prefer `with open(...) as fh` in practice (shown in Sec. 7).\n", "# This example uses explicit open/close to make else/finally visible.\n", "result = load_scores(\"nonexistent.txt\")\n", "print(f\"Result: {result}\")" ] }, { "cell_type": "markdown", "id": "70", "metadata": {}, "source": [ "### Custom Exception Classes\n", "Subclass a built-in exception to give callers a **specific type** to catch. Store the structured context as instance attributes for programmatic access:" ] }, { "cell_type": "code", "execution_count": null, "id": "71", "metadata": {}, "outputs": [], "source": [ "# Custom exception classes give callers something specific to catch\n", "class DataValidationError(ValueError):\n", " \"\"\"Raised when a data record fails validation.\"\"\"\n", "\n", " def __init__(self, field: str, value: object, reason: str) -> None:\n", " self.field = field\n", " self.value = value\n", " self.reason = reason\n", " super().__init__(f\"Validation failed for {field!r}={value!r}: {reason}\")" ] }, { "cell_type": "markdown", "id": "72", "metadata": {}, "source": [ "Define a validator that raises `DataValidationError` with field-level context, then test it. The `except DataValidationError` clause catches only your custom type -- not any accidental `ValueError` from elsewhere in the code:" ] }, { "cell_type": "code", "execution_count": null, "id": "73", "metadata": {}, "outputs": [], "source": [ "def validate_student(record: dict[str, object]) -> None:\n", " \"\"\"Validate a student record dict against required field constraints.\"\"\"\n", " gpa = record.get(\"gpa\")\n", " if not isinstance(gpa, int | float):\n", " raise DataValidationError(\"gpa\", gpa, \"must be numeric\")\n", " if not 0.0 <= float(gpa) <= 4.0:\n", " raise DataValidationError(\"gpa\", gpa, \"must be in [0.0, 4.0]\")\n", "\n", " name = record.get(\"name\", \"\")\n", " if not isinstance(name, str) or not name.strip():\n", " raise DataValidationError(\"name\", name, \"must be a non-empty string\")" ] }, { "cell_type": "markdown", "id": "74", "metadata": {}, "source": [ "Test against valid and invalid records. The custom exception prints exactly which field failed and why:" ] }, { "cell_type": "code", "execution_count": null, "id": "75", "metadata": {}, "outputs": [], "source": [ "test_records: list[dict[str, object]] = [\n", " {\"name\": \"Alice\", \"gpa\": 3.95}, # valid\n", " {\"name\": \"Bob\", \"gpa\": 5.0}, # GPA out of range\n", " {\"name\": \"\", \"gpa\": 3.5}, # empty name\n", "]\n", "\n", "for rec in test_records:\n", " try:\n", " validate_student(rec)\n", " print(f\" {rec['name']!r:<10} -> valid\")\n", " except DataValidationError as exc:\n", " print(f\" {rec.get('name')!r:<10} -> {exc}\")" ] }, { "cell_type": "markdown", "id": "76", "metadata": {}, "source": [ "
\n", " Activity 3 - Safe Batch Parser

\n", "Goal: Write parse_batch(rows) that returns (valid, errors): a list of successfully parsed floats and a list of error messages.\n", "
rows = ['85.0', '92', 'n/a', '-5', '78.5', '110', '63']\n",
    "\n",
    "valid, errors = parse_batch(rows)\n",
    "# valid  = [85.0, 92.0, 78.5, 63.0]\n",
    "# errors = [\"'n/a' is not a valid number\",\n",
    "#           \"'-5' out of range [0, 100]\",\n",
    "#           \"'110' out of range [0, 100]\"]
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "77", "metadata": {}, "outputs": [], "source": [ "def parse_batch(rows: list[str]) -> tuple[list[float], list[str]]:\n", " \"\"\"Parse a batch of score strings, separating valid from invalid.\n", "\n", " Args:\n", " rows: List of raw score strings.\n", "\n", " Returns:\n", " Tuple of (valid_scores, error_messages).\n", " \"\"\"\n", " valid: list[float] = []\n", " errors: list[str] = []\n", " # TODO: iterate rows, use parse_score from above, collect results\n", " return valid, errors\n", "\n", "\n", "rows: list[str] = [\"85.0\", \"92\", \"n/a\", \"-5\", \"78.5\", \"110\", \"63\"]\n", "valid, errors = parse_batch(rows)\n", "print(f\"valid = {valid}\")\n", "print(f\"errors = {errors}\")" ] }, { "cell_type": "markdown", "id": "78", "metadata": {}, "source": [ "## 7. File I/O with pathlib\n", "\n", "**File I/O** (Input/Output) means reading data from files on disk and writing results back. Almost every data science workflow starts by loading a CSV, JSON, or Parquet file and ends by saving results somewhere.\n", "\n", "`pathlib.Path` is the modern Python way to work with file paths. It is cross-platform (works on Windows, macOS, and Linux without changes) and composable:\n", "\n", "```python\n", "from pathlib import Path\n", "\n", "data_dir = Path('tutorials') / 'data' # / joins path parts\n", "csv_file = data_dir / 'students.csv'\n", "print(csv_file) # tutorials/data/students.csv\n", "```\n", "\n", "
\n", " Key Concept: pathlib.Path, the Modern Way to Handle Paths

\n", "Since Python 3.4, pathlib.Path is the standard for file-system work. It is cross-platform, composable with /, and carries methods for existence checks, directory creation, and reading/writing, all in one object.

Always use with open(...) as fh (context manager) so the file is closed automatically, even if an exception occurs.\n", "
\n", "\n", "
\n", " Common Mistake: Bare String Paths

\n", "open('data/file.csv') works but gives you no path-manipulation methods and is fragile on Windows vs. macOS/Linux. Use Path('data') / 'file.csv' instead.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "79", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "# Path composition: / operator joins parts, cross-platform\n", "project_root: Path = Path()\n", "data_dir: Path = project_root / \"data\"\n", "output_file: Path = data_dir / \"results\" / \"run_001.json\"\n", "\n", "print(f\"data_dir : {data_dir}\")\n", "print(f\"output_file : {output_file}\")" ] }, { "cell_type": "markdown", "id": "80", "metadata": {}, "source": [ "Every `Path` object knows its own parts: no string slicing to extract a filename or extension. `mkdir(exist_ok=True)` is the safest way to create a directory (no error if it already exists):" ] }, { "cell_type": "code", "execution_count": null, "id": "81", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "# Path properties: inspect parts of a path without string slicing\n", "p = Path(\"tutorials/03-data-analysis/data/primary.csv\")\n", "print(f\"p.name : {p.name}\") # 'primary.csv'\n", "print(f\"p.stem : {p.stem}\") # 'primary'\n", "print(f\"p.suffix : {p.suffix}\") # '.csv'\n", "print(f\"p.parent : {p.parent}\") # 'tutorials/03-data-analysis/data'\n", "print(f\"p.parts : {p.parts}\")\n", "\n", "# Safe directory creation: no error if it already exists\n", "tmp_dir = Path(\"tmp_activity\")\n", "tmp_dir.mkdir(exist_ok=True)\n", "print(f\"\\ntmp_dir exists: {tmp_dir.exists()}\")" ] }, { "cell_type": "markdown", "id": "82", "metadata": {}, "source": [ "### Reading & Writing Files\n", "Always use the `with` statement. It closes the file automatically, even if an exception occurs. `DictWriter` writes rows as dicts keyed by column name:" ] }, { "cell_type": "code", "execution_count": null, "id": "83", "metadata": {}, "outputs": [], "source": [ "import csv\n", "from pathlib import Path\n", "\n", "tmp = Path(\"tmp_activity\")\n", "tmp.mkdir(exist_ok=True)\n", "\n", "csv_path = tmp / \"students.csv\"\n", "rows: list[dict[str, object]] = [\n", " {\"name\": \"Alice Kamau\", \"gpa\": 3.95, \"major\": \"CS\"},\n", " {\"name\": \"Bob Mwangi\", \"gpa\": 3.45, \"major\": \"Math\"},\n", " {\"name\": \"Carol Osei\", \"gpa\": 3.88, \"major\": \"CS\"},\n", "]\n", "\n", "with csv_path.open(\"w\", newline=\"\", encoding=\"utf-8\") as fh:\n", " writer = csv.DictWriter(fh, fieldnames=[\"name\", \"gpa\", \"major\"])\n", " writer.writeheader()\n", " writer.writerows(rows)\n", "\n", "print(f\"Wrote: {csv_path}\")" ] }, { "cell_type": "markdown", "id": "84", "metadata": {}, "source": [ "`DictReader` reads each row back as a `dict` keyed by header names, with no positional index access needed:" ] }, { "cell_type": "code", "execution_count": null, "id": "85", "metadata": {}, "outputs": [], "source": [ "import csv\n", "from pathlib import Path\n", "\n", "csv_path = Path(\"tmp_activity\") / \"students.csv\"\n", "\n", "with csv_path.open(encoding=\"utf-8\") as fh:\n", " reader = csv.DictReader(fh)\n", " loaded: list[dict[str, str]] = list(reader)\n", "\n", "for row in loaded:\n", " print(f\" {row['name']:<15} GPA={row['gpa']} {row['major']}\")" ] }, { "cell_type": "markdown", "id": "86", "metadata": {}, "source": [ "For single-document JSON files, `Path.write_text()` + `json.dumps()` and `Path.read_text()` + `json.loads()` is the most concise round-trip:" ] }, { "cell_type": "code", "execution_count": null, "id": "87", "metadata": {}, "outputs": [], "source": [ "import json\n", "from pathlib import Path\n", "\n", "tmp = Path(\"tmp_activity\")\n", "json_path = tmp / \"run_result.json\"\n", "run_data = {\"run_id\": \"exp-001\", \"accuracy\": 0.923, \"tags\": [\"baseline\"]}\n", "\n", "# Write: write_text is the cleanest one-liner for JSON\n", "json_path.write_text(json.dumps(run_data, indent=2), encoding=\"utf-8\")\n", "print(f\"Wrote: {json_path}\")\n", "\n", "# Read: read_text + json.loads\n", "reloaded: dict[str, object] = json.loads(json_path.read_text(encoding=\"utf-8\"))\n", "print(f\"Read back: {reloaded}\")" ] }, { "cell_type": "markdown", "id": "88", "metadata": {}, "source": [ "### Finding Files\n", "`Path.iterdir()` yields the immediate children of a directory. `Path.rglob(pattern)` searches the entire subtree recursively:" ] }, { "cell_type": "code", "execution_count": null, "id": "89", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "tmp = Path(\"tmp_activity\")\n", "print(\"Files in tmp_activity:\")\n", "for f in sorted(tmp.iterdir()):\n", " size = f.stat().st_size\n", " print(f\" {f.name:<30} {size:>6} bytes\")" ] }, { "cell_type": "markdown", "id": "90", "metadata": {}, "source": [ "`rglob('*.ipynb')` finds all matching files at any depth. After exploring, clean up the temporary directory with `shutil.rmtree()`:" ] }, { "cell_type": "code", "execution_count": null, "id": "91", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import shutil\n", "\n", "# rglob: recursive search by pattern\n", "notebooks = list(Path(\"tutorials\").rglob(\"*.ipynb\"))\n", "print(f\"Notebooks found: {len(notebooks)}\")\n", "for nb in sorted(notebooks)[:5]:\n", " print(f\" {nb}\")\n", "\n", "# Clean up tmp directory\n", "tmp = Path(\"tmp_activity\")\n", "shutil.rmtree(tmp)\n", "print(f\"\\nCleaned up: {tmp} exists = {tmp.exists()}\")" ] }, { "cell_type": "markdown", "id": "92", "metadata": {}, "source": [ "### Creating & Checking Directories\n", "\n", "`Path.mkdir()` creates directories; `Path.exists()` and `Path.is_dir()` check state without raising an error. Always prefer `mkdir(parents=True, exist_ok=True)` over conditionally calling `os.makedirs()`:" ] }, { "cell_type": "code", "execution_count": null, "id": "93", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "results_dir = Path(\"results\") / \"experiment_001\"\n", "print(f\"Exists before : {results_dir.exists()}\")\n", "\n", "# parents=True creates any missing parent directories\n", "# exist_ok=True is silent if the directory already exists\n", "results_dir.mkdir(parents=True, exist_ok=True)\n", "print(f\"Exists after : {results_dir.exists()}\")\n", "print(f\"Is directory : {results_dir.is_dir()}\")\n", "\n", "# Write a file into the new directory\n", "log_file = results_dir / \"metrics.txt\"\n", "log_file.write_text(\"accuracy=0.923\\nval_loss=0.218\\n\")\n", "print(f\"Log file size : {log_file.stat().st_size} bytes\")\n", "\n", "# Clean up\n", "log_file.unlink()\n", "results_dir.rmdir()\n", "results_dir.parent.rmdir()\n", "print(\"Cleaned up\")" ] }, { "cell_type": "markdown", "id": "94", "metadata": {}, "source": [ "
\n", " Activity 4 - Experiment Logger

\n", "Goal: Write a function that appends an experiment result as a JSON line to a log file, then reads and prints all logged runs.\n", "
log_experiment(Path('runs.jsonl'), run_id='run-001', accuracy=0.901, loss=0.312)\n",
    "log_experiment(Path('runs.jsonl'), run_id='run-002', accuracy=0.923, loss=0.218)\n",
    "\n",
    "# runs.jsonl contents:\n",
    "# {\"run_id\": \"run-001\", \"accuracy\": 0.901, \"loss\": 0.312}\n",
    "# {\"run_id\": \"run-002\", \"accuracy\": 0.923, \"loss\": 0.218}
\n", "Hint: JSONL (JSON Lines), one JSON object per line, is the standard format for streaming experiment logs. Use mode='a' to append.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "95", "metadata": {}, "outputs": [], "source": [ "import json\n", "from pathlib import Path\n", "\n", "\n", "def log_experiment(log_path: Path, **metrics: object) -> None:\n", " \"\"\"Append an experiment result as a JSON line to log_path.\n", "\n", " Args:\n", " log_path: Path to the .jsonl log file (created if absent).\n", " **metrics: Any metric name/value pairs to record.\n", " \"\"\"\n", " ... # TODO\n", "\n", "\n", "log_path = Path(\"runs.jsonl\")\n", "if log_path.exists():\n", " log_path.unlink() # start fresh for this activity\n", "\n", "log_experiment(log_path, run_id=\"run-001\", accuracy=0.901, loss=0.312)\n", "log_experiment(log_path, run_id=\"run-002\", accuracy=0.923, loss=0.218)\n", "\n", "# Read back and print all runs\n", "print(\"Logged runs:\")\n", "for line in log_path.read_text(encoding=\"utf-8\").splitlines():\n", " run = json.loads(line)\n", " print(f\" {run}\")\n", "\n", "log_path.unlink() # clean up" ] }, { "cell_type": "markdown", "id": "96", "metadata": {}, "source": [ "### Why study gotchas?\n", "\n", "The bugs in this section are **silent**: they do not raise an error. Python happily runs the code and produces the **wrong answer**. These patterns appear in real data pipelines and ML training scripts and can waste hours of debugging time.\n", "\n", "Read through them now and you will recognise them instantly in the wild.\n", "\n", "## 8. Common Gotchas\n", "\n", "
\n", " Key Concept: Bugs That Are Hard to See

\n", "The following patterns cause real bugs in data pipelines and ML code. None of them raise an exception. They silently produce wrong results. Learn to recognise them now so you never spend hours debugging them later.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "97", "metadata": {}, "outputs": [], "source": [ "# GOTCHA 1: Mutable default argument\n", "# The default [] is created ONCE at function definition time - shared across all calls!\n", "def append_score_bad(score: float, history: list[float] = []) -> list[float]: # noqa: B006\n", " history.append(score)\n", " return history\n", "\n", "\n", "print(\"Bad default: the list leaks between calls:\")\n", "print(append_score_bad(82.0)) # [82.0] expected\n", "print(append_score_bad(91.0)) # [82.0, 91.0] WRONG: previous call leaked in!" ] }, { "cell_type": "markdown", "id": "98", "metadata": {}, "source": [ "The fix: use `None` as the default sentinel and create a fresh list inside the function body on each call:" ] }, { "cell_type": "code", "execution_count": null, "id": "99", "metadata": {}, "outputs": [], "source": [ "def append_score(score: float, history: list[float] | None = None) -> list[float]:\n", " if history is None:\n", " history = [] # new list created on every call where history is not provided\n", " history.append(score)\n", " return history\n", "\n", "\n", "print(\"Fixed: independent list each time:\")\n", "print(append_score(82.0)) # [82.0]\n", "print(append_score(91.0)) # [91.0] fresh list\n", "\n", "# Rule: never use a mutable object (list, dict, set) as a default argument value.\n", "# With @dataclass use field(default_factory=list) instead (shown in Sec. 4)." ] }, { "cell_type": "markdown", "id": "100", "metadata": {}, "source": [ "**Gotcha 2: assignment is not a copy.** `b = a` creates a second name for the **same** list. Shallow `.copy()` creates a new outer container but inner objects are still shared. Use `copy.deepcopy()` for fully independent nested structures:" ] }, { "cell_type": "code", "execution_count": null, "id": "101", "metadata": {}, "outputs": [], "source": [ "# GOTCHA 2: Assignment is NOT a copy\n", "# For nested structures, .copy() is a SHALLOW copy: inner objects are still shared.\n", "\n", "import copy\n", "\n", "original: list[list[int]] = [[1, 2], [3, 4]]\n", "\n", "ref = original # same object\n", "shallow_copy = original.copy() # new outer list, shared inner lists\n", "deep_copy = copy.deepcopy(original) # completely independent\n", "\n", "original[0].append(99)\n", "print(f\"original : {original}\") # [[1, 2, 99], [3, 4]]\n", "print(f\"ref : {ref}\") # [[1, 2, 99], [3, 4]] : same object\n", "print(f\"shallow_copy : {shallow_copy}\") # [[1, 2, 99], [3, 4]] : inner list shared!\n", "print(f\"deep_copy : {deep_copy}\") # [[1, 2], [3, 4]] : fully independent" ] }, { "cell_type": "markdown", "id": "102", "metadata": {}, "source": [ "**Gotcha 3: `/` vs `//`.** Both divide, but `//` floors toward negative infinity, not toward zero. Note that `//` on floats returns a float, not an int:" ] }, { "cell_type": "code", "execution_count": null, "id": "103", "metadata": {}, "outputs": [], "source": [ "# GOTCHA 3: / vs //: easy to confuse\n", "print(f\"7 / 2 = {7 / 2}\") # 3.5 : true division, always float\n", "print(f\"7 // 2 = {7 // 2}\") # 3 : floor, NOT truncate\n", "print(f\"-7 // 2 = {-7 // 2}\") # -4 : floors toward negative infinity\n", "print(f\"7.5//2 = {7.5 // 2}\") # 3.0 : floor of float is still float" ] }, { "cell_type": "markdown", "id": "104", "metadata": {}, "source": [ "**Gotcha 4 & 5: `{}` is a dict, and truthiness is not `None`-ness.** Python treats `0`, `0.0`, `None`, `[]`, and `''` all as falsy, which silently breaks the common `value or default` pattern when `0.0` is a legitimate result:" ] }, { "cell_type": "code", "execution_count": null, "id": "105", "metadata": {}, "outputs": [], "source": [ "# GOTCHA 4: {} creates a dict, not a set\n", "empty1 = {}\n", "empty2 = set()\n", "print(f\"type({{}}) : {type(empty1)}\")\n", "print(f\"type(set()) : {type(empty2)}\")\n", "\n", "# GOTCHA 5: Boolean short-circuit: 0.0, None, '', [] are all falsy\n", "score: float | None = None\n", "result = score or 0.0 # 0.0: but breaks if score is legitimately 0.0!\n", "print(f'\\n0.0 or \"default\" : {0.0 or \"default\"}') # noqa: SIM222\n", "\n", "# Prefer an explicit None check\n", "result2 = score if score is not None else 0.0\n", "print(f\"Explicit check : {result2}\")" ] }, { "cell_type": "markdown", "id": "106", "metadata": {}, "source": [ "## 9. Capstone Exercises\n", "\n", "Apply everything from Parts 1-3. Each exercise is self-contained. Attempt them without looking at previous sections first.\n", "\n", "
\n", " Exercise 1 - Student Report Generator

\n", "Goal: Given a list of student dicts, produce a formatted text report.\n", "
students = [\n",
    "    {'name': 'Alice', 'scores': [88, 92, 85], 'major': 'CS'},\n",
    "    {'name': 'Bob',   'scores': [62, 70, 58], 'major': 'Math'},\n",
    "    {'name': 'Carol', 'scores': [91, 95, 89], 'major': 'CS'},\n",
    "]\n",
    "\n",
    "# Expected output:\n",
    "# Name      Major    Avg    Grade\n",
    "# Alice     CS       88.3   B\n",
    "# Carol     CS       91.7   A\n",
    "# Bob       Math     63.3   D\n",
    "# (sorted by average score, descending)
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "107", "metadata": {}, "outputs": [], "source": [ "students: list[dict[str, object]] = [\n", " {\"name\": \"Alice\", \"scores\": [88, 92, 85], \"major\": \"CS\"},\n", " {\"name\": \"Bob\", \"scores\": [62, 70, 58], \"major\": \"Math\"},\n", " {\"name\": \"Carol\", \"scores\": [91, 95, 89], \"major\": \"CS\"},\n", "]\n", "\n", "# TODO: produce the formatted report\n", "..." ] }, { "cell_type": "markdown", "id": "108", "metadata": {}, "source": [ "
\n", " Exercise 2 - Experiment Tracker Dataclass

\n", "Goal: Define an Experiment dataclass, populate a list of runs, then print the best run by validation accuracy.\n", "
@dataclass\n",
    "class Experiment:\n",
    "    run_id: str\n",
    "    model: str\n",
    "    val_accuracy: float\n",
    "    config: TrainingConfig   # from Sec. 4\n",
    "\n",
    "# Expected output:\n",
    "# Best run: Experiment(run_id='run-002', model='xgboost', val_accuracy=0.934, ...)
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "109", "metadata": {}, "outputs": [], "source": [ "from dataclasses import dataclass\n", "\n", "\n", "@dataclass\n", "class Experiment:\n", " \"\"\"Record for a single training experiment.\"\"\"\n", "\n", " run_id: str\n", " model: str\n", " val_accuracy: float\n", " # TODO: add more fields as needed\n", "\n", "\n", "runs: list[Experiment] = [\n", " Experiment(\"run-001\", \"xgboost\", 0.901),\n", " Experiment(\"run-002\", \"xgboost\", 0.934),\n", " Experiment(\"run-003\", \"linear\", 0.881),\n", "]\n", "\n", "# TODO: find and print the best run\n", "best: Experiment = ...\n", "print(f\"Best run: {best}\")" ] }, { "cell_type": "markdown", "id": "110", "metadata": {}, "source": [ "
\n", " Exercise 3 - Rolling Anomaly Detector

\n", "Goal: Using a deque of size 5, flag any reading that deviates more than 2 standard deviations from the window mean.\n", "
readings = [36.5, 36.7, 36.8, 36.6, 36.9, 39.5, 36.7, 36.8]\n",
    "\n",
    "# Expected: reading 39.5 flagged as anomaly
\n", "Hint: Use score_summary() from Sec. 1 (or inline the calculation).\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "111", "metadata": {}, "outputs": [], "source": [ "from collections import deque\n", "\n", "readings: list[float] = [36.5, 36.7, 36.8, 36.6, 36.9, 39.5, 36.7, 36.8]\n", "WINDOW_SIZE: int = 5\n", "THRESHOLD_STD: float = 2.0\n", "\n", "window: deque[float] = deque(maxlen=WINDOW_SIZE)\n", "\n", "# TODO: detect and print anomalies\n", "for _reading in readings:\n", " ..." ] }, { "cell_type": "markdown", "id": "112", "metadata": {}, "source": [ "
\n", " Exercise 4 - Moving Window Average

\n", "A raw metric stream (sensor readings, training loss per step) can be noisy. A moving window average smooths it by replacing each value with the average of its neighbours within a fixed window.

Task: Implement moving_window_average(x, n_neighbors):
\n", "\n", "
losses = [0.95, 0.82, 0.91, 0.78, 0.65, 0.70, 0.60, 0.55]\n",
    "moving_window_average(losses, n_neighbors=1)\n",
    "# window of 3: each value = mean of itself + 1 left + 1 right\n",
    "# → [0.887, 0.893, 0.837, 0.780, 0.710, 0.650, 0.617, 0.575]
\n", "Then: compute the average for n_neighbors in 1–4 and print the range (max − min) of each smoothed list. Does the range shrink as the window grows? Why?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "113", "metadata": {}, "outputs": [], "source": [ "def moving_window_average(x: list[float], n_neighbors: int = 1) -> list[float]:\n", " \"\"\"Replace each value with the mean of its n_neighbors on each side.\n", "\n", " Args:\n", " x: Input list of floats.\n", " n_neighbors: Number of neighbours on each side to include.\n", "\n", " Returns:\n", " Smoothed list of the same length as x.\n", " \"\"\"\n", " n = len(x)\n", " width = n_neighbors * 2 + 1\n", " # Pad: repeat first/last element for missing neighbours at edges\n", " padded = [x[0]] * n_neighbors + x + [x[-1]] * n_neighbors\n", " # TODO: return the windowed means\n", " return [sum(padded[i : i + width]) / width for i in range(n)]\n", "\n", "\n", "training_loss: list[float] = [0.95, 0.82, 0.91, 0.78, 0.65, 0.70, 0.60, 0.55]\n", "print(\"Original :\", training_loss)\n", "\n", "for k in range(1, 5):\n", " smoothed = moving_window_average(training_loss, n_neighbors=k)\n", " rng = round(max(smoothed) - min(smoothed), 4)\n", " print(f\" n={k}: range={rng} {[round(v, 3) for v in smoothed]}\")" ] }, { "cell_type": "markdown", "id": "114", "metadata": {}, "source": [ "## Further Reading\n", "\n", "| Resource | Why it matters |\n", "|---|---|\n", "| [PEP 484 — Type Hints](https://peps.python.org/pep-0484/) | The original proposal; reading it explains *why* certain type annotation rules exist |\n", "| [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html) | The docstring format used throughout this notebook comes from here |\n", "| Van Rossum, G., Warsaw, B. & Coghlan, N. (2001). [PEP 8 — Style Guide for Python Code](https://peps.python.org/pep-0008/). | The canonical style reference for all Python code |\n", "| Hunt, A. & Thomas, D. (1999). *The Pragmatic Programmer*. | Chapters on DRY, orthogonality, and design-by-contract translate directly to the patterns in this notebook |\n" ] }, { "cell_type": "markdown", "id": "115", "metadata": {}, "source": [ "## Summary\n", "\n", "| Concept | Key rule |\n", "|---|---|\n", "| Functions | Annotate all params and return types; use Google-style docstrings |\n", "| Defaults | Never use a mutable object as a default, use `None` and check inside |\n", "| Lambda | Use for short key functions: `sorted(items, key=lambda x: x['score'])` |\n", "| `*args` | Collect variable positional args into a tuple; unpack a list with `*list` |\n", "| `**kwargs` | Collect variable keyword args into a dict; unpack a dict with `**dict` |\n", "| `@dataclass` | Generated `__init__`, `__repr__`, `__eq__`; use `frozen=True` for immutability |\n", "| `field(default_factory=list)` | The correct way to give a dataclass a mutable default |\n", "| Imports | `import module` over `from module import *`; use conventional aliases |\n", "| Exceptions | `except SpecificError` not bare `except`; use `else` for success path, `finally` for cleanup |\n", "| `pathlib.Path` | Cross-platform paths; compose with `/`; read with `.read_text()`, write with `.write_text()` |\n", "| Context manager | `with open(...) as fh` always, file closes automatically |\n", "| Gotchas | Mutable defaults, `=` is not copy, `//` floors, `{}` is dict |\n", "\n", "\n", "**Next:** `04-numpy.ipynb`, covering arrays, broadcasting, and vectorised operations with NumPy." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }