{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 3: Patterns for Data Science & ML\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/03-python-patterns.ipynb) [](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/03-python-patterns.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Python Foundations**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed Part 1 (`01-python-core.ipynb`) and Part 2 (`02-control-flow.ipynb`). If you have not, start there. The concepts here build directly on both.\n", "\n", "Part 2 introduces **professional coding patterns**: the habits and structures that separate a working script from maintainable, production-grade code. These patterns are used every day in real data science and ML engineering work.\n", "\n", "| Pattern | Why it matters |\n", "|---|---|\n", "| **Functions** | Reuse logic without copying code; make code testable |\n", "| **Lambda** | Write concise callbacks for `sorted()`, `map()`, pandas `.apply()` |\n", "| **\\*args / \\*\\*kwargs** | Handle flexible inputs like scikit-learn and PyTorch do |\n", "| **Dataclasses** | Typed, structured containers for configs and pipeline state |\n", "| **Modules** | Organise code into files; use the standard library |\n", "| **Exceptions** | Handle errors gracefully instead of crashing |\n", "| **pathlib** | Read and write files safely, cross-platform |\n", "\n", "The running example is the same **university analytics platform** from Part 1.\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Write type-annotated functions with Google-style docstrings | Sec. 1 |\n", "| 2 | Use lambda, `*args`, and `**kwargs` | Sec. 2, 3 |\n", "| 3 | Define structured data with `@dataclass` | Sec. 4 |\n", "| 4 | Import and use the standard library | Sec. 5 |\n", "| 5 | Handle exceptions with `try/except/else/finally` | Sec. 6 |\n", "| 6 | Read and write files with `pathlib.Path` | Sec. 7 |\n", "| 7 | Recognise and avoid the most common Python gotchas | Sec. 8 |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 1. Functions\n", "\n", "A **function** is a named, reusable block of code. You define it once and call it as many times as you need, with different inputs each time.\n", "\n", "```python\n", "# Define once:\n", "def greet(name):\n", " print(f'Hello, {name}!')\n", "\n", "# Call many times:\n", "greet('Alice') # Hello, Alice!\n", "greet('Bob') # Hello, Bob!\n", "```\n", "\n", "Without functions, any repeated logic must be copy-pasted, and copy-pasted code means bugs fixed in one place but not the other. Functions are the foundation of all reusable, testable code.\n", "\n", "
name: type syntax from Part 1, Sec. 1, just applied to function signatures.classify_cohort that takes a list of scores and returns a dict mapping each grade letter to its count.\n",
"classify_cohort([95, 83, 71, 62, 45, 88, 76])\n",
"# -> {'A': 1, 'B': 2, 'C': 2, 'D': 1, 'F': 1}\n",
"Hint: Use a helper function grade_letter(score) -> str and Counter.\n",
"accept_login(users, username, password) that takes a dict[str, str] of username→password pairs and returns True if the username exists and the password matches, False otherwise.users = {\"alice\": \"ds2024\", \"bob\": \"ml#secure\"}\n",
"\n",
"accept_login(users, \"alice\", \"ds2024\") # True\n",
"accept_login(users, \"alice\", \"wrong\") # False (bad password)\n",
"accept_login(users, \"carol\", \"any\") # False (user not found)\n",
"Hint: Use dict.get() to avoid a KeyError on missing usernames.\n",
"lambda is a function with no name, a single expression, and an implicit return. Use it as a short callback for sorted(), map(), filter(), and especially pandas .apply().*args collects any number of positional arguments into a tuple.**kwargs collects any number of keyword arguments into a dict.model.fit(X, y, **config), nn.Sequential(*layers).\n",
"log_metrics(epoch, **metrics) that prints a formatted line and returns a dict.\n",
"log_metrics(5, loss=0.312, accuracy=0.901, val_loss=0.334)\n",
"# prints:\n",
"# Epoch 05 | loss=0.3120 accuracy=0.9010 val_loss=0.3340\n",
"# returns:\n",
"# {'epoch': 5, 'loss': 0.312, 'accuracy': 0.901, 'val_loss': 0.334}\n",
"@dataclass (Python 3.7+) generates __init__, __repr__, and __eq__ from field annotations automatically. It is the modern replacement for plain dicts when the shape of your data is known and fixed.dict: flexible, arbitrary keys, JSON-friendlyNamedTuple: immutable record, tuple-compatible@dataclass: mutable typed object with methods; default for ML configs and pipeline state@dataclass(frozen=True): immutable dataclass; hashable, usable as dict key| import math | import the whole module; access with math.sqrt() |
| from math import sqrt, pi | import specific names; use directly |
| import numpy as np | alias (conventional for large packages) |
import module over from module import *. The star import pollutes the namespace and hides where names come from.\n",
"try: code that might raise an exceptionexcept ExcType as e: handle a specific exceptionelse: runs only if NO exception was raised in tryfinally: always runs, even if an exception propagates (use for cleanup)except: or except Exception: hides bugs and silences keyboard interrupts.\n",
"parse_batch(rows) that returns (valid, errors): a list of successfully parsed floats and a list of error messages.\n",
"rows = ['85.0', '92', 'n/a', '-5', '78.5', '110', '63']\n",
"\n",
"valid, errors = parse_batch(rows)\n",
"# valid = [85.0, 92.0, 78.5, 63.0]\n",
"# errors = [\"'n/a' is not a valid number\",\n",
"# \"'-5' out of range [0, 100]\",\n",
"# \"'110' out of range [0, 100]\"]\n",
"pathlib.Path is the standard for file-system work. It is cross-platform, composable with /, and carries methods for existence checks, directory creation, and reading/writing, all in one object. with open(...) as fh (context manager) so the file is closed automatically, even if an exception occurs.\n",
"open('data/file.csv') works but gives you no path-manipulation methods and is fragile on Windows vs. macOS/Linux. Use Path('data') / 'file.csv' instead.\n",
"log_experiment(Path('runs.jsonl'), run_id='run-001', accuracy=0.901, loss=0.312)\n",
"log_experiment(Path('runs.jsonl'), run_id='run-002', accuracy=0.923, loss=0.218)\n",
"\n",
"# runs.jsonl contents:\n",
"# {\"run_id\": \"run-001\", \"accuracy\": 0.901, \"loss\": 0.312}\n",
"# {\"run_id\": \"run-002\", \"accuracy\": 0.923, \"loss\": 0.218}\n",
"Hint: JSONL (JSON Lines), one JSON object per line, is the standard format for streaming experiment logs. Use mode='a' to append.\n",
"students = [\n",
" {'name': 'Alice', 'scores': [88, 92, 85], 'major': 'CS'},\n",
" {'name': 'Bob', 'scores': [62, 70, 58], 'major': 'Math'},\n",
" {'name': 'Carol', 'scores': [91, 95, 89], 'major': 'CS'},\n",
"]\n",
"\n",
"# Expected output:\n",
"# Name Major Avg Grade\n",
"# Alice CS 88.3 B\n",
"# Carol CS 91.7 A\n",
"# Bob Math 63.3 D\n",
"# (sorted by average score, descending)\n",
"Experiment dataclass, populate a list of runs, then print the best run by validation accuracy.\n",
"@dataclass\n",
"class Experiment:\n",
" run_id: str\n",
" model: str\n",
" val_accuracy: float\n",
" config: TrainingConfig # from Sec. 4\n",
"\n",
"# Expected output:\n",
"# Best run: Experiment(run_id='run-002', model='xgboost', val_accuracy=0.934, ...)\n",
"deque of size 5, flag any reading that deviates more than 2 standard deviations from the window mean.\n",
"readings = [36.5, 36.7, 36.8, 36.6, 36.9, 39.5, 36.7, 36.8]\n",
"\n",
"# Expected: reading 39.5 flagged as anomaly\n",
"Hint: Use score_summary() from Sec. 1 (or inline the calculation).\n",
"moving_window_average(x, n_neighbors):n_neighbors elements on each side plus the element itselflosses = [0.95, 0.82, 0.91, 0.78, 0.65, 0.70, 0.60, 0.55]\n",
"moving_window_average(losses, n_neighbors=1)\n",
"# window of 3: each value = mean of itself + 1 left + 1 right\n",
"# → [0.887, 0.893, 0.837, 0.780, 0.710, 0.650, 0.617, 0.575]\n",
"Then: compute the average for n_neighbors in 1–4 and print the range (max − min) of each smoothed list. Does the range shrink as the window grows? Why?\n",
"