{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 15: Type Annotations\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/02-dev-tools/03-type-annotations.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/02-dev-tools/03-type-annotations.ipynb)\n" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Dev Tools**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed [Part 13: Project Setup with uv](01-uv-project-setup.qmd) and [Part 14: Code Quality with ruff](02-code-quality-ruff.qmd). The `grade-predictor` project from those chapters is the codebase we annotate here.\n", "\n", "This is the one notebook in Part 3, because type annotations are pure Python: running annotated functions live shows the gap between what Python accepts at runtime and what a type checker would flag statically. Every example is executable.\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide).\n" ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 15 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Explain why type annotations matter in DS code | Sec. 1 |\n", "| 2 | Write annotated function signatures with basic types | Sec. 2 |\n", "| 3 | Annotate numpy arrays with `NDArray` and pandas DataFrames | Sec. 3 |\n", "| 4 | Use `TypeAlias` and `Protocol` for complex DS types | Sec. 4 |\n", "| 5 | Interpret `ty check` output and fix type errors | Sec. 5 |\n", "| 6 | Apply gradual typing: where to start and what to skip | Sec. 6 |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 1. Why Type Annotations Matter" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "Two versions of the same function:\n", "\n", "```python\n", "# Without annotations\n", "def compute_grade(midterm, final, project, weights):\n", " ...\n", "\n", "# With annotations\n", "def compute_grade(\n", " midterm: float,\n", " final: float,\n", " project: float,\n", " weights: tuple[float, float, float] = (0.30, 0.45, 0.25),\n", ") -> float:\n", " ...\n", "```\n", "\n", "The annotated version is self-documenting: any editor with a type checker installed will warn you the moment you pass `\"82\"` instead of `82.0`. The unannotated version silently computes `\"82\" * 0.30 = \"82828282828282828282828282828282828282828282828282828282828282\"`. That is a real Python behavior, not a hypothetical.\n", "\n", "Python does not enforce annotations at runtime. That is the job of a static type checker. The annotation is documentation that a machine can check.\n" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "
\n", " Key Concept: Annotations are documentation a machine can check

\n", "They tell collaborators and your future self what a function expects and returns, without writing a word of prose. A type checker like ty reads them and flags type mismatches before the code runs.\n", "
" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "## 2. Basic Annotations" ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "# The basic scalar types used in grade-predictor\n", "from __future__ import annotations # enables newer type syntax on Python 3.10+\n", "\n", "\n", "def compute_grade(\n", " midterm: float,\n", " final: float,\n", " project: float,\n", " weights: tuple[float, float, float] = (0.30, 0.45, 0.25),\n", ") -> float:\n", " if abs(sum(weights) - 1.0) > 0.001:\n", " raise ValueError(f\"weights must sum to 1, got {sum(weights):.3f}\")\n", " return midterm * weights[0] + final * weights[1] + project * weights[2]\n", "\n", "\n", "compute_grade(80.0, 85.0, 90.0)" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "Python runs `compute_grade(\"82\", 85.0, 90.0)` without raising an error. The annotation is a contract, not a runtime check:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "# Python does NOT enforce annotations at runtime\n", "result = compute_grade(\"82\", 85.0, 90.0) # no error from Python\n", "print(result) # \"82\" * 0.30 -> TypeError: can't multiply sequence by non-int of type 'float'\n", "# Actually raises TypeError here, but only because float multiplication fails on str\n", "# With an int weight it would silently produce wrong output" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "The full set of basic types used in DS function signatures:\n", "\n", "| Type | Use for |\n", "|---|---|\n", "| `int` | counts, indices |\n", "| `float` | scores, rates, measurements |\n", "| `str` | labels, column names, IDs |\n", "| `bool` | flags, binary outcomes |\n", "| `int or float` | either, when both are valid |\n", "| `float or None` | an optional numeric value |\n", "| `list[float]` | a sequence of floats |\n", "| `tuple[float, float, float]` | a fixed-length sequence |\n", "| `dict[str, float]` | a mapping from string keys to float values |\n" ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "def grade_to_letter(average: float) -> str:\n", " if average >= 85:\n", " return \"A\"\n", " elif average >= 70:\n", " return \"B\"\n", " elif average >= 55:\n", " return \"C\"\n", " elif average >= 45:\n", " return \"D\"\n", " return \"F\"\n", "\n", "\n", "def flag_at_risk(score: float | None, threshold: float = 50.0) -> bool:\n", " if score is None:\n", " return True # missing score is treated as at-risk\n", " return score < threshold\n", "\n", "\n", "def grade_summary(midterm: float, final: float, project: float) -> dict[str, float]:\n", " avg = compute_grade(midterm, final, project)\n", " return {\"average\": avg, \"midterm\": midterm, \"final\": final, \"project\": project}\n", "\n", "\n", "grade_summary(80.0, 85.0, 90.0)" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "
\n", " Activity 1 - Annotate Three Functions

\n", "Goal: Annotate these three signatures. Include one with a float | None parameter for a nullable score, one that returns dict[str, float], and one that takes a list[str] of column names.\n", "
def normalize_score(raw, min_val, max_val): ...\n",
    "def compute_cohort_summary(scores): ...          # returns dict\n",
    "def select_columns(df, columns): ...             # columns is list[str]
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "# TODO: annotate the three functions\n", "..." ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "## 3. Annotating numpy Arrays and pandas DataFrames" ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "This is the gap in most Python type annotation tutorials. DS code is full of numpy arrays and pandas DataFrames, and the annotations for them are not obvious.\n", "\n", "For numpy, use `NDArray` from `numpy.typing`:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from numpy.typing import NDArray\n", "import pandas as pd\n", "\n", "\n", "def normalize(X: NDArray[np.float64]) -> NDArray[np.float64]:\n", " mean = X.mean(axis=0)\n", " std = X.std(axis=0)\n", " return (X - mean) / std\n", "\n", "\n", "# NDArray[np.float64] is a typed array: a 2D array of 64-bit floats\n", "scores = np.array([[80.0, 85.0, 90.0], [70.0, 75.0, 80.0]])\n", "normalize(scores)" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "For pandas, `pd.DataFrame` is the practical annotation, even though it carries no column-level information:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "def add_average_marks(df: pd.DataFrame) -> pd.DataFrame:\n", " df = df.copy()\n", " df[\"average_marks\"] = (df[\"midterm_score\"] + df[\"final_score\"] + df[\"project_score\"]) / 3\n", " return df\n", "\n", "\n", "def flag_at_risk_series(df: pd.DataFrame, threshold: float = 50.0) -> pd.Series:\n", " return df[\"average_marks\"] < threshold\n", "\n", "\n", "# Create a sample DataFrame to test\n", "sample = pd.DataFrame(\n", " {\n", " \"student_id\": [\"S0001\", \"S0002\"],\n", " \"midterm_score\": [80.0, 60.0],\n", " \"final_score\": [85.0, 55.0],\n", " \"project_score\": [90.0, 65.0],\n", " }\n", ")\n", "result = add_average_marks(sample)\n", "result" ] }, { "cell_type": "markdown", "id": "20", "metadata": {}, "source": [ "
\n", " Pro Tip: pd.DataFrame is practical; pandera adds column types

\n", "pd.DataFrame is a useful annotation even though it carries no column information. The next step is pandera.typing.DataFrame[Schema], which encodes column names and dtypes at the type level. Start with pd.DataFrame and graduate to pandera when you need column-level guarantees in a data pipeline.\n", "
" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "
\n", " Activity 2 - Annotate Array and DataFrame Functions

\n", "Goal: Write and annotate two functions: one that takes NDArray[np.float64] and returns a normalized array, and one that takes a pd.DataFrame and returns a filtered pd.DataFrame. Confirm both run correctly on the sample DataFrame above.\n", "
def normalize_features(X: NDArray[np.float64]) -> NDArray[np.float64]: ...\n",
    "def filter_passing(df: pd.DataFrame, threshold: float = 50.0) -> pd.DataFrame: ...
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [], "source": [ "# TODO: write and annotate the two functions\n", "..." ] }, { "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "## 4. TypeAlias and Protocol" ] }, { "cell_type": "markdown", "id": "24", "metadata": {}, "source": [ "When the same complex type appears in many function signatures, give it a name. In Python 3.12, the `type` keyword creates a type alias clearly and without imports:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "25", "metadata": {}, "outputs": [], "source": [ "type ScoreVector = list[float]\n", "type GradeMap = dict[str, str] # student_id -> letter grade\n", "type WeightTuple = tuple[float, float, float]\n", "\n", "\n", "def batch_compute_grades(\n", " score_rows: list[ScoreVector],\n", " weights: WeightTuple = (0.30, 0.45, 0.25),\n", ") -> GradeMap:\n", " results = {}\n", " for i, (midterm, final, project) in enumerate(score_rows):\n", " avg = compute_grade(midterm, final, project, weights)\n", " results[f\"S{i + 1:04d}\"] = grade_to_letter(avg)\n", " return results\n", "\n", "\n", "batch_compute_grades([[80.0, 85.0, 90.0], [55.0, 60.0, 58.0]])" ] }, { "cell_type": "markdown", "id": "26", "metadata": {}, "source": [ "`Protocol` is for duck-typed objects. Instead of importing a specific class, you describe the interface you need:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "27", "metadata": {}, "outputs": [], "source": [ "from typing import Protocol\n", "\n", "\n", "class Predictor(Protocol):\n", " def predict(self, X: NDArray[np.float64]) -> NDArray[np.float64]: ...\n", " def fit(self, X: NDArray[np.float64], y: NDArray[np.float64]) -> None: ...\n", "\n", "\n", "def evaluate(model: Predictor, X_test: NDArray[np.float64], y_test: NDArray[np.float64]) -> float:\n", " predictions = model.predict(X_test)\n", " return float(np.mean((predictions - y_test) ** 2) ** 0.5) # RMSE\n", "\n", "\n", "# Any sklearn-compatible model satisfies Predictor without importing sklearn\n", "print(\"Predictor protocol defined\")" ] }, { "cell_type": "markdown", "id": "28", "metadata": {}, "source": [ "
\n", " Key Concept: Protocol over import

\n", "evaluate(model: Predictor, ...) accepts any object with predict and fit methods: sklearn's LinearRegression, XGBRegressor, a custom class. No import of sklearn needed in the type signature. This is structural subtyping, and it keeps your utility functions independent of any specific ML library.\n", "
" ] }, { "cell_type": "markdown", "id": "29", "metadata": {}, "source": [ "## 5. Running ty" ] }, { "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "Install `ty` and run it on the `grade-predictor` source:\n", "\n", "```bash\n", "uv add --optional dev ty\n", "uv run ty check src/\n", "```\n", "\n", "Reading the output: each line is `file:line:col: error[code] message`. Errors must be fixed. Warnings are suggestions.\n", "\n", "Common errors in DS code:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "31", "metadata": {}, "outputs": [], "source": [ "# Simulate what ty would flag:\n", "\n", "# 1. Return type mismatch\n", "def get_threshold() -> float:\n", " return \"50.0\" # str, not float -- ty flags this\n", "\n", "\n", "# 2. Argument type mismatch\n", "def double_score(score: float) -> float:\n", " return score * 2\n", "\n", "\n", "result = double_score(\"82\") # str passed as float -- ty flags this\n", "\n", "\n", "# 3. Optional not handled\n", "def safe_grade(score: float | None) -> str:\n", " return grade_to_letter(score) # score might be None -- ty flags this\n", "\n", "\n", "# Correct version\n", "def safe_grade_fixed(score: float | None) -> str:\n", " if score is None:\n", " return \"N/A\"\n", " return grade_to_letter(score)\n", "\n", "\n", "safe_grade_fixed(None)" ] }, { "cell_type": "markdown", "id": "32", "metadata": {}, "source": [ "Configure `ty` in `pyproject.toml`:\n", "\n", "```toml\n", "[tool.ty]\n", "python-version = \"3.12\"\n", "```\n", "\n", "The `--ignore-missing-imports` flag suppresses errors from third-party packages that lack type stubs. Pandas stubs are partial; great-tables has no stubs. Use it when third-party noise hides real errors in your own code.\n" ] }, { "cell_type": "markdown", "id": "33", "metadata": {}, "source": [ "
\n", " Activity 3 - Run ty and Fix Errors

\n", "Goal: Add full type annotations to core.py. Run uv run ty check src/. Fix every error (not warning) that ty reports in your own code. Confirm the output is clean before moving on.\n", "
uv run ty check src/\n",
    "# Fix each error line by line\n",
    "uv run ty check src/  # should report 0 errors
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "34", "metadata": {}, "outputs": [], "source": [ "# TODO: annotate core.py fully and run ty check" ] }, { "cell_type": "markdown", "id": "35", "metadata": {}, "source": [ "## 6. Gradual Typing: Where to Start" ] }, { "cell_type": "markdown", "id": "36", "metadata": {}, "source": [ "You do not need to annotate everything at once. Gradual typing means adding annotations incrementally, in the order that buys the most value.\n", "\n", "Priorities for a DS codebase:\n", "1. Public function signatures first: what callers see\n", "2. Return types before argument types: return type mismatches catch more bugs\n", "3. Skip internal helpers and one-off notebook cells initially\n", "4. Use `Any` as a placeholder when you need to annotate something complex you will refine later\n", "\n", "`Any` is not giving up. It is a marker that says: this is unannotated, I know it, I will return to it.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "37", "metadata": {}, "outputs": [], "source": [ "from typing import Any\n", "\n", "\n", "# Acceptable as a placeholder during gradual annotation\n", "def process_raw_data(data: Any) -> pd.DataFrame:\n", " # Will be refined once the input schema is settled\n", " return pd.DataFrame(data)\n", "\n", "\n", "# The same function with a more specific type once the schema is known\n", "def process_records(data: list[dict[str, Any]]) -> pd.DataFrame:\n", " return pd.DataFrame(data)\n", "\n", "\n", "# Test both\n", "process_records([{\"student_id\": \"S0001\", \"midterm_score\": 80.0}])" ] }, { "cell_type": "markdown", "id": "38", "metadata": {}, "source": [ "
\n", " Common Mistake: Annotating everything at once

\n", "Trying to annotate a 2000-line codebase in a single session produces two outcomes: you give up halfway, or you annotate things badly and introduce incorrect type information that misleads the checker. Start with the five most-called public functions. Get them clean. Move on.\n", "
" ] }, { "cell_type": "markdown", "id": "39", "metadata": {}, "source": [ "## Capstone: Fully Annotate core.py\n", "\n", "Bring the `grade-predictor/src/grade_predictor/core.py` to zero type errors.\n" ] }, { "cell_type": "markdown", "id": "40", "metadata": {}, "source": [ "
\n", " Capstone - Zero ty Errors

\n", "Goal:\n", "
    \n", "
  1. Annotate every function in core.py: compute_grade, grade_to_letter, flag_at_risk, add_average_marks
  2. \n", "
  3. Use NDArray[np.float64] for any numpy array parameters
  4. \n", "
  5. Use pd.DataFrame and pd.Series for pandas types
  6. \n", "
  7. Run uv run ty check src/ and bring it to zero errors
  8. \n", "
  9. Commit: git commit -m \"feat(types): fully annotate core.py\"
  10. \n", "
\n", "
uv run ty check src/\n",
    "# Fix all errors\n",
    "uv run ty check src/  # zero errors
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "41", "metadata": {}, "outputs": [], "source": [ "# TODO: annotate core.py and confirm zero ty errors\n", "..." ] }, { "cell_type": "markdown", "id": "42", "metadata": {}, "source": [ "## Further Reading\n", "\n", "| Resource | Why it matters |\n", "|---|---|\n", "| [ty documentation](https://github.com/astral-sh/ty) | Astral's type checker, integrated with the uv/ruff toolchain |\n", "| [numpy.typing reference](https://numpy.org/doc/stable/reference/typing.html) | `NDArray` and array annotation reference |\n", "| [ty documentation](https://github.com/astral-sh/ty) | Astral type checker; authoritative reference for ty errors and configuration |\n", "| PEP 544, [Protocols](https://peps.python.org/pep-0544/) | The spec behind structural subtyping |\n", "| [pandas type stubs](https://github.com/pandas-dev/pandas-stubs) | Official stubs for IDE-level type inference on DataFrames |\n" ] }, { "cell_type": "markdown", "id": "43", "metadata": {}, "source": [ "## Summary\n", "\n", "| Concept | Key rule |\n", "|---|---|\n", "| Runtime vs static | Python does not enforce annotations at runtime. A type checker does. |\n", "| `NDArray[np.float64]` | The correct annotation for a typed numpy array |\n", "| `pd.DataFrame` | Practical but untyped at the column level. Pandera adds column types. |\n", "| `type ScoreVector = ...` | Name a complex type you repeat in three or more places (Python 3.12 `type` keyword) |\n", "| `Protocol` | Accept any object with a given method, without importing its concrete class |\n", "| Gradual typing | Start with public function signatures. Use `Any` as a placeholder, not a cop-out. |\n", "\n", "**Next:** [Part 16: Git and GitHub](04-git-github.qmd) versions the typed, clean codebase you have here.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }