{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "---\n",
    "title: \"Part 19: Data Validation with Pydantic\"\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/02-dev-tools/07-pydantic-validation.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/02-dev-tools/07-pydantic-validation.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "**DS-MLOps Dev Tools**\n",
    "\n",
    "**Python 3.12+ | Author: Anthony Faustine**\n",
    "\n",
    "## Before you begin\n",
    "\n",
    "This notebook assumes you have completed [Part 15: Type Annotations](03-type-annotations.ipynb). Pydantic is type annotations put to work: the type hints you write in Part 15 become runtime validators that reject wrong inputs before they corrupt a pipeline.\n",
    "\n",
    "The `grade-predictor` project continues here: a `GradeConfig` model replaces the ad-hoc defaults in `compute_grade`, and a `StudentRecord` model validates incoming data before it ever reaches a computation function.\n",
    "\n",
    "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "::: {.callout-note collapse=\"true\" icon=false}\n",
    "## Learning Objectives\n",
    "\n",
    "By the end of Part 19 you will be able to:\n",
    "\n",
    "| # | Skill | Covered in |\n",
    "| --- | --- | --- |\n",
    "| 1 | Define Pydantic models and understand how they differ from dataclasses | Sec. 1 |\n",
    "| 2 | Validate input data and handle `ValidationError` cleanly | Sec. 2 |\n",
    "| 3 | Write field validators and cross-field validators | Sec. 3 |\n",
    "| 4 | Build CLI tools with argparse and typer, and understand when to use each | Sec. 4 |\n",
    "| 5 | Use `BaseSettings` to manage configuration from environment variables | Sec. 5 |\n",
    "| 6 | Build a typed config object for a DS pipeline | Sec. 6 |\n",
    "| 7 | Validate a batch of student records and collect all errors at once | Sec. 7 |\n",
    ":::\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "## 0. The Bug That Type Annotations Cannot Catch\n",
    "\n",
    "You have just finished building the `grade-predictor` pipeline. It reads a CSV, computes weighted scores, applies a pass threshold, and writes results to a file. You add type annotations everywhere. Your type checker passes clean. You deploy.\n",
    "\n",
    "Two weeks later, a bug report arrives: the pipeline produced a passing grade for a student whose `midterm_score` was logged as `150.0` — which is impossible on a 100-point scale. The pipeline did not crash. The math ran correctly on the wrong number. A downstream report was wrong, and nobody noticed until a student appealed their grade.\n",
    "\n",
    "The problem was not the computation. It was that nothing stopped the invalid value from entering the pipeline in the first place. Type annotations say *what type a value should be*. They do not say *whether a value of that type makes sense at runtime*. `midterm_score: float` accepts any float: `150.0`, `-20.0`, `float(\"inf\")`. The annotation is a hint to the reader and the type checker. It is not a gate.\n",
    "\n",
    "**Pydantic** ([pydantic.dev](https://docs.pydantic.dev/latest/)) turns annotations into gates. Define a model, describe the constraints, and Pydantic enforces them on every construction — before the value reaches any computation. One validation point at entry; everything downstream trusts the types.\n",
    "\n",
    "### Install\n",
    "\n",
    "Pydantic is in `pyproject.toml`. For a standalone project:\n",
    "\n",
    "```bash\n",
    "uv add pydantic pydantic-settings    # or: pip install pydantic pydantic-settings\n",
    "```"
   ]
  },
  {
   "cell_type": "raw",
   "id": "5",
   "metadata": {
    "raw_mimetype": "text/markdown"
   },
   "source": [
    "## 1. Models: Type Annotations That Bite\n",
    "\n",
    "```{mermaid}\n",
    "flowchart LR\n",
    "    A[\"raw input\\ndict / JSON / env vars\"] --> B[\"Pydantic model\\nconstructor\"]\n",
    "    B --> C[\"field validators\\ntype coercion + constraints\"]\n",
    "    C -->|\"type error or\\nconstraint violated\"| E[\"ValidationError\\nfield path + message\"]\n",
    "    C -->|\"all fields valid\"| D[\"model validators\\ncross-field checks\"]\n",
    "    D -->|\"invariant violated\"| E\n",
    "    D -->|\"all pass\"| F[\"model instance\\ntype-safe, immutable\"]\n",
    "\n",
    "    style F fill:#EBF5F0,stroke:#059669,color:#065F46\n",
    "    style E fill:#FEF2F2,stroke:#DC2626,color:#991B1B\n",
    "    style B fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "A Python dataclass accepts any value for any field, regardless of what the annotation says. A Pydantic model validates on construction. Pass the wrong type and it raises an error immediately, before the value reaches any computation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataclasses import dataclass\n",
    "\n",
    "from pydantic import BaseModel\n",
    "\n",
    "\n",
    "@dataclass\n",
    "class DataclassStudent:\n",
    "    student_id: str\n",
    "    midterm_score: float\n",
    "\n",
    "\n",
    "class PydanticStudent(BaseModel):\n",
    "    student_id: str\n",
    "    midterm_score: float\n",
    "\n",
    "\n",
    "# Dataclass: accepts \"eighty\" silently\n",
    "dc = DataclassStudent(student_id=\"S0001\", midterm_score=\"eighty\")\n",
    "print(dc.midterm_score)  # \"eighty\" -- no error\n",
    "\n",
    "# Pydantic: raises ValidationError immediately\n",
    "try:\n",
    "    ps = PydanticStudent(student_id=\"S0001\", midterm_score=\"eighty\")\n",
    "except Exception as e:\n",
    "    print(type(e).__name__, e)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "Pydantic also coerces where the conversion is unambiguous. `\"85\"` becomes `85.0` for a `float` field; `\"true\"` becomes `True` for a `bool` field. This makes it practical for inputs from CSV files, APIs, or environment variables, where everything arrives as a string.\n",
    "\n",
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Validate at the boundary, not inside the pipeline</span><br><br>\n",
    "A pipeline that validates its inputs at the entry point can trust every value it works with from that point on. A pipeline that validates inside individual functions repeats the same checks everywhere and still misses values that enter through unexpected paths. Pydantic at the boundary is the architectural principle; the model is the mechanism.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "## 2. `StudentRecord`: Validating Incoming Data\n",
    "\n",
    "Define a model for a student record coming in from a CSV or API. The `model_config` attribute controls Pydantic v2's behaviour:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from typing import Annotated\n",
    "\n",
    "from pydantic import BaseModel, Field, ValidationError, field_validator\n",
    "\n",
    "\n",
    "class StudentRecord(BaseModel):\n",
    "    model_config = {\"str_strip_whitespace\": True, \"validate_assignment\": True}\n",
    "\n",
    "    student_id: str\n",
    "    midterm_score: Annotated[float, Field(ge=0.0, le=100.0)]\n",
    "    final_score: Annotated[float, Field(ge=0.0, le=100.0)]\n",
    "    project_score: Annotated[float, Field(ge=0.0, le=100.0)]\n",
    "    program: str\n",
    "    has_internet: bool = True\n",
    "\n",
    "    @field_validator(\"student_id\")\n",
    "    @classmethod\n",
    "    def student_id_format(cls, v: str) -> str:\n",
    "        if not re.match(r\"^S\\d{4}$\", v):\n",
    "            raise ValueError(f\"student_id must match S followed by 4 digits, got '{v}'\")\n",
    "        return v"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "`Field(ge=0.0, le=100.0)` means \"greater than or equal to 0, less than or equal to 100\". Pydantic v2 uses `Annotated` with `Field` for constraints; the old `validator` decorator is replaced by `field_validator`.\n",
    "\n",
    "Try it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Valid record: succeeds\n",
    "record = StudentRecord(\n",
    "    student_id=\"S0042\",\n",
    "    midterm_score=78.5,\n",
    "    final_score=82.0,\n",
    "    project_score=\"91.0\",  # string coerced to float\n",
    "    program=\"CS\",\n",
    ")\n",
    "print(record.model_dump())\n",
    "\n",
    "# Invalid: midterm over 100\n",
    "try:\n",
    "    bad = StudentRecord(\n",
    "        student_id=\"S0042\",\n",
    "        midterm_score=150.0,  # violates le=100.0\n",
    "        final_score=80.0,\n",
    "        project_score=75.0,\n",
    "        program=\"CS\",\n",
    "    )\n",
    "except ValidationError as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 1 - Extend StudentRecord</span><br><br>\n",
    "<b>Goal:</b> Add a <code>semester</code> field to <code>StudentRecord</code> that must be one of <code>\"Fall\"</code>, <code>\"Spring\"</code>, <code>\"Summer\"</code>. Use <code>Literal[\"Fall\", \"Spring\", \"Summer\"]</code> as the type annotation. Try creating a record with <code>semester=\"Winter\"</code> and confirm it raises a <code>ValidationError</code>.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>from typing import Literal\n",
    "\n",
    "class StudentRecord(BaseModel):\n",
    "    ...\n",
    "    semester: Literal[\"Fall\", \"Spring\", \"Summer\"]\n",
    "\n",
    "StudentRecord(..., semester=\"Winter\")  # should raise ValidationError</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14",
   "metadata": {},
   "source": [
    "## 3. Field Validators and Cross-Field Validators\n",
    "\n",
    "A `@field_validator` runs on a single field after type coercion. Use it for business rules that go beyond a simple range check.\n",
    "\n",
    "A `@model_validator` runs after all fields are set and has access to the complete model. Use it for cross-field constraints — \"weights must sum to 1.0\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Annotated\n",
    "\n",
    "from pydantic import BaseModel, Field, ValidationError, field_validator, model_validator\n",
    "\n",
    "\n",
    "class GradeConfig(BaseModel):\n",
    "    midterm_weight: Annotated[float, Field(gt=0.0, lt=1.0)] = 0.30\n",
    "    final_weight: Annotated[float, Field(gt=0.0, lt=1.0)] = 0.45\n",
    "    project_weight: Annotated[float, Field(gt=0.0, lt=1.0)] = 0.25\n",
    "    pass_threshold: Annotated[float, Field(ge=50.0, le=80.0)] = 60.0\n",
    "\n",
    "    @model_validator(mode=\"after\")\n",
    "    def weights_must_sum_to_one(self) -> \"GradeConfig\":\n",
    "        total = self.midterm_weight + self.final_weight + self.project_weight\n",
    "        if abs(total - 1.0) > 1e-6:\n",
    "            raise ValueError(\n",
    "                f\"Weights must sum to 1.0, got {total:.6f}. \"\n",
    "                f\"Adjust midterm ({self.midterm_weight}), \"\n",
    "                f\"final ({self.final_weight}), or project ({self.project_weight}).\"\n",
    "            )\n",
    "        return self"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16",
   "metadata": {},
   "source": [
    "The `mode=\"after\"` argument means the validator runs after Pydantic has already validated and coerced every individual field. Use `mode=\"before\"` when you need to transform raw input before type coercion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Valid: weights sum to 1.0\n",
    "cfg = GradeConfig(midterm_weight=0.3, final_weight=0.5, project_weight=0.2)\n",
    "print(cfg)\n",
    "\n",
    "# Invalid: weights sum to 1.05\n",
    "try:\n",
    "    bad_cfg = GradeConfig(midterm_weight=0.4, final_weight=0.45, project_weight=0.2)\n",
    "except ValidationError as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Use <code>model_dump()</code> and <code>model_validate()</code> to round-trip through dicts and JSON</span><br><br>\n",
    "<code>config.model_dump()</code> converts a Pydantic model to a plain Python dict. <code>GradeConfig.model_validate(some_dict)</code> validates and constructs a model from a dict. <code>GradeConfig.model_validate_json(json_string)</code> parses and validates from a JSON string. These three methods cover the most common I/O patterns for config files, API payloads, and database rows.\n",
    "</div>\n",
    "\n",
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 2 - Config from a Dict</span><br><br>\n",
    "<b>Goal:</b> Create a <code>GradeConfig</code> from a Python dict using <code>model_validate()</code>. Then call <code>model_dump()</code> to get a plain dict back. Confirm the two dicts have the same values. Also confirm that a dict with weights that do not sum to 1.0 raises a <code>ValidationError</code>.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>raw = {\"midterm_weight\": 0.3, \"final_weight\": 0.45, \"project_weight\": 0.25}\n",
    "cfg = GradeConfig.model_validate(raw)\n",
    "assert cfg.model_dump() == {**raw, \"pass_threshold\": 60.0}</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19",
   "metadata": {},
   "source": [
    "## 4. CLI Interfaces: argparse and typer\n",
    "\n",
    "Every ML pipeline eventually needs to be run from the command line: `python train.py --data-path data/train.csv --epochs 50 --threshold 0.65`. Two tools exist for this: `argparse` (stdlib, the traditional approach) and `typer` (modern, annotation-based, by the same author as FastAPI).\n",
    "\n",
    "### argparse: the baseline\n",
    "\n",
    "`argparse` requires you to declare each argument manually, convert strings to types yourself, and write help text by hand:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {},
   "outputs": [],
   "source": [
    "import argparse\n",
    "\n",
    "# argparse: explicit, verbose, manual type conversion\n",
    "parser = argparse.ArgumentParser(description=\"Grade predictor CLI\")\n",
    "parser.add_argument(\"--data-path\", type=str, default=\"data/university_analytics.csv\")\n",
    "parser.add_argument(\"--pass-threshold\", type=float, default=60.0)\n",
    "parser.add_argument(\"--debug\", action=\"store_true\", default=False)\n",
    "\n",
    "# In a script: args = parser.parse_args()\n",
    "# In a notebook: simulate with a list\n",
    "args = parser.parse_args([\"--data-path\", \"data/train.csv\", \"--pass-threshold\", \"65.0\"])\n",
    "print(args.data_path, args.pass_threshold, args.debug)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21",
   "metadata": {},
   "source": [
    "### typer: annotations as the CLI spec\n",
    "\n",
    "`typer` reads your function's type annotations and builds the parser for you. The same type hints that describe the function's signature generate `--help`, automatic type coercion, and validation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22",
   "metadata": {},
   "outputs": [],
   "source": [
    "# pip install typer  /  uv add typer\n",
    "# This is a module-level demo -- run as: python script.py --data-path ... from terminal\n",
    "\n",
    "import typer\n",
    "\n",
    "app = typer.Typer()\n",
    "\n",
    "\n",
    "@app.command()\n",
    "def train(\n",
    "    data_path: str = typer.Option(\"data/university_analytics.csv\", help=\"Path to CSV\"),\n",
    "    pass_threshold: float = typer.Option(60.0, min=50.0, max=80.0, help=\"Pass threshold\"),\n",
    "    debug: bool = typer.Option(False, help=\"Enable debug logging\"),\n",
    ") -> None:\n",
    "    typer.echo(f\"Loading data from: {data_path}\")\n",
    "    typer.echo(f\"Pass threshold: {pass_threshold}\")\n",
    "    typer.echo(f\"Debug: {debug}\")\n",
    "\n",
    "\n",
    "# In a real script: if __name__ == \"__main__\": app()\n",
    "# Running `python train.py --help` auto-generates full usage docs."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23",
   "metadata": {},
   "source": [
    "### typer + pydantic-settings: the DS/MLOps pattern\n",
    "\n",
    "In production, you typically want *two* sources of configuration: environment variables (for secrets, deployment targets) and CLI arguments (for per-run overrides). The pattern is:\n",
    "\n",
    "- `pydantic-settings` reads from `.env` and environment — covered in Sec 5\n",
    "- `typer` provides the CLI surface\n",
    "- The typer command constructs a `PipelineConfig` from both sources\n",
    "\n",
    "The two tools compose naturally because they share the same type-annotation vocabulary.\n",
    "\n",
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: argparse vs typer</span><br><br>\n",
    "Use <code>argparse</code> when you want zero dependencies and a simple script. Use <code>typer</code> when you want auto-generated help, validation, subcommands, and code that reads like the function signature it already is. For DS/MLOps tools that are both importable as a library and runnable as a CLI, typer is the better fit.\n",
    "</div>\n",
    "\n",
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 3 - typer Command</span><br><br>\n",
    "<b>Goal:</b> Write a typer command <code>validate_data</code> that accepts <code>--data-path</code> (str) and <code>--max-errors</code> (int, default 10). The command should print the arguments it received. Confirm that passing <code>--max-errors abc</code> would raise a typer error (type mismatch).\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>@app.command()\n",
    "def validate_data(\n",
    "    data_path: str = typer.Option(..., help=\"Path to CSV to validate\"),\n",
    "    max_errors: int = typer.Option(10, help=\"Stop after this many errors\"),\n",
    ") -> None:\n",
    "    typer.echo(f\"Validating {data_path} (max errors: {max_errors})\")</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24",
   "metadata": {},
   "source": [
    "## 5. `BaseSettings`: Config from Environment Variables\n",
    "\n",
    "`pydantic-settings` extends Pydantic with a `BaseSettings` class that reads values from environment variables and `.env` files. This replaces the manual `os.getenv` pattern with a typed, validated config object.\n",
    "\n",
    "```bash\n",
    "uv add pydantic-settings\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25",
   "metadata": {},
   "outputs": [],
   "source": [
    "# grade_predictor/config.py\n",
    "from typing import Annotated\n",
    "\n",
    "from pydantic import Field\n",
    "from pydantic_settings import BaseSettings, SettingsConfigDict\n",
    "\n",
    "\n",
    "class GradeSettings(BaseSettings):\n",
    "    model_config = SettingsConfigDict(\n",
    "        env_file=\".env\",\n",
    "        env_prefix=\"GRADE_\",  # GRADE_API_KEY maps to api_key\n",
    "        case_sensitive=False,\n",
    "    )\n",
    "\n",
    "    api_key: str = Field(default=\"dev-key\", description=\"API key for the university data service\")\n",
    "    data_path: str = Field(default=\"data/university_analytics.csv\")\n",
    "    pass_threshold: Annotated[float, Field(ge=50.0, le=80.0)] = 60.0\n",
    "    debug: bool = False\n",
    "\n",
    "\n",
    "# Load (reads .env if it exists, otherwise uses defaults)\n",
    "settings = GradeSettings()\n",
    "print(settings.data_path)\n",
    "print(settings.pass_threshold)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26",
   "metadata": {},
   "source": [
    "With a `.env` file:\n",
    "\n",
    "```bash\n",
    "GRADE_API_KEY=secret-key-here\n",
    "GRADE_DATA_PATH=data/production.csv\n",
    "GRADE_PASS_THRESHOLD=65.0\n",
    "```\n",
    "\n",
    "Values from the `.env` file override defaults; environment variables override both.\n",
    "\n",
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Settings validation catches misconfiguration at startup, not mid-run</span><br><br>\n",
    "Without validation, a missing or malformed environment variable is discovered when the code first uses it: halfway through a pipeline run, after 20 minutes of processing. <code>BaseSettings</code> validates all settings at construction time, failing fast at startup with a clear error that lists every missing or invalid variable.\n",
    "</div>\n",
    "\n",
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: Constructing settings inside a function that runs in a loop</span><br><br>\n",
    "<code>GradeSettings()</code> reads the <code>.env</code> file and environment on every call. Calling it inside a loop that processes thousands of rows reads and validates the config thousands of times. Construct settings once at module or application level and pass the object through.\n",
    "</div>\n",
    "\n",
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 4 - Settings with a Test Override</span><br><br>\n",
    "<b>Goal:</b> Create a <code>GradeSettings</code> instance, overriding individual settings without a <code>.env</code> file by passing values directly to the constructor: <code>GradeSettings(api_key=\"test-key\", pass_threshold=70.0)</code>. Confirm the values are set correctly and that an invalid <code>pass_threshold</code> (e.g., <code>95.0</code>) raises a <code>ValidationError</code>.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>test_settings = GradeSettings(api_key=\"test-key\", pass_threshold=70.0)\n",
    "assert test_settings.pass_threshold == 70.0\n",
    "\n",
    "GradeSettings(api_key=\"test-key\", pass_threshold=95.0)  # should raise</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27",
   "metadata": {},
   "source": [
    "## 6. Typing a DS Pipeline Config\n",
    "\n",
    "A real DS pipeline has more than one config object. The pattern is to compose small focused models into a root config that is loaded once:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Annotated\n",
    "\n",
    "from pydantic import BaseModel, Field, model_validator\n",
    "from pydantic_settings import BaseSettings, SettingsConfigDict\n",
    "\n",
    "\n",
    "class WeightConfig(BaseModel):\n",
    "    midterm: Annotated[float, Field(gt=0.0, lt=1.0)] = 0.30\n",
    "    final: Annotated[float, Field(gt=0.0, lt=1.0)] = 0.45\n",
    "    project: Annotated[float, Field(gt=0.0, lt=1.0)] = 0.25\n",
    "\n",
    "    @model_validator(mode=\"after\")\n",
    "    def weights_sum_to_one(self) -> \"WeightConfig\":\n",
    "        total = self.midterm + self.final + self.project\n",
    "        if abs(total - 1.0) > 1e-6:\n",
    "            raise ValueError(f\"Weights must sum to 1.0, got {total:.4f}\")\n",
    "        return self\n",
    "\n",
    "\n",
    "class PipelineConfig(BaseSettings):\n",
    "    model_config = SettingsConfigDict(env_file=\".env\", env_prefix=\"GRADE_\")\n",
    "\n",
    "    weights: WeightConfig = WeightConfig()\n",
    "    pass_threshold: Annotated[float, Field(ge=50.0, le=80.0)] = 60.0\n",
    "    data_path: str = \"data/university_analytics.csv\"\n",
    "    output_path: str = \"data/results.parquet\"\n",
    "    debug: bool = False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29",
   "metadata": {},
   "outputs": [],
   "source": [
    "config = PipelineConfig()\n",
    "\n",
    "\n",
    "def compute_grade(midterm: float, final: float, project: float, cfg: PipelineConfig) -> float:\n",
    "    w = cfg.weights\n",
    "    return midterm * w.midterm + final * w.final + project * w.project\n",
    "\n",
    "\n",
    "print(compute_grade(78.0, 82.0, 91.0, config))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 5 - Nested Config</span><br><br>\n",
    "<b>Goal:</b> Create a <code>PipelineConfig</code> object. Confirm that <code>config.weights.midterm</code> returns the expected default. Then create one with custom weights <code>WeightConfig(midterm=0.25, final=0.50, project=0.25)</code> and confirm it validates correctly. Finally confirm that invalid nested weights (sum != 1.0) raise a <code>ValidationError</code> on the nested model.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31",
   "metadata": {},
   "source": [
    "## 7. Validating a Batch of Records\n",
    "\n",
    "In DS, data comes in batches. Pydantic validates each record independently; the question is how to handle errors without stopping on the first failure:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pydantic import ValidationError\n",
    "\n",
    "\n",
    "def validate_batch(\n",
    "    records: list[dict],\n",
    ") -> tuple[list[StudentRecord], list[dict]]:\n",
    "    valid: list[StudentRecord] = []\n",
    "    errors: list[dict] = []\n",
    "\n",
    "    for raw in records:\n",
    "        try:\n",
    "            valid.append(StudentRecord.model_validate(raw))\n",
    "        except ValidationError as e:\n",
    "            errors.append({**raw, \"errors\": e.errors()})\n",
    "\n",
    "    return valid, errors\n",
    "\n",
    "\n",
    "# Simulate a small batch\n",
    "sample = [\n",
    "    {\"student_id\": \"S0001\", \"midterm_score\": 78.5, \"final_score\": 82.0, \"project_score\": 90.0, \"program\": \"CS\"},\n",
    "    {\n",
    "        \"student_id\": \"S0002\",\n",
    "        \"midterm_score\": 150.0,\n",
    "        \"final_score\": 80.0,\n",
    "        \"project_score\": 75.0,\n",
    "        \"program\": \"EE\",\n",
    "    },  # invalid\n",
    "    {\n",
    "        \"student_id\": \"INVALID\",\n",
    "        \"midterm_score\": 70.0,\n",
    "        \"final_score\": 68.0,\n",
    "        \"project_score\": 72.0,\n",
    "        \"program\": \"CS\",\n",
    "    },  # invalid\n",
    "]\n",
    "\n",
    "valid_records, error_records = validate_batch(sample)\n",
    "print(f\"Valid: {len(valid_records)}, Errors: {len(error_records)}\")\n",
    "for rec in error_records:\n",
    "    print(rec[\"student_id\"], \"->\", rec[\"errors\"][0][\"msg\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Use <code>TypeAdapter</code> to validate a list without a wrapper model</span><br><br>\n",
    "Pydantic v2's <code>TypeAdapter</code> validates arbitrary types, including <code>list[StudentRecord]</code>, without defining a wrapper model:\n",
    "<pre style='background:#F4F5F6;padding:10px;border-radius:4px;font-size:0.9em'>from pydantic import TypeAdapter\n",
    "\n",
    "adapter = TypeAdapter(list[StudentRecord])\n",
    "\n",
    "# Raises ValidationError for the first invalid record\n",
    "all_records = adapter.validate_python(df.to_dict(orient=\"records\"))\n",
    "\n",
    "# Generate JSON Schema for documentation\n",
    "print(adapter.json_schema())</pre>\n",
    "This is cleaner than wrapping in a container model when you only need batch validation.\n",
    "</div>\n",
    "\n",
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 6 - Batch Validation Report</span><br><br>\n",
    "<b>Goal:</b> Load <code>university_analytics.csv</code>. Introduce two invalid rows: one with <code>midterm_score=150.0</code> and one with <code>student_id=\"INVALID\"</code>. Run <code>validate_batch()</code> on all rows. Print the count of valid and invalid records, and the error details for the two invalid rows.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>df_with_errors = df.copy()\n",
    "df_with_errors.loc[0, \"midterm_score\"] = 150.0\n",
    "df_with_errors.loc[1, \"student_id\"] = \"INVALID\"\n",
    "\n",
    "valid, errors = validate_batch(df_with_errors.to_dict(orient=\"records\"))\n",
    "print(f\"Valid: {len(valid)}, Errors: {len(errors)}\")</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34",
   "metadata": {},
   "source": [
    "## Capstone: Typed grade-predictor Pipeline\n",
    "\n",
    "Bring Pydantic and typer into the `grade-predictor` project end to end.\n",
    "\n",
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Capstone - A Validated, CLI-Driven Pipeline</span><br><br>\n",
    "<ol>\n",
    "<li>Define <code>StudentRecord</code> in <code>grade_predictor/models.py</code> with all CSV fields, appropriate constraints, and a <code>student_id</code> format validator</li>\n",
    "<li>Define <code>PipelineConfig</code> in <code>grade_predictor/config.py</code> with <code>WeightConfig</code>, <code>pass_threshold</code>, and <code>data_path</code></li>\n",
    "<li>Update <code>compute_grade</code> in <code>core.py</code> to accept a <code>PipelineConfig</code> instead of separate weight arguments</li>\n",
    "<li>Write a <code>load_and_validate(path: str) -> tuple[list[StudentRecord], list[dict]]</code> function that reads the CSV and validates every row</li>\n",
    "<li>Add a typer CLI command <code>run</code> that accepts <code>--data-path</code> and <code>--pass-threshold</code>, constructs a <code>PipelineConfig</code>, calls <code>load_and_validate</code>, and prints the valid/error counts</li>\n",
    "<li>Write two tests: one confirming a valid <code>StudentRecord</code> is constructed correctly, and one confirming an invalid record raises <code>ValidationError</code> with the right field name</li>\n",
    "</ol>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35",
   "metadata": {},
   "source": [
    "## Further Reading\n",
    "\n",
    "| Resource | Why it matters |\n",
    "| --- | --- |\n",
    "| [Pydantic v2 documentation](https://docs.pydantic.dev/latest/) | The primary reference; the migration guide from v1 is worth reading if you encounter older Pydantic code |\n",
    "| [pydantic-settings documentation](https://docs.pydantic.dev/latest/concepts/pydantic_settings/) | `BaseSettings`, env file loading, nested settings, and secrets |\n",
    "| [typer documentation](https://typer.tiangolo.com/) | Full reference for CLI commands, subcommands, arguments vs options, and testing |\n",
    "| [Pydantic v2 validators](https://docs.pydantic.dev/latest/concepts/validators/) | `field_validator`, `model_validator`, `Annotated` constraints |\n",
    "| [FastAPI + Pydantic](https://fastapi.tiangolo.com/tutorial/body/) | FastAPI uses Pydantic models for request/response; the same `BaseModel` patterns apply directly |\n",
    "| [pandera](https://pandera.readthedocs.io/) | Schema-level DataFrame validation: the DataFrame equivalent of `StudentRecord` for column dtypes and constraints. Covered in Part 20. |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Concept | Key rule |\n",
    "| --- | --- |\n",
    "| `BaseModel` | Validates on construction; coerces compatible types; rejects the rest with `ValidationError` |\n",
    "| `Field(ge=0, le=100)` | Inline constraints via `Annotated`; replaces manual range checks inside functions |\n",
    "| `@field_validator` | Single-field business rules; runs after type coercion by default |\n",
    "| `@model_validator(mode=\"after\")` | Cross-field rules; has access to the fully constructed model |\n",
    "| `model_dump()` | Convert model to dict for JSON serialisation or pandas |\n",
    "| `model_validate(dict)` | Construct and validate from a plain dict or JSON string |\n",
    "| `argparse` | Stdlib CLI parser; verbose but zero-dependency |\n",
    "| `typer` | Type-annotation-driven CLI; auto-generates help; preferred for DS/MLOps tools |\n",
    "| `BaseSettings` | Reads from env vars and `.env` files; validates at construction time |\n",
    "| `env_prefix` | Namespaces all env vars for a settings class |\n",
    "| Batch validation | Loop with `try/except ValidationError`; collect errors without stopping |\n",
    "| `TypeAdapter` | Validate arbitrary types including `list[Model]` without a wrapper model |\n",
    "\n",
    "**Next:** Part 20 covers DataFrame schema validation with Pandera: the same \"validate at the boundary\" principle, applied to tabular data instead of individual records."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}