{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 19: Data Validation with Pydantic\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/02-dev-tools/07-pydantic-validation.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/02-dev-tools/07-pydantic-validation.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Dev Tools**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed [Part 15: Type Annotations](03-type-annotations.ipynb). Pydantic is type annotations put to work: the type hints you write in Part 15 become runtime validators that reject wrong inputs before they corrupt a pipeline.\n", "\n", "The `grade-predictor` project continues here: a `GradeConfig` model replaces the ad-hoc defaults in `compute_grade`, and a `StudentRecord` model validates incoming data before it ever reaches a computation function.\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 19 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "| --- | --- | --- |\n", "| 1 | Define Pydantic models and understand how they differ from dataclasses | Sec. 1 |\n", "| 2 | Validate input data and handle `ValidationError` cleanly | Sec. 2 |\n", "| 3 | Write field validators and cross-field validators | Sec. 3 |\n", "| 4 | Build CLI tools with argparse and typer, and understand when to use each | Sec. 4 |\n", "| 5 | Use `BaseSettings` to manage configuration from environment variables | Sec. 5 |\n", "| 6 | Build a typed config object for a DS pipeline | Sec. 6 |\n", "| 7 | Validate a batch of student records and collect all errors at once | Sec. 7 |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 0. The Bug That Type Annotations Cannot Catch\n", "\n", "You have just finished building the `grade-predictor` pipeline. It reads a CSV, computes weighted scores, applies a pass threshold, and writes results to a file. You add type annotations everywhere. Your type checker passes clean. You deploy.\n", "\n", "Two weeks later, a bug report arrives: the pipeline produced a passing grade for a student whose `midterm_score` was logged as `150.0` — which is impossible on a 100-point scale. The pipeline did not crash. The math ran correctly on the wrong number. A downstream report was wrong, and nobody noticed until a student appealed their grade.\n", "\n", "The problem was not the computation. It was that nothing stopped the invalid value from entering the pipeline in the first place. Type annotations say *what type a value should be*. They do not say *whether a value of that type makes sense at runtime*. `midterm_score: float` accepts any float: `150.0`, `-20.0`, `float(\"inf\")`. The annotation is a hint to the reader and the type checker. It is not a gate.\n", "\n", "**Pydantic** ([pydantic.dev](https://docs.pydantic.dev/latest/)) turns annotations into gates. Define a model, describe the constraints, and Pydantic enforces them on every construction — before the value reaches any computation. One validation point at entry; everything downstream trusts the types.\n", "\n", "### Install\n", "\n", "Pydantic is in `pyproject.toml`. For a standalone project:\n", "\n", "```bash\n", "uv add pydantic pydantic-settings # or: pip install pydantic pydantic-settings\n", "```" ] }, { "cell_type": "raw", "id": "5", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "## 1. Models: Type Annotations That Bite\n", "\n", "```{mermaid}\n", "flowchart LR\n", " A[\"raw input\\ndict / JSON / env vars\"] --> B[\"Pydantic model\\nconstructor\"]\n", " B --> C[\"field validators\\ntype coercion + constraints\"]\n", " C -->|\"type error or\\nconstraint violated\"| E[\"ValidationError\\nfield path + message\"]\n", " C -->|\"all fields valid\"| D[\"model validators\\ncross-field checks\"]\n", " D -->|\"invariant violated\"| E\n", " D -->|\"all pass\"| F[\"model instance\\ntype-safe, immutable\"]\n", "\n", " style F fill:#EBF5F0,stroke:#059669,color:#065F46\n", " style E fill:#FEF2F2,stroke:#DC2626,color:#991B1B\n", " style B fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E\n", "```" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "A Python dataclass accepts any value for any field, regardless of what the annotation says. A Pydantic model validates on construction. Pass the wrong type and it raises an error immediately, before the value reaches any computation:" ] }, { "cell_type": "code", "execution_count": null, "id": "7", "metadata": {}, "outputs": [], "source": [ "from dataclasses import dataclass\n", "\n", "from pydantic import BaseModel\n", "\n", "\n", "@dataclass\n", "class DataclassStudent:\n", " student_id: str\n", " midterm_score: float\n", "\n", "\n", "class PydanticStudent(BaseModel):\n", " student_id: str\n", " midterm_score: float\n", "\n", "\n", "# Dataclass: accepts \"eighty\" silently\n", "dc = DataclassStudent(student_id=\"S0001\", midterm_score=\"eighty\")\n", "print(dc.midterm_score) # \"eighty\" -- no error\n", "\n", "# Pydantic: raises ValidationError immediately\n", "try:\n", " ps = PydanticStudent(student_id=\"S0001\", midterm_score=\"eighty\")\n", "except Exception as e:\n", " print(type(e).__name__, e)" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "Pydantic also coerces where the conversion is unambiguous. `\"85\"` becomes `85.0` for a `float` field; `\"true\"` becomes `True` for a `bool` field. This makes it practical for inputs from CSV files, APIs, or environment variables, where everything arrives as a string.\n", "\n", "
\n", " Key Concept: Validate at the boundary, not inside the pipeline

\n", "A pipeline that validates its inputs at the entry point can trust every value it works with from that point on. A pipeline that validates inside individual functions repeats the same checks everywhere and still misses values that enter through unexpected paths. Pydantic at the boundary is the architectural principle; the model is the mechanism.\n", "
" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "## 2. `StudentRecord`: Validating Incoming Data\n", "\n", "Define a model for a student record coming in from a CSV or API. The `model_config` attribute controls Pydantic v2's behaviour:" ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "import re\n", "from typing import Annotated\n", "\n", "from pydantic import BaseModel, Field, ValidationError, field_validator\n", "\n", "\n", "class StudentRecord(BaseModel):\n", " model_config = {\"str_strip_whitespace\": True, \"validate_assignment\": True}\n", "\n", " student_id: str\n", " midterm_score: Annotated[float, Field(ge=0.0, le=100.0)]\n", " final_score: Annotated[float, Field(ge=0.0, le=100.0)]\n", " project_score: Annotated[float, Field(ge=0.0, le=100.0)]\n", " program: str\n", " has_internet: bool = True\n", "\n", " @field_validator(\"student_id\")\n", " @classmethod\n", " def student_id_format(cls, v: str) -> str:\n", " if not re.match(r\"^S\\d{4}$\", v):\n", " raise ValueError(f\"student_id must match S followed by 4 digits, got '{v}'\")\n", " return v" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "`Field(ge=0.0, le=100.0)` means \"greater than or equal to 0, less than or equal to 100\". Pydantic v2 uses `Annotated` with `Field` for constraints; the old `validator` decorator is replaced by `field_validator`.\n", "\n", "Try it:" ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "# Valid record: succeeds\n", "record = StudentRecord(\n", " student_id=\"S0042\",\n", " midterm_score=78.5,\n", " final_score=82.0,\n", " project_score=\"91.0\", # string coerced to float\n", " program=\"CS\",\n", ")\n", "print(record.model_dump())\n", "\n", "# Invalid: midterm over 100\n", "try:\n", " bad = StudentRecord(\n", " student_id=\"S0042\",\n", " midterm_score=150.0, # violates le=100.0\n", " final_score=80.0,\n", " project_score=75.0,\n", " program=\"CS\",\n", " )\n", "except ValidationError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "
\n", " Activity 1 - Extend StudentRecord

\n", "Goal: Add a semester field to StudentRecord that must be one of \"Fall\", \"Spring\", \"Summer\". Use Literal[\"Fall\", \"Spring\", \"Summer\"] as the type annotation. Try creating a record with semester=\"Winter\" and confirm it raises a ValidationError.\n", "
from typing import Literal\n",
    "\n",
    "class StudentRecord(BaseModel):\n",
    "    ...\n",
    "    semester: Literal[\"Fall\", \"Spring\", \"Summer\"]\n",
    "\n",
    "StudentRecord(..., semester=\"Winter\")  # should raise ValidationError
\n", "
" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "## 3. Field Validators and Cross-Field Validators\n", "\n", "A `@field_validator` runs on a single field after type coercion. Use it for business rules that go beyond a simple range check.\n", "\n", "A `@model_validator` runs after all fields are set and has access to the complete model. Use it for cross-field constraints — \"weights must sum to 1.0\":" ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "from typing import Annotated\n", "\n", "from pydantic import BaseModel, Field, ValidationError, field_validator, model_validator\n", "\n", "\n", "class GradeConfig(BaseModel):\n", " midterm_weight: Annotated[float, Field(gt=0.0, lt=1.0)] = 0.30\n", " final_weight: Annotated[float, Field(gt=0.0, lt=1.0)] = 0.45\n", " project_weight: Annotated[float, Field(gt=0.0, lt=1.0)] = 0.25\n", " pass_threshold: Annotated[float, Field(ge=50.0, le=80.0)] = 60.0\n", "\n", " @model_validator(mode=\"after\")\n", " def weights_must_sum_to_one(self) -> \"GradeConfig\":\n", " total = self.midterm_weight + self.final_weight + self.project_weight\n", " if abs(total - 1.0) > 1e-6:\n", " raise ValueError(\n", " f\"Weights must sum to 1.0, got {total:.6f}. \"\n", " f\"Adjust midterm ({self.midterm_weight}), \"\n", " f\"final ({self.final_weight}), or project ({self.project_weight}).\"\n", " )\n", " return self" ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "The `mode=\"after\"` argument means the validator runs after Pydantic has already validated and coerced every individual field. Use `mode=\"before\"` when you need to transform raw input before type coercion." ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [], "source": [ "# Valid: weights sum to 1.0\n", "cfg = GradeConfig(midterm_weight=0.3, final_weight=0.5, project_weight=0.2)\n", "print(cfg)\n", "\n", "# Invalid: weights sum to 1.05\n", "try:\n", " bad_cfg = GradeConfig(midterm_weight=0.4, final_weight=0.45, project_weight=0.2)\n", "except ValidationError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "
\n", " Pro Tip: Use model_dump() and model_validate() to round-trip through dicts and JSON

\n", "config.model_dump() converts a Pydantic model to a plain Python dict. GradeConfig.model_validate(some_dict) validates and constructs a model from a dict. GradeConfig.model_validate_json(json_string) parses and validates from a JSON string. These three methods cover the most common I/O patterns for config files, API payloads, and database rows.\n", "
\n", "\n", "
\n", " Activity 2 - Config from a Dict

\n", "Goal: Create a GradeConfig from a Python dict using model_validate(). Then call model_dump() to get a plain dict back. Confirm the two dicts have the same values. Also confirm that a dict with weights that do not sum to 1.0 raises a ValidationError.\n", "
raw = {\"midterm_weight\": 0.3, \"final_weight\": 0.45, \"project_weight\": 0.25}\n",
    "cfg = GradeConfig.model_validate(raw)\n",
    "assert cfg.model_dump() == {**raw, \"pass_threshold\": 60.0}
\n", "
" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "## 4. CLI Interfaces: argparse and typer\n", "\n", "Every ML pipeline eventually needs to be run from the command line: `python train.py --data-path data/train.csv --epochs 50 --threshold 0.65`. Two tools exist for this: `argparse` (stdlib, the traditional approach) and `typer` (modern, annotation-based, by the same author as FastAPI).\n", "\n", "### argparse: the baseline\n", "\n", "`argparse` requires you to declare each argument manually, convert strings to types yourself, and write help text by hand:" ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [ "import argparse\n", "\n", "# argparse: explicit, verbose, manual type conversion\n", "parser = argparse.ArgumentParser(description=\"Grade predictor CLI\")\n", "parser.add_argument(\"--data-path\", type=str, default=\"data/university_analytics.csv\")\n", "parser.add_argument(\"--pass-threshold\", type=float, default=60.0)\n", "parser.add_argument(\"--debug\", action=\"store_true\", default=False)\n", "\n", "# In a script: args = parser.parse_args()\n", "# In a notebook: simulate with a list\n", "args = parser.parse_args([\"--data-path\", \"data/train.csv\", \"--pass-threshold\", \"65.0\"])\n", "print(args.data_path, args.pass_threshold, args.debug)" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "### typer: annotations as the CLI spec\n", "\n", "`typer` reads your function's type annotations and builds the parser for you. The same type hints that describe the function's signature generate `--help`, automatic type coercion, and validation:" ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [], "source": [ "# pip install typer / uv add typer\n", "# This is a module-level demo -- run as: python script.py --data-path ... from terminal\n", "\n", "import typer\n", "\n", "app = typer.Typer()\n", "\n", "\n", "@app.command()\n", "def train(\n", " data_path: str = typer.Option(\"data/university_analytics.csv\", help=\"Path to CSV\"),\n", " pass_threshold: float = typer.Option(60.0, min=50.0, max=80.0, help=\"Pass threshold\"),\n", " debug: bool = typer.Option(False, help=\"Enable debug logging\"),\n", ") -> None:\n", " typer.echo(f\"Loading data from: {data_path}\")\n", " typer.echo(f\"Pass threshold: {pass_threshold}\")\n", " typer.echo(f\"Debug: {debug}\")\n", "\n", "\n", "# In a real script: if __name__ == \"__main__\": app()\n", "# Running `python train.py --help` auto-generates full usage docs." ] }, { "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "### typer + pydantic-settings: the DS/MLOps pattern\n", "\n", "In production, you typically want *two* sources of configuration: environment variables (for secrets, deployment targets) and CLI arguments (for per-run overrides). The pattern is:\n", "\n", "- `pydantic-settings` reads from `.env` and environment — covered in Sec 5\n", "- `typer` provides the CLI surface\n", "- The typer command constructs a `PipelineConfig` from both sources\n", "\n", "The two tools compose naturally because they share the same type-annotation vocabulary.\n", "\n", "
\n", " Key Concept: argparse vs typer

\n", "Use argparse when you want zero dependencies and a simple script. Use typer when you want auto-generated help, validation, subcommands, and code that reads like the function signature it already is. For DS/MLOps tools that are both importable as a library and runnable as a CLI, typer is the better fit.\n", "
\n", "\n", "
\n", " Activity 3 - typer Command

\n", "Goal: Write a typer command validate_data that accepts --data-path (str) and --max-errors (int, default 10). The command should print the arguments it received. Confirm that passing --max-errors abc would raise a typer error (type mismatch).\n", "
@app.command()\n",
    "def validate_data(\n",
    "    data_path: str = typer.Option(..., help=\"Path to CSV to validate\"),\n",
    "    max_errors: int = typer.Option(10, help=\"Stop after this many errors\"),\n",
    ") -> None:\n",
    "    typer.echo(f\"Validating {data_path} (max errors: {max_errors})\")
\n", "
" ] }, { "cell_type": "markdown", "id": "24", "metadata": {}, "source": [ "## 5. `BaseSettings`: Config from Environment Variables\n", "\n", "`pydantic-settings` extends Pydantic with a `BaseSettings` class that reads values from environment variables and `.env` files. This replaces the manual `os.getenv` pattern with a typed, validated config object.\n", "\n", "```bash\n", "uv add pydantic-settings\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "25", "metadata": {}, "outputs": [], "source": [ "# grade_predictor/config.py\n", "from typing import Annotated\n", "\n", "from pydantic import Field\n", "from pydantic_settings import BaseSettings, SettingsConfigDict\n", "\n", "\n", "class GradeSettings(BaseSettings):\n", " model_config = SettingsConfigDict(\n", " env_file=\".env\",\n", " env_prefix=\"GRADE_\", # GRADE_API_KEY maps to api_key\n", " case_sensitive=False,\n", " )\n", "\n", " api_key: str = Field(default=\"dev-key\", description=\"API key for the university data service\")\n", " data_path: str = Field(default=\"data/university_analytics.csv\")\n", " pass_threshold: Annotated[float, Field(ge=50.0, le=80.0)] = 60.0\n", " debug: bool = False\n", "\n", "\n", "# Load (reads .env if it exists, otherwise uses defaults)\n", "settings = GradeSettings()\n", "print(settings.data_path)\n", "print(settings.pass_threshold)" ] }, { "cell_type": "markdown", "id": "26", "metadata": {}, "source": [ "With a `.env` file:\n", "\n", "```bash\n", "GRADE_API_KEY=secret-key-here\n", "GRADE_DATA_PATH=data/production.csv\n", "GRADE_PASS_THRESHOLD=65.0\n", "```\n", "\n", "Values from the `.env` file override defaults; environment variables override both.\n", "\n", "
\n", " Key Concept: Settings validation catches misconfiguration at startup, not mid-run

\n", "Without validation, a missing or malformed environment variable is discovered when the code first uses it: halfway through a pipeline run, after 20 minutes of processing. BaseSettings validates all settings at construction time, failing fast at startup with a clear error that lists every missing or invalid variable.\n", "
\n", "\n", "
\n", " Common Mistake: Constructing settings inside a function that runs in a loop

\n", "GradeSettings() reads the .env file and environment on every call. Calling it inside a loop that processes thousands of rows reads and validates the config thousands of times. Construct settings once at module or application level and pass the object through.\n", "
\n", "\n", "
\n", " Activity 4 - Settings with a Test Override

\n", "Goal: Create a GradeSettings instance, overriding individual settings without a .env file by passing values directly to the constructor: GradeSettings(api_key=\"test-key\", pass_threshold=70.0). Confirm the values are set correctly and that an invalid pass_threshold (e.g., 95.0) raises a ValidationError.\n", "
test_settings = GradeSettings(api_key=\"test-key\", pass_threshold=70.0)\n",
    "assert test_settings.pass_threshold == 70.0\n",
    "\n",
    "GradeSettings(api_key=\"test-key\", pass_threshold=95.0)  # should raise
\n", "
" ] }, { "cell_type": "markdown", "id": "27", "metadata": {}, "source": [ "## 6. Typing a DS Pipeline Config\n", "\n", "A real DS pipeline has more than one config object. The pattern is to compose small focused models into a root config that is loaded once:" ] }, { "cell_type": "code", "execution_count": null, "id": "28", "metadata": {}, "outputs": [], "source": [ "from typing import Annotated\n", "\n", "from pydantic import BaseModel, Field, model_validator\n", "from pydantic_settings import BaseSettings, SettingsConfigDict\n", "\n", "\n", "class WeightConfig(BaseModel):\n", " midterm: Annotated[float, Field(gt=0.0, lt=1.0)] = 0.30\n", " final: Annotated[float, Field(gt=0.0, lt=1.0)] = 0.45\n", " project: Annotated[float, Field(gt=0.0, lt=1.0)] = 0.25\n", "\n", " @model_validator(mode=\"after\")\n", " def weights_sum_to_one(self) -> \"WeightConfig\":\n", " total = self.midterm + self.final + self.project\n", " if abs(total - 1.0) > 1e-6:\n", " raise ValueError(f\"Weights must sum to 1.0, got {total:.4f}\")\n", " return self\n", "\n", "\n", "class PipelineConfig(BaseSettings):\n", " model_config = SettingsConfigDict(env_file=\".env\", env_prefix=\"GRADE_\")\n", "\n", " weights: WeightConfig = WeightConfig()\n", " pass_threshold: Annotated[float, Field(ge=50.0, le=80.0)] = 60.0\n", " data_path: str = \"data/university_analytics.csv\"\n", " output_path: str = \"data/results.parquet\"\n", " debug: bool = False" ] }, { "cell_type": "code", "execution_count": null, "id": "29", "metadata": {}, "outputs": [], "source": [ "config = PipelineConfig()\n", "\n", "\n", "def compute_grade(midterm: float, final: float, project: float, cfg: PipelineConfig) -> float:\n", " w = cfg.weights\n", " return midterm * w.midterm + final * w.final + project * w.project\n", "\n", "\n", "print(compute_grade(78.0, 82.0, 91.0, config))" ] }, { "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "
\n", " Activity 5 - Nested Config

\n", "Goal: Create a PipelineConfig object. Confirm that config.weights.midterm returns the expected default. Then create one with custom weights WeightConfig(midterm=0.25, final=0.50, project=0.25) and confirm it validates correctly. Finally confirm that invalid nested weights (sum != 1.0) raise a ValidationError on the nested model.\n", "
" ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "## 7. Validating a Batch of Records\n", "\n", "In DS, data comes in batches. Pydantic validates each record independently; the question is how to handle errors without stopping on the first failure:" ] }, { "cell_type": "code", "execution_count": null, "id": "32", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from pydantic import ValidationError\n", "\n", "\n", "def validate_batch(\n", " records: list[dict],\n", ") -> tuple[list[StudentRecord], list[dict]]:\n", " valid: list[StudentRecord] = []\n", " errors: list[dict] = []\n", "\n", " for raw in records:\n", " try:\n", " valid.append(StudentRecord.model_validate(raw))\n", " except ValidationError as e:\n", " errors.append({**raw, \"errors\": e.errors()})\n", "\n", " return valid, errors\n", "\n", "\n", "# Simulate a small batch\n", "sample = [\n", " {\"student_id\": \"S0001\", \"midterm_score\": 78.5, \"final_score\": 82.0, \"project_score\": 90.0, \"program\": \"CS\"},\n", " {\n", " \"student_id\": \"S0002\",\n", " \"midterm_score\": 150.0,\n", " \"final_score\": 80.0,\n", " \"project_score\": 75.0,\n", " \"program\": \"EE\",\n", " }, # invalid\n", " {\n", " \"student_id\": \"INVALID\",\n", " \"midterm_score\": 70.0,\n", " \"final_score\": 68.0,\n", " \"project_score\": 72.0,\n", " \"program\": \"CS\",\n", " }, # invalid\n", "]\n", "\n", "valid_records, error_records = validate_batch(sample)\n", "print(f\"Valid: {len(valid_records)}, Errors: {len(error_records)}\")\n", "for rec in error_records:\n", " print(rec[\"student_id\"], \"->\", rec[\"errors\"][0][\"msg\"])" ] }, { "cell_type": "markdown", "id": "33", "metadata": {}, "source": [ "
\n", " Pro Tip: Use TypeAdapter to validate a list without a wrapper model

\n", "Pydantic v2's TypeAdapter validates arbitrary types, including list[StudentRecord], without defining a wrapper model:\n", "
from pydantic import TypeAdapter\n",
    "\n",
    "adapter = TypeAdapter(list[StudentRecord])\n",
    "\n",
    "# Raises ValidationError for the first invalid record\n",
    "all_records = adapter.validate_python(df.to_dict(orient=\"records\"))\n",
    "\n",
    "# Generate JSON Schema for documentation\n",
    "print(adapter.json_schema())
\n", "This is cleaner than wrapping in a container model when you only need batch validation.\n", "
\n", "\n", "
\n", " Activity 6 - Batch Validation Report

\n", "Goal: Load university_analytics.csv. Introduce two invalid rows: one with midterm_score=150.0 and one with student_id=\"INVALID\". Run validate_batch() on all rows. Print the count of valid and invalid records, and the error details for the two invalid rows.\n", "
df_with_errors = df.copy()\n",
    "df_with_errors.loc[0, \"midterm_score\"] = 150.0\n",
    "df_with_errors.loc[1, \"student_id\"] = \"INVALID\"\n",
    "\n",
    "valid, errors = validate_batch(df_with_errors.to_dict(orient=\"records\"))\n",
    "print(f\"Valid: {len(valid)}, Errors: {len(errors)}\")
\n", "
" ] }, { "cell_type": "markdown", "id": "34", "metadata": {}, "source": [ "## Capstone: Typed grade-predictor Pipeline\n", "\n", "Bring Pydantic and typer into the `grade-predictor` project end to end.\n", "\n", "
\n", " Capstone - A Validated, CLI-Driven Pipeline

\n", "
    \n", "
  1. Define StudentRecord in grade_predictor/models.py with all CSV fields, appropriate constraints, and a student_id format validator
  2. \n", "
  3. Define PipelineConfig in grade_predictor/config.py with WeightConfig, pass_threshold, and data_path
  4. \n", "
  5. Update compute_grade in core.py to accept a PipelineConfig instead of separate weight arguments
  6. \n", "
  7. Write a load_and_validate(path: str) -> tuple[list[StudentRecord], list[dict]] function that reads the CSV and validates every row
  8. \n", "
  9. Add a typer CLI command run that accepts --data-path and --pass-threshold, constructs a PipelineConfig, calls load_and_validate, and prints the valid/error counts
  10. \n", "
  11. Write two tests: one confirming a valid StudentRecord is constructed correctly, and one confirming an invalid record raises ValidationError with the right field name
  12. \n", "
\n", "
" ] }, { "cell_type": "markdown", "id": "35", "metadata": {}, "source": [ "## Further Reading\n", "\n", "| Resource | Why it matters |\n", "| --- | --- |\n", "| [Pydantic v2 documentation](https://docs.pydantic.dev/latest/) | The primary reference; the migration guide from v1 is worth reading if you encounter older Pydantic code |\n", "| [pydantic-settings documentation](https://docs.pydantic.dev/latest/concepts/pydantic_settings/) | `BaseSettings`, env file loading, nested settings, and secrets |\n", "| [typer documentation](https://typer.tiangolo.com/) | Full reference for CLI commands, subcommands, arguments vs options, and testing |\n", "| [Pydantic v2 validators](https://docs.pydantic.dev/latest/concepts/validators/) | `field_validator`, `model_validator`, `Annotated` constraints |\n", "| [FastAPI + Pydantic](https://fastapi.tiangolo.com/tutorial/body/) | FastAPI uses Pydantic models for request/response; the same `BaseModel` patterns apply directly |\n", "| [pandera](https://pandera.readthedocs.io/) | Schema-level DataFrame validation: the DataFrame equivalent of `StudentRecord` for column dtypes and constraints. Covered in Part 20. |" ] }, { "cell_type": "markdown", "id": "36", "metadata": {}, "source": [ "## Summary\n", "\n", "| Concept | Key rule |\n", "| --- | --- |\n", "| `BaseModel` | Validates on construction; coerces compatible types; rejects the rest with `ValidationError` |\n", "| `Field(ge=0, le=100)` | Inline constraints via `Annotated`; replaces manual range checks inside functions |\n", "| `@field_validator` | Single-field business rules; runs after type coercion by default |\n", "| `@model_validator(mode=\"after\")` | Cross-field rules; has access to the fully constructed model |\n", "| `model_dump()` | Convert model to dict for JSON serialisation or pandas |\n", "| `model_validate(dict)` | Construct and validate from a plain dict or JSON string |\n", "| `argparse` | Stdlib CLI parser; verbose but zero-dependency |\n", "| `typer` | Type-annotation-driven CLI; auto-generates help; preferred for DS/MLOps tools |\n", "| `BaseSettings` | Reads from env vars and `.env` files; validates at construction time |\n", "| `env_prefix` | Namespaces all env vars for a settings class |\n", "| Batch validation | Loop with `try/except ValidationError`; collect errors without stopping |\n", "| `TypeAdapter` | Validate arbitrary types including `list[Model]` without a wrapper model |\n", "\n", "**Next:** Part 20 covers DataFrame schema validation with Pandera: the same \"validate at the boundary\" principle, applied to tabular data instead of individual records." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }