{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 19: Data Validation with Pydantic\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/02-dev-tools/07-pydantic-validation.ipynb) [](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/02-dev-tools/07-pydantic-validation.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Dev Tools**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed [Part 15: Type Annotations](03-type-annotations.ipynb). Pydantic is type annotations put to work: the type hints you write in Part 15 become runtime validators that reject wrong inputs before they corrupt a pipeline.\n", "\n", "The `grade-predictor` project continues here: a `GradeConfig` model replaces the ad-hoc defaults in `compute_grade`, and a `StudentRecord` model validates incoming data before it ever reaches a computation function.\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 19 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "| --- | --- | --- |\n", "| 1 | Define Pydantic models and understand how they differ from dataclasses | Sec. 1 |\n", "| 2 | Validate input data and handle `ValidationError` cleanly | Sec. 2 |\n", "| 3 | Write field validators and cross-field validators | Sec. 3 |\n", "| 4 | Build CLI tools with argparse and typer, and understand when to use each | Sec. 4 |\n", "| 5 | Use `BaseSettings` to manage configuration from environment variables | Sec. 5 |\n", "| 6 | Build a typed config object for a DS pipeline | Sec. 6 |\n", "| 7 | Validate a batch of student records and collect all errors at once | Sec. 7 |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 0. The Bug That Type Annotations Cannot Catch\n", "\n", "You have just finished building the `grade-predictor` pipeline. It reads a CSV, computes weighted scores, applies a pass threshold, and writes results to a file. You add type annotations everywhere. Your type checker passes clean. You deploy.\n", "\n", "Two weeks later, a bug report arrives: the pipeline produced a passing grade for a student whose `midterm_score` was logged as `150.0` — which is impossible on a 100-point scale. The pipeline did not crash. The math ran correctly on the wrong number. A downstream report was wrong, and nobody noticed until a student appealed their grade.\n", "\n", "The problem was not the computation. It was that nothing stopped the invalid value from entering the pipeline in the first place. Type annotations say *what type a value should be*. They do not say *whether a value of that type makes sense at runtime*. `midterm_score: float` accepts any float: `150.0`, `-20.0`, `float(\"inf\")`. The annotation is a hint to the reader and the type checker. It is not a gate.\n", "\n", "**Pydantic** ([pydantic.dev](https://docs.pydantic.dev/latest/)) turns annotations into gates. Define a model, describe the constraints, and Pydantic enforces them on every construction — before the value reaches any computation. One validation point at entry; everything downstream trusts the types.\n", "\n", "### Install\n", "\n", "Pydantic is in `pyproject.toml`. For a standalone project:\n", "\n", "```bash\n", "uv add pydantic pydantic-settings # or: pip install pydantic pydantic-settings\n", "```" ] }, { "cell_type": "raw", "id": "5", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "## 1. Models: Type Annotations That Bite\n", "\n", "```{mermaid}\n", "flowchart LR\n", " A[\"raw input\\ndict / JSON / env vars\"] --> B[\"Pydantic model\\nconstructor\"]\n", " B --> C[\"field validators\\ntype coercion + constraints\"]\n", " C -->|\"type error or\\nconstraint violated\"| E[\"ValidationError\\nfield path + message\"]\n", " C -->|\"all fields valid\"| D[\"model validators\\ncross-field checks\"]\n", " D -->|\"invariant violated\"| E\n", " D -->|\"all pass\"| F[\"model instance\\ntype-safe, immutable\"]\n", "\n", " style F fill:#EBF5F0,stroke:#059669,color:#065F46\n", " style E fill:#FEF2F2,stroke:#DC2626,color:#991B1B\n", " style B fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E\n", "```" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "A Python dataclass accepts any value for any field, regardless of what the annotation says. A Pydantic model validates on construction. Pass the wrong type and it raises an error immediately, before the value reaches any computation:" ] }, { "cell_type": "code", "execution_count": null, "id": "7", "metadata": {}, "outputs": [], "source": [ "from dataclasses import dataclass\n", "\n", "from pydantic import BaseModel\n", "\n", "\n", "@dataclass\n", "class DataclassStudent:\n", " student_id: str\n", " midterm_score: float\n", "\n", "\n", "class PydanticStudent(BaseModel):\n", " student_id: str\n", " midterm_score: float\n", "\n", "\n", "# Dataclass: accepts \"eighty\" silently\n", "dc = DataclassStudent(student_id=\"S0001\", midterm_score=\"eighty\")\n", "print(dc.midterm_score) # \"eighty\" -- no error\n", "\n", "# Pydantic: raises ValidationError immediately\n", "try:\n", " ps = PydanticStudent(student_id=\"S0001\", midterm_score=\"eighty\")\n", "except Exception as e:\n", " print(type(e).__name__, e)" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "Pydantic also coerces where the conversion is unambiguous. `\"85\"` becomes `85.0` for a `float` field; `\"true\"` becomes `True` for a `bool` field. This makes it practical for inputs from CSV files, APIs, or environment variables, where everything arrives as a string.\n", "\n", "
semester field to StudentRecord that must be one of \"Fall\", \"Spring\", \"Summer\". Use Literal[\"Fall\", \"Spring\", \"Summer\"] as the type annotation. Try creating a record with semester=\"Winter\" and confirm it raises a ValidationError.\n",
"from typing import Literal\n",
"\n",
"class StudentRecord(BaseModel):\n",
" ...\n",
" semester: Literal[\"Fall\", \"Spring\", \"Summer\"]\n",
"\n",
"StudentRecord(..., semester=\"Winter\") # should raise ValidationError\n",
"model_dump() and model_validate() to round-trip through dicts and JSONconfig.model_dump() converts a Pydantic model to a plain Python dict. GradeConfig.model_validate(some_dict) validates and constructs a model from a dict. GradeConfig.model_validate_json(json_string) parses and validates from a JSON string. These three methods cover the most common I/O patterns for config files, API payloads, and database rows.\n",
"GradeConfig from a Python dict using model_validate(). Then call model_dump() to get a plain dict back. Confirm the two dicts have the same values. Also confirm that a dict with weights that do not sum to 1.0 raises a ValidationError.\n",
"raw = {\"midterm_weight\": 0.3, \"final_weight\": 0.45, \"project_weight\": 0.25}\n",
"cfg = GradeConfig.model_validate(raw)\n",
"assert cfg.model_dump() == {**raw, \"pass_threshold\": 60.0}\n",
"argparse when you want zero dependencies and a simple script. Use typer when you want auto-generated help, validation, subcommands, and code that reads like the function signature it already is. For DS/MLOps tools that are both importable as a library and runnable as a CLI, typer is the better fit.\n",
"validate_data that accepts --data-path (str) and --max-errors (int, default 10). The command should print the arguments it received. Confirm that passing --max-errors abc would raise a typer error (type mismatch).\n",
"@app.command()\n",
"def validate_data(\n",
" data_path: str = typer.Option(..., help=\"Path to CSV to validate\"),\n",
" max_errors: int = typer.Option(10, help=\"Stop after this many errors\"),\n",
") -> None:\n",
" typer.echo(f\"Validating {data_path} (max errors: {max_errors})\")\n",
"BaseSettings validates all settings at construction time, failing fast at startup with a clear error that lists every missing or invalid variable.\n",
"GradeSettings() reads the .env file and environment on every call. Calling it inside a loop that processes thousands of rows reads and validates the config thousands of times. Construct settings once at module or application level and pass the object through.\n",
"GradeSettings instance, overriding individual settings without a .env file by passing values directly to the constructor: GradeSettings(api_key=\"test-key\", pass_threshold=70.0). Confirm the values are set correctly and that an invalid pass_threshold (e.g., 95.0) raises a ValidationError.\n",
"test_settings = GradeSettings(api_key=\"test-key\", pass_threshold=70.0)\n",
"assert test_settings.pass_threshold == 70.0\n",
"\n",
"GradeSettings(api_key=\"test-key\", pass_threshold=95.0) # should raise\n",
"PipelineConfig object. Confirm that config.weights.midterm returns the expected default. Then create one with custom weights WeightConfig(midterm=0.25, final=0.50, project=0.25) and confirm it validates correctly. Finally confirm that invalid nested weights (sum != 1.0) raise a ValidationError on the nested model.\n",
"TypeAdapter to validate a list without a wrapper modelTypeAdapter validates arbitrary types, including list[StudentRecord], without defining a wrapper model:\n",
"from pydantic import TypeAdapter\n",
"\n",
"adapter = TypeAdapter(list[StudentRecord])\n",
"\n",
"# Raises ValidationError for the first invalid record\n",
"all_records = adapter.validate_python(df.to_dict(orient=\"records\"))\n",
"\n",
"# Generate JSON Schema for documentation\n",
"print(adapter.json_schema())\n",
"This is cleaner than wrapping in a container model when you only need batch validation.\n",
"university_analytics.csv. Introduce two invalid rows: one with midterm_score=150.0 and one with student_id=\"INVALID\". Run validate_batch() on all rows. Print the count of valid and invalid records, and the error details for the two invalid rows.\n",
"df_with_errors = df.copy()\n",
"df_with_errors.loc[0, \"midterm_score\"] = 150.0\n",
"df_with_errors.loc[1, \"student_id\"] = \"INVALID\"\n",
"\n",
"valid, errors = validate_batch(df_with_errors.to_dict(orient=\"records\"))\n",
"print(f\"Valid: {len(valid)}, Errors: {len(errors)}\")\n",
"StudentRecord in grade_predictor/models.py with all CSV fields, appropriate constraints, and a student_id format validatorPipelineConfig in grade_predictor/config.py with WeightConfig, pass_threshold, and data_pathcompute_grade in core.py to accept a PipelineConfig instead of separate weight argumentsload_and_validate(path: str) -> tuple[list[StudentRecord], list[dict]] function that reads the CSV and validates every rowrun that accepts --data-path and --pass-threshold, constructs a PipelineConfig, calls load_and_validate, and prints the valid/error countsStudentRecord is constructed correctly, and one confirming an invalid record raises ValidationError with the right field name