{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "---\n",
    "title: \"Data Schema Validation with Pandera\"\n",
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/02-dev-tools/08-pandera-schema-validation.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/02-dev-tools/08-pandera-schema-validation.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "**DS-MLOps Dev Tools**\n",
    "\n",
    "**Python 3.12+ | Author: Anthony Faustine**\n",
    "\n",
    "## Before you begin\n",
    "\n",
    "This notebook assumes you have completed [Part 19: Data Validation with Pydantic](07-pydantic-validation.ipynb). Pandera extends the same \"validate at the boundary\" principle to DataFrames: where Pydantic validates individual records, Pandera validates the schema of an entire table.\n",
    "\n",
    "The `grade-predictor` project continues here: a Pandera schema replaces the implicit assumptions about `university_analytics.csv` with explicit, testable contracts.\n",
    "\n",
    "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "::: {.callout-note collapse=\"true\" icon=false}\n",
    "## Learning Objectives\n",
    "\n",
    "By the end of Part 20 you will be able to:\n",
    "\n",
    "| # | Skill | Covered in |\n",
    "| --- | --- | --- |\n",
    "| 1 | Explain the difference between row-level (Pydantic) and schema-level (Pandera) validation | Sec. 1 |\n",
    "| 2 | Define a Pandera `DataFrameSchema` with column types and constraints | Sec. 2 |\n",
    "| 3 | Use the class-based API with `pa.DataFrameModel` for typed schemas | Sec. 3 |\n",
    "| 4 | Write custom element-wise and series-level checks | Sec. 4 |\n",
    "| 5 | Validate DataFrames in a pipeline and collect errors without stopping | Sec. 5 |\n",
    "| 6 | Use Pandera schemas as pytest fixtures to document data contracts | Sec. 6 |\n",
    ":::\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "## 0. Pydantic Validated the Row. Who Validates the Table?\n",
    "\n",
    "You have a `StudentRecord` Pydantic model. It validates `midterm_score` is between 0 and 100, that `student_id` matches `S\\d{4}`, and that `program` is a non-empty string. Pydantic runs when a *single record* enters the system.\n",
    "\n",
    "Now you load `university_analytics.csv` into a DataFrame. There are 2,400 rows. You could loop through them with `StudentRecord.model_validate`, but that tells you nothing about the *table*: whether `student_id` is unique across all rows, whether the distribution of `program` values matches what you expect, whether the proportion of missing values in `has_internet` is within an acceptable bound. A row validator answers \"is this row correct?\". A schema validator answers \"is this table correct?\".\n",
    "\n",
    "**Pandera** ([pandera.readthedocs.io](https://pandera.readthedocs.io/)) is a statistical data validation library for DataFrames. You define a schema: column types, constraints, uniqueness, allowed values, statistical properties, and Pandera checks the whole DataFrame against it in one call. It supports pandas and Polars, integrates with pytest, and can generate synthetic data for testing.\n",
    "\n",
    "### Install\n",
    "\n",
    "```bash\n",
    "uv add pandera          # or: pip install pandera\n",
    "```"
   ]
  },
  {
   "cell_type": "raw",
   "id": "5",
   "metadata": {
    "raw_mimetype": "text/markdown"
   },
   "source": [
    "## 1. Row Validation vs Schema Validation\n",
    "\n",
    "```{mermaid}\n",
    "flowchart LR\n",
    "    CSV[\"university_analytics.csv\"] --> DF[\"pandas DataFrame\\n2,400 rows × 10 cols\"]\n",
    "    DF --> PA[\"Pandera schema\\nColumn types, constraints,\\nuniqueness, statistics\"]\n",
    "    PA -->|\"schema violation\"| ERR[\"SchemaError\\ncell location + rule\"]\n",
    "    PA -->|\"all checks pass\"| OK[\"validated DataFrame\\nsafe to use downstream\"]\n",
    "\n",
    "    style OK fill:#EBF5F0,stroke:#059669,color:#065F46\n",
    "    style ERR fill:#FEF2F2,stroke:#DC2626,color:#991B1B\n",
    "    style PA fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "Pydantic and Pandera answer different questions:\n",
    "\n",
    "| | Pydantic `BaseModel` | Pandera `DataFrameSchema` |\n",
    "| --- | --- | --- |\n",
    "| Unit | One record (row) | Whole DataFrame |\n",
    "| Checks | Type coercion, field constraints, cross-field | Column dtype, nullability, uniqueness, value ranges, statistical bounds |\n",
    "| Returns | Validated model instance | Validated DataFrame |\n",
    "| Error info | Field path + message | Row index + column + failed check |\n",
    "| Best for | API inputs, config objects | CSVs, pipeline data, feature tables |\n",
    "\n",
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Use both, not one</span><br><br>\n",
    "Pydantic and Pandera are not alternatives; they validate at different levels. A production pipeline uses Pydantic to validate records as they arrive (JSON from an API, rows from a queue) and Pandera to validate the assembled DataFrame before it enters a model or a write operation. Together they form a complete contract on the data.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "## 2. Defining a `DataFrameSchema`\n",
    "\n",
    "The functional API creates a schema by describing each column with `pa.Column`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import pandera as pa\n",
    "\n",
    "schema = pa.DataFrameSchema(\n",
    "    {\n",
    "        \"student_id\": pa.Column(str, pa.Check.str_matches(r\"^S\\d{4}$\")),\n",
    "        \"midterm_score\": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(100.0)]),\n",
    "        \"final_score\": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(100.0)]),\n",
    "        \"project_score\": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(100.0)]),\n",
    "        \"program\": pa.Column(str, pa.Check.isin([\"CS\", \"EE\", \"Math\", \"Physics\", \"Biology\"])),\n",
    "        \"has_internet\": pa.Column(bool),\n",
    "        \"school_id\": pa.Column(str, nullable=True),\n",
    "        \"teacher_count\": pa.Column(int, pa.Check.ge(1)),\n",
    "        \"school_size\": pa.Column(str, pa.Check.isin([\"Small\", \"Medium\", \"Large\"])),\n",
    "        \"pass_threshold\": pa.Column(float, pa.Check.between(50.0, 80.0), nullable=True),\n",
    "    },\n",
    "    checks=[\n",
    "        pa.Check(lambda df: df[\"student_id\"].is_unique, error=\"student_id must be unique\"),\n",
    "    ],\n",
    "    coerce=True,\n",
    ")\n",
    "\n",
    "print(\"Schema created with\", len(schema.columns), \"columns\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "`coerce=True` tells Pandera to attempt type coercion before validation: `\"78.5\"` becomes `78.5` for a `float` column, mirroring Pydantic's behaviour at the row level.\n",
    "\n",
    "Validate a DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Build a small valid sample\n",
    "sample_df = pd.DataFrame(\n",
    "    {\n",
    "        \"student_id\": [\"S0001\", \"S0002\", \"S0003\"],\n",
    "        \"midterm_score\": [78.5, 65.0, 90.0],\n",
    "        \"final_score\": [82.0, 70.0, 88.0],\n",
    "        \"project_score\": [91.0, 75.0, 85.0],\n",
    "        \"program\": [\"CS\", \"EE\", \"Math\"],\n",
    "        \"has_internet\": [True, False, True],\n",
    "        \"school_id\": [\"SCH01\", \"SCH01\", \"SCH02\"],\n",
    "        \"teacher_count\": [12, 8, 15],\n",
    "        \"school_size\": [\"Large\", \"Medium\", \"Large\"],\n",
    "        \"pass_threshold\": [60.0, 60.0, 65.0],\n",
    "    }\n",
    ")\n",
    "\n",
    "validated = schema.validate(sample_df)\n",
    "print(\"Validation passed:\", validated.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Introduce an invalid row (midterm_score > 100)\n",
    "bad_df = sample_df.copy()\n",
    "bad_df.loc[0, \"midterm_score\"] = 150.0\n",
    "\n",
    "try:\n",
    "    schema.validate(bad_df)\n",
    "except pa.errors.SchemaError as e:\n",
    "    print(type(e).__name__)\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 1 - Duplicate student_id</span><br><br>\n",
    "<b>Goal:</b> Create a DataFrame where two rows share the same <code>student_id</code>. Run <code>schema.validate(df)</code> and confirm it raises a <code>SchemaError</code> with a message about uniqueness.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>dup_df = sample_df.copy()\n",
    "dup_df.loc[1, \"student_id\"] = \"S0001\"  # duplicate\n",
    "schema.validate(dup_df)  # should raise</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {},
   "source": [
    "## 3. `DataFrameModel`: The Class-Based API\n",
    "\n",
    "The functional `DataFrameSchema` API is explicit but verbose. Pandera's class-based API, `pa.DataFrameModel`, mirrors Pydantic's `BaseModel`: define columns as class-level fields with type annotations, and use `pa.Field` for constraints:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Optional\n",
    "\n",
    "import pandera as pa\n",
    "from pandera.typing import Series\n",
    "\n",
    "\n",
    "class StudentDataSchema(pa.DataFrameModel):\n",
    "    student_id: Series[str] = pa.Field(str_matches=r\"^S\\d{4}$\")\n",
    "    midterm_score: Series[float] = pa.Field(ge=0.0, le=100.0)\n",
    "    final_score: Series[float] = pa.Field(ge=0.0, le=100.0)\n",
    "    project_score: Series[float] = pa.Field(ge=0.0, le=100.0)\n",
    "    program: Series[str] = pa.Field(isin=[\"CS\", \"EE\", \"Math\", \"Physics\", \"Biology\"])\n",
    "    has_internet: Series[bool]\n",
    "    school_id: Series[str] | None = pa.Field(nullable=True)\n",
    "    teacher_count: Series[int] = pa.Field(ge=1)\n",
    "    school_size: Series[str] = pa.Field(isin=[\"Small\", \"Medium\", \"Large\"])\n",
    "\n",
    "    class Config:\n",
    "        coerce = True\n",
    "\n",
    "    @pa.check(\"student_id\")\n",
    "    def student_id_unique(cls, series: Series[str]) -> bool:  # noqa: N805\n",
    "        return series.is_unique\n",
    "\n",
    "\n",
    "validated = StudentDataSchema.validate(sample_df)\n",
    "print(\"Class-based validation passed:\", validated.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "The `@pa.check` decorator attaches a column-level check as a classmethod. The check receives the entire `Series` and must return a boolean (or a boolean Series for element-wise checks).\n",
    "\n",
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Use <code>DataFrameModel</code> for reuse, <code>DataFrameSchema</code> for quick scripts</span><br><br>\n",
    "<code>DataFrameModel</code> is easier to subclass, document, and test: it reads like a dataclass and fits naturally alongside Pydantic models. <code>DataFrameSchema</code> is useful when you want to build a schema programmatically at runtime, e.g., from a config file or database metadata.\n",
    "</div>\n",
    "\n",
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 2 - Extend the Schema</span><br><br>\n",
    "<b>Goal:</b> Subclass <code>StudentDataSchema</code> and add a <code>semester</code> column constrained to <code>[\"Fall\", \"Spring\", \"Summer\"]</code>. Validate a DataFrame that includes the column with valid values, then one that has an invalid value. Confirm only the second raises a <code>SchemaError</code>.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>class ExtendedSchema(StudentDataSchema):\n",
    "    semester: Series[str] = pa.Field(isin=[\"Fall\", \"Spring\", \"Summer\"])</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16",
   "metadata": {},
   "source": [
    "## 4. Custom Checks\n",
    "\n",
    "Built-in checks cover ranges, null counts, string patterns, and allowed values. Custom checks handle business rules that require more logic:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandera as pa\n",
    "from pandera.typing import Series\n",
    "\n",
    "\n",
    "class GradeSchema(pa.DataFrameModel):\n",
    "    midterm_score: Series[float] = pa.Field(ge=0.0, le=100.0)\n",
    "    final_score: Series[float] = pa.Field(ge=0.0, le=100.0)\n",
    "    project_score: Series[float] = pa.Field(ge=0.0, le=100.0)\n",
    "\n",
    "    @pa.check(\"midterm_score\", name=\"not_all_perfect\")\n",
    "    def not_all_perfect_scores(cls, series: Series[float]) -> bool:  # noqa: N805\n",
    "        return (series == 100.0).mean() < 0.5  # less than 50% perfect scores\n",
    "\n",
    "    @pa.dataframe_check\n",
    "    def weighted_average_reasonable(cls, df: pd.DataFrame) -> bool:  # noqa: N805\n",
    "        avg = df[\"midterm_score\"] * 0.3 + df[\"final_score\"] * 0.45 + df[\"project_score\"] * 0.25\n",
    "        return (avg >= 0.0).all() and (avg <= 100.0).all()\n",
    "\n",
    "\n",
    "# Validate the sample\n",
    "result = GradeSchema.validate(sample_df[[\"midterm_score\", \"final_score\", \"project_score\"]])\n",
    "print(\"Custom checks passed:\", result.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18",
   "metadata": {},
   "source": [
    "`@pa.check(\"column_name\")` adds a column-level check: return `True` (the whole series passes), a boolean `Series` (element-wise result), or raise with a message.\n",
    "\n",
    "`@pa.dataframe_check` gets the whole DataFrame: use it for cross-column constraints."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 3 - Cross-Column Check</span><br><br>\n",
    "<b>Goal:</b> Add a <code>@pa.dataframe_check</code> to <code>GradeSchema</code> that verifies <code>midterm_score</code> and <code>final_score</code> are not both 0 for the same student (a student with both scores at 0 is almost certainly an error, not a result). Confirm it passes on valid data and fails when you introduce a row with both at 0.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>@pa.dataframe_check\n",
    "def not_both_zero(cls, df: pd.DataFrame) -> pd.Series:\n",
    "    return ~((df[\"midterm_score\"] == 0) & (df[\"final_score\"] == 0))</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20",
   "metadata": {},
   "source": [
    "## 5. Validation in a Pipeline\n",
    "\n",
    "In a pipeline, you want to validate without crashing on the first error, collecting all failures and decide what to do with them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import pandera as pa\n",
    "\n",
    "\n",
    "def load_and_validate(\n",
    "    path: str,\n",
    "    schema: type[pa.DataFrameModel],\n",
    "    *,\n",
    "    lazy: bool = True,\n",
    ") -> tuple[pd.DataFrame, list[dict]]:\n",
    "    df = pd.read_csv(path)\n",
    "\n",
    "    if not lazy:\n",
    "        return schema.validate(df, lazy=False), []\n",
    "\n",
    "    try:\n",
    "        return schema.validate(df, lazy=True), []\n",
    "    except pa.errors.SchemaErrors as e:\n",
    "        error_df = e.failure_cases\n",
    "        errors = error_df.to_dict(orient=\"records\")\n",
    "        # Return only the valid rows for downstream use\n",
    "        bad_idx = set(error_df[\"index\"].dropna().astype(int).tolist())\n",
    "        clean_df = df.drop(index=list(bad_idx)).reset_index(drop=True)\n",
    "        return clean_df, errors\n",
    "\n",
    "\n",
    "# Use it\n",
    "try:\n",
    "    clean, errors = load_and_validate(\n",
    "        \"tutorials/data/university_analytics.csv\",\n",
    "        StudentDataSchema,\n",
    "    )\n",
    "    print(f\"Clean rows: {len(clean)}, Errors: {len(errors)}\")\n",
    "    if errors:\n",
    "        print(\"First error:\", errors[0])\n",
    "except FileNotFoundError:\n",
    "    print(\"CSV not found in this context; run from repo root\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: lazy=True collects all errors; lazy=False stops on the first</span><br><br>\n",
    "<code>schema.validate(df, lazy=False)</code> (the default) raises a <code>SchemaError</code> on the first failure: fast and clear for development. <code>schema.validate(df, lazy=True)</code> collects every failure and raises a <code>SchemaErrors</code> (note the plural) at the end, better for production, where you want a full error report rather than a partial run.\n",
    "</div>\n",
    "\n",
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 4 - Full Error Report</span><br><br>\n",
    "<b>Goal:</b> Introduce three invalid rows into <code>sample_df</code>: one with <code>midterm_score=150.0</code>, one with an invalid <code>program</code>, and one with a duplicate <code>student_id</code>. Call <code>StudentDataSchema.validate(bad_df, lazy=True)</code>. Catch the <code>SchemaErrors</code> exception and print the <code>failure_cases</code> DataFrame showing all three failures at once.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23",
   "metadata": {},
   "source": [
    "## 6. Schemas as Data Contracts in Tests\n",
    "\n",
    "A Pandera schema is documentation that runs. Put it in a pytest fixture and every test that touches a DataFrame automatically validates the contract:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24",
   "metadata": {},
   "outputs": [],
   "source": [
    "# tests/test_data_contracts.py  (save and run: uv run pytest tests/)\n",
    "\n",
    "# ── paste into a test file and run with pytest ──\n",
    "import pandas as pd\n",
    "import pandera as pa\n",
    "from pandera.typing import Series\n",
    "import pytest\n",
    "\n",
    "\n",
    "class StudentDataSchema(pa.DataFrameModel):\n",
    "    student_id: Series[str] = pa.Field(str_matches=r\"^S\\d{4}$\")\n",
    "    midterm_score: Series[float] = pa.Field(ge=0.0, le=100.0)\n",
    "    final_score: Series[float] = pa.Field(ge=0.0, le=100.0)\n",
    "    project_score: Series[float] = pa.Field(ge=0.0, le=100.0)\n",
    "    program: Series[str] = pa.Field(isin=[\"CS\", \"EE\", \"Math\", \"Physics\", \"Biology\"])\n",
    "    has_internet: Series[bool]\n",
    "    teacher_count: Series[int] = pa.Field(ge=1)\n",
    "    school_size: Series[str] = pa.Field(isin=[\"Small\", \"Medium\", \"Large\"])\n",
    "\n",
    "    class Config:\n",
    "        coerce = True\n",
    "\n",
    "\n",
    "@pytest.fixture\n",
    "def valid_df():\n",
    "    return pd.DataFrame(\n",
    "        {\n",
    "            \"student_id\": [\"S0001\", \"S0002\"],\n",
    "            \"midterm_score\": [78.5, 65.0],\n",
    "            \"final_score\": [82.0, 70.0],\n",
    "            \"project_score\": [91.0, 75.0],\n",
    "            \"program\": [\"CS\", \"EE\"],\n",
    "            \"has_internet\": [True, False],\n",
    "            \"teacher_count\": [12, 8],\n",
    "            \"school_size\": [\"Large\", \"Medium\"],\n",
    "        }\n",
    "    )\n",
    "\n",
    "\n",
    "def test_valid_data_passes_schema(valid_df):\n",
    "    validated = StudentDataSchema.validate(valid_df)\n",
    "    assert len(validated) == 2  # noqa: S101\n",
    "\n",
    "\n",
    "def test_invalid_score_raises(valid_df):\n",
    "    bad = valid_df.copy()\n",
    "    bad.loc[0, \"midterm_score\"] = 150.0\n",
    "    with pytest.raises(pa.errors.SchemaError):\n",
    "        StudentDataSchema.validate(bad)\n",
    "\n",
    "\n",
    "def test_invalid_program_raises(valid_df):\n",
    "    bad = valid_df.copy()\n",
    "    bad.loc[0, \"program\"] = \"Underwater Basket Weaving\"\n",
    "    with pytest.raises(pa.errors.SchemaError):\n",
    "        StudentDataSchema.validate(bad)\n",
    "\n",
    "\n",
    "# Demonstration (not a real pytest run)\n",
    "print(\"Tests defined. Run with: uv run pytest tests/test_data_contracts.py -v\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Use <code>pa.DataFrameModel.example()</code> to generate test data automatically</span><br><br>\n",
    "Pandera can synthesise valid DataFrames from a schema: <code>StudentDataSchema.example(size=50)</code> generates 50 valid rows matching all constraints. This removes the need to hand-craft test fixtures for every new schema.\n",
    "</div>\n",
    "\n",
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 5 - Schema-Driven Test Data</span><br><br>\n",
    "<b>Goal:</b> Call <code>StudentDataSchema.example(size=20)</code> to generate 20 synthetic rows. Confirm that the generated DataFrame passes <code>StudentDataSchema.validate()</code> without errors. Then confirm that if you corrupt one cell (set a score to 200), validation fails.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>synthetic = StudentDataSchema.example(size=20)\n",
    "StudentDataSchema.validate(synthetic)  # should pass</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26",
   "metadata": {},
   "source": [
    "## Capstone: Data Contract for grade-predictor\n",
    "\n",
    "Bring Pandera into the `grade-predictor` pipeline as a first-class data contract.\n",
    "\n",
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Capstone - End-to-End Validated Pipeline</span><br><br>\n",
    "<ol>\n",
    "<li>Define <code>StudentDataSchema</code> in <code>grade_predictor/schemas.py</code> covering all columns of <code>university_analytics.csv</code></li>\n",
    "<li>Update <code>load_and_validate</code> (from Part 19) to run Pandera schema validation <em>after</em> row-level Pydantic validation</li>\n",
    "<li>Add a <code>@pa.dataframe_check</code> that verifies the computed weighted average (using the weights from <code>PipelineConfig</code>) falls in [0, 100] for every row</li>\n",
    "<li>Write three tests: one that the real CSV passes the schema, one that a DataFrame with a bad score raises <code>SchemaError</code>, and one that uses <code>example()</code> to generate synthetic data and confirms it passes</li>\n",
    "<li>Run <code>uv run pytest -v</code> and confirm all three pass</li>\n",
    "</ol>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27",
   "metadata": {},
   "source": [
    "## 7. Why Pandera? Comparing Schema Validation Tools\n",
    "\n",
    "Before Pandera became the community standard for DataFrame validation, several tools tackled the same problem in different ways. Understanding the landscape helps you choose the right tool for your context.\n",
    "\n",
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Schema validation is not the same as data cleaning</span><br><br>\n",
    "Schema validation answers a yes/no question: does this DataFrame conform to the contract? It runs fast and fails loudly. Data cleaning transforms and repairs data. The two serve different purposes: validate at the pipeline boundary, clean before you get there.\n",
    "</div>\n",
    "\n",
    "| Tool | Best for | Limitation |\n",
    "| --- | --- | --- |\n",
    "| **Pandera** | DataFrame contracts in Python pipelines | pandas/polars-specific; not for arbitrary dicts |\n",
    "| **Great Expectations** | Enterprise data quality suites, HTML reports, data docs | Heavy setup, complex configuration, overkill for most ML work |\n",
    "| **Pydantic** | Row-level validation, API payloads, config models | Not designed for tabular data; looping over rows is slow |\n",
    "| **Frictionless Data** | YAML-defined schemas, cross-language contracts | Less Python-native; validation logic lives in config files |\n",
    "| **Cerberus / marshmallow** | Dict and object validation | No DataFrame support; designed for records not tables |\n",
    "\n",
    "The practical split comes down to scope. Pydantic is the right choice when you are validating individual records at a system boundary: an API payload, a config file, a single row entering a pipeline. Pandera is the right choice when you are validating the structure and statistics of an entire DataFrame. They are complementary, not competing: Part 19 showed you Pydantic for the row, this notebook shows you Pandera for the table.\n",
    "\n",
    "Great Expectations is powerful but designed for data engineering teams who need data documentation, alerting, and audit trails across multiple data sources. For most ML projects, Pandera gives you 90% of the value with 10% of the setup.\n",
    "\n",
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Use both Pydantic and Pandera in the same pipeline</span><br><br>\n",
    "Validate incoming JSON records with Pydantic as they arrive, then validate the assembled DataFrame with Pandera before it enters any model or transformation. The two checks run at different granularities and catch different classes of bugs: Pydantic catches a bad individual record, Pandera catches drift in the distribution across hundreds of records.\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28",
   "metadata": {},
   "source": [
    "## Further Reading\n",
    "\n",
    "| Resource | Why it matters |\n",
    "| --- | --- |\n",
    "| [Pandera documentation](https://pandera.readthedocs.io/) | Full reference: DataFrameSchema, DataFrameModel, check types, Polars support |\n",
    "| [Pandera + Polars](https://pandera.readthedocs.io/en/stable/polars.html) | Same API works on Polars DataFrames |\n",
    "| [Pandera pytest integration](https://pandera.readthedocs.io/en/stable/pandas_on_spark.html) | `@pa.check_types` decorator for function-level schema enforcement |\n",
    "| [Pydantic + Pandera together](https://pandera.readthedocs.io/en/stable/pydantic_integration.html) | Use a Pandera schema as a Pydantic field type |\n",
    "| [Great Expectations](https://docs.greatexpectations.io/) | Enterprise-grade data quality platform; Pandera for code-first, GX for teams with data catalogs |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Concept | Key rule |\n",
    "| --- | --- |\n",
    "| `DataFrameSchema` | Functional API: describe columns, constraints, table-level checks as a dict |\n",
    "| `DataFrameModel` | Class-based API: columns as annotated fields, checks as classmethods |\n",
    "| `pa.Field(ge=, le=)` | Same constraint vocabulary as Pydantic's `Field` |\n",
    "| `pa.Check.isin([...])` | Restrict a column to an explicit set of values |\n",
    "| `@pa.check(\"col\")` | Column-level custom check: receives the Series, returns bool or bool Series |\n",
    "| `@pa.dataframe_check` | Cross-column check: receives the full DataFrame |\n",
    "| `lazy=True` | Collect all failures; raises `SchemaErrors` (plural) |\n",
    "| `lazy=False` | Stop on first failure; raises `SchemaError` |\n",
    "| `schema.example(size=n)` | Generate n rows of valid synthetic data for testing |\n",
    "| Pydantic + Pandera | Validate rows with Pydantic, validate the assembled table with Pandera |\n",
    "\n",
    "**Next:** Part 21: Classical ML: Scikit-learn pipelines applied to the fully validated, typed `grade-predictor` dataset."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}