{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compliance & Governance\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LakeLogic/LakeLogic/blob/main/examples/colab/02_compliance_governance.ipynb) [![View on GitHub](https://img.shields.io/badge/github-view_source-black?logo=github)](https://github.com/lakelogic/LakeLogic/blob/main/examples/colab/02_compliance_governance.ipynb)\n",
    "\n",
    "GDPR erasure in 2 lines, automatic lineage, and per-entity cost intelligence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import subprocess\n",
    "import sys\n",
    "import importlib\n",
    "import urllib.request\n",
    "import os\n",
    "\n",
    "if importlib.util.find_spec(\"lakelogic\") is None:\n",
    "    subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"--upgrade\", \"-q\", \"lakelogic[polars]\"])\n",
    "if not os.path.exists(\"_setup.py\"):\n",
    "    urllib.request.urlretrieve(\n",
    "        \"https://raw.githubusercontent.com/LakeLogic/LakeLogic/main/examples/colab/_setup.py\", \"_setup.py\"\n",
    "    )\n",
    "import _setup as s\n",
    "import lakelogic as ll"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 1. GDPR Erasure \u2014 `forget_subjects()` in 2 Lines\n",
    "\n",
    "**The Problem:** A customer submits an Article 17 erasure request. Your legal team needs proof that all PII was removed across every table \u2014 with a timestamped audit trail.\n",
    "\n",
    "**The Solution:** Mark fields as `pii: true` in the contract. Call `forget_subjects()`. Done."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lakelogic.core.gdpr import forget_subjects\n",
    "\n",
    "contract_path = s.write_contract(\n",
    "    \"\"\"\n",
    "version: 1.0.0\n",
    "dataset: customers\n",
    "\n",
    "model:\n",
    "  fields:\n",
    "    - name: customer_id\n",
    "      type: string\n",
    "      required: true\n",
    "    - name: name\n",
    "      type: string\n",
    "      pii: true\n",
    "    - name: email\n",
    "      type: string\n",
    "      pii: true\n",
    "    - name: phone\n",
    "      type: string\n",
    "      pii: true\n",
    "    - name: lifetime_value\n",
    "      type: float\n",
    "\"\"\",\n",
    "    \"02_compliance_governance_demo/customers.yaml\",\n",
    ")\n",
    "\n",
    "# Generate a dataset with PII\n",
    "source_df = ll.DataGenerator(contract_path).generate(rows=100)\n",
    "proc = ll.DataProcessor(contract_path, engine=\"polars\")\n",
    "good, _ = proc.run(source_df)\n",
    "\n",
    "import polars as pl\n",
    "\n",
    "sample_id = good[\"customer_id\"][0]\n",
    "print(f\"Before erasure ({sample_id}):\")\n",
    "\n",
    "display(\n",
    "    good.filter(pl.col(\"customer_id\") == sample_id).select([\"customer_id\", \"name\", \"email\", \"phone\", \"lifetime_value\"])\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The Proof \u2014 erasure + audit trail\n",
    "erased = forget_subjects(\n",
    "    good,\n",
    "    proc.contract,\n",
    "    subject_column=\"customer_id\",\n",
    "    subject_ids=[sample_id],\n",
    "    erasure_strategy=\"hash\",\n",
    ")\n",
    "\n",
    "print(f\"After erasure (subject: {sample_id}):\")\n",
    "audit_cols = [\n",
    "    c\n",
    "    for c in erased.columns\n",
    "    if \"customer_id\" in c or \"name\" in c or \"email\" in c or \"phone\" in c or \"lifetime_value\" in c or \"_lakelogic_\" in c\n",
    "]\n",
    "display(erased.filter(pl.col(\"customer_id\") == sample_id).select(audit_cols[:8]))\n",
    "\n",
    "print(\"\\nPII fields hashed. Audit columns added. Erasure in 2 lines of code.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 2. Automatic Lineage \u2014 Trace Any Row to Its Source\n",
    "\n",
    "**The Problem:** An auditor asks: \"Where did this Gold-layer number come from?\" You spend 3 hours tracing through ETL jobs.\n",
    "\n",
    "**The Solution:** LakeLogic stamps every row with `_lakelogic_run_id`, `_lakelogic_loaded_at`, and `_lakelogic_source_path` \u2014 automatically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run a pipeline and inspect lineage columns\n",
    "lineage_contract = s.write_contract(\n",
    "    \"\"\"\n",
    "version: 1.0.0\n",
    "dataset: orders_lineage\n",
    "info:\n",
    "  title: silver_orders\n",
    "  target_layer: silver\n",
    "\n",
    "model:\n",
    "  fields:\n",
    "    - name: order_id\n",
    "      type: integer\n",
    "      required: true\n",
    "    - name: amount\n",
    "      type: float\n",
    "    - name: status\n",
    "      type: string\n",
    "\n",
    "quality:\n",
    "  row_rules:\n",
    "    - name: positive\n",
    "      sql: \"amount > 0\"\n",
    "\n",
    "lineage:\n",
    "  enabled: true\n",
    "  upstream: [bronze.raw_orders]\n",
    "\n",
    "\"\"\",\n",
    "    \"02_compliance_governance_demo/lineage.yaml\",\n",
    ")\n",
    "\n",
    "source_df = ll.DataGenerator(lineage_contract).generate(rows=50)\n",
    "proc = ll.DataProcessor(lineage_contract, engine=\"polars\")\n",
    "good, bad = proc.run(source_df, source_path=\"bronze/raw_orders/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The Proof \u2014 lineage columns on every row\n",
    "lineage_cols = [c for c in good.columns if \"_lakelogic_\" in c]\n",
    "print(f\"Lineage columns added automatically: {lineage_cols}\")\n",
    "print()\n",
    "display(good.select([\"order_id\"] + lineage_cols).head(5))\n",
    "print(\"\\nEvery row traceable to its source. Every run has a unique ID.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 3. Pipeline Cost Intelligence\n",
    "\n",
    "**The Problem:** Your cloud bill grows 40% in a quarter but nobody knows which domain or pipeline is responsible.\n",
    "\n",
    "**The Solution:** LakeLogic estimates per-entity cost in every run report. Configure `metadata.cost` with a provider (`manual` or `databricks`) and LakeLogic calculates cost attribution using DBU rates, cluster scaling, and run duration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run two contracts with cost tracking enabled\n",
    "small = s.write_contract(\n",
    "    \"\"\"\n",
    "version: 1.0.0\n",
    "dataset: small_entity\n",
    "info:\n",
    "  title: bronze_small_entity\n",
    "  domain: marketing\n",
    "  system: google_analytics\n",
    "\n",
    "metadata:\n",
    "  domain: marketing\n",
    "  system: google_analytics\n",
    "  data_layer: bronze\n",
    "  cost:\n",
    "    provider: \"manual\"\n",
    "    currency: \"USD\"\n",
    "    rates:\n",
    "      dbu_per_hour: 0.22\n",
    "\n",
    "model:\n",
    "  fields:\n",
    "    - name: id\n",
    "      type: integer\n",
    "      required: true\n",
    "    - name: value\n",
    "      type: string\n",
    "\"\"\",\n",
    "    \"02_compliance_governance_demo/small.yaml\",\n",
    ")\n",
    "\n",
    "large = s.write_contract(\n",
    "    \"\"\"\n",
    "version: 1.0.0\n",
    "dataset: large_entity\n",
    "info:\n",
    "  title: bronze_large_entity\n",
    "  domain: finance\n",
    "  system: shopify\n",
    "\n",
    "metadata:\n",
    "  domain: finance\n",
    "  system: shopify\n",
    "  data_layer: bronze\n",
    "  cost:\n",
    "    provider: \"manual\"\n",
    "    currency: \"USD\"\n",
    "    rates:\n",
    "      dbu_per_hour: 0.55\n",
    "    cluster:\n",
    "      min_nodes: 2\n",
    "      max_nodes: 8\n",
    "      scaling_assumption: \"avg\"\n",
    "\n",
    "model:\n",
    "  fields:\n",
    "    - name: id\n",
    "      type: integer\n",
    "      required: true\n",
    "    - name: value\n",
    "      type: string\n",
    "\"\"\",\n",
    "    \"02_compliance_governance_demo/large.yaml\",\n",
    ")\n",
    "\n",
    "p1 = ll.DataProcessor(small, engine=\"polars\")\n",
    "p1.run(ll.DataGenerator(small).generate(rows=100))\n",
    "r1 = p1.last_report\n",
    "\n",
    "p2 = ll.DataProcessor(large, engine=\"polars\")\n",
    "p2.run(ll.DataGenerator(large).generate(rows=5000))\n",
    "r2 = p2.last_report\n",
    "\n",
    "# \u2500\u2500 Per-Entity Cost Report \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n",
    "print(\"Per-Entity Cost Attribution\")\n",
    "print(\"=\" * 60)\n",
    "for label, r in [(\"marketing / small_entity\", r1), (\"finance   / large_entity\", r2)]:\n",
    "    counts = r.get(\"counts\", {})\n",
    "    cost = r.get(\"estimated_cost\", 0) or 0\n",
    "    currency = r.get(\"cost_currency\", \"USD\") or \"USD\"\n",
    "    confidence = r.get(\"cost_confidence\", \"none\")\n",
    "    duration = r.get(\"run_duration_seconds\", 0)\n",
    "    print(f\"  {label}\")\n",
    "    print(f\"    Rows     : {counts.get('source', '?'):>6}\")\n",
    "    print(f\"    Duration : {duration:.3f}s\")\n",
    "    print(f\"    Cost     : {currency} {cost:.6f}  (confidence: {confidence})\")\n",
    "    print()\n",
    "\n",
    "print(\"In production, these cost estimates flow into the run log\")\n",
    "print(\"and feed domain-level budget dashboards.\")\n",
    "print(\"\\nConfigure cost.provider in _system.yaml:\")\n",
    "print(\"  manual       \u2192 duration \u00d7 DBU rate \u00d7 nodes\")\n",
    "print(\"  databricks   \u2192 queries system.billing.usage for exact costs\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What You Just Saw\n",
    "\n",
    "- **GDPR erasure** \u2014 hash/nullify PII with an audit trail, in 2 lines\n",
    "- **Automatic lineage** \u2014 run ID and timestamp on every row, no config needed\n",
    "- **Cost intelligence** \u2014 per-entity, per-domain attribution in every run report"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Go Deeper \u2014 Explore by Capability\n",
    "\n",
    "Each notebook below maps to a pillar of LakeLogic's [Technical Capabilities](https://lakelogic.github.io/LakeLogic/#technical-capabilities):\n",
    "\n",
    "| # | Notebook | What You'll See |\n",
    "|---|---|---|\n",
    "| \ud83d\udee1\ufe0f | **[Data Quality & Trust](01_data_quality_trust.ipynb)** | Reconciliation proofs, Pydantic validation, SQL-first rules, SLO monitoring |\n",
    "| \ud83d\udcdc | **[Compliance & Governance](02_compliance_governance.ipynb)** | GDPR erasure in 2 lines, automatic lineage, cost intelligence |\n",
    "| \u26a1 | **[Engine & Scale](03_engine_scale.ipynb)** | Same contract on Polars & DuckDB, incremental processing, dry run |\n",
    "| \ud83d\udd27 | **[Developer Experience](04_developer_experience.ipynb)** | Structured diagnostics, DDL generation, surgical resets, multi-channel alerts |\n",
    "| \ud83e\uddec | **[Data Generation & AI](05_data_generation_ai.ipynb)** | Synthetic data, referential integrity, edge case injection, contract inference |\n",
    "| \ud83d\udd0c | **[Integrations](06_integrations.ipynb)** | dbt adapter, dlt sources, contract-driven quality gates on arrival |\n",
    "\n",
    "> **Each notebook is self-contained** \u2014 pick the capability that matters most to you and run it independently."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "LakeLogic \u2014 Compliance & Governance",
   "provenance": []
  },
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}