{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Compliance & Governance\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LakeLogic/LakeLogic/blob/main/examples/colab/02_compliance_governance.ipynb) [![View on GitHub](https://img.shields.io/badge/github-view_source-black?logo=github)](https://github.com/lakelogic/LakeLogic/blob/main/examples/colab/02_compliance_governance.ipynb)\n", "\n", "GDPR erasure in 2 lines, automatic lineage, and per-entity cost intelligence." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "import sys\n", "import importlib\n", "import urllib.request\n", "import os\n", "\n", "if importlib.util.find_spec(\"lakelogic\") is None:\n", " subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"--upgrade\", \"-q\", \"lakelogic[polars]\"])\n", "if not os.path.exists(\"_setup.py\"):\n", " urllib.request.urlretrieve(\n", " \"https://raw.githubusercontent.com/LakeLogic/LakeLogic/main/examples/colab/_setup.py\", \"_setup.py\"\n", " )\n", "import _setup as s\n", "import lakelogic as ll" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 1. GDPR Erasure \u2014 `forget_subjects()` in 2 Lines\n", "\n", "**The Problem:** A customer submits an Article 17 erasure request. Your legal team needs proof that all PII was removed across every table \u2014 with a timestamped audit trail.\n", "\n", "**The Solution:** Mark fields as `pii: true` in the contract. Call `forget_subjects()`. Done." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from lakelogic.core.gdpr import forget_subjects\n", "\n", "contract_path = s.write_contract(\n", " \"\"\"\n", "version: 1.0.0\n", "dataset: customers\n", "\n", "model:\n", " fields:\n", " - name: customer_id\n", " type: string\n", " required: true\n", " - name: name\n", " type: string\n", " pii: true\n", " - name: email\n", " type: string\n", " pii: true\n", " - name: phone\n", " type: string\n", " pii: true\n", " - name: lifetime_value\n", " type: float\n", "\"\"\",\n", " \"02_compliance_governance_demo/customers.yaml\",\n", ")\n", "\n", "# Generate a dataset with PII\n", "source_df = ll.DataGenerator(contract_path).generate(rows=100)\n", "proc = ll.DataProcessor(contract_path, engine=\"polars\")\n", "good, _ = proc.run(source_df)\n", "\n", "import polars as pl\n", "\n", "sample_id = good[\"customer_id\"][0]\n", "print(f\"Before erasure ({sample_id}):\")\n", "\n", "display(\n", " good.filter(pl.col(\"customer_id\") == sample_id).select([\"customer_id\", \"name\", \"email\", \"phone\", \"lifetime_value\"])\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The Proof \u2014 erasure + audit trail\n", "erased = forget_subjects(\n", " good,\n", " proc.contract,\n", " subject_column=\"customer_id\",\n", " subject_ids=[sample_id],\n", " erasure_strategy=\"hash\",\n", ")\n", "\n", "print(f\"After erasure (subject: {sample_id}):\")\n", "audit_cols = [\n", " c\n", " for c in erased.columns\n", " if \"customer_id\" in c or \"name\" in c or \"email\" in c or \"phone\" in c or \"lifetime_value\" in c or \"_lakelogic_\" in c\n", "]\n", "display(erased.filter(pl.col(\"customer_id\") == sample_id).select(audit_cols[:8]))\n", "\n", "print(\"\\nPII fields hashed. Audit columns added. Erasure in 2 lines of code.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 2. Automatic Lineage \u2014 Trace Any Row to Its Source\n", "\n", "**The Problem:** An auditor asks: \"Where did this Gold-layer number come from?\" You spend 3 hours tracing through ETL jobs.\n", "\n", "**The Solution:** LakeLogic stamps every row with `_lakelogic_run_id`, `_lakelogic_loaded_at`, and `_lakelogic_source_path` \u2014 automatically." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run a pipeline and inspect lineage columns\n", "lineage_contract = s.write_contract(\n", " \"\"\"\n", "version: 1.0.0\n", "dataset: orders_lineage\n", "info:\n", " title: silver_orders\n", " target_layer: silver\n", "\n", "model:\n", " fields:\n", " - name: order_id\n", " type: integer\n", " required: true\n", " - name: amount\n", " type: float\n", " - name: status\n", " type: string\n", "\n", "quality:\n", " row_rules:\n", " - name: positive\n", " sql: \"amount > 0\"\n", "\n", "lineage:\n", " enabled: true\n", " upstream: [bronze.raw_orders]\n", "\n", "\"\"\",\n", " \"02_compliance_governance_demo/lineage.yaml\",\n", ")\n", "\n", "source_df = ll.DataGenerator(lineage_contract).generate(rows=50)\n", "proc = ll.DataProcessor(lineage_contract, engine=\"polars\")\n", "good, bad = proc.run(source_df, source_path=\"bronze/raw_orders/\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The Proof \u2014 lineage columns on every row\n", "lineage_cols = [c for c in good.columns if \"_lakelogic_\" in c]\n", "print(f\"Lineage columns added automatically: {lineage_cols}\")\n", "print()\n", "display(good.select([\"order_id\"] + lineage_cols).head(5))\n", "print(\"\\nEvery row traceable to its source. Every run has a unique ID.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 3. Pipeline Cost Intelligence\n", "\n", "**The Problem:** Your cloud bill grows 40% in a quarter but nobody knows which domain or pipeline is responsible.\n", "\n", "**The Solution:** LakeLogic estimates per-entity cost in every run report. Configure `metadata.cost` with a provider (`manual` or `databricks`) and LakeLogic calculates cost attribution using DBU rates, cluster scaling, and run duration." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run two contracts with cost tracking enabled\n", "small = s.write_contract(\n", " \"\"\"\n", "version: 1.0.0\n", "dataset: small_entity\n", "info:\n", " title: bronze_small_entity\n", " domain: marketing\n", " system: google_analytics\n", "\n", "metadata:\n", " domain: marketing\n", " system: google_analytics\n", " data_layer: bronze\n", " cost:\n", " provider: \"manual\"\n", " currency: \"USD\"\n", " rates:\n", " dbu_per_hour: 0.22\n", "\n", "model:\n", " fields:\n", " - name: id\n", " type: integer\n", " required: true\n", " - name: value\n", " type: string\n", "\"\"\",\n", " \"02_compliance_governance_demo/small.yaml\",\n", ")\n", "\n", "large = s.write_contract(\n", " \"\"\"\n", "version: 1.0.0\n", "dataset: large_entity\n", "info:\n", " title: bronze_large_entity\n", " domain: finance\n", " system: shopify\n", "\n", "metadata:\n", " domain: finance\n", " system: shopify\n", " data_layer: bronze\n", " cost:\n", " provider: \"manual\"\n", " currency: \"USD\"\n", " rates:\n", " dbu_per_hour: 0.55\n", " cluster:\n", " min_nodes: 2\n", " max_nodes: 8\n", " scaling_assumption: \"avg\"\n", "\n", "model:\n", " fields:\n", " - name: id\n", " type: integer\n", " required: true\n", " - name: value\n", " type: string\n", "\"\"\",\n", " \"02_compliance_governance_demo/large.yaml\",\n", ")\n", "\n", "p1 = ll.DataProcessor(small, engine=\"polars\")\n", "p1.run(ll.DataGenerator(small).generate(rows=100))\n", "r1 = p1.last_report\n", "\n", "p2 = ll.DataProcessor(large, engine=\"polars\")\n", "p2.run(ll.DataGenerator(large).generate(rows=5000))\n", "r2 = p2.last_report\n", "\n", "# \u2500\u2500 Per-Entity Cost Report \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n", "print(\"Per-Entity Cost Attribution\")\n", "print(\"=\" * 60)\n", "for label, r in [(\"marketing / small_entity\", r1), (\"finance / large_entity\", r2)]:\n", " counts = r.get(\"counts\", {})\n", " cost = r.get(\"estimated_cost\", 0) or 0\n", " currency = r.get(\"cost_currency\", \"USD\") or \"USD\"\n", " confidence = r.get(\"cost_confidence\", \"none\")\n", " duration = r.get(\"run_duration_seconds\", 0)\n", " print(f\" {label}\")\n", " print(f\" Rows : {counts.get('source', '?'):>6}\")\n", " print(f\" Duration : {duration:.3f}s\")\n", " print(f\" Cost : {currency} {cost:.6f} (confidence: {confidence})\")\n", " print()\n", "\n", "print(\"In production, these cost estimates flow into the run log\")\n", "print(\"and feed domain-level budget dashboards.\")\n", "print(\"\\nConfigure cost.provider in _system.yaml:\")\n", "print(\" manual \u2192 duration \u00d7 DBU rate \u00d7 nodes\")\n", "print(\" databricks \u2192 queries system.billing.usage for exact costs\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What You Just Saw\n", "\n", "- **GDPR erasure** \u2014 hash/nullify PII with an audit trail, in 2 lines\n", "- **Automatic lineage** \u2014 run ID and timestamp on every row, no config needed\n", "- **Cost intelligence** \u2014 per-entity, per-domain attribution in every run report" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Go Deeper \u2014 Explore by Capability\n", "\n", "Each notebook below maps to a pillar of LakeLogic's [Technical Capabilities](https://lakelogic.github.io/LakeLogic/#technical-capabilities):\n", "\n", "| # | Notebook | What You'll See |\n", "|---|---|---|\n", "| \ud83d\udee1\ufe0f | **[Data Quality & Trust](01_data_quality_trust.ipynb)** | Reconciliation proofs, Pydantic validation, SQL-first rules, SLO monitoring |\n", "| \ud83d\udcdc | **[Compliance & Governance](02_compliance_governance.ipynb)** | GDPR erasure in 2 lines, automatic lineage, cost intelligence |\n", "| \u26a1 | **[Engine & Scale](03_engine_scale.ipynb)** | Same contract on Polars & DuckDB, incremental processing, dry run |\n", "| \ud83d\udd27 | **[Developer Experience](04_developer_experience.ipynb)** | Structured diagnostics, DDL generation, surgical resets, multi-channel alerts |\n", "| \ud83e\uddec | **[Data Generation & AI](05_data_generation_ai.ipynb)** | Synthetic data, referential integrity, edge case injection, contract inference |\n", "| \ud83d\udd0c | **[Integrations](06_integrations.ipynb)** | dbt adapter, dlt sources, contract-driven quality gates on arrival |\n", "\n", "> **Each notebook is self-contained** \u2014 pick the capability that matters most to you and run it independently." ] } ], "metadata": { "colab": { "name": "LakeLogic \u2014 Compliance & Governance", "provenance": [] }, "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 0 }