{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# LakeLogic — 5 Minute Quickstart\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LakeLogic/LakeLogic/blob/main/examples/colab/00_quickstart.ipynb) [![View on GitHub](https://img.shields.io/badge/github-view_source-black?logo=github)](https://github.com/lakelogic/LakeLogic/blob/main/examples/colab/00_quickstart.ipynb)\n", "\n", "One contract. One pipeline. Every row accounted for. Five minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "import sys\n", "import importlib\n", "import urllib.request\n", "import os\n", "\n", "if importlib.util.find_spec(\"lakelogic\") is None:\n", " subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"--upgrade\", \"-q\", \"lakelogic[polars]\"])\n", "if not os.path.exists(\"_setup.py\"):\n", " urllib.request.urlretrieve(\n", " \"https://raw.githubusercontent.com/LakeLogic/LakeLogic/main/examples/colab/_setup.py\", \"_setup.py\"\n", " )\n", "import _setup as s\n", "import lakelogic as ll" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Problem\n", "\n", "You have raw order data landing in your lake. Some rows have bad emails, negative amounts, or unknown statuses. You need to validate every row, quarantine the bad ones, and prove nothing was silently dropped — with zero custom Python logic." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "contract = s.write_contract(\n", " \"\"\"\n", "version: 1.0.0\n", "dataset: orders\n", "\n", "info:\n", " title: E-Commerce Orders\n", " version: 1.0.0\n", " owner: data-team@company.com\n", " target_layer: silver\n", "\n", "model:\n", " fields:\n", " - name: order_id\n", " type: integer\n", " required: true\n", " - name: customer_email\n", " type: string\n", " required: true\n", " pii: true\n", " - name: amount\n", " type: float\n", " required: true\n", " - name: currency\n", " type: string\n", " - name: status\n", " type: string\n", " - name: created_at\n", " type: string\n", "\n", "transformations:\n", " - phase: \"post\"\n", " derive:\n", " field: \"amount_gbp\"\n", " sql: \"CAST(CASE WHEN currency='USD' THEN amount*0.79 WHEN currency='EUR' THEN amount*0.86 ELSE amount END AS DECIMAL(10,2))\"\n", "\n", "quality:\n", " row_rules:\n", " - name: valid_email\n", " sql: \"customer_email LIKE '%@%.%'\"\n", " - name: positive_amount\n", " sql: \"amount > 0\"\n", " - name: valid_status\n", " sql: \"status IN ('pending','shipped','delivered','returned')\"\n", " - name: valid_currency\n", " sql: \"currency IN ('GBP','USD','EUR')\"\n", " - name: valid_order_id\n", " sql: \"order_id > 0\"\n", "\n", "\"\"\",\n", " \"00_quickstart_demo/orders_contract.yaml\",\n", ")\n", "\n", "# Generate 1000 rows — 10% intentionally bad\n", "source_df = ll.DataGenerator(contract).generate(rows=1000, invalid_ratio=0.10)\n", "\n", "# Run the pipeline\n", "proc = ll.DataProcessor(contract, engine=\"polars\")\n", "good, bad = proc.run(source_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bad" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Proof" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Every row accounted for\n", "s.assert_reconciliation(source_df, good, bad)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# What was caught\n", "print(\"Quarantined rows (sample):\")\n", "display(bad.head(10))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# What was good\n", "print(\"Valid rows (sample):\")\n", "display(good.head(10))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Full audit trail\n", "s.print_report(proc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What You Just Saw\n", "\n", "- **One YAML contract** defined schema, quality rules, and a derived column\n", "- **100% reconciliation** — source == good + bad, every row accounted for\n", "- **Automatic audit trail** — run ID, timestamp, per-rule failure counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Go Deeper — Explore by Capability\n", "\n", "Each notebook below maps to a pillar of LakeLogic's [Technical Capabilities](https://lakelogic.github.io/LakeLogic/#technical-capabilities):\n", "\n", "| # | Notebook | What You'll See |\n", "|---|---|---|\n", "| 🛡️ | **[Data Quality & Trust](01_data_quality_trust.ipynb)** | Reconciliation proofs, Pydantic validation, SQL-first rules, SLO monitoring |\n", "| 📜 | **[Compliance & Governance](02_compliance_governance.ipynb)** | GDPR erasure in 2 lines, automatic lineage, cost intelligence |\n", "| ⚡ | **[Engine & Scale](03_engine_scale.ipynb)** | Same contract on Polars & DuckDB, incremental processing, dry run |\n", "| 🔧 | **[Developer Experience](04_developer_experience.ipynb)** | Structured diagnostics, DDL generation, surgical resets, multi-channel alerts |\n", "| 🧬 | **[Data Generation & AI](05_data_generation_ai.ipynb)** | Synthetic data, referential integrity, edge case injection, contract inference |\n", "| 🔌 | **[Integrations](06_integrations.ipynb)** | dbt adapter, dlt sources, contract-driven quality gates on arrival |\n", "\n", "> **Each notebook is self-contained** — pick the capability that matters most to you and run it independently." ] } ], "metadata": { "colab": { "name": "LakeLogic — 5 Minute Quickstart", "provenance": [] }, "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 0 }