# Skill: Databricks Lakeflow — Data Engineering

## Metadata

| Field | Value |
|-------|-------|
| **id** | `databricks-lakeflow` |
| **domain** | `lakehouse` |
| **source** | https://learn.microsoft.com/en-us/azure/databricks/data-engineering/ |
| **extracted** | 2026-03-15 |
| **applies_to** | lakehouse, automations |
| **tags** | databricks, lakeflow, dlt, sdp, streaming, etl, connect, jobs, spark |

---

## What It Is

Lakeflow is Databricks' end-to-end data engineering solution. Three components:

1. **Lakeflow Connect** — Managed connectors for ingestion
2. **Lakeflow Spark Declarative Pipelines (SDP)** — Declarative ETL framework (formerly DLT)
3. **Lakeflow Jobs** — Orchestration and production monitoring

## Lakeflow Connect (Ingestion)

| Connector Type | Description | IPAI Use |
|---------------|-------------|----------|
| **Managed connectors** | UI-based, config-driven, minimal ops | Odoo DB → Bronze (if DB replica available) |
| **Standard connectors** | Wider range, used within pipelines | ADLS, PostgreSQL, REST API |

### Relevant Managed Connectors

| Source | Connector | Status |
|--------|-----------|--------|
| PostgreSQL (Odoo DB) | `postgresql` | Available — use for Odoo → Bronze |
| Azure Blob / ADLS | `azure_storage` | Available — use for file ingestion |
| Kafka / Event Hubs | `kafka` | Available — for streaming events |
| REST API | Custom via standard | Build for Odoo Extract API |

## Lakeflow SDP (Transformation)

Formerly Delta Live Tables (DLT). Declarative framework for batch + streaming pipelines.

| Concept | Description | IPAI Equivalent |
|---------|-------------|-----------------|
| **Flows** | Process data using Spark DataFrame API | n8n workflows (simpler) |
| **Streaming tables** | Delta tables with incremental processing | `ops.platform_events` (append-only) |
| **Materialized views** | Cached views with auto-refresh | Superset cached queries |
| **Sinks** | Output to Kafka, Event Hubs, external tables | Supabase REST API writes |
| **Pipelines** | Container that orchestrates flows, tables, views | n8n workflow sequences |

### SDP Pipeline Example (Python)

```python
import dlt
from pyspark.sql.functions import col, sum as spark_sum

# Bronze: ingest raw Odoo journal entries
@dlt.table(comment="Raw Odoo journal entries from Extract API")
def bronze_account_move():
    return (
        spark.read.format("parquet")
        .load("abfss://bronze@ipailakehouse.dfs.core.windows.net/odoo/account_move/")
    )

# Silver: clean and type
@dlt.table(comment="Cleaned journal entries")
@dlt.expect_or_drop("valid_state", "state IN ('posted', 'draft')")
def silver_account_move():
    return (
        dlt.read("bronze_account_move")
        .select(
            col("id"),
            col("name").alias("move_name"),
            col("date").cast("date").alias("move_date"),
            col("partner_id"),
            col("journal_id"),
            col("amount_total").cast("decimal(15,2)"),
            col("state"),
        )
    )

# Gold: monthly revenue mart
@dlt.table(comment="Monthly revenue aggregation")
def gold_monthly_revenue():
    return (
        dlt.read("silver_account_move")
        .filter(col("state") == "posted")
        .groupBy(
            spark_sum("amount_total").alias("total_revenue"),
        )
    )
```

### Data Quality with Expectations

```python
@dlt.expect("valid_amount", "amount_total > 0")           # Warn but keep
@dlt.expect_or_drop("has_date", "move_date IS NOT NULL")   # Drop invalid
@dlt.expect_or_fail("valid_state", "state IS NOT NULL")    # Fail pipeline
```

## Lakeflow Jobs (Orchestration)

| Feature | Description | IPAI Equivalent |
|---------|-------------|-----------------|
| **Jobs** | Primary orchestration resource, scheduled | n8n cron workflows |
| **Tasks** | Unit of work (notebook, pipeline, SQL, ML) | n8n nodes |
| **Control flow** | If/else branching, for-each loops | n8n IF/Switch nodes |

### Job Types

| Task Type | Use Case |
|-----------|----------|
| Notebook | Python/SQL ETL scripts |
| Pipeline | Lakeflow SDP pipeline execution |
| SQL | Databricks SQL query execution |
| dbt | dbt project execution |
| Python wheel | Packaged Python applications |
| ML training | Model training jobs |

## Databricks Runtime

| Feature | Description |
|---------|-------------|
| **Photon** | High-performance vectorized query engine |
| **Structured Streaming** | Near real-time processing |
| **Autoscaling** | Automatic cluster scaling |
| **Delta Lake** | ACID transactions on Parquet |

## IPAI Integration Architecture

```
Source Systems                    Databricks Lakeflow                  Consumption

Odoo PG ───── Lakeflow Connect ──► Bronze (Delta) ──►
                                    │
Supabase ──── REST API / CDC ────► Bronze (Delta) ──► SDP Pipeline ──► Silver ──► Gold
                                    │                  (expectations)         │
n8n events ── Webhook / Batch ───► Bronze (Delta) ──►                        │
                                                                              ▼
                                                                    ┌─────────────────┐
                                                                    │ Superset (BI)    │
                                                                    │ Databricks SQL   │
                                                                    │ ML Models        │
                                                                    │ Reverse ETL      │
                                                                    └─────────────────┘
```

## Decision: n8n vs Lakeflow for ETL

| Criteria | n8n | Lakeflow |
|----------|-----|----------|
| Simple webhook routing | Winner | Overkill |
| Data quality (expectations) | Manual | Built-in |
| Large volume transforms | Limited (memory) | Spark-native (distributed) |
| Streaming | Not supported | Structured Streaming |
| Schema evolution | Manual | Delta Lake native |
| Cost | Self-hosted (free) | Databricks DBUs ($$$) |
| Monitoring | Basic | Production-grade |

**Recommendation**:
- **n8n** for: event routing, webhook handling, Odoo CRUD, Slack notifications, simple ETL (<100K rows)
- **Lakeflow** for: Bronze → Silver → Gold transforms, data quality, streaming, ML features (>100K rows)

## Next Steps (from n8n-odoo-supabase-etl skill)

1. Provision Databricks workspace (`dbw-ipai-dev` in `rg-ipai-ai-dev`)
2. Create Unity Catalog for data governance
3. Set up Lakeflow Connect for Odoo PostgreSQL
4. Build first SDP pipeline: `bronze_account_move` → `silver_account_move` → `gold_monthly_revenue`
5. Connect Superset to Databricks SQL warehouse for Gold queries