---
name: e2e-medallion-architecture
description: >
  Implement end-to-end Medallion Architecture (Bronze/Silver/Gold) lakehouse patterns
  in Microsoft Fabric using PySpark, Delta Lake, and Fabric Pipelines. Use when the user
  wants to: (1) design a Bronze/Silver/Gold data lakehouse, (2) set up multi-layer
  workspace with lakehouses for each tier, (3) build ingestion-to-analytics pipelines
  with data quality enforcement, (4) optimize Spark configurations per medallion layer,
  (5) orchestrate Bronze-to-Silver-to-Gold flows via notebooks. Triggers: "medallion architecture",
  "bronze silver gold", "lakehouse layers", "e2e data pipeline", "end-to-end lakehouse",
  "data lakehouse pattern", "multi-layer lakehouse", "build medallion", "setup medallion".
---

> **Update Check — ONCE PER SESSION (mandatory)**
> The first time this skill is used in a session, run the **check-updates** skill before proceeding.
> - **GitHub Copilot CLI / VS Code**: invoke the `check-updates` skill.
> - **Claude Code / Cowork / Cursor / Windsurf / Codex**: compare local vs remote package.json version.
> - Skip if the check was already performed earlier in this session.

> **CRITICAL NOTES**
> 1. To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering
> 2. To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering

# End-to-End Medallion Architecture

## Prerequisite Knowledge

Read these companion documents — they contain the foundational context this skill depends on:

- [COMMON-CORE.md](../../common/COMMON-CORE.md) — Fabric REST API patterns, authentication, token audiences, item discovery
- [COMMON-CLI.md](../../common/COMMON-CLI.md) — `az rest`, `az login`, token acquisition, Fabric REST via CLI
- [SPARK-AUTHORING-CORE.md](../../common/SPARK-AUTHORING-CORE.md) — Notebook deployment, lakehouse creation, job execution
- [notebook-api-operations.md](../spark-authoring-cli/resources/notebook-api-operations.md) — **Required for notebook creation** — `.ipynb` structure requirements, cell format, `getDefinition`/`updateDefinition` workflow

For Spark-specific optimization details, see [data-engineering-patterns.md](../spark-authoring-cli/resources/data-engineering-patterns.md).

---

## Architecture Overview

**Medallion Architecture** is a data lakehouse pattern with three progressive layers:

| Layer | Purpose | Optimization Profile | Use Case |
|-------|---------|---------------------|----------|
| **Bronze** (Raw) | Land raw data exactly as received | Write-optimized, append-only, partitioned by ingestion date | Audit trail, reprocessing, lineage |
| **Silver** (Cleaned) | Deduplicated, validated, conformed data | Balanced read/write, partitioned by business date | Feature engineering, operational reporting |
| **Gold** (Aggregated) | Pre-calculated metrics for analytics | Read-optimized (ZORDER, compaction), partitioned by month/year | Power BI reports, dashboards, ad-hoc analytics via SQL endpoint |

- **Bronze**: Schema-on-read — flexible schema, Delta time travel supports audit and rollback
- **Silver**: Schema enforcement — reject non-conforming writes; handle schema evolution with `mergeSchema` when sources change
- **Gold**: Strict schema governance — curated, business-approved datasets only

---

## Must/Prefer/Avoid

### MUST DO
- Create a **separate lakehouse** for each medallion layer (Bronze, Silver, Gold)
- Add **metadata columns** in Bronze: ingestion timestamp, source file, batch ID
- Apply **data quality rules** in the Bronze-to-Silver transformation (deduplication, null handling, range validation)
- Use **Delta Lake format** for all medallion layer tables
- Use **partition-aware overwrite** in Silver/Gold writes to avoid reprocessing unchanged data
- Include **validation steps** after each layer (row counts, schema checks, anomaly detection)
- Follow the **`.ipynb` validation + Fabric nuances** in [notebook-api-operations.md](../spark-authoring-cli/resources/notebook-api-operations.md#ipynb-validation--fabric-nuances) when creating notebooks via REST API — every code cell must include `"outputs": []` and `"execution_count": null`
- **Default to separate workspaces per layer** for governance and access control: one workspace each for Bronze, Silver, and Gold
- **Complete the full end-to-end flow** — do not stop after creating notebooks; always bind lakehouses, execute notebooks sequentially (Bronze → Silver → Gold), verify results, and connect Power BI to the Gold layer unless the user explicitly requests a partial setup

### PREFER
- Incremental processing (watermark pattern) over full refresh
- Separate notebooks per layer for independent testing and debugging
- ZORDER on frequently filtered columns in Gold tables
- Running OPTIMIZE after writes in Silver and Gold layers
- Environment-specific Spark configs (write-heavy for Bronze, balanced for Silver, read-heavy for Gold)
- OneLake shortcuts to expose Gold data to consumer workspaces without duplication
- Clear layer ownership: engineers own Bronze/Silver, analysts own Gold
- Fabric Variable Libraries to centralize paths and configuration across layers
- Multi-workspace deployment patterns for medium/high governance requirements (Bronze/Silver/Gold in separate workspaces)

### AVOID
- Storing all layers in a single lakehouse — this defeats isolation and independent optimization
- Skipping the Silver layer and going directly from Bronze to Gold
- Hardcoded workspace IDs, lakehouse IDs, or FQDNs — discover via REST API
- SELECT * without LIMIT on Bronze tables (they grow unboundedly)
- Running VACUUM without checking downstream dependencies
- Chaining OneLake shortcuts between medallion layers (Bronze→Silver→Gold) — each layer must be physically materialized for lineage and governance
- Copying complete implementation code into skills — guide the LLM to generate instead
- Reading from **external HTTP/HTTPS URLs** directly in Spark — Fabric Spark cannot access arbitrary external URLs; land data in lakehouse `Files/` first (via `curl`, OneLake API, or Fabric pipeline Copy activity), then read from the lakehouse path
- Creating notebooks via REST API **without validating `.ipynb` structure** — missing `execution_count: null` or `outputs: []` on code cells causes silent failures or "Job instance failed without detail error"

---

## Workspace Setup Guidance

When setting up a medallion workspace, guide LLM to generate commands for:

1. **Default architecture: create three workspaces** (recommended):
   - `{project}-bronze-{env}`
   - `{project}-silver-{env}`
   - `{project}-gold-{env}`
2. **Create one lakehouse per workspace**:
   - Bronze workspace → `{project}_bronze` lakehouse
   - Silver workspace → `{project}_silver` lakehouse
   - Gold workspace → `{project}_gold` lakehouse
3. **Assign RBAC per layer workspace**:
   - Bronze: ingestion/engineering write permissions
   - Silver: engineering/data quality permissions
   - Gold: analytics/BI consumer access with stricter curation controls
4. **Create notebooks** for each layer (one per transformation stage) — follow `.ipynb` validation + Fabric nuances
5. **Bind each notebook to its lakehouse** — set `metadata.dependencies.lakehouse` with the correct lakehouse ID (see [notebook-api-operations.md § Default Lakehouse Binding](../spark-authoring-cli/resources/notebook-api-operations.md#default-lakehouse-binding)):
   - Bronze notebook → Bronze workspace/lakehouse
   - Silver notebook → Silver workspace/lakehouse (reads Bronze via cross-workspace oneLake access / fully qualified references)
   - Gold notebook → Gold workspace/lakehouse (reads Silver via cross-workspace access)
6. **Confirm notebook deployment** — check that `updateDefinition` returned `Succeeded`; this is sufficient confirmation that content and lakehouse binding persisted. Do NOT call `getDefinition` to re-verify — it is an async LRO and adds unnecessary latency.
7. **Execute notebooks** sequentially — Bronze first, then Silver, then Gold — using `POST .../jobs/instances?jobType=RunNotebook` with the correct `defaultLakehouse` in execution config (both `id` and `name` required)
8. **Connect Power BI to Gold layer** — discover the Gold lakehouse SQL endpoint, create a Direct Lake semantic model, create a report with visuals on the Gold summary table (see [Gold Layer → Power BI Consumption](#gold-layer--power-bi-consumption))
9. **Create pipeline** to orchestrate the Bronze → Silver → Gold flow for recurring execution

### Explicit Override: Single Workspace

If the user explicitly asks for a single workspace deployment (for example, POC/small team/monolithic pattern), keep the current approach:

- One workspace with separate Bronze/Silver/Gold lakehouses
- Preserve layer separation logically even when workspace is shared
- Call out governance trade-offs versus multi-workspace design

Parameterize by environment: workspace name suffix (`-dev`, `-prod`), data volume (sample vs full), capacity SKU, and Bronze retention period.

---

## Bronze Layer — Ingestion Patterns

When a user requests data ingestion into the Bronze layer, guide LLM to:

1. **Land data in lakehouse first**: External data must be staged into the lakehouse `Files/` folder before Spark can read it — use one of:
   - **Fabric Pipeline Copy activity** (preferred for recurring loads) — connects to external sources (HTTP, FTP, databases, cloud storage) and writes to OneLake
   - **OneLake API / `curl`** — upload files via REST API using `storage.azure.com` token (see COMMON-CLI.md § OneLake Data Access)
   - **OneLake Shortcut** — for data already in Azure ADLS Gen2, S3, or another OneLake location
   - **`notebookutils.fs`** — copy from mounted storage paths within a notebook
   - ⚠️ **Fabric Spark cannot read from arbitrary HTTP/HTTPS URLs** — `spark.read.format("csv").load("https://...")` will fail
2. **Read from lakehouse path**: Once data is in `Files/`, read using lakehouse-relative paths (e.g., `spark.read.format("csv").load("Files/landing/daily/")`)
3. **Add metadata and write**: Tracking columns (ingestion timestamp, source file, batch ID), Delta table with descriptive name, partition by ingestion date, append mode
4. **Validate**: Log row counts, validate schema structure, flag anomalies vs historical patterns

---

## Silver Layer — Transformation Patterns

When a user requests Bronze-to-Silver transformation, guide LLM to:

- **Quality rules**: Deduplicate on natural/composite key, filter invalid ranges, handle nulls (drop required, fill optional), validate logical constraints
- **Schema conformance**: snake_case column names, standardized data types, derived columns (durations, percentages, categories)
- **Schema evolution**: Use `mergeSchema` option when source schemas change; coordinate downstream updates to Gold tables and Power BI datasets
- **Write strategy**: Partition by business date, partition-aware overwrite, run OPTIMIZE after write, log before/after metrics

---

## Gold Layer — Aggregation Patterns

When a user requests Gold analytics tables, guide LLM to generate:

- **Common aggregates**: Daily/weekly/monthly summaries, dimensional analysis (by location, category, type), trend breakdowns over time, demand patterns (hour-of-day, day-of-week)
- **Spark session config** — set these properties in the Gold notebook **before** any write operations:
  ```python
  spark.conf.set("spark.sql.parquet.vorder.default", "true")
  spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
  spark.conf.set("spark.databricks.delta.optimizeWrite.binSize", "1g")
  ```
  - **V-Order** (`vorder.default`) — applies Fabric's columnar sort optimization to all Parquet files, dramatically improving Direct Lake and SQL endpoint read performance
  - **Optimize Write** (`optimizeWrite.enabled`) — coalesces small partitions into optimally-sized files (target ~1 GB per `binSize`), reducing file count and improving scan efficiency
- **Optimization**: ZORDER on filter columns, run OPTIMIZE after writes, pre-aggregate metrics to avoid runtime computation

---

## End-to-End Execution Flow

When setting up medallion architecture end-to-end, the LLM **must not stop** after creating notebooks and deploying code. The complete lifecycle is:

```
Create Resources → Deploy Content → Bind Lakehouses → Execute → Verify Results
```

### Step-by-Step

1. **Create layer workspaces and lakehouses (default)** — one workspace and one lakehouse per layer (Bronze, Silver, Gold); capture workspace IDs and lakehouse IDs
2. **Create notebooks** — one per layer, with valid `.ipynb` structure (see [notebook-api-operations.md](../spark-authoring-cli/resources/notebook-api-operations.md))
3. **Bind lakehouse to each notebook** — include `metadata.dependencies.lakehouse` in the `.ipynb` payload with:
   - `default_lakehouse`: the target lakehouse GUID
   - `default_lakehouse_name`: the lakehouse display name
   - `default_lakehouse_workspace_id`: the workspace GUID
4. **Deploy notebook content** — `updateDefinition` with the Base64-encoded `.ipynb` payload (content + lakehouse binding together)
5. **Confirm deployment** — check that each `updateDefinition` LRO returned `Succeeded`; that is sufficient. Do NOT call `getDefinition` to re-verify — it is an async LRO and adds significant latency per notebook.
6. **Execute notebooks sequentially** — use `POST .../jobs/instances?jobType=RunNotebook`:
   - Pass `defaultLakehouse` with both `id` and `name` in `executionData.configuration`
   - Run Bronze first → poll until `Completed` → run Silver → poll → run Gold → poll
   - Check for recent jobs before submitting (prevent duplicates — see SPARK-AUTHORING-CORE.md)
7. **Verify results** — after each notebook completes, confirm expected tables exist and row counts are reasonable
8. **Connect Power BI to Gold** — create semantic model + report on Gold summary tables (see [Gold Layer → Power BI Consumption](#gold-layer--power-bi-consumption))

### Common Failure: Stopping After Notebook Creation

If the flow stops after deploying notebook code without binding or executing:
- Notebooks will have no lakehouse context → `spark.sql()` and relative paths (`Tables/`, `Files/`) fail at runtime
- The user sees no output or results — the architecture is set up but never tested
- **Always complete through step 7** unless the user explicitly asks to stop at a specific step

---

## Gold Layer → Power BI Consumption

After Gold tables are populated, connect Power BI to surface the analytics. 
Build a semantic model on top of the Gold lakehouse, using DirectLake. 


### Step-by-Step

1. **Discover the Gold lakehouse SQL endpoint** — call `GET /v1/workspaces/{workspaceId}/lakehouses/{goldLakehouseId}` and extract `properties.sqlEndpointProperties.connectionString` and `provisioningStatus`; wait until status is `Success`
2. **Verify Gold tables via SQL** — connect to the SQL endpoint using `sqlcmd` (see [COMMON-CLI.md § SQL / TDS Data-Plane Access](../../common/COMMON-CLI.md#sql--tds-data-plane-access)) and confirm the target table exists:
   ```sql
   SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'nyc_taxi_daily_summary'
   ```
3. **Create a semantic model** — use the [powerbi-authoring-cli](../powerbi-authoring-cli/SKILL.md) skill for semantic model creation and TMDL deployment. Create via `POST /v1/workspaces/{workspaceId}/items` with `type: "SemanticModel"` then deploy definition via `updateDefinition` using TMDL format (see [ITEM-DEFINITIONS-CORE.md § SemanticModel](../../common/ITEM-DEFINITIONS-CORE.md#semanticmodel)):
   - The model must reference the Gold lakehouse SQL endpoint as its data source
   - Define a table mapping to the Gold summary table (e.g., `nyc_taxi_daily_summary`)
   - Use **Direct Lake** mode — this connects directly to Delta tables in OneLake without data import
   - Include measures for key aggregations you find interesting (e.g., `Total Trips`, `Avg Fare`, `Total Revenue`, `Month over Month Growth`)
4. **Create a Power BI report** — `POST /v1/workspaces/{workspaceId}/items` with `type: "Report"` then deploy definition via `updateDefinition` using PBIR format (see [ITEM-DEFINITIONS-CORE.md § Report](../../common/ITEM-DEFINITIONS-CORE.md#report)):
   - Reference the semantic model created in step 3 via `definition.pbir`
   - Define at least one page with visuals on the Gold summary table
   - Suggested visuals: line chart (daily trend), card (KPI totals), bar chart (by category), table (detail view)
5. **Verify end-to-end** — use the `powerbi-consumption-cli` skill to run DAX queries against the semantic model and confirm data flows from Gold tables through to the report

### Principles

- **Discover SQL endpoint dynamically** — the connection string is in `properties.sqlEndpointProperties.connectionString` on the lakehouse response; never hardcode it
- **Wait for SQL endpoint provisioning** — status must be `Success` before connecting; newly created lakehouses may take minutes to provision
- **Prefer Direct Lake mode** — avoids data duplication; semantic model reads directly from OneLake Delta tables
- **Match table/column names exactly** — the semantic model table definition must use the exact Delta table and column names from the Gold lakehouse
- **For semantic model authoring** (TMDL, refresh, permissions), cross-reference the [powerbi-authoring-cli](../powerbi-authoring-cli/SKILL.md) skill
- **For DAX query validation**, cross-reference the [powerbi-consumption-cli](../powerbi-consumption-cli/SKILL.md) skill

---

## Pipeline Orchestration

When a user requests a pipeline for the medallion flow, guide LLM to design with:

- **Structure**: Sequential activities (Bronze → Silver → Gold), each waiting for previous success; independent Gold aggregations can run in parallel; include validation and notification activities
- **Parameterization**: Pipeline-level processing date (defaults to yesterday), passed to all notebooks; dynamic date expressions
- **Scheduling**: Daily aligned with source refresh, watermark-based incremental processing, periodic full refresh for corrections
- **Error handling**: Retry with backoff for transient failures, alerting for persistent failures, graceful degradation (downstream uses previous data if upstream fails)

---

## Environment Optimization

**For detailed Spark configurations and optimization strategies, see [data-engineering-patterns.md](../spark-authoring-cli/resources/data-engineering-patterns.md).**

| Layer | Profile | Key Settings |
|-------|---------|-------------|
| Bronze | Write-heavy | Disable V-Order, enable autoCompact, large file targets, partition by ingestion_date |
| Silver | Balanced | Enable V-Order, adaptive query execution, partition by business date, ZORDER on filtered columns |
| Gold | Read-heavy | V-Order (`spark.sql.parquet.vorder.default=true`), Optimize Write (`optimizeWrite.enabled=true`, `binSize=1g`), vectorized readers, adaptive execution, ZORDER on all filter columns, pre-aggregate metrics |

---

## Examples

### Example 1: Set Up Medallion Workspaces (Default)

**Prompt**: "Set up medallion architecture with separate Bronze, Silver, and Gold workspaces for sales analytics"

**What the LLM should generate**: REST API calls to:
1. Create workspaces: `sales-bronze-dev`, `sales-silver-dev`, `sales-gold-dev`
2. Create one lakehouse in each workspace: `sales_bronze`, `sales_silver`, `sales_gold`
3. Assign RBAC roles per workspace/layer

```bash
# Workspace creation (see COMMON-CLI.md for full patterns)
cat > /tmp/body.json << 'EOF'
{"displayName": "sales-analytics-dev"}
EOF
workspace_id=$(az rest --method post --resource "https://api.fabric.microsoft.com" \
  --url "https://api.fabric.microsoft.com/v1/workspaces" \
  --body @/tmp/body.json --query "id" --output tsv)

# Create Bronze lakehouse
cat > /tmp/body.json << 'EOF'
{"displayName": "sales_bronze", "type": "Lakehouse"}
EOF
az rest --method post --resource "https://api.fabric.microsoft.com" \
  --url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/items" \
  --body @/tmp/body.json
```

### Example 2: Design Bronze Ingestion

**Prompt**: "Ingest daily CSV files into bronze lakehouse with metadata columns"

**What the LLM should generate**: PySpark notebook that:
1. Reads source files with schema inference or explicit schema
2. Adds `ingestion_timestamp`, `source_file`, `batch_id` columns
3. Writes to Delta table partitioned by ingestion date
4. Logs row count and validation metrics

```python
# Bronze ingestion pattern (guide LLM to generate full implementation)
from pyspark.sql.functions import current_timestamp, input_file_name, lit
import uuid

batch_id = str(uuid.uuid4())
df = (spark.read.format("csv").option("header", True).load("/Files/landing/daily/")
      .withColumn("ingestion_timestamp", current_timestamp())
      .withColumn("source_file", input_file_name())
      .withColumn("batch_id", lit(batch_id)))
df.write.mode("append").partitionBy("ingestion_date").format("delta").saveAsTable("bronze.events_raw")
```

### Example 3: Bronze-to-Silver Transformation

**Prompt**: "Clean bronze data: remove duplicates, filter invalid records, add derived columns, write to silver"

**What the LLM should generate**: PySpark notebook applying quality rules, schema conformance, and partitioned write with optimization.

### Example 4: End-to-End Pipeline

**Prompt**: "Create a pipeline that runs bronze ingestion, then silver transformation, then gold aggregation daily at 2 AM"

**What the LLM should generate**: Pipeline JSON definition with sequential notebook activities, date parameter, retry logic, and schedule trigger.