---
name: etl-elt-patterns
description: Use when designing data pipelines, choosing between ETL and ELT approaches, or implementing data transformation patterns. Covers modern data pipeline architecture.
allowed-tools: Read, Glob, Grep
---

# ETL/ELT Patterns

Patterns for data extraction, loading, and transformation including modern ELT approaches and pipeline design.

## When to Use This Skill

- Choosing between ETL and ELT
- Designing data pipelines
- Implementing data transformations
- Building modern data stacks
- Handling data quality in pipelines

## ETL vs ELT

### ETL (Extract, Transform, Load)

```text
┌─────────┐     ┌─────────────┐     ┌─────────────┐
│ Sources │ ──► │  Transform  │ ──► │  Warehouse  │
└─────────┘     │   Server    │     └─────────────┘
                └─────────────┘
                (Transformation happens
                 before loading)

Characteristics:
- Transform before load
- Requires ETL server/tool
- Schema-on-write
- Traditional approach

Best for:
- Complex transformations
- Limited target storage
- Strict data quality requirements
- Legacy systems
```

### ELT (Extract, Load, Transform)

```text
┌─────────┐     ┌─────────────┐     ┌─────────────┐
│ Sources │ ──► │   Target    │ ──► │  Transform  │
└─────────┘     │  (Load raw) │     │  (in-place) │
                └─────────────┘     └─────────────┘
                                    (Transformation happens
                                     after loading)

Characteristics:
- Load first, transform later
- Uses target system's compute
- Schema-on-read
- Modern approach

Best for:
- Cloud data warehouses
- Flexible exploration
- Iterative development
- Large data volumes
```

### Comparison

| Factor | ETL | ELT |
| ------ | --- | --- |
| Transform timing | Before load | After load |
| Compute location | Separate server | Target system |
| Raw data access | Limited | Full |
| Flexibility | Low | High |
| Latency | Higher | Lower |
| Cost model | ETL server + storage | Storage + target compute |
| Best for | Complex, pre-defined | Exploratory, iterative |

## Modern Data Stack

```text
Extract:        Fivetran, Airbyte, Stitch, Custom
                        │
                        ▼
Load:           Cloud Warehouse (Snowflake, BigQuery, Redshift)
                        │
                        ▼
Transform:      dbt, Dataform, SQLMesh
                        │
                        ▼
Visualize:      Looker, Tableau, Metabase
```

### dbt (Data Build Tool)

```text
Core concepts:
- Models: SQL SELECT statements that define transformations
- Tests: Data quality assertions
- Documentation: Inline docs and lineage
- Macros: Reusable SQL snippets

Example model:
-- models/customers.sql
SELECT
    customer_id,
    first_name,
    last_name,
    order_count
FROM {{ ref('stg_customers') }}
LEFT JOIN {{ ref('customer_orders') }} USING (customer_id)
```

## Pipeline Patterns

### Full Refresh

```text
Strategy: Drop and recreate entire table

Process:
1. Extract all data from source
2. Truncate target table
3. Load all data

Pros: Simple, consistent
Cons: Slow for large tables, can't handle deletes
Best for: Small dimension tables, reference data
```

### Incremental Load

```text
Strategy: Only process new/changed records

Process:
1. Track high watermark (last processed timestamp/ID)
2. Extract records > watermark
3. Merge into target

Pros: Fast, efficient
Cons: Complex, may miss updates
Best for: Large fact tables, event data
```

### Change Data Capture (CDC)

```text
Strategy: Capture all changes from source

Approaches:
┌────────────────────────────────────────────────┐
│ Log-based CDC (Debezium, AWS DMS)              │
│ - Reads database transaction log               │
│ - Captures inserts, updates, deletes           │
│ - No source table modification needed          │
└────────────────────────────────────────────────┘

┌────────────────────────────────────────────────┐
│ Trigger-based CDC                              │
│ - Database triggers on changes                 │
│ - Writes to change table                       │
│ - Adds load to source                          │
└────────────────────────────────────────────────┘

┌────────────────────────────────────────────────┐
│ Timestamp-based CDC                            │
│ - Query by updated_at timestamp                │
│ - Simple but misses hard deletes              │
└────────────────────────────────────────────────┘
```

### Merge (Upsert) Pattern

```sql
-- Snowflake/BigQuery style MERGE
MERGE INTO target t
USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET
    t.name = s.name,
    t.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (id, name, created_at)
    VALUES (s.id, s.name, CURRENT_TIMESTAMP);
```

## Data Quality Patterns

### Validation Gates

```text
Pipeline with quality gates:

Extract → Validate → Load → Transform → Validate → Serve
            │                              │
            ▼                              ▼
         Quarantine                    Alert/Block
         (bad records)                 (quality issue)
```

### Quality Checks

```text
Schema validation:
- Required fields present
- Data types match
- Field lengths within limits

Data validation:
- Null checks
- Range checks
- Format validation (dates, emails)
- Referential integrity

Statistical validation:
- Row count within expected range
- Value distributions normal
- No unexpected duplicates
```

### Data Contracts

```text
Define expectations between producer and consumer:

{
  "contract_version": "1.0",
  "schema": {
    "customer_id": {"type": "string", "required": true},
    "email": {"type": "string", "format": "email"},
    "created_at": {"type": "timestamp"}
  },
  "quality": {
    "freshness": "< 1 hour",
    "completeness": "> 99%",
    "row_count": "10000-100000"
  }
}
```

## Pipeline Architecture

### Batch Pipeline

```text
Schedule-based processing:

┌─────────┐     ┌─────────┐     ┌─────────┐
│  Cron   │ ──► │  Spark  │ ──► │   DW    │
│ (daily) │     │  (ETL)  │     │         │
└─────────┘     └─────────┘     └─────────┘

Best for: Non-real-time, large volumes
Tools: Airflow, Dagster, Prefect
```

### Streaming Pipeline

```text
Real-time processing:

┌─────────┐     ┌─────────┐     ┌─────────┐
│  Kafka  │ ──► │  Flink  │ ──► │   DW    │
│(events) │     │(process)│     │(stream) │
└─────────┘     └─────────┘     └─────────┘

Best for: Real-time analytics, event-driven
Tools: Kafka Streams, Flink, Spark Streaming
```

### Lambda Architecture

```text
Batch + Speed layers:

                    ┌─────────────────┐
             ┌────► │   Batch Layer   │ ────┐
             │      │ (comprehensive) │     │
┌─────────┐  │      └─────────────────┘     │  ┌─────────┐
│  Data   │──┤                              ├─►│ Serving │
│ Sources │  │      ┌─────────────────┐     │  │  Layer  │
└─────────┘  └────► │   Speed Layer   │ ────┘  └─────────┘
                    │ (real-time)     │
                    └─────────────────┘

Pros: Complete + real-time
Cons: Complex, duplicate logic
```

### Kappa Architecture

```text
Streaming only (reprocess from log):

┌─────────┐     ┌─────────┐     ┌─────────┐
│  Kafka  │ ──► │ Stream  │ ──► │ Serving │
│  (log)  │     │ Process │     │  Layer  │
└─────────┘     └─────────┘     └─────────┘
      │
      └── Replay for reprocessing

Pros: Simple, single codebase
Cons: Requires replayable log
```

## Orchestration

### DAG-Based Orchestration

```text
Directed Acyclic Graph of tasks:

    extract_a ──┐
                ├── transform ── load
    extract_b ──┘

Tools: Airflow, Dagster, Prefect
```

### Orchestration Best Practices

```text
1. Idempotent tasks (safe to retry)
2. Clear dependencies
3. Appropriate granularity
4. Monitoring and alerting
5. Backfill support
6. Parameterized runs
```

## Error Handling

### Retry Strategies

```text
Transient errors (network, timeout):
- Exponential backoff
- Max retry count
- Circuit breaker

Data errors (validation failure):
- Quarantine bad records
- Continue processing good records
- Alert for review
```

### Dead Letter Queue

```text
Failed records → DLQ → Manual review → Reprocess

Capture:
- Original record
- Error message
- Timestamp
- Retry count
```

## Best Practices

### Pipeline Design

```text
1. Idempotent transformations
2. Clear lineage tracking
3. Appropriate checkpointing
4. Graceful failure handling
5. Comprehensive logging
6. Data quality gates
```

### Performance

```text
1. Partition data appropriately
2. Incremental processing when possible
3. Parallel extraction
4. Efficient file formats (Parquet, ORC)
5. Compression
6. Resource sizing
```

## Related Skills

- `data-architecture` - Data platform design
- `stream-processing` - Real-time processing
- `ml-system-design` - Feature engineering