---
name: data-architecture
description: Use when designing data platforms, choosing between data lakes/lakehouses/warehouses, or implementing data mesh patterns. Covers modern data architecture approaches.
allowed-tools: Read, Glob, Grep
---

# Data Architecture

Modern data architecture patterns including data lakes, lakehouses, data mesh, and data platform design.

## When to Use This Skill

- Choosing between data lake, warehouse, and lakehouse
- Designing a modern data platform
- Implementing data mesh principles
- Planning data storage strategy
- Understanding data architecture trade-offs

## Data Architecture Evolution

```text
Generation 1: Data Warehouse (1990s-2000s)
- Structured data only
- ETL into warehouse
- Star/snowflake schemas
- SQL-based analytics

Generation 2: Data Lake (2010s)
- All data types (structured, semi, unstructured)
- Schema-on-read
- Hadoop/HDFS based
- Cheap storage, complex processing

Generation 3: Lakehouse (2020s)
- Best of both: lake flexibility + warehouse features
- ACID transactions on lake
- Schema enforcement optional
- Unified analytics and ML
```

## Architecture Comparison

### Data Warehouse

```text
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Sources   │ ──► │     ETL     │ ──► │  Warehouse  │
│ (Structured)│     │ (Transform) │     │ (Star/Snow) │
└─────────────┘     └─────────────┘     └─────────────┘
                                              │
                                              ▼
                                        ┌─────────────┐
                                        │     BI      │
                                        │  Analytics  │
                                        └─────────────┘

Characteristics:
- Schema-on-write
- Optimized for SQL queries
- Structured data only
- High data quality
- Expensive storage

Best for:
- Business intelligence
- Financial reporting
- Structured analytics
```

### Data Lake

```text
┌─────────────┐     ┌─────────────┐
│   Sources   │ ──► │  Data Lake  │
│    (All)    │     │   (Raw)     │
└─────────────┘     └─────────────┘
                          │
         ┌────────────────┼────────────────┐
         ▼                ▼                ▼
    ┌─────────┐     ┌─────────┐     ┌─────────┐
    │   ML    │     │   ETL   │     │  Spark  │
    │ Training│     │ to DW   │     │ Analysis│
    └─────────┘     └─────────┘     └─────────┘

Characteristics:
- Schema-on-read
- All data types
- Cheap storage
- Flexible processing
- Risk of "data swamp"

Best for:
- Data science/ML
- Unstructured data
- Experimental analysis
```

### Data Lakehouse

```text
┌─────────────┐     ┌─────────────────────────────────┐
│   Sources   │ ──► │         Data Lakehouse          │
│    (All)    │     │  ┌──────────────────────────┐   │
└─────────────┘     │  │    Metadata Layer        │   │
                    │  │ (Delta/Iceberg/Hudi)     │   │
                    │  └──────────────────────────┘   │
                    │  ┌──────────────────────────┐   │
                    │  │    Storage Layer         │   │
                    │  │    (Object Storage)      │   │
                    │  └──────────────────────────┘   │
                    └─────────────────────────────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              ▼                    ▼                    ▼
         ┌─────────┐         ┌─────────┐         ┌─────────┐
         │   SQL   │         │   ML    │         │  Stream │
         │   BI    │         │ Workload│         │ Process │
         └─────────┘         └─────────┘         └─────────┘

Characteristics:
- ACID transactions
- Schema evolution
- Time travel
- Unified batch/streaming
- Open formats

Best for:
- Unified analytics
- Both BI and ML
- Modern data platforms
```

## Architecture Selection Guide

| Factor | Warehouse | Lake | Lakehouse |
| ------ | --------- | ---- | --------- |
| Data types | Structured | All | All |
| Query performance | Excellent | Poor-Medium | Good |
| Data quality | High | Variable | Configurable |
| Cost | High | Low | Medium |
| ML workloads | Limited | Excellent | Excellent |
| Real-time | Limited | Good | Good |
| Governance | Strong | Weak | Strong |
| Complexity | Low | High | Medium |

```text
Decision Tree:

Is data mostly structured with BI focus?
├── Yes → Data Warehouse
└── No
    └── Need ML + BI on same data?
        ├── Yes → Lakehouse
        └── No
            └── Primarily ML/unstructured?
                ├── Yes → Data Lake
                └── No → Lakehouse
```

## Lakehouse Technologies

### Delta Lake (Databricks)

```text
Features:
- ACID transactions
- Time travel (data versioning)
- Schema enforcement/evolution
- Unified batch/streaming
- Optimized performance (Z-ordering, compaction)

File format: Parquet + Delta log
```

### Apache Iceberg (Netflix)

```text
Features:
- ACID transactions
- Hidden partitioning
- Schema evolution
- Time travel
- Vendor neutral

File format: Parquet/ORC/Avro + metadata
```

### Apache Hudi (Uber)

```text
Features:
- ACID transactions
- Incremental processing
- Record-level updates
- Time travel
- Optimized for streaming

File format: Parquet + Hudi metadata
```

### Technology Comparison

| Feature | Delta Lake | Iceberg | Hudi |
| ------- | ---------- | ------- | ---- |
| ACID | Yes | Yes | Yes |
| Time Travel | Yes | Yes | Yes |
| Schema Evolution | Good | Excellent | Good |
| Streaming | Excellent | Good | Excellent |
| Ecosystem | Databricks | Wide | Wide |
| Performance | Excellent | Excellent | Good |
| Community | Large | Growing | Medium |

## Data Mesh

### Principles

```text
Data Mesh = Decentralized data architecture

Four Principles:

1. Domain Ownership
   - Data owned by domain teams
   - Not centralized data team

2. Data as a Product
   - Treat data like a product
   - Quality, discoverability, usability

3. Self-Serve Platform
   - Platform enables domain teams
   - Reduces friction

4. Federated Governance
   - Global standards
   - Local implementation
```

### Data Products

```text
Data Product = Autonomous unit of data

Components:
┌──────────────────────────────────────┐
│           Data Product               │
│  ┌──────────┐  ┌──────────────────┐ │
│  │   Data   │  │     Metadata     │ │
│  │ (Tables) │  │ (Schema, docs)   │ │
│  └──────────┘  └──────────────────┘ │
│  ┌──────────┐  ┌──────────────────┐ │
│  │   Code   │  │      APIs        │ │
│  │ (ETL)    │  │  (Access layer)  │ │
│  └──────────┘  └──────────────────┘ │
│  ┌──────────────────────────────────┐│
│  │         Quality + SLAs           ││
│  └──────────────────────────────────┘│
└──────────────────────────────────────┘
```

### Data Mesh vs Centralized

| Aspect | Centralized | Data Mesh |
| ------ | ----------- | --------- |
| Ownership | Central data team | Domain teams |
| Scaling | Team bottleneck | Scales with org |
| Domain knowledge | Lost in translation | Preserved |
| Governance | Centralized | Federated |
| Implementation | Uniform | Heterogeneous |
| Complexity | Lower initially | Higher initially |

## Data Modeling Patterns

### Star Schema

```text
        ┌─────────────┐
        │  Dim_Time   │
        └──────┬──────┘
               │
┌───────────┐  │  ┌───────────┐
│Dim_Product├──┼──┤Dim_Customer│
└───────────┘  │  └───────────┘
               │
        ┌──────┴──────┐
        │ Fact_Sales  │
        └─────────────┘

Pros: Simple, fast queries
Cons: Denormalized, redundancy
Best for: BI, reporting
```

### Snowflake Schema

```text
Normalized dimensions:
Dim_Product → Dim_Category → Dim_Subcategory

Pros: Less redundancy
Cons: More joins, slower
Best for: Complex hierarchies
```

### Data Vault

```text
Hub (business keys) ←→ Link (relationships) ←→ Satellite (attributes)

Pros: Auditable, flexible, scalable
Cons: Complex, learning curve
Best for: Enterprise data warehouse
```

## Storage Layers

### Bronze/Silver/Gold (Medallion Architecture)

```text
┌─────────┐     ┌─────────┐     ┌─────────┐
│ Bronze  │ ──► │ Silver  │ ──► │  Gold   │
│  (Raw)  │     │(Cleaned)│     │(Curated)│
└─────────┘     └─────────┘     └─────────┘

Bronze: Raw ingestion, append-only
Silver: Cleaned, validated, conformed
Gold: Business-level aggregates, features
```

### Zones in Data Lake

```text
Landing Zone: Raw files from sources
Raw Zone: Structured raw data
Curated Zone: Transformed, quality-checked
Consumption Zone: Ready for analytics
Sandbox Zone: Exploration and experimentation
```

## Best Practices

### Data Quality

```text
Implement quality gates:
- Schema validation
- Null checks
- Range validation
- Referential integrity
- Freshness monitoring
```

### Governance

```text
Key capabilities:
- Data catalog
- Lineage tracking
- Access control
- Privacy compliance
- Audit logging
```

### Performance

```text
Optimization techniques:
- Partitioning (by date, region)
- Clustering/Z-ordering
- Compaction
- Caching
- Materialized views
```

## Related Skills

- `etl-elt-patterns` - Data transformation
- `stream-processing` - Real-time data
- `database-scaling` - Database patterns