---
name: sql-databases
description: SQL query optimization, schema design, indexing strategies, and relational database mastery for production data systems
sasmp_version: "1.3.0"
bonded_agent: 01-data-engineer
bond_type: PRIMARY_BOND
skill_version: "2.0.0"
last_updated: "2025-01"
complexity: foundational
estimated_mastery_hours: 100
prerequisites: [python-programming]
unlocks: [data-warehousing, etl-tools, nosql-databases]
---

# SQL & Relational Databases

Production-grade SQL skills for designing, querying, and optimizing relational databases in data engineering workflows.

## Quick Start

```sql
-- Modern PostgreSQL 16+ query with window functions and CTEs
WITH daily_metrics AS (
    SELECT
        DATE_TRUNC('day', created_at) AS metric_date,
        category,
        COUNT(*) AS event_count,
        SUM(amount) AS total_amount,
        AVG(amount) AS avg_amount
    FROM events
    WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY 1, 2
),
ranked_categories AS (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY metric_date ORDER BY total_amount DESC) AS daily_rank,
        LAG(total_amount) OVER (PARTITION BY category ORDER BY metric_date) AS prev_day_amount
    FROM daily_metrics
)
SELECT
    metric_date,
    category,
    event_count,
    total_amount,
    ROUND(100.0 * (total_amount - prev_day_amount) / NULLIF(prev_day_amount, 0), 2) AS day_over_day_pct
FROM ranked_categories
WHERE daily_rank <= 5
ORDER BY metric_date DESC, daily_rank;
```

## Core Concepts

### 1. Query Optimization with EXPLAIN ANALYZE

```sql
-- Always analyze before optimizing
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.id, u.email, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at >= '2024-01-01'
GROUP BY u.id, u.email
HAVING COUNT(o.id) > 10;

-- Reading EXPLAIN output:
-- Seq Scan = Table scan (often slow, needs index)
-- Index Scan = Using index (good)
-- Bitmap Index Scan = Multiple index conditions (acceptable)
-- Hash Join = Building hash table (memory intensive)
-- Nested Loop = Row-by-row join (slow for large tables)
```

### 2. Indexing Strategies

```sql
-- B-tree index (default, most common)
CREATE INDEX idx_users_email ON users(email);

-- Partial index (smaller, faster for specific queries)
CREATE INDEX idx_active_users ON users(created_at)
WHERE status = 'active';

-- Composite index (column order matters!)
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC);

-- Covering index (includes all columns query needs)
CREATE INDEX idx_orders_covering ON orders(user_id)
INCLUDE (total_amount, status);

-- Expression index (for computed values)
CREATE INDEX idx_users_lower_email ON users(LOWER(email));

-- GIN index for JSONB
CREATE INDEX idx_metadata_gin ON events USING GIN(metadata jsonb_path_ops);

-- Check index usage
SELECT
    schemaname, tablename, indexname,
    idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
```

### 3. Window Functions (2024-2025 Essential)

```sql
-- Running totals and moving averages
SELECT
    order_date,
    amount,
    -- Running total
    SUM(amount) OVER (ORDER BY order_date) AS running_total,
    -- 7-day moving average
    AVG(amount) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d,
    -- Rank within partition
    RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS customer_rank,
    -- Percent of total within partition
    100.0 * amount / SUM(amount) OVER (PARTITION BY customer_id) AS pct_of_customer_total
FROM orders;

-- Gap and island detection
WITH numbered AS (
    SELECT
        event_date,
        ROW_NUMBER() OVER (ORDER BY event_date) AS rn,
        event_date - (ROW_NUMBER() OVER (ORDER BY event_date) * INTERVAL '1 day') AS grp
    FROM events
)
SELECT
    MIN(event_date) AS streak_start,
    MAX(event_date) AS streak_end,
    COUNT(*) AS streak_length
FROM numbered
GROUP BY grp
ORDER BY streak_start;
```

### 4. Transaction Isolation & ACID

```sql
-- Read Committed (default in PostgreSQL)
BEGIN;
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- Sees committed data at query start
COMMIT;

-- Repeatable Read (for consistent snapshots)
BEGIN;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- Same query returns same results within transaction
COMMIT;

-- Serializable (strongest, may cause retries)
BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Full isolation, but prepare for serialization failures
COMMIT;

-- Optimistic locking pattern
UPDATE products
SET stock = stock - 1, version = version + 1
WHERE id = $1 AND version = $2;
-- Check rows affected; if 0, concurrent modification occurred
```

### 5. Partitioning for Scale

```sql
-- Range partitioning (time-series data)
CREATE TABLE events (
    id BIGSERIAL,
    event_type TEXT NOT NULL,
    payload JSONB,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);

-- Create partitions
CREATE TABLE events_2024_q1 PARTITION OF events
    FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE events_2024_q2 PARTITION OF events
    FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');

-- Auto-partition with pg_partman (production setup)
SELECT partman.create_parent(
    p_parent_table := 'public.events',
    p_control := 'created_at',
    p_type := 'native',
    p_interval := 'monthly',
    p_premake := 3
);

-- List partitioning (categorical data)
CREATE TABLE orders (
    id BIGSERIAL,
    region TEXT NOT NULL,
    total NUMERIC(10,2)
) PARTITION BY LIST (region);

CREATE TABLE orders_us PARTITION OF orders FOR VALUES IN ('US', 'CA');
CREATE TABLE orders_eu PARTITION OF orders FOR VALUES IN ('UK', 'DE', 'FR');
```

## Tools & Technologies

| Tool | Purpose | Version (2025) |
|------|---------|----------------|
| **PostgreSQL** | Primary OLTP database | 16+ |
| **MySQL** | Alternative OLTP | 8.0+ |
| **DBeaver** | Universal SQL client | Latest |
| **pgAdmin** | PostgreSQL admin | 8+ |
| **pgcli** | CLI with autocomplete | Latest |
| **pg_stat_statements** | Query performance | Built-in |
| **pgBouncer** | Connection pooling | 1.21+ |
| **Flyway/Liquibase** | Schema migrations | Latest |

## Learning Path

### Phase 1: Foundations (Weeks 1-2)
```
Week 1: SELECT, WHERE, JOINs, GROUP BY, ORDER BY
Week 2: Aggregations, HAVING, subqueries, UNION
```

### Phase 2: Intermediate (Weeks 3-4)
```
Week 3: Window functions, CTEs, CASE expressions
Week 4: Index fundamentals, EXPLAIN basics
```

### Phase 3: Advanced (Weeks 5-7)
```
Week 5: Query optimization, execution plans
Week 6: Transactions, locking, isolation levels
Week 7: Partitioning, schema design patterns
```

### Phase 4: Production (Weeks 8-10)
```
Week 8: Connection pooling, high availability
Week 9: Migrations, versioning, rollback strategies
Week 10: Monitoring, alerting, performance tuning
```

## Production Patterns

### Schema Design: Star Schema

```sql
-- Fact table (measures, metrics)
CREATE TABLE fact_sales (
    sale_id BIGSERIAL PRIMARY KEY,
    date_key INT NOT NULL REFERENCES dim_date(date_key),
    product_key INT NOT NULL REFERENCES dim_product(product_key),
    customer_key INT NOT NULL REFERENCES dim_customer(customer_key),
    quantity INT NOT NULL,
    unit_price NUMERIC(10,2) NOT NULL,
    total_amount NUMERIC(12,2) NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY RANGE (date_key);

-- Dimension table (descriptive attributes)
CREATE TABLE dim_product (
    product_key SERIAL PRIMARY KEY,
    product_id VARCHAR(50) NOT NULL,
    product_name VARCHAR(255) NOT NULL,
    category VARCHAR(100),
    subcategory VARCHAR(100),
    -- SCD Type 2 fields
    valid_from DATE NOT NULL DEFAULT CURRENT_DATE,
    valid_to DATE DEFAULT '9999-12-31',
    is_current BOOLEAN DEFAULT TRUE,
    UNIQUE(product_id, valid_from)
);
```

### Slow Query Detection

```sql
-- Enable pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top 10 slowest queries
SELECT
    ROUND((total_exec_time / 1000)::numeric, 2) AS total_secs,
    calls,
    ROUND((mean_exec_time / 1000)::numeric, 4) AS mean_secs,
    ROUND((stddev_exec_time / 1000)::numeric, 4) AS stddev_secs,
    rows,
    query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Reset stats after optimization
SELECT pg_stat_statements_reset();
```

### Connection Pooling Configuration

```ini
# pgbouncer.ini
[databases]
mydb = host=localhost port=5432 dbname=mydb

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

# Pool modes:
# session = 1 connection per client session
# transaction = 1 connection per transaction (recommended)
# statement = 1 connection per statement (limited use)
pool_mode = transaction

default_pool_size = 20
max_client_conn = 1000
reserve_pool_size = 5
```

## Troubleshooting Guide

### Common Failure Modes

| Issue | Symptoms | Root Cause | Fix |
|-------|----------|------------|-----|
| **Slow Query** | Query > 1s | Missing index, bad plan | EXPLAIN ANALYZE, add index |
| **Lock Wait** | Queries hang | Long transaction, deadlock | Check `pg_locks`, kill blocking |
| **Connection Exhausted** | "Too many connections" | No pooling, connection leak | Use pgbouncer, fix app |
| **Disk Full** | Write errors | Table bloat, WAL growth | VACUUM FULL, archive WAL |
| **Replication Lag** | Stale reads | Slow replica, network | Check `pg_stat_replication` |

### Debug Checklist

```sql
-- 1. Check active queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;

-- 2. Find blocking queries
SELECT
    blocked.pid AS blocked_pid,
    blocked.query AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks blocked_locks ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks ON blocked_locks.locktype = blocking_locks.locktype
    AND blocked_locks.relation = blocking_locks.relation
    AND blocked_locks.pid != blocking_locks.pid
JOIN pg_stat_activity blocking ON blocking_locks.pid = blocking.pid
WHERE NOT blocked_locks.granted;

-- 3. Check table bloat
SELECT
    schemaname, tablename,
    pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
    n_dead_tup AS dead_rows,
    n_live_tup AS live_rows,
    ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

-- 4. Check index usage
SELECT
    schemaname, tablename, indexname,
    idx_scan AS times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC  -- Least used first
LIMIT 20;
```

### Kill Long-Running Query

```sql
-- Graceful cancel (SIGINT)
SELECT pg_cancel_backend(pid);

-- Force terminate (SIGTERM)
SELECT pg_terminate_backend(pid);

-- Kill all queries older than 5 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'active'
  AND query_start < NOW() - INTERVAL '5 minutes'
  AND pid != pg_backend_pid();
```

## Unit Test Template

```python
import pytest
from sqlalchemy import create_engine, text
from testcontainers.postgres import PostgresContainer

@pytest.fixture(scope="module")
def postgres_container():
    """Spin up real PostgreSQL for integration tests."""
    with PostgresContainer("postgres:16-alpine") as postgres:
        yield postgres

@pytest.fixture
def db_connection(postgres_container):
    """Create connection with transaction rollback."""
    engine = create_engine(postgres_container.get_connection_url())
    with engine.connect() as conn:
        trans = conn.begin()
        yield conn
        trans.rollback()

class TestQueryOptimization:

    def test_index_is_used(self, db_connection):
        # Arrange
        db_connection.execute(text("""
            CREATE TABLE test_users (
                id SERIAL PRIMARY KEY,
                email VARCHAR(255) NOT NULL
            );
            CREATE INDEX idx_email ON test_users(email);
            INSERT INTO test_users (email)
            SELECT 'user' || i || '@test.com' FROM generate_series(1, 10000) i;
            ANALYZE test_users;
        """))

        # Act
        result = db_connection.execute(text("""
            EXPLAIN (FORMAT JSON)
            SELECT * FROM test_users WHERE email = 'user5000@test.com'
        """)).fetchone()

        plan = result[0][0]

        # Assert
        assert "Index Scan" in str(plan) or "Index Only Scan" in str(plan)

    def test_window_function_correctness(self, db_connection):
        # Arrange
        db_connection.execute(text("""
            CREATE TABLE test_sales (id SERIAL, amount NUMERIC);
            INSERT INTO test_sales (amount) VALUES (100), (200), (300);
        """))

        # Act
        result = db_connection.execute(text("""
            SELECT amount, SUM(amount) OVER (ORDER BY id) AS running_total
            FROM test_sales ORDER BY id
        """)).fetchall()

        # Assert
        assert result[0][1] == 100  # First running total
        assert result[1][1] == 300  # 100 + 200
        assert result[2][1] == 600  # 100 + 200 + 300
```

## Best Practices

### Query Writing
```sql
-- ✅ DO: Use explicit JOINs
SELECT u.name, o.total
FROM users u
INNER JOIN orders o ON u.id = o.user_id;

-- ❌ DON'T: Use implicit joins
SELECT u.name, o.total
FROM users u, orders o
WHERE u.id = o.user_id;

-- ✅ DO: Use EXISTS for existence checks
SELECT * FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);

-- ❌ DON'T: Use IN with subquery for large datasets
SELECT * FROM users
WHERE id IN (SELECT user_id FROM orders);

-- ✅ DO: Paginate with keyset (cursor)
SELECT * FROM events
WHERE id > $last_seen_id
ORDER BY id
LIMIT 100;

-- ❌ DON'T: Use OFFSET for deep pagination
SELECT * FROM events
ORDER BY id
LIMIT 100 OFFSET 1000000;  -- Slow!
```

### Schema Design
```sql
-- ✅ DO: Use appropriate data types
CREATE TABLE metrics (
    id BIGSERIAL PRIMARY KEY,  -- Use BIGSERIAL for high-volume
    value NUMERIC(10,2),       -- Precise decimal
    created_at TIMESTAMPTZ     -- Always use timezone-aware
);

-- ✅ DO: Add constraints
ALTER TABLE orders
ADD CONSTRAINT chk_positive_amount CHECK (amount > 0);

-- ✅ DO: Use JSONB for flexible schema
CREATE TABLE events (
    id BIGSERIAL PRIMARY KEY,
    event_type TEXT NOT NULL,
    payload JSONB NOT NULL DEFAULT '{}'
);
```

## Resources

### Official Documentation
- [PostgreSQL Docs](https://www.postgresql.org/docs/)
- [MySQL Reference Manual](https://dev.mysql.com/doc/refman/8.0/en/)
- [SQL Standard](https://modern-sql.com/)

### Performance Tuning
- [Use The Index, Luke](https://use-the-index-luke.com/)
- [PostgreSQL EXPLAIN Visualizer](https://explain.dalibo.com/)
- [pgMustard](https://www.pgmustard.com/)

### Books
- "SQL Performance Explained" by Markus Winand
- "High Performance MySQL" by Schwartz et al.
- "The Art of PostgreSQL" by Dimitri Fontaine

## Next Skills

After mastering SQL databases:
- → `data-warehousing` - Snowflake, BigQuery, dimensional modeling
- → `etl-tools` - Build pipelines with Airflow
- → `nosql-databases` - MongoDB, Redis, DynamoDB
- → `big-data` - Spark SQL at scale

---

**Skill Certification Checklist:**
- [ ] Can write complex queries with CTEs and window functions
- [ ] Can analyze and optimize query performance with EXPLAIN
- [ ] Can design normalized schemas (3NF, BCNF)
- [ ] Can implement partitioning strategies
- [ ] Can configure connection pooling and monitoring