--- name: ELT Modeling description: Comprehensive guide to ELT (Extract, Load, Transform) modeling patterns, dimensional modeling, fact and dimension tables, and data warehouse design --- # ELT Modeling ## ELT vs ETL ### ETL (Traditional) ``` Extract → Transform → Load 1. Extract from source 2. Transform in ETL tool (Informatica, Talend) 3. Load to warehouse Pros: Clean data before loading Cons: Slow, expensive transformation servers ``` ### ELT (Modern) ``` Extract → Load → Transform 1. Extract from source 2. Load raw data to warehouse 3. Transform in warehouse (SQL, dbt) Pros: Fast loading, leverage warehouse power Cons: Raw data in warehouse (need governance) ``` ### Why ELT? - **Cloud warehouses:** Snowflake, BigQuery, Redshift (powerful, scalable) - **Separation of concerns:** EL (Fivetran, Airbyte) + T (dbt) - **Flexibility:** Transform multiple ways from same raw data - **Speed:** Parallel processing in warehouse --- ## Dimensional Modeling ### Star Schema ``` dim_customers | dim_products | fct_orders ← Central fact table | dim_dates | dim_locations ``` **Characteristics:** - One fact table (center) - Multiple dimension tables (points) - Denormalized (fast queries) ### Snowflake Schema ``` dim_customers | dim_customer_segments | fct_orders | dim_products | dim_product_categories ``` **Characteristics:** - Normalized dimensions - Less redundancy - More joins (slower queries) **Recommendation:** Use star schema for analytics (faster) --- ## Fact Tables ### Definition **Fact table:** Stores measurable events (transactions, orders, clicks) ### Characteristics - **Large:** Millions to billions of rows - **Numeric measures:** Amounts, quantities, counts - **Foreign keys:** Links to dimensions - **Grain:** Level of detail (one row per order, per day, etc.) ### Types of Facts **Transaction Facts:** ```sql -- One row per transaction fct_orders: order_id (PK) customer_id (FK) product_id (FK) order_date_id (FK) quantity amount ``` **Periodic Snapshot Facts:** ```sql -- One row per period (daily, monthly) fct_inventory_daily: date_id (PK) product_id (PK) warehouse_id (PK) quantity_on_hand quantity_sold ``` **Accumulating Snapshot Facts:** ```sql -- One row per process (order lifecycle) fct_order_lifecycle: order_id (PK) order_date payment_date shipment_date delivery_date days_to_ship days_to_deliver ``` ### Fact Table Design ```sql CREATE TABLE fct_orders ( -- Surrogate key order_key BIGINT PRIMARY KEY, -- Natural key order_id VARCHAR(50) NOT NULL, -- Foreign keys (dimensions) customer_key BIGINT NOT NULL, product_key BIGINT NOT NULL, date_key INT NOT NULL, -- Degenerate dimensions (no separate dim table) order_number VARCHAR(50), -- Measures quantity INT, unit_price DECIMAL(10,2), total_amount DECIMAL(10,2), discount_amount DECIMAL(10,2), -- Audit columns created_at TIMESTAMP, updated_at TIMESTAMP ); ``` --- ## Dimension Tables ### Definition **Dimension table:** Stores descriptive attributes (who, what, where, when) ### Characteristics - **Small:** Thousands to millions of rows - **Descriptive:** Text, categories, hierarchies - **Slowly changing:** Updates over time (SCD) ### Dimension Design ```sql CREATE TABLE dim_customers ( -- Surrogate key customer_key BIGINT PRIMARY KEY, -- Natural key customer_id VARCHAR(50) NOT NULL, -- Attributes customer_name VARCHAR(255), email VARCHAR(255), phone VARCHAR(50), -- Hierarchies city VARCHAR(100), state VARCHAR(100), country VARCHAR(100), region VARCHAR(100), -- Segments customer_segment VARCHAR(50), customer_tier VARCHAR(50), -- SCD Type 2 columns effective_date DATE, expiration_date DATE, is_current BOOLEAN, -- Audit columns created_at TIMESTAMP, updated_at TIMESTAMP ); ``` ### Slowly Changing Dimensions (SCD) **Type 1: Overwrite** ```sql -- Customer moves, update address UPDATE dim_customers SET city = 'New York', state = 'NY' WHERE customer_id = '123'; -- No history preserved ``` **Type 2: Add New Row (Historical)** ```sql -- Customer moves, add new row INSERT INTO dim_customers ( customer_id, city, state, effective_date, is_current ) VALUES ( '123', 'New York', 'NY', '2024-01-15', TRUE ); -- Mark old row as not current UPDATE dim_customers SET is_current = FALSE, expiration_date = '2024-01-14' WHERE customer_id = '123' AND is_current = TRUE; -- History preserved ``` **Type 3: Add New Column** ```sql -- Track previous value ALTER TABLE dim_customers ADD COLUMN previous_city VARCHAR(100); UPDATE dim_customers SET previous_city = city, city = 'New York' WHERE customer_id = '123'; ``` --- ## Grain Definition ### What is Grain? **Grain:** Level of detail in fact table (one row represents what?) ### Examples ``` Order grain: One row per order Order line grain: One row per product in order Daily grain: One row per day per product Hourly grain: One row per hour per product ``` ### Choosing Grain ``` Too fine (order line): - More rows - More storage - More flexibility Too coarse (monthly): - Fewer rows - Less storage - Less flexibility Balance: Choose finest grain needed for analysis ``` --- ## Kimball vs Inmon ### Kimball (Bottom-Up, Dimensional) ``` Data Marts (Star Schemas) → Enterprise Data Warehouse Pros: - Fast to implement - Business-friendly (denormalized) - Query performance Cons: - Data redundancy - Hard to change ``` ### Inmon (Top-Down, Normalized) ``` Enterprise Data Warehouse (3NF) → Data Marts Pros: - Single source of truth - Flexible - Less redundancy Cons: - Slow to implement - Complex queries (many joins) ``` **Modern Approach:** Hybrid (normalized staging, dimensional marts) --- ## Data Vault ### Components ``` Hubs: Business keys (customers, products) Links: Relationships (orders link customers + products) Satellites: Descriptive attributes (customer details) ``` ### Example ```sql -- Hub: Customer CREATE TABLE hub_customer ( customer_hub_key BIGINT PRIMARY KEY, customer_id VARCHAR(50) UNIQUE, load_date TIMESTAMP, record_source VARCHAR(50) ); -- Satellite: Customer Details CREATE TABLE sat_customer_details ( customer_hub_key BIGINT, load_date TIMESTAMP, customer_name VARCHAR(255), email VARCHAR(255), phone VARCHAR(50), PRIMARY KEY (customer_hub_key, load_date) ); -- Link: Order CREATE TABLE link_order ( order_link_key BIGINT PRIMARY KEY, customer_hub_key BIGINT, product_hub_key BIGINT, load_date TIMESTAMP ); ``` **Pros:** - Audit trail - Flexible - Handles source changes **Cons:** - Complex - Many joins - Steep learning curve --- ## Conformed Dimensions ### Definition **Conformed dimension:** Shared across multiple fact tables ### Example ```sql -- dim_date used by multiple facts dim_date ← fct_orders ← fct_shipments ← fct_returns -- Ensures consistent date attributes ``` ### Benefits - **Consistency:** Same date logic everywhere - **Drill-across:** Compare metrics across facts - **Reusability:** Build once, use many times --- ## Bridge Tables ### Many-to-Many Relationships ```sql -- Customer can have multiple accounts -- Account can have multiple customers CREATE TABLE bridge_customer_account ( customer_key BIGINT, account_key BIGINT, allocation_percentage DECIMAL(5,2), PRIMARY KEY (customer_key, account_key) ); ``` --- ## Junk Dimensions ### Definition **Junk dimension:** Collection of low-cardinality flags ### Example ```sql -- Instead of many boolean columns in fact CREATE TABLE dim_order_flags ( order_flag_key INT PRIMARY KEY, is_gift BOOLEAN, is_express_shipping BOOLEAN, is_first_order BOOLEAN, has_discount BOOLEAN ); -- Fact table references junk dimension fct_orders: order_key customer_key order_flag_key ← Reference to junk dimension amount ``` --- ## Role-Playing Dimensions ### Definition **Role-playing dimension:** Same dimension used multiple times with different meanings ### Example ```sql -- dim_date used multiple times fct_orders: order_key order_date_key ← dim_date (order date) ship_date_key ← dim_date (ship date) delivery_date_key ← dim_date (delivery date) amount ``` --- ## Surrogate Keys ### Natural Key vs Surrogate Key ```sql -- Natural key: Business identifier customer_id = 'CUST-12345' -- Surrogate key: System-generated customer_key = 1001 (auto-increment or hash) ``` ### Why Surrogate Keys? - **Performance:** Integer joins faster than string joins - **SCD Type 2:** Multiple rows for same natural key - **Independence:** Source system can change natural key - **Consistency:** Same format across all tables ### Generating Surrogate Keys ```sql -- Auto-increment customer_key BIGINT AUTO_INCREMENT PRIMARY KEY -- Hash (dbt) {{ dbt_utils.surrogate_key(['customer_id', 'effective_date']) }} -- Sequence (PostgreSQL) customer_key BIGINT DEFAULT nextval('customer_key_seq') ``` --- ## Data Warehouse Layers ### Bronze (Raw) ``` Purpose: Exact copy of source data Format: As-is from source Schema: Source schema Example: raw_salesforce_accounts ``` ### Silver (Cleaned) ``` Purpose: Cleaned, standardized Format: Consistent types, naming Schema: Staging schema Example: stg_salesforce_accounts ``` ### Gold (Analytics) ``` Purpose: Business-ready Format: Dimensional models Schema: Analytics schema Example: dim_customers, fct_orders ``` --- ## Best Practices ### 1. Define Grain Clearly ``` ✓ "One row per order" ✗ "Order data" ``` ### 2. Use Surrogate Keys ``` ✓ customer_key BIGINT ✗ customer_id VARCHAR(50) ``` ### 3. Denormalize Dimensions ``` ✓ dim_customers includes city, state, country ✗ Separate dim_cities, dim_states, dim_countries ``` ### 4. Keep Facts Narrow ``` ✓ fct_orders: Keys + measures only ✗ fct_orders: Keys + measures + customer name, product name ``` ### 5. Use Conformed Dimensions ``` ✓ dim_date shared across all facts ✗ Each fact has own date dimension ``` --- ## Common Patterns ### Daily Snapshot ```sql -- Capture state once per day INSERT INTO fct_inventory_daily SELECT CURRENT_DATE as snapshot_date, product_id, warehouse_id, quantity_on_hand, quantity_reserved FROM current_inventory; ``` ### Cumulative Metrics ```sql -- Running totals SELECT order_date, SUM(amount) OVER ( ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) as cumulative_revenue FROM fct_orders; ``` --- ## Summary **ELT:** Extract → Load → Transform (modern approach) **Dimensional Modeling:** - Star schema (recommended) - Fact tables (measures) - Dimension tables (attributes) **Fact Types:** - Transaction (one per event) - Periodic snapshot (one per period) - Accumulating snapshot (one per process) **SCD Types:** - Type 1: Overwrite - Type 2: Historical (recommended) - Type 3: Previous value **Grain:** Level of detail (define clearly!) **Surrogate Keys:** Use for performance and SCD **Layers:** - Bronze: Raw - Silver: Cleaned - Gold: Analytics-ready **Best Practices:** - Define grain - Use surrogate keys - Denormalize dimensions - Keep facts narrow - Use conformed dimensions