--- name: dbt-modeling description: Writing dbt models with proper CTE patterns, SQL structure, and layer-specific templates. Use this skill when writing or refactoring dbt models, implementing CTE patterns, creating staging/intermediate/mart models, or ensuring proper SQL structure and dependencies. --- # dbt Modeling ## Purpose Transform AI agents into experts on writing production-quality dbt models, providing guidance on CTE patterns, SQL structure, and best practices for creating maintainable and performant data models across all medallion architecture layers. ## When to Use This Skill Activate this skill when users ask about: - Writing or refactoring dbt models - Implementing CTE (Common Table Expression) patterns - Creating staging, intermediate, or mart models - Structuring SQL for readability and maintainability - Implementing proper model dependencies with ref() and source() - Converting existing SQL to dbt model format - Debugging model SQL structure issues --- ## CTE Pattern Structure All dbt models should follow this consistent CTE pattern for readability and maintainability: ```sql -- Import CTEs (staging and intermediate models) with customers as ( select * from {{ ref('stg_customers') }} ), orders as ( select * from {{ ref('stg_orders') }} ), -- Logical CTEs (business logic) customer_metrics as ( select customer_id, count(*) as order_count, sum(order_amount) as lifetime_value from orders group by customer_id ), -- Final CTE (column selection and standardization) final as ( select -- Primary key customers.customer_id, -- Attributes customers.customer_name, customers.customer_email, -- Metrics coalesce(customer_metrics.order_count, 0) as lifetime_orders, coalesce(customer_metrics.lifetime_value, 0) as lifetime_value, -- Metadata current_timestamp() as dbt_updated_at from customers left join customer_metrics on customers.customer_id = customer_metrics.customer_id ) select * from final ``` **Official dbt Documentation**: [How we structure our dbt projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview) --- ## Bronze Layer: Staging Model Template **Purpose**: One-to-one with source tables. Clean and standardize only. **Materialization**: `ephemeral` (set at folder level in dbt_project.yml) ### Basic Staging Model ```sql -- models/bronze/stg_salesforce__accounts.sql select -- Primary key id as account_id, -- Attributes name as account_name, type as account_type, industry, -- Standardized fields upper(trim(email)) as email_clean, cast(annual_revenue as number) as annual_revenue, -- Metadata current_timestamp() as dbt_loaded_at from {{ source('salesforce', 'accounts') }} ``` ### Staging Model with Light Cleaning ```sql -- models/bronze/stg_ecommerce__customers.sql select -- Primary key customer_id, -- Attributes (cleaned) trim(first_name) as first_name, trim(last_name) as last_name, lower(trim(email)) as email, -- Phone standardization regexp_replace(phone, '[^0-9]', '') as phone_clean, -- Boolean conversions case when status = 'active' then true else false end as is_active, -- Date standardization cast(created_at as timestamp) as created_at, -- Metadata current_timestamp() as dbt_loaded_at from {{ source('ecommerce', 'customers') }} ``` ### Staging Rules ✅ **DO**: - Use `{{ source() }}` for source references - Rename columns to standard naming conventions - Cast data types explicitly - Clean data (trim, upper/lower, standardize formats) - Add metadata columns (dbt_loaded_at, dbt_updated_at) ❌ **DON'T**: - Join multiple sources - Add business logic or calculations - Aggregate or group data - Filter rows (except for obvious bad data) - Hard-code table names --- ## Silver Layer: Intermediate Model Template **Purpose**: Reusable business logic, enrichment, and complex transformations. **Materialization**: `ephemeral` or `table` (set at folder level, override if needed) ### Basic Intermediate Model ```sql -- models/silver/int_customers__with_orders.sql with customers as ( select * from {{ ref('stg_salesforce__customers') }} ), orders as ( select * from {{ ref('stg_ecommerce__orders') }} ), customer_order_metrics as ( select customer_id, count(distinct order_id) as total_orders, sum(order_amount) as lifetime_value, min(order_date) as first_order_date, max(order_date) as last_order_date from orders group by customer_id ), final as ( select c.customer_id, c.customer_name, c.customer_email, coalesce(m.total_orders, 0) as total_orders, coalesce(m.lifetime_value, 0) as lifetime_value, m.first_order_date, m.last_order_date, datediff(day, m.first_order_date, m.last_order_date) as customer_tenure_days from customers c left join customer_order_metrics m on c.customer_id = m.customer_id ) select * from final ``` ### Complex Intermediate Model with Business Logic ```sql -- models/silver/int_customers__segmented.sql with customer_metrics as ( select * from {{ ref('int_customers__with_orders') }} ), rfm_scores as ( select customer_id, -- Recency (days since last order) datediff(day, last_order_date, current_date()) as recency_days, -- Frequency total_orders as frequency, -- Monetary lifetime_value as monetary, -- Quartile scoring ntile(4) over (order by datediff(day, last_order_date, current_date()) desc) as recency_score, ntile(4) over (order by total_orders) as frequency_score, ntile(4) over (order by lifetime_value) as monetary_score from customer_metrics where last_order_date is not null ), final as ( select customer_id, recency_days, frequency, monetary, recency_score, frequency_score, monetary_score, -- RFM segment case when recency_score >= 3 and frequency_score >= 3 and monetary_score >= 3 then 'Champions' when recency_score >= 3 and frequency_score >= 2 then 'Loyal Customers' when recency_score >= 3 and monetary_score >= 3 then 'Big Spenders' when recency_score <= 2 and frequency_score >= 3 then 'At Risk' when recency_score <= 1 then 'Lost' else 'Regular' end as customer_segment from rfm_scores ) select * from final ``` ### Intermediate Rules ✅ **DO**: - Reference staging and other intermediate models with `{{ ref() }}` - Add business logic and calculations - Create reusable components - Use descriptive CTE names - Group related logic into CTEs ❌ **DON'T**: - Reference `{{ source() }}` directly - Add presentation-layer logic (save for marts) - Create one-off transformations (ensure reusability) --- ## Gold Layer: Dimension Model Template **Purpose**: Business entities ready for BI tools. **Materialization**: `table` (set at folder level) ### Basic Dimension ```sql -- models/gold/dim_customers.sql with customers as ( select * from {{ ref('int_customers__segmented') }} ) select -- Primary key customer_id, -- Attributes customer_name, customer_email, -- Metrics total_orders, lifetime_value, first_order_date, last_order_date, customer_tenure_days, -- Segmentation customer_segment, -- Business classification case when customer_segment = 'Champions' then 'High Value' when customer_segment in ('Loyal Customers', 'Big Spenders') then 'Medium Value' else 'Low Value' end as customer_value_tier, -- Flags case when last_order_date >= dateadd(day, -90, current_date()) then true else false end as is_active_90d, case when total_orders = 1 then true else false end as is_one_time_buyer, -- Metadata current_timestamp() as dbt_updated_at from customers ``` ### Dimension with Type 0 SCD (Ghost Records) ```sql -- models/gold/dim_products.sql -- Includes ghost key for unknown/missing products with products as ( select * from {{ ref('int_products__enriched') }} ), ghost_key as ( select -1 as product_id, 'Unknown' as product_name, 'Unknown' as product_category, 0.00 as product_price, false as is_active, current_timestamp() as dbt_updated_at ), final as ( select * from products union all select * from ghost_key ) select * from final ``` ### Dimension Rules ✅ **DO**: - Include primary key as first column - Add business-friendly attributes - Include calculated flags and classifications - Add metadata columns - Document all columns in schema.yml - Test with `dbt_constraints.primary_key` ❌ **DON'T**: - Include transaction-level data (that's for facts) - Create overly wide tables (be selective) --- ## Gold Layer: Fact Model Template **Purpose**: Business processes and transactions. **Materialization**: `table` or `incremental` (override at model level for incremental) ### Basic Fact Table ```sql -- models/gold/fct_orders.sql with orders as ( select * from {{ ref('stg_ecommerce__orders') }} ), customers as ( select customer_id from {{ ref('dim_customers') }} ), products as ( select product_id from {{ ref('dim_products') }} ) select -- Primary key orders.order_id, -- Foreign keys coalesce(customers.customer_id, -1) as customer_id, -- Ghost key for unknown coalesce(products.product_id, -1) as product_id, -- Attributes orders.order_date, orders.order_status, -- Metrics orders.order_quantity, orders.order_amount, orders.discount_amount, orders.tax_amount, orders.total_amount, -- Metadata current_timestamp() as dbt_updated_at from orders left join customers on orders.customer_id = customers.customer_id left join products on orders.product_id = products.product_id ``` ### Incremental Fact Table ```sql -- models/gold/fct_order_lines.sql {{ config( materialized='incremental', unique_key='order_line_id', incremental_strategy='merge', merge_exclude_columns=['dbt_inserted_at'], cluster_by=['order_date'] ) }} with order_lines as ( select * from {{ ref('stg_ecommerce__order_lines') }} ) select -- Primary key order_line_id, -- Foreign keys order_id, product_id, customer_id, -- Attributes order_date, line_number, -- Metrics quantity, unit_price, discount_percent, line_total, -- Metadata {% if is_incremental() %} dbt_inserted_at, -- Preserve from merge_exclude_columns {% else %} current_timestamp() as dbt_inserted_at, {% endif %} current_timestamp() as dbt_updated_at from order_lines {% if is_incremental() %} where order_date > (select max(order_date) from {{ this }}) {% endif %} ``` ### Fact Rules ✅ **DO**: - Include all foreign keys to dimensions - Use ghost keys (-1) for unknown/missing references - Include metrics and measures - Use incremental for large tables (millions+ rows) - Add clustering keys for large tables - Test with `dbt_constraints.foreign_key` ❌ **DON'T**: - Denormalize dimension attributes into facts (use foreign keys) - Skip foreign key tests --- ## Model Configuration Strategy ### Folder-Level First (in dbt_project.yml) Most configuration should be at the folder level: ```yaml models: your_project: bronze: +materialized: ephemeral +tags: ["bronze", "staging"] silver: +materialized: ephemeral +tags: ["silver"] gold: +materialized: table +tags: ["gold", "marts"] ``` ### Model-Level Only for Unique Requirements Add `{{ config() }}` ONLY when overriding folder defaults: ```sql -- Only for incremental-specific settings {{ config( materialized='incremental', unique_key='order_id', incremental_strategy='merge', cluster_by=['order_date'] ) }} ``` --- ## Common Modeling Patterns ### Handling NULL Values ```sql -- Use COALESCE for safety select customer_id, coalesce(total_orders, 0) as total_orders, coalesce(lifetime_value, 0.00) as lifetime_value from {{ ref('customer_metrics') }} ``` ### Window Functions for Ranking ```sql -- Use QUALIFY in Snowflake for cleaner code select customer_id, order_date, order_amount, row_number() over (partition by customer_id order by order_date desc) as order_rank from {{ ref('stg_orders') }} qualify order_rank <= 5 -- Top 5 orders per customer ``` ### Conditional Aggregation ```sql select customer_id, count(*) as total_orders, sum(case when order_status = 'completed' then 1 else 0 end) as completed_orders, sum(case when order_status = 'cancelled' then 1 else 0 end) as cancelled_orders from {{ ref('stg_orders') }} group by customer_id ``` --- ## Helping Users with Modeling ### Strategy for Assisting Users When users ask for modeling help: 1. **Understand the goal**: What business question does this answer? 2. **Identify the layer**: Bronze/silver/gold based on purpose 3. **Recommend structure**: CTEs, column organization, logic flow 4. **Apply naming conventions**: Proper prefixes and column names 5. **Provide complete example**: Working code they can adapt 6. **Suggest tests**: Appropriate constraints and validations ### Common User Questions **"How do I write this model?"** - Identify source data (staging models) - Break logic into CTEs (import → logic → final) - Apply column naming standards - Add appropriate tests **"How do I join these tables?"** - Use CTE pattern (import CTEs at top) - Perform joins in logical CTEs - Select final columns in final CTE - Use `ref()` for all dbt model references **"Should this be ephemeral or table?"** - Ephemeral: Staging, reusable intermediate logic - Table: Dimensions, complex silver, production marts - Incremental: Large facts (millions+ rows) --- ## Related Official Documentation - [dbt Best Practices: How We Structure Our dbt Projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview) - [dbt Docs: Building Models](https://docs.getdbt.com/docs/build/models) - [dbt Docs: Jinja & Macros](https://docs.getdbt.com/docs/build/jinja-macros) --- **Goal**: Transform AI agents into expert dbt modelers who write clean, maintainable, production-quality SQL that follows industry best practices and is easy for teams to understand and extend.