--- name: dbt-migration-validation description: Comprehensive validation skill for dbt models and schema YAML files. Defines validation rules, common anti-patterns to detect, and auto-fix suggestions. Integrates with Claude Code hooks to enforce quality standards during migration. --- # dbt Migration Validation Skill ## Purpose Define and enforce validation rules for dbt models during migration. This skill provides comprehensive validation rules, common anti-patterns to detect, and auto-fix suggestions that are implemented by the validation hooks. ## When to Use This Skill Activate this skill when: - Reviewing dbt models for quality issues - Diagnosing validation hook failures - Understanding validation rules and their rationale - Looking up auto-fix suggestions for common issues - Configuring validation thresholds --- ## Validation Rules Reference ### Schema YAML Rules #### YAML001: Model Description Required **Severity:** Error **Description:** Every model must have a description in the schema YAML file. **Rationale:** Descriptions are essential for documentation, lineage understanding, and team collaboration. They appear in dbt docs and help stakeholders understand data assets. **Detection:** Model entry exists but `description` field is missing or empty. **Fix:** ```yaml ## Before models: - name: dim_customers columns: - name: customer_id ## After models: - name: dim_customers description: | Customer dimension containing customer attributes, contact info, and segmentation. Updated daily from CRM system. Source: CRM Database Owner: Analytics Team columns: - name: customer_id ``` --- #### YAML002: Primary Key Test Required **Severity:** Error **Description:** Columns that appear to be primary keys (ending in `_id`, `_key`, `_sk`) must have a `dbt_constraints.primary_key` test. **Rationale:** Primary key validation ensures data integrity. The `dbt_constraints` package creates actual database constraints for validation. **Detection:** Column name matches primary key pattern but lacks required test. **Primary Key Patterns:** - `*_id` (e.g., `customer_id`, `order_id`) - `*_key` (e.g., `surrogate_key`, `natural_key`) - `*_sk` (e.g., `customer_sk`) - `id` (exact match) **Fix:** ```yaml ## Before columns: - name: customer_id description: "Unique customer identifier" ## After columns: - name: customer_id description: "Unique customer identifier" data_type: integer tests: - dbt_constraints.primary_key ``` --- #### YAML003: Foreign Key Relationship Test **Severity:** Warning **Description:** Columns that appear to be foreign keys should have a `relationships` or `dbt_constraints.foreign_key` test. **Rationale:** Relationship tests validate referential integrity between tables. **Detection:** Column name matches foreign key pattern but lacks relationship test. **Foreign Key Patterns:** - `fk_*` (e.g., `fk_customer`) - `*_fk` (e.g., `customer_fk`) **Fix:** ```yaml ## Before columns: - name: fk_customer_id description: "Reference to customer" ## After columns: - name: fk_customer_id description: "Reference to customer" tests: - relationships: to: ref('dim_customers') field: customer_id ``` --- #### YAML004: Column Description **Severity:** Warning **Description:** All columns should have descriptions. **Rationale:** Column descriptions improve documentation and make data assets more discoverable. **Fix:** ```yaml ## Before columns: - name: signup_date ## After columns: - name: signup_date description: "Date the customer signed up for service" ``` --- #### YAML005: Model Naming Convention **Severity:** Error **Description:** Model names must follow layer-specific naming conventions. **Conventions by Layer:** | Layer | Prefix | Pattern | Example | | ------------------- | ------------------------------- | ----------------------------- | ---------------------------- | | Bronze/Staging | `stg_` | `stg_{source}__{table}` | `stg_sqlserver__customers` | | Silver/Intermediate | `int_`, `lookup_` | `int_{entity}__{description}` | `int_customers__with_orders` | | Gold/Mart | `dim_`, `fct_`, `mart_`, `agg_` | `dim_{entity}` | `dim_customers` | **Fix:** ```yaml ## Before (in gold layer) models: - name: customers ## After models: - name: dim_customers ``` --- #### YAML006: Column Data Type **Severity:** Warning **Description:** All columns should have `data_type` specified. **Rationale:** Explicit data types improve documentation and enable contract testing. **Fix:** ```yaml ## Before columns: - name: amount description: "Order amount" ## After columns: - name: amount description: "Order amount in USD" data_type: number(18,2) ``` --- ### SQL Model Rules #### SQL001: Config Block **Severity:** Warning **Description:** Models should have a config block specifying materialization. **Rationale:** Explicit configuration makes materialization strategy clear and enables per-model customization. **Fix:** ```sql -- Before select * from {{ ref('stg_customers') }} -- After {{ config( materialized='table', tags=['gold', 'customer'] ) }} select * from {{ ref('stg_customers') }} ``` --- #### SQL002: CTE Pattern **Severity:** Warning **Description:** Models should use the standard CTE pattern for readability. **Pattern:** 1. Import CTEs - Reference source data 2. Logical CTEs - Apply transformations 3. Final CTE - Prepare output 4. Final SELECT from final CTE **Fix:** ```sql -- Before select c.customer_id, c.customer_name, count(o.order_id) as order_count from {{ ref('stg_customers') }} c left join {{ ref('stg_orders') }} o on c.customer_id = o.customer_id group by 1, 2 -- After with customers as ( select * from {{ ref('stg_customers') }} ), orders as ( select * from {{ ref('stg_orders') }} ), customer_orders as ( select c.customer_id, c.customer_name, count(o.order_id) as order_count from customers c left join orders o on c.customer_id = o.customer_id group by c.customer_id, c.customer_name ), final as ( select customer_id, customer_name, order_count from customer_orders ) select * from final ``` --- #### SQL003: No SELECT \* in Final Output **Severity:** Error **Description:** The final query should explicitly list columns, not use `SELECT *`. **Rationale:** Explicit columns make the contract clear, prevent accidental exposure of new columns, and improve query performance. **Exception:** `SELECT * FROM final` is acceptable when `final` CTE explicitly lists columns. **Fix:** ```sql -- Before (problematic) select * from {{ ref('stg_customers') }} -- After select customer_id, customer_name, email, signup_date from {{ ref('stg_customers') }} ``` --- #### SQL004: Use ref() and source() **Severity:** Error **Description:** All table references must use `{{ ref() }}` or `{{ source() }}`, not hardcoded table names. **Rationale:** Using ref() and source() enables: - Automatic dependency tracking - Environment-aware table resolution - Proper lineage documentation **Detection:** Pattern `FROM schema.table` or `JOIN database.schema.table` without Jinja braces. **Fix:** ```sql -- Before select * from raw_data.customers -- After (for source tables) select * from {{ source('raw_data', 'customers') }} -- After (for dbt models) select * from {{ ref('stg_customers') }} ``` --- #### SQL005: Migration Header Comment **Severity:** Warning **Description:** Migrated models should include a header comment documenting the original source and conversion notes. **Detection:** Model contains migration indicators but lacks proper header. **Required Header Elements:** - Original object name and schema - Source platform - Migration date - Conversion notes - Breaking changes (if any) **Template:** ```sql /* Original Object: {schema}.{object_name} Source Platform: {SQL Server|Oracle|Teradata|etc.} Original Type: {Stored Procedure|View|Function} Migration Date: YYYY-MM-DD Conversion Notes: - Replaced ISNULL() with COALESCE() - Converted TOP to LIMIT - Removed NOLOCK hints Breaking Changes: - Output column 'old_name' renamed to 'new_name' */ {{ config(materialized='table') }} -- Model implementation ``` --- #### SQL006: Snowflake-Incompatible Syntax **Severity:** Error **Description:** Model contains SQL syntax that is not compatible with Snowflake. **Common Issues:** | Pattern | Platform | Snowflake Equivalent | | -------------------- | ---------- | ------------------------ | | `TOP N` | SQL Server | `LIMIT N` | | `ISNULL(a, b)` | SQL Server | `COALESCE(a, b)` | | `GETDATE()` | SQL Server | `CURRENT_TIMESTAMP()` | | `LEN(s)` | SQL Server | `LENGTH(s)` | | `CHARINDEX(a, b)` | SQL Server | `POSITION(a IN b)` | | `CONVERT(type, val)` | SQL Server | `CAST(val AS type)` | | `WITH (NOLOCK)` | SQL Server | (remove) | | `@@ROWCOUNT` | SQL Server | (use different approach) | | `ROWNUM` | Oracle | `ROW_NUMBER() OVER()` | | `DECODE(...)` | Oracle | `CASE WHEN...` | | `CONNECT BY` | Oracle | Recursive CTE | | `SYSDATE` | Oracle | `CURRENT_DATE()` | | `SEL` | Teradata | `SELECT` | | Backticks | MySQL | Double quotes | **Fix Examples:** ```sql -- SQL Server Before SELECT TOP 100 * FROM customers WITH (NOLOCK) WHERE ISNULL(email, '') = '' AND GETDATE() > signup_date -- Snowflake After SELECT * FROM customers WHERE COALESCE(email, '') = '' AND CURRENT_TIMESTAMP() > signup_date LIMIT 100 ``` --- ## Anti-Patterns to Detect ### Common Anti-Patterns #### 1. Direct Source References **Problem:** Using hardcoded table names instead of `source()` or `ref()`. **Impact:** Breaks dependency tracking, environment portability, and lineage. #### 2. Generic Column Names **Problem:** Columns named `col1`, `field1`, `temp`, etc. **Impact:** Poor documentation, confusing for consumers. #### 3. Mixed Naming Conventions **Problem:** Inconsistent casing or naming patterns within a model. **Impact:** Confusion, maintenance difficulty. #### 4. Missing Tests on Key Columns **Problem:** Primary/foreign keys without uniqueness or relationship tests. **Impact:** Data quality issues may go undetected. #### 5. Overly Complex Models **Problem:** Models with excessive CTEs, complex logic, or doing too much. **Impact:** Hard to maintain, test, and understand. #### 6. Platform-Specific Syntax **Problem:** SQL syntax from source database that won't work in Snowflake. **Impact:** Runtime errors, compilation failures. --- ## Hook Integration ### Validation Hook Configuration Hooks are configured in `.claude/settings.local.json`: ```json { "hooks": { "PostToolUse": [ { "matcher": "Write", "hooks": [ { "type": "command", "command": "python3 .claude/hooks/dbt-validation/validate_file.py \"$FILE_PATH\"", "timeout": 30000 } ] }, { "matcher": "Edit", "hooks": [ { "type": "command", "command": "python3 .claude/hooks/dbt-validation/validate_file.py \"$FILE_PATH\"", "timeout": 30000 } ] } ] } } ``` ### Exit Codes - **Exit 0**: Validation passed (or file not in scope) - **Exit 1**: Validation failed with errors Warnings are reported but don't cause exit code 1. ### File Scope Validation runs only on files matching: - `models/**/_models.yml` - Schema YAML validation - `models/**/_sources.yml` - Source YAML validation - `models/**/*.sql` - SQL model validation Other files are skipped (exit 0). --- ## Validation Scripts Validation is implemented in `.claude/hooks/dbt-validation/`: | Script | Purpose | | ----------------------------- | -------------------------------------------- | | `validate_file.py` | Entry point, routes to appropriate validator | | `validate_schema_yaml.py` | YAML rule validation | | `validate_dbt_model.py` | SQL rule validation | | `check_migration_status.py` | Cross-validation and reporting | | `rules/naming_conventions.py` | Naming convention checks | | `rules/cte_patterns.py` | CTE structure validation | | `rules/snowflake_syntax.py` | Platform syntax detection | --- ## Related Skills - $dbt-migration - Migration workflow - $dbt-testing - Test strategies - $dbt-architecture - Naming conventions - $dbt-modeling - CTE patterns