--- name: source-analyzer description: Analyze data source characteristics including update frequency, volume patterns, and schema stability. allowed-tools: Read Grep Glob Bash --- # Source Analyzer **Audience:** Data engineers planning ETL pipelines. **Goal:** Characterize data sources for extraction planning by analyzing volume, update patterns, schema, and quality baselines. ## Analysis Dimensions ### Volume Analysis - Current row count - Historical growth rate (if multiple snapshots available) - Projected growth - Peak vs average volume - Seasonality patterns ### Update Patterns - Full refresh vs incremental - Update frequency (hourly, daily, weekly) - Batch arrival times - Late-arriving data windows ### Schema Characteristics - Column count and types - Nullable patterns - Primary key candidates - Foreign key relationships - Schema change history ### Data Quality Baseline - Typical null rates per column - Expected value ranges - Cardinality expectations - Known data issues ## Workflow 1. **Inventory the source** - File location/database connection, access patterns (API, file drop, DB query), authentication requirements 2. **Sample multiple time periods** - Analyze data from different dates, identify temporal patterns 3. **Profile the schema** - Document all columns, identify stable vs volatile columns, note computed/derived columns 4. **Assess quality** - Baseline quality metrics, known quirks, required transformations 5. **Document extraction requirements** - Optimal extraction method, incremental key columns, filtering criteria ## Output: Source Specification ```yaml source: name: customer_orders type: postgresql connection: orders_db extraction: method: incremental key_column: updated_at batch_size: 100000 frequency: hourly schema: columns: - name: order_id type: bigint nullable: false primary_key: true - name: customer_id type: bigint nullable: false foreign_key: customers.id - name: order_date type: date nullable: false - name: total_amount type: decimal(10,2) nullable: false - name: status type: varchar(20) nullable: false values: [pending, confirmed, shipped, delivered, cancelled] - name: updated_at type: timestamp nullable: false incremental_key: true volume: current_rows: 5_200_000 daily_growth: 15_000 peak_hours: [10, 14, 18] quality: known_issues: - "status can be null for orders before 2023" - "total_amount occasionally negative (refunds)" null_rates: customer_id: 0% order_date: 0% status: 0.5% recommendations: - "Use updated_at for incremental loads" - "Add check constraint for status values" - "Consider partitioning by order_date" ``` ## Comparison Analysis When analyzing multiple related sources: | Attribute | Orders | Order_Items | |-----------|--------|-------------| | Row count | 5.2M | 18.7M | | Daily growth | 15K | 52K | | Key column | order_id | item_id | | Join key | order_id | order_id | | Update lag | < 1 hour | < 1 hour | **Relationship:** 1:N (avg 3.6 items per order) **Join strategy:** Hash join on order_id **Load order:** Orders first, then Order_Items ## API Source Analysis For API-based sources: ```yaml api_source: name: stripe_payments base_url: https://api.stripe.com/v1 auth: bearer_token endpoints: - path: /charges method: GET pagination: cursor rate_limit: 100/sec params: created[gte]: "{last_sync}" extraction: strategy: cursor_pagination page_size: 100 sync_frequency: 15min full_refresh_weekly: true data_characteristics: avg_response_size: 50KB records_per_page: 100 typical_daily_volume: 5000 error_handling: retry_codes: [429, 500, 502, 503] max_retries: 3 backoff: exponential ```