---
name: source-analyzer
description: Analyze data source characteristics including update frequency, volume patterns, and schema stability.
allowed-tools: Read Grep Glob Bash
---

# Source Analyzer

**Audience:** Data engineers planning ETL pipelines.

**Goal:** Characterize data sources for extraction planning by analyzing volume, update patterns, schema, and quality baselines.

## Analysis Dimensions

### Volume Analysis
- Current row count
- Historical growth rate (if multiple snapshots available)
- Projected growth
- Peak vs average volume
- Seasonality patterns

### Update Patterns
- Full refresh vs incremental
- Update frequency (hourly, daily, weekly)
- Batch arrival times
- Late-arriving data windows

### Schema Characteristics
- Column count and types
- Nullable patterns
- Primary key candidates
- Foreign key relationships
- Schema change history

### Data Quality Baseline
- Typical null rates per column
- Expected value ranges
- Cardinality expectations
- Known data issues

## Workflow

1. **Inventory the source** - File location/database connection, access patterns (API, file drop, DB query), authentication requirements
2. **Sample multiple time periods** - Analyze data from different dates, identify temporal patterns
3. **Profile the schema** - Document all columns, identify stable vs volatile columns, note computed/derived columns
4. **Assess quality** - Baseline quality metrics, known quirks, required transformations
5. **Document extraction requirements** - Optimal extraction method, incremental key columns, filtering criteria

## Output: Source Specification

```yaml
source:
  name: customer_orders
  type: postgresql
  connection: orders_db

  extraction:
    method: incremental
    key_column: updated_at
    batch_size: 100000
    frequency: hourly

  schema:
    columns:
      - name: order_id
        type: bigint
        nullable: false
        primary_key: true
      - name: customer_id
        type: bigint
        nullable: false
        foreign_key: customers.id
      - name: order_date
        type: date
        nullable: false
      - name: total_amount
        type: decimal(10,2)
        nullable: false
      - name: status
        type: varchar(20)
        nullable: false
        values: [pending, confirmed, shipped, delivered, cancelled]
      - name: updated_at
        type: timestamp
        nullable: false
        incremental_key: true

  volume:
    current_rows: 5_200_000
    daily_growth: 15_000
    peak_hours: [10, 14, 18]

  quality:
    known_issues:
      - "status can be null for orders before 2023"
      - "total_amount occasionally negative (refunds)"
    null_rates:
      customer_id: 0%
      order_date: 0%
      status: 0.5%

  recommendations:
    - "Use updated_at for incremental loads"
    - "Add check constraint for status values"
    - "Consider partitioning by order_date"
```

## Comparison Analysis

When analyzing multiple related sources:

| Attribute | Orders | Order_Items |
|-----------|--------|-------------|
| Row count | 5.2M | 18.7M |
| Daily growth | 15K | 52K |
| Key column | order_id | item_id |
| Join key | order_id | order_id |
| Update lag | < 1 hour | < 1 hour |

**Relationship:** 1:N (avg 3.6 items per order)
**Join strategy:** Hash join on order_id
**Load order:** Orders first, then Order_Items

## API Source Analysis

For API-based sources:

```yaml
api_source:
  name: stripe_payments
  base_url: https://api.stripe.com/v1
  auth: bearer_token

  endpoints:
    - path: /charges
      method: GET
      pagination: cursor
      rate_limit: 100/sec
      params:
        created[gte]: "{last_sync}"

  extraction:
    strategy: cursor_pagination
    page_size: 100
    sync_frequency: 15min
    full_refresh_weekly: true

  data_characteristics:
    avg_response_size: 50KB
    records_per_page: 100
    typical_daily_volume: 5000

  error_handling:
    retry_codes: [429, 500, 502, 503]
    max_retries: 3
    backoff: exponential
```