---
name: data-analyst
description: Expert data analysis covering SQL, visualization, statistical analysis, business intelligence, and data storytelling.
version: 1.0.0
author: Claude Skills
category: data-analytics
tags: [analytics, sql, visualization, statistics, reporting]
---

# Data Analyst

Expert-level data analysis for business insights.

## Core Competencies

- SQL and database querying
- Data visualization
- Statistical analysis
- Business intelligence
- Data storytelling
- Dashboard development
- Reporting automation
- Stakeholder communication

## SQL Mastery

### Query Patterns

**Aggregation:**
```sql
SELECT
    date_trunc('month', created_at) as month,
    COUNT(*) as total_orders,
    COUNT(DISTINCT customer_id) as unique_customers,
    SUM(amount) as total_revenue,
    AVG(amount) as avg_order_value
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY 1
ORDER BY 1;
```

**Window Functions:**
```sql
SELECT
    customer_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as running_total,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as order_number,
    LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as previous_order
FROM orders;
```

**CTEs for Clarity:**
```sql
WITH monthly_metrics AS (
    SELECT
        date_trunc('month', created_at) as month,
        SUM(amount) as revenue
    FROM orders
    GROUP BY 1
),
growth_calc AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) as prev_revenue
    FROM monthly_metrics
)
SELECT
    month,
    revenue,
    ROUND((revenue - prev_revenue) / prev_revenue * 100, 1) as growth_pct
FROM growth_calc;
```

**Cohort Analysis:**
```sql
WITH first_orders AS (
    SELECT
        customer_id,
        date_trunc('month', MIN(created_at)) as cohort_month
    FROM orders
    GROUP BY 1
),
cohort_data AS (
    SELECT
        f.cohort_month,
        date_trunc('month', o.created_at) as order_month,
        COUNT(DISTINCT o.customer_id) as customers
    FROM orders o
    JOIN first_orders f ON o.customer_id = f.customer_id
    GROUP BY 1, 2
)
SELECT
    cohort_month,
    order_month,
    EXTRACT(MONTH FROM AGE(order_month, cohort_month)) as months_since_cohort,
    customers
FROM cohort_data
ORDER BY 1, 2;
```

### Query Optimization

**Use EXPLAIN:**
```sql
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;
```

**Best Practices:**
- Use indexes on filtered columns
- Avoid SELECT * in production
- Use LIMIT for exploratory queries
- Filter early, aggregate late
- Use appropriate data types

## Data Visualization

### Chart Selection Guide

| Data Type | Best Chart | Alternative |
|-----------|------------|-------------|
| Trend over time | Line chart | Area chart |
| Part of whole | Pie/Donut | Stacked bar |
| Comparison | Bar chart | Column chart |
| Distribution | Histogram | Box plot |
| Correlation | Scatter plot | Heatmap |
| Geographic | Map | Choropleth |

### Visualization Best Practices

**Do:**
- Start Y-axis at zero (for bars)
- Use consistent colors
- Label axes clearly
- Include context (benchmarks, targets)
- Order categories meaningfully

**Don't:**
- Use 3D charts
- Use more than 5-7 colors
- Truncate axes misleadingly
- Clutter with gridlines
- Use pie charts for many categories

### Dashboard Layout

```
┌─────────────────────────────────────────────────────────────┐
│  EXECUTIVE SUMMARY                                          │
│  [KPI 1: $X]  [KPI 2: X%]  [KPI 3: X]  [KPI 4: X%]         │
├─────────────────────────────────────────────────────────────┤
│  TRENDS                          │  BREAKDOWN               │
│  [Line Chart - Primary Metric]   │  [Bar Chart - Segments]  │
│                                  │                          │
├──────────────────────────────────┼──────────────────────────┤
│  COMPARISON                      │  DETAIL TABLE            │
│  [Bar Chart - vs Target/LY]      │  [Top N with metrics]    │
│                                  │                          │
└──────────────────────────────────┴──────────────────────────┘
```

## Statistical Analysis

### Descriptive Statistics

```python
import pandas as pd
import numpy as np

def describe_data(df, column):
    stats = {
        'count': df[column].count(),
        'mean': df[column].mean(),
        'median': df[column].median(),
        'std': df[column].std(),
        'min': df[column].min(),
        'max': df[column].max(),
        'q25': df[column].quantile(0.25),
        'q75': df[column].quantile(0.75),
        'skewness': df[column].skew(),
        'kurtosis': df[column].kurtosis()
    }
    return stats
```

### Hypothesis Testing

```python
from scipy import stats

# T-test: Compare two groups
def compare_groups(group_a, group_b, alpha=0.05):
    stat, p_value = stats.ttest_ind(group_a, group_b)

    result = {
        't_statistic': stat,
        'p_value': p_value,
        'significant': p_value < alpha,
        'effect_size': (group_a.mean() - group_b.mean()) / np.sqrt(
            (group_a.std()**2 + group_b.std()**2) / 2
        )
    }
    return result

# Chi-square: Test independence
def test_independence(observed, alpha=0.05):
    stat, p_value, dof, expected = stats.chi2_contingency(observed)

    return {
        'chi2': stat,
        'p_value': p_value,
        'degrees_of_freedom': dof,
        'significant': p_value < alpha
    }
```

### Regression Analysis

```python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

def simple_regression(X, y):
    model = LinearRegression()
    model.fit(X.reshape(-1, 1), y)

    predictions = model.predict(X.reshape(-1, 1))

    return {
        'coefficient': model.coef_[0],
        'intercept': model.intercept_,
        'r_squared': r2_score(y, predictions),
        'mae': mean_absolute_error(y, predictions)
    }
```

## Business Analysis

### Analysis Framework

```markdown
# Analysis: [Topic]

## Business Question
[What are we trying to answer?]

## Hypothesis
[What do we expect to find?]

## Data Sources
- [Source 1]: [Description]
- [Source 2]: [Description]

## Methodology
1. [Step 1]
2. [Step 2]
3. [Step 3]

## Findings

### Finding 1: [Title]
[Description with supporting data]

### Finding 2: [Title]
[Description with supporting data]

## Recommendations
1. [Recommendation]: [Expected impact]
2. [Recommendation]: [Expected impact]

## Limitations
- [Limitation 1]
- [Limitation 2]

## Next Steps
- [Action item]
```

### Key Business Metrics

**Acquisition:**
- Customer Acquisition Cost (CAC)
- Cost per Lead (CPL)
- Conversion Rate

**Engagement:**
- Daily/Monthly Active Users
- Session Duration
- Feature Adoption

**Retention:**
- Churn Rate
- Retention Rate
- Net Revenue Retention

**Revenue:**
- Monthly Recurring Revenue (MRR)
- Average Revenue Per User (ARPU)
- Lifetime Value (LTV)

## Data Storytelling

### Presentation Structure

```
1. CONTEXT
   - Why does this matter?
   - What question are we answering?

2. KEY FINDING
   - Lead with the insight
   - Make it memorable

3. EVIDENCE
   - Show the data
   - Use effective visuals

4. IMPLICATIONS
   - What does this mean?
   - So what?

5. RECOMMENDATIONS
   - What should we do?
   - Clear next steps
```

### Insight Template

```markdown
## [Headline: Action-oriented finding]

**What:** [One sentence description of the finding]

**So What:** [Why this matters to the business]

**Now What:** [Recommended action]

**Evidence:**
[Chart or data supporting the finding]

**Confidence:** [High/Medium/Low]
```

## Reference Materials

- `references/sql_patterns.md` - Advanced SQL queries
- `references/visualization.md` - Chart selection guide
- `references/statistics.md` - Statistical methods
- `references/storytelling.md` - Presentation best practices

## Scripts

```bash
# Data profiler
python scripts/data_profiler.py --table orders --output profile.html

# SQL query analyzer
python scripts/query_analyzer.py --query query.sql --explain

# Dashboard generator
python scripts/dashboard_gen.py --config dashboard.yaml

# Report automation
python scripts/report_gen.py --template monthly --output report.pdf
```