---
name: data-analysis
description: Analyze, explore, clean, and visualize datasets with statistical rigor. Use when user asks to analyze data, find patterns, compute statistics, create visualizations, clean messy data, or explore a dataset. Trigger when user says things like "analyze this data", "what trends do you see", "find patterns in", "create a chart", "clean this dataset", "run statistics on", "what does this data tell us", or provides CSV/Excel/JSON data for exploration. Also trigger for A/B test analysis, cohort analysis, and data quality assessments. Do NOT trigger for simple data format conversions, database query writing without analysis, or ETL pipeline design.
metadata:
  version: "2.0.0"
  author: Upsonic
  tags: [data, analysis, statistics, visualization]
---

# Data Analysis

Explore, clean, analyze, and communicate findings from data. The goal is always to answer a question — start with what the user wants to know and work backward to the analysis that answers it.

## Before You Analyze

### Understand the Question

Before touching the data, clarify:

1. **What question are we answering?** ("Is our conversion rate improving?" is an answerable question. "Analyze this data" is not — help the user sharpen it.)
2. **Who needs the answer?** (Engineer debugging an issue? Executive making a budget decision? Researcher testing a hypothesis?)
3. **What decisions will this inform?** (This determines how precise you need to be and what format the answer should take.)
4. **What's the timeline?** (A quick sanity check and a thorough statistical analysis require different approaches.)

If the user says "analyze this data" without a specific question, help them formulate one:
- "What would be most useful to know from this data?"
- "Are you looking for trends over time, comparisons between groups, or something else?"
- "Is there a specific business question this should answer?"

## Reference Materials and Scripts

- Execute `profile_data.py` with a data file path to get a quick profile of any CSV, Excel, or JSON dataset — it reports shape, types, missing values, stats, and value distributions. Run with `--help` for usage.
- Load `statistical-tests-guide.md` when choosing statistical tests — it has a decision matrix for test selection, effect size interpretation tables, and sample size guidelines.

### Understand the Data

Before analysis, get your bearings:

1. **Source and context**: Where did this data come from? How was it collected? What time period does it cover?
2. **Schema**: What are the columns/fields? What do they represent? What are the data types?
3. **Scale**: How many rows/records? What's the granularity? (Per user? Per day? Per transaction?)
4. **Known issues**: Is the data known to be incomplete, biased, or have quality problems?

```python
# First look at any dataset
import pandas as pd

df = pd.read_csv("data.csv")  # or read_excel, read_json, etc.
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nFirst rows:\n{df.head()}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic stats:\n{df.describe()}")
```

## Analysis Workflow

### Step 1: Clean and Validate

Data quality determines analysis quality. Don't skip this.

#### Handle Missing Values
- **Count them first**: What percentage of each column is missing?
- **Understand why**: Are they random? Systematic? (e.g., optional fields vs data collection failures)
- **Choose a strategy and document it**:
  - Drop rows: When missing data is rare and random (less than 5%)
  - Impute with median/mode: When missing data is moderate and the distribution is known
  - Flag as separate category: When missingness itself is informative
  - Leave as-is: When the analysis method handles nulls natively

```python
# Document your decisions
missing_pct = df.isnull().sum() / len(df) * 100
print("Missing data percentage per column:")
print(missing_pct[missing_pct > 0].sort_values(ascending=False))
```

#### Handle Outliers
- **Detect**: Use IQR method, z-scores, or domain knowledge
- **Investigate**: Are they errors or legitimate extreme values?
- **Document your decision**: Keep, cap, or remove — and explain why

```python
# IQR method for outlier detection
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['value'] < Q1 - 1.5 * IQR) | (df['value'] > Q3 + 1.5 * IQR)]
print(f"Found {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")
```

#### Validate Data Types and Ranges
- Dates should be dates, numbers should be numbers
- Check for impossible values (negative ages, future dates, percentages over 100)
- Verify categorical values are consistent (watch for "USA", "US", "United States")

### Step 2: Explore

Start broad, then focus on what's interesting.

#### Descriptive Statistics
Always start here — understand the basics before going deeper.

```python
# Numerical columns
print(df.describe())

# Categorical columns
for col in df.select_dtypes(include='object').columns:
    print(f"\n{col}: {df[col].nunique()} unique values")
    print(df[col].value_counts().head(10))
```

#### Distributions
Understanding shape matters for choosing the right tests.

```python
import matplotlib.pyplot as plt

# Distribution of key metrics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, col in enumerate(['metric_a', 'metric_b', 'metric_c']):
    df[col].hist(ax=axes[i], bins=30)
    axes[i].set_title(col)
    axes[i].axvline(df[col].median(), color='red', linestyle='--', label='median')
    axes[i].legend()
plt.tight_layout()
plt.savefig("distributions.png")
```

#### Correlations and Relationships
Look for patterns between variables.

```python
# Correlation matrix for numerical columns
corr = df.select_dtypes(include='number').corr()
print("Strong correlations (|r| > 0.5):")
for i in range(len(corr.columns)):
    for j in range(i+1, len(corr.columns)):
        if abs(corr.iloc[i, j]) > 0.5:
            print(f"  {corr.columns[i]} vs {corr.columns[j]}: {corr.iloc[i,j]:.3f}")
```

#### Trends Over Time
If the data has a time dimension, always look at trends.

```python
# Time series analysis
df['date'] = pd.to_datetime(df['date'])
daily = df.groupby('date')['metric'].agg(['mean', 'count'])
daily['mean'].plot(figsize=(12, 4), title='Daily Average')
plt.savefig("trend.png")
```

### Step 3: Analyze

Choose the right method for the question.

#### Comparison Questions ("Is A different from B?")

Use the right statistical test:
- **Two groups, continuous outcome**: t-test (if normal) or Mann-Whitney U (if not)
- **Multiple groups**: ANOVA (if normal) or Kruskal-Wallis (if not)
- **Two groups, categorical outcome**: Chi-squared test
- **Before/after with same subjects**: Paired t-test or Wilcoxon signed-rank

Always report:
- Sample sizes for each group
- Effect size (not just p-value) — Cohen's d, odds ratio, or percentage difference
- Confidence intervals
- Practical significance, not just statistical significance

```python
from scipy import stats

# Example: comparing two groups
group_a = df[df['variant'] == 'A']['metric']
group_b = df[df['variant'] == 'B']['metric']

# Check normality first
_, p_normal_a = stats.shapiro(group_a.sample(min(5000, len(group_a))))
_, p_normal_b = stats.shapiro(group_b.sample(min(5000, len(group_b))))

if p_normal_a > 0.05 and p_normal_b > 0.05:
    stat, p_value = stats.ttest_ind(group_a, group_b)
    test_name = "t-test"
else:
    stat, p_value = stats.mannwhitneyu(group_a, group_b)
    test_name = "Mann-Whitney U"

# Effect size (Cohen's d)
pooled_std = ((group_a.std()**2 + group_b.std()**2) / 2) ** 0.5
cohens_d = (group_a.mean() - group_b.mean()) / pooled_std

print(f"Test: {test_name}")
print(f"Group A: mean={group_a.mean():.3f}, n={len(group_a)}")
print(f"Group B: mean={group_b.mean():.3f}, n={len(group_b)}")
print(f"Difference: {group_a.mean() - group_b.mean():.3f} ({(group_a.mean() - group_b.mean())/group_b.mean()*100:.1f}%)")
print(f"Cohen's d: {cohens_d:.3f}")
print(f"p-value: {p_value:.4f}")
```

#### Trend Questions ("Is this changing over time?")

- Use rolling averages to smooth noise
- Look for seasonality (day-of-week, monthly, quarterly patterns)
- Segment by key dimensions — overall trends can mask group-level differences
- Compare period-over-period (WoW, MoM, YoY)

#### Composition Questions ("What's the breakdown?")

- Use percentages and proportions
- Show both absolute numbers and percentages
- Look for the Pareto principle (80/20 rule)
- Segment into meaningful groups

#### A/B Test Analysis

For experiment analysis, follow this checklist:

1. **Sample size**: Was the test adequately powered?
2. **Duration**: Did it run long enough for weekly patterns?
3. **Randomization**: Were groups balanced on key dimensions?
4. **Metric definition**: Is the metric clearly defined and correctly computed?
5. **Statistical test**: Use the appropriate test for the metric type
6. **Multiple comparisons**: Correct for multiple testing if checking many metrics
7. **Practical significance**: Is the effect large enough to matter?

### Step 4: Communicate Findings

Structure matters as much as the analysis itself.

#### Lead With the Answer
Start with what the user asked. Then support it.

**Pattern:**
```
[Answer to the question in 1-2 sentences]

[Key supporting evidence: 2-4 bullet points with specific numbers]

[Methodology note: How you arrived at this, in 1-2 sentences]

[Caveats and limitations: What could affect this conclusion]

[Recommended next steps: What to do with this information]
```

#### Visualization Guidelines

Choose the right chart type:
- **Trend over time**: Line chart
- **Comparison between categories**: Bar chart (horizontal for many categories)
- **Distribution**: Histogram or box plot
- **Relationship between two variables**: Scatter plot
- **Part-of-whole**: Stacked bar or pie chart (only for 2-5 categories)
- **Multiple dimensions**: Heatmap or small multiples

For every chart:
- Title that states the finding, not just the data ("Conversion drops 40% on weekends" not "Conversion by day")
- Label axes with units
- Include sample size or time period
- Use consistent colors across related charts

#### What to Always Include
- **Sample sizes**: How much data backs each claim
- **Time periods**: When the data is from
- **Definitions**: How key metrics are calculated
- **Confidence**: How certain you are (confidence intervals, p-values where appropriate)
- **Limitations**: What the data can't tell you

## Common Pitfalls

- **Survivorship bias**: Only analyzing data from things that succeeded/persisted
- **Simpson's paradox**: A trend that appears in groups but reverses when combined
- **Confounding variables**: Correlation doesn't mean causation — look for third variables
- **Small sample sizes**: Don't draw strong conclusions from small N. State the limitation.
- **Cherry-picking timeframes**: Make sure the time period is representative
- **Precision theater**: Reporting "23.847%" when the data only supports "roughly 24%"
- **Missing baseline**: A number without context is meaningless ("10,000 errors" — is that a lot?)

## Handling Common Requests

### "Just give me a quick look"
Provide: shape, column summary, missing data count, top-level descriptive stats, one key insight. Keep it to one screen of output.

### "What's interesting in this data?"
Run the full exploration workflow (Step 2) and surface the 3-5 most notable patterns: unexpected distributions, strong correlations, outliers, trends.

### "Is this statistically significant?"
Clarify what comparison they mean, choose the right test, report effect size and p-value, and explain what it means in practical terms.

### "Create a dashboard / visualization"
Ask what decisions the dashboard supports. Design 3-5 charts that answer those questions. Use consistent styling and clear titles that state findings.