---
name: statistical-analyzer
description: Perform statistical hypothesis testing, regression analysis, ANOVA, and t-tests with plain-English interpretations and visualizations.
---

# Statistical Analyzer

Guided statistical analysis with hypothesis testing, regression, ANOVA, and plain-English results.

## Features

- **Hypothesis Testing**: t-tests, chi-square, proportion tests
- **Regression Analysis**: Linear, polynomial, multiple regression
- **ANOVA**: One-way, two-way ANOVA with post-hoc tests
- **Distribution Analysis**: Normality tests, Q-Q plots
- **Correlation Analysis**: Pearson, Spearman with significance
- **Plain-English Results**: Interpret statistical outputs
- **Visualizations**: Regression plots, residual analysis, box plots
- **Report Generation**: PDF/HTML reports with interpretations

## Quick Start

```python
from statistical_analyzer import StatisticalAnalyzer

analyzer = StatisticalAnalyzer()

# T-test
analyzer.load_data(df, group_col='treatment', value_col='score')
results = analyzer.t_test(group1='control', group2='experimental')
print(results['interpretation'])

# Regression
analyzer.load_data(df)
results = analyzer.linear_regression(x='age', y='income')
print(f"R²: {results['r_squared']}")
analyzer.plot_regression('regression.png')
```

## CLI Usage

```bash
# T-test
python statistical_analyzer.py --data data.csv --test t-test --group treatment --value score --output results.html

# ANOVA
python statistical_analyzer.py --data data.csv --test anova --group category --value score --output results.pdf

# Regression
python statistical_analyzer.py --data data.csv --test regression --x age --y income --output report.pdf

# Correlation matrix
python statistical_analyzer.py --data data.csv --test correlation --output correlation.png
```

## API Reference

### StatisticalAnalyzer Class

```python
class StatisticalAnalyzer:
    def __init__(self)

    # Data Loading
    def load_data(self, data, **kwargs) -> 'StatisticalAnalyzer'
    def load_csv(self, filepath, **kwargs) -> 'StatisticalAnalyzer'

    # Hypothesis Tests
    def t_test(self, group1, group2, paired=False, alternative='two-sided') -> Dict
    def one_sample_t_test(self, column, expected_mean, alternative='two-sided') -> Dict
    def anova(self, groups, value_col) -> Dict
    def chi_square(self, observed, expected=None) -> Dict
    def proportion_test(self, successes, total, expected_prop=0.5) -> Dict

    # Regression
    def linear_regression(self, x, y) -> Dict
    def polynomial_regression(self, x, y, degree=2) -> Dict
    def multiple_regression(self, predictors: List[str], target: str) -> Dict

    # Correlation
    def correlation(self, method='pearson') -> pd.DataFrame  # Correlation matrix
    def correlation_test(self, var1, var2, method='pearson') -> Dict

    # Distribution Tests
    def normality_test(self, column, method='shapiro') -> Dict
    def qq_plot(self, column, output=None) -> str

    # Visualization
    def plot_regression(self, output, x=None, y=None) -> str
    def plot_residuals(self, output) -> str
    def plot_distribution(self, column, output) -> str
    def plot_boxplot(self, groups, value_col, output) -> str

    # Reporting
    def generate_report(self, output, format='pdf') -> str
    def summary(self) -> str
```

## Tests

### T-Test

Compare means between two groups:

```python
analyzer.load_csv('data.csv')

# Independent samples
results = analyzer.t_test(
    group1='control',
    group2='treatment',
    paired=False
)

# Results
print(results)
# {
#     'statistic': -2.45,
#     'p_value': 0.018,
#     'mean_diff': -5.2,
#     'ci': (-9.5, -0.9),
#     'interpretation': 'The difference is statistically significant (p=0.018)...'
# }

# Paired samples (before/after)
results = analyzer.t_test(
    group1='before',
    group2='after',
    paired=True
)
```

### ANOVA

Compare means across multiple groups:

```python
results = analyzer.anova(
    groups=['control', 'treatment_a', 'treatment_b'],
    value_col='score'
)

# Results include post-hoc tests
print(results['interpretation'])
# "There is a statistically significant difference between groups (p<0.001).
#  Post-hoc tests show treatment_a differs from control (p=0.003)..."
```

### Regression Analysis

```python
# Simple linear regression
results = analyzer.linear_regression(x='hours_studied', y='exam_score')

print(f"R² = {results['r_squared']:.3f}")
print(f"Equation: y = {results['slope']:.2f}x + {results['intercept']:.2f}")
print(f"p-value: {results['p_value']:.4f}")

# Polynomial regression
results = analyzer.polynomial_regression(x='age', y='salary', degree=2)

# Multiple regression
results = analyzer.multiple_regression(
    predictors=['age', 'experience', 'education'],
    target='salary'
)
```

### Correlation Analysis

```python
# Full correlation matrix
corr_matrix = analyzer.correlation(method='pearson')
print(corr_matrix)

# Test specific correlation
results = analyzer.correlation_test('height', 'weight', method='pearson')
print(results['interpretation'])
# "There is a strong positive correlation (r=0.82, p<0.001)"
```

### Distribution Tests

```python
# Test normality
results = analyzer.normality_test('scores', method='shapiro')
# Returns: {'statistic': 0.98, 'p_value': 0.35,
#           'interpretation': 'Data appears normally distributed (p=0.35)'}

# Q-Q plot
analyzer.qq_plot('scores', output='qq_plot.png')
```

## Interpretation Guide

The analyzer provides plain-English interpretations:

### Significance Levels
- **p < 0.001**: "Highly significant"
- **p < 0.01**: "Very significant"
- **p < 0.05**: "Statistically significant"
- **p ≥ 0.05**: "Not statistically significant"

### Effect Sizes
- **Cohen's d**: Small (0.2), Medium (0.5), Large (0.8)
- **R²**: Weak (<0.3), Moderate (0.3-0.7), Strong (>0.7)
- **Correlation**: Weak (<0.3), Moderate (0.3-0.7), Strong (>0.7)

## Visualizations

### Regression Plot
```python
analyzer.linear_regression(x='age', y='income')
analyzer.plot_regression('regression.png')
# Creates scatter plot with regression line and confidence interval
```

### Residual Plot
```python
analyzer.plot_residuals('residuals.png')
# Checks regression assumptions (homoscedasticity)
```

### Box Plot
```python
analyzer.plot_boxplot(
    groups=['control', 'treatment_a', 'treatment_b'],
    value_col='score',
    output='boxplot.png'
)
```

### Distribution Plot
```python
analyzer.plot_distribution('scores', 'distribution.png')
# Histogram with normal curve overlay
```

## Reports

Generate comprehensive reports:

```python
analyzer.load_csv('data.csv')
analyzer.t_test(group1='control', group2='treatment')
analyzer.linear_regression(x='hours', y='score')

# PDF report with all analyses
analyzer.generate_report('analysis_report.pdf', format='pdf')

# HTML report
analyzer.generate_report('analysis_report.html', format='html')
```

Reports include:
- Summary statistics
- Test results with interpretations
- Visualizations
- Assumptions checks
- Recommendations

## Assumptions Checking

Automatic assumptions validation:

```python
# T-test checks:
# - Normality (Shapiro-Wilk)
# - Equal variances (Levene's test)
# Warnings if assumptions violated

# ANOVA checks:
# - Normality per group
# - Homogeneity of variances
# Suggests non-parametric alternatives

# Regression checks:
# - Linearity
# - Homoscedasticity
# - Normality of residuals
# - Independence (Durbin-Watson)
```

## Dependencies

- scipy>=1.10.0
- statsmodels>=0.14.0
- pandas>=2.0.0
- numpy>=1.24.0
- matplotlib>=3.7.0
- seaborn>=0.12.0
- reportlab>=4.0.0