---
name: statistics-math
description: Statistics, probability, linear algebra, and mathematical foundations for data science
sasmp_version: "1.3.0"
bonded_agent: 04-data-scientist
bond_type: PRIMARY_BOND
skill_version: "2.0.0"
last_updated: "2025-01"
complexity: foundational
estimated_mastery_hours: 120
prerequisites: []
unlocks: [machine-learning, deep-learning, data-engineering]
---

# Statistics & Mathematics

Mathematical foundations for data science, machine learning, and statistical analysis.

## Quick Start

```python
import numpy as np
import scipy.stats as stats
from sklearn.linear_model import LinearRegression

# Descriptive Statistics
data = np.array([23, 45, 67, 32, 45, 67, 89, 12, 34, 56])
print(f"Mean: {np.mean(data):.2f}")
print(f"Median: {np.median(data):.2f}")
print(f"Std Dev: {np.std(data, ddof=1):.2f}")
print(f"IQR: {np.percentile(data, 75) - np.percentile(data, 25):.2f}")

# Hypothesis Testing
sample_a = [23, 45, 67, 32, 45]
sample_b = [56, 78, 45, 67, 89]
t_stat, p_value = stats.ttest_ind(sample_a, sample_b)
print(f"T-statistic: {t_stat:.4f}, p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Reject null hypothesis: significant difference")
else:
    print("Fail to reject null hypothesis")
```

## Core Concepts

### 1. Probability Distributions

```python
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Normal Distribution
mu, sigma = 100, 15
normal_dist = stats.norm(loc=mu, scale=sigma)
x = np.linspace(50, 150, 100)

# PDF, CDF calculations
print(f"P(X < 85): {normal_dist.cdf(85):.4f}")
print(f"P(X > 115): {1 - normal_dist.cdf(115):.4f}")
print(f"95th percentile: {normal_dist.ppf(0.95):.2f}")

# Binomial Distribution (discrete)
n, p = 100, 0.3
binom_dist = stats.binom(n=n, p=p)
print(f"P(X = 30): {binom_dist.pmf(30):.4f}")
print(f"P(X <= 30): {binom_dist.cdf(30):.4f}")

# Poisson Distribution (events per time)
lambda_param = 5
poisson_dist = stats.poisson(mu=lambda_param)
print(f"P(X = 3): {poisson_dist.pmf(3):.4f}")

# Central Limit Theorem demonstration
population = np.random.exponential(scale=10, size=100000)
sample_means = [np.mean(np.random.choice(population, 30)) for _ in range(1000)]
print(f"Sample means are approximately normal: mean={np.mean(sample_means):.2f}")
```

### 2. Hypothesis Testing Framework

```python
from scipy import stats
import numpy as np

class HypothesisTest:
    """Framework for statistical hypothesis testing."""

    @staticmethod
    def two_sample_ttest(group_a, group_b, alpha=0.05):
        """Independent samples t-test."""
        t_stat, p_value = stats.ttest_ind(group_a, group_b)
        effect_size = (np.mean(group_a) - np.mean(group_b)) / np.sqrt(
            (np.var(group_a) + np.var(group_b)) / 2
        )
        return {
            "t_statistic": t_stat,
            "p_value": p_value,
            "significant": p_value < alpha,
            "effect_size_cohens_d": effect_size
        }

    @staticmethod
    def chi_square_test(observed, expected=None, alpha=0.05):
        """Chi-square test for categorical data."""
        if expected is None:
            chi2, p_value, dof, expected = stats.chi2_contingency(observed)
        else:
            chi2, p_value = stats.chisquare(observed, expected)
            dof = len(observed) - 1
        return {
            "chi2_statistic": chi2,
            "p_value": p_value,
            "degrees_of_freedom": dof,
            "significant": p_value < alpha
        }

    @staticmethod
    def ab_test_proportion(conversions_a, total_a, conversions_b, total_b, alpha=0.05):
        """Two-proportion z-test for A/B testing."""
        p_a = conversions_a / total_a
        p_b = conversions_b / total_b
        p_pooled = (conversions_a + conversions_b) / (total_a + total_b)

        se = np.sqrt(p_pooled * (1 - p_pooled) * (1/total_a + 1/total_b))
        z_stat = (p_a - p_b) / se
        p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

        return {
            "conversion_a": p_a,
            "conversion_b": p_b,
            "lift": (p_b - p_a) / p_a * 100,
            "z_statistic": z_stat,
            "p_value": p_value,
            "significant": p_value < alpha
        }

# Usage
result = HypothesisTest.ab_test_proportion(
    conversions_a=120, total_a=1000,
    conversions_b=150, total_b=1000
)
print(f"Lift: {result['lift']:.1f}%, p-value: {result['p_value']:.4f}")
```

### 3. Linear Algebra Essentials

```python
import numpy as np

# Matrix operations
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Basic operations
print("Matrix multiplication:", A @ B)
print("Element-wise:", A * B)
print("Transpose:", A.T)
print("Inverse:", np.linalg.inv(A))
print("Determinant:", np.linalg.det(A))

# Eigenvalues and eigenvectors (PCA foundation)
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"Eigenvalues: {eigenvalues}")

# Singular Value Decomposition (dimensionality reduction)
U, S, Vt = np.linalg.svd(A)
print(f"Singular values: {S}")

# Solving linear systems: Ax = b
b = np.array([5, 11])
x = np.linalg.solve(A, b)
print(f"Solution: {x}")

# Cosine similarity (NLP, recommendations)
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

vec1 = np.array([1, 2, 3])
vec2 = np.array([4, 5, 6])
print(f"Cosine similarity: {cosine_similarity(vec1, vec2):.4f}")
```

### 4. Regression Analysis

```python
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score, mean_squared_error
import statsmodels.api as sm

# Multiple Linear Regression with statsmodels
X = np.random.randn(100, 3)
y = 2*X[:, 0] + 3*X[:, 1] - X[:, 2] + np.random.randn(100)*0.5

X_with_const = sm.add_constant(X)
model = sm.OLS(y, X_with_const).fit()

print(model.summary())
print(f"R-squared: {model.rsquared:.4f}")
print(f"Coefficients: {model.params}")
print(f"P-values: {model.pvalues}")

# Regularization comparison
X_train, y_train = X[:80], y[:80]
X_test, y_test = X[80:], y[80:]

models = {
    "OLS": LinearRegression(),
    "Ridge": Ridge(alpha=1.0),
    "Lasso": Lasso(alpha=0.1)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"{name}: R²={r2_score(y_test, y_pred):.4f}, RMSE={np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
```

## Tools & Technologies

| Tool | Purpose | Version (2025) |
|------|---------|----------------|
| **NumPy** | Numerical computing | 1.26+ |
| **SciPy** | Scientific computing | 1.12+ |
| **pandas** | Data manipulation | 2.2+ |
| **statsmodels** | Statistical models | 0.14+ |
| **scikit-learn** | ML algorithms | 1.4+ |

## Troubleshooting Guide

| Issue | Symptoms | Root Cause | Fix |
|-------|----------|------------|-----|
| **Low p-value, small effect** | Significant but meaningless | Large sample size | Check effect size |
| **High variance** | Unstable estimates | Small sample, outliers | More data, robust methods |
| **Multicollinearity** | Inflated coefficients | Correlated features | VIF check, remove features |
| **Heteroscedasticity** | Invalid inference | Non-constant variance | Weighted least squares |

## Best Practices

```python
# ✅ DO: Check assumptions before testing
from scipy.stats import shapiro
stat, p = shapiro(data)
if p > 0.05:
    print("Data is approximately normal")

# ✅ DO: Use effect sizes, not just p-values
# ✅ DO: Correct for multiple comparisons (Bonferroni)
# ✅ DO: Report confidence intervals

# ❌ DON'T: p-hack by trying many tests
# ❌ DON'T: Confuse correlation with causation
# ❌ DON'T: Ignore sample size requirements
```

## Resources

- [Khan Academy Statistics](https://www.khanacademy.org/math/statistics-probability)
- [StatQuest with Josh Starmer](https://statquest.org/)
- "Introduction to Statistical Learning" (ISLR)

---

**Skill Certification Checklist:**
- [ ] Can calculate descriptive statistics
- [ ] Can perform hypothesis tests (t-test, chi-square)
- [ ] Can implement A/B testing
- [ ] Can perform regression analysis
- [ ] Can use matrix operations for ML