---
name: correlation-explorer
description: Find and visualize correlations between variables in datasets. Use for data exploration, feature selection, or identifying relationships between columns.
---

# Correlation Explorer

Analyze correlations between variables in CSV/Excel datasets.

## Features

- **Correlation Matrix**: Compute all pairwise correlations
- **Heatmap Visualization**: Color-coded correlation display
- **Significance Testing**: P-values for correlations
- **Multiple Methods**: Pearson, Spearman, Kendall
- **Strong Correlations**: Find highly correlated pairs
- **Target Analysis**: Correlations with specific variable

## Quick Start

```python
from correlation_explorer import CorrelationExplorer

explorer = CorrelationExplorer()

# Load and analyze
explorer.load_csv("sales_data.csv")
matrix = explorer.correlation_matrix()

# Find strong correlations
strong = explorer.find_strong_correlations(threshold=0.7)
print(strong)

# Generate heatmap
explorer.plot_heatmap("correlation_heatmap.png")
```

## CLI Usage

```bash
# Compute correlation matrix
python correlation_explorer.py --input data.csv --output correlations.csv

# Generate heatmap
python correlation_explorer.py --input data.csv --heatmap heatmap.png

# Find strong correlations
python correlation_explorer.py --input data.csv --strong --threshold 0.7

# Correlations with target variable
python correlation_explorer.py --input data.csv --target sales

# Use Spearman correlation
python correlation_explorer.py --input data.csv --method spearman

# Include p-values
python correlation_explorer.py --input data.csv --pvalues
```

## API Reference

### CorrelationExplorer Class

```python
class CorrelationExplorer:
    def __init__(self)

    # Data loading
    def load_csv(self, filepath: str, **kwargs) -> 'CorrelationExplorer'
    def load_dataframe(self, df: pd.DataFrame) -> 'CorrelationExplorer'

    # Analysis
    def correlation_matrix(self, method: str = "pearson") -> pd.DataFrame
    def correlation_with_pvalues(self, method: str = "pearson") -> tuple
    def correlate_with_target(self, target: str, method: str = "pearson") -> pd.Series

    # Discovery
    def find_strong_correlations(self, threshold: float = 0.7) -> list
    def find_weak_correlations(self, threshold: float = 0.3) -> list

    # Visualization
    def plot_heatmap(self, output: str, **kwargs) -> str
    def plot_scatter(self, var1: str, var2: str, output: str) -> str

    # Export
    def to_csv(self, output: str) -> str
    def to_json(self, output: str) -> str
```

## Correlation Methods

| Method | Best For |
|--------|----------|
| `pearson` | Linear relationships, normal data |
| `spearman` | Non-linear, ordinal data |
| `kendall` | Small samples, ordinal data |

```python
# Pearson (default) - parametric
matrix = explorer.correlation_matrix(method="pearson")

# Spearman - rank-based, non-parametric
matrix = explorer.correlation_matrix(method="spearman")

# Kendall - robust to outliers
matrix = explorer.correlation_matrix(method="kendall")
```

## Output Format

### Correlation Matrix
```python
           sales  marketing  customers
sales      1.000      0.854      0.723
marketing  0.854      1.000      0.612
customers  0.723      0.612      1.000
```

### Strong Correlations
```python
[
    {"var1": "sales", "var2": "marketing", "correlation": 0.854, "abs_corr": 0.854},
    {"var1": "sales", "var2": "customers", "correlation": 0.723, "abs_corr": 0.723}
]
```

### With P-Values
```python
{
    "correlations": DataFrame,
    "pvalues": DataFrame,
    "significant": [...],  # p < 0.05
}
```

## Example Workflows

### Feature Selection
```python
explorer = CorrelationExplorer()
explorer.load_csv("features.csv")

# Find features correlated with target
target_corr = explorer.correlate_with_target("target")
important_features = target_corr[abs(target_corr) > 0.3].index.tolist()
print(f"Important features: {important_features}")

# Find multicollinear features (to potentially drop)
strong = explorer.find_strong_correlations(threshold=0.9)
print("Highly correlated pairs (consider dropping one):")
for pair in strong:
    print(f"  {pair['var1']} <-> {pair['var2']}: {pair['correlation']:.3f}")
```

### Sales Analysis
```python
explorer = CorrelationExplorer()
explorer.load_csv("sales_data.csv")

# What drives sales?
sales_corr = explorer.correlate_with_target("revenue")
print("Factors correlated with revenue:")
for var, corr in sales_corr.sort_values(ascending=False).items():
    if var != "revenue":
        print(f"  {var}: {corr:.3f}")

# Visualize
explorer.plot_heatmap("sales_correlations.png")
```

### Data Exploration
```python
explorer = CorrelationExplorer()
explorer.load_csv("dataset.csv")

# Get full picture
corr, pvals = explorer.correlation_with_pvalues()

# Find all significant correlations
significant = []
for i in range(len(corr.columns)):
    for j in range(i+1, len(corr.columns)):
        if pvals.iloc[i, j] < 0.05:
            significant.append({
                'var1': corr.columns[i],
                'var2': corr.columns[j],
                'r': corr.iloc[i, j],
                'p': pvals.iloc[i, j]
            })
```

## Heatmap Options

```python
explorer.plot_heatmap(
    output="heatmap.png",
    cmap="coolwarm",      # Color scheme
    annot=True,           # Show values
    figsize=(12, 10),     # Figure size
    vmin=-1, vmax=1,      # Color scale
    title="Correlation Matrix"
)
```

## Dependencies

- pandas>=2.0.0
- numpy>=1.24.0
- scipy>=1.10.0
- matplotlib>=3.7.0
- seaborn>=0.12.0