---
name: outlier-detective
description: Detect anomalies and outliers in datasets using statistical and ML methods. Use for data cleaning, fraud detection, or quality control analysis.
---

# Outlier Detective

Detect anomalies and outliers in numeric data using multiple methods.

## Features

- **Statistical Methods**: Z-score, IQR, Modified Z-score
- **ML Methods**: Isolation Forest, LOF, DBSCAN
- **Visualization**: Box plots, scatter plots
- **Multi-Column**: Analyze multiple variables
- **Reports**: Detailed outlier reports
- **Flexible Thresholds**: Configurable sensitivity

## Quick Start

```python
from outlier_detective import OutlierDetective

detective = OutlierDetective()
detective.load_csv("sales_data.csv")

# Detect outliers in a column
outliers = detective.detect("revenue", method="iqr")
print(f"Found {len(outliers)} outliers")

# Get full report
report = detective.analyze("revenue")
print(report)
```

## CLI Usage

```bash
# Detect outliers using IQR method
python outlier_detective.py --input data.csv --column sales --method iqr

# Use Z-score with custom threshold
python outlier_detective.py --input data.csv --column price --method zscore --threshold 3

# Analyze all numeric columns
python outlier_detective.py --input data.csv --all

# Generate visualization
python outlier_detective.py --input data.csv --column revenue --plot boxplot.png

# Export outliers to CSV
python outlier_detective.py --input data.csv --column value --output outliers.csv

# Use Isolation Forest (ML)
python outlier_detective.py --input data.csv --method isolation_forest
```

## API Reference

### OutlierDetective Class

```python
class OutlierDetective:
    def __init__(self)

    # Data loading
    def load_csv(self, filepath: str, **kwargs) -> 'OutlierDetective'
    def load_dataframe(self, df: pd.DataFrame) -> 'OutlierDetective'

    # Detection (single column)
    def detect(self, column: str, method: str = "iqr", **kwargs) -> pd.DataFrame
    def analyze(self, column: str) -> dict

    # Detection (multi-column)
    def detect_multivariate(self, columns: list = None, method: str = "isolation_forest") -> pd.DataFrame
    def analyze_all(self) -> dict

    # Visualization
    def plot_boxplot(self, column: str, output: str) -> str
    def plot_scatter(self, col1: str, col2: str, output: str) -> str
    def plot_distribution(self, column: str, output: str) -> str

    # Export
    def get_outliers(self, column: str, method: str = "iqr") -> pd.DataFrame
    def get_clean_data(self, column: str, method: str = "iqr") -> pd.DataFrame
```

## Detection Methods

### Statistical Methods

#### IQR (Interquartile Range)
- Default and most robust method
- Outliers: values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR
- Multiplier configurable (default: 1.5)

```python
outliers = detective.detect("price", method="iqr", multiplier=1.5)
```

#### Z-Score
- Based on standard deviations from mean
- Assumes normal distribution
- Threshold configurable (default: 3)

```python
outliers = detective.detect("price", method="zscore", threshold=3)
```

#### Modified Z-Score
- Uses median instead of mean
- More robust to existing outliers
- Based on MAD (Median Absolute Deviation)

```python
outliers = detective.detect("price", method="modified_zscore", threshold=3.5)
```

### ML Methods

#### Isolation Forest
- Ensemble method, good for high-dimensional data
- Contamination parameter sets expected outlier fraction

```python
outliers = detective.detect_multivariate(
    method="isolation_forest",
    contamination=0.1
)
```

#### Local Outlier Factor (LOF)
- Density-based method
- Compares local density to neighbors

```python
outliers = detective.detect_multivariate(
    method="lof",
    n_neighbors=20
)
```

## Output Format

### detect() Result
```python
# Returns DataFrame of outlier rows with additional columns:
#   - outlier_score: How extreme the value is
#   - outlier_reason: Description of why it's an outlier

   index  value  outlier_score  outlier_reason
0     15   5000          4.2    Above Q3 + 1.5×IQR
1     42  -1000         -3.8    Below Q1 - 1.5×IQR
```

### analyze() Result
```python
{
    "column": "revenue",
    "total_rows": 1000,
    "outlier_count": 23,
    "outlier_percent": 2.3,
    "methods": {
        "iqr": {"count": 23, "indices": [...]},
        "zscore": {"count": 18, "indices": [...]},
        "modified_zscore": {"count": 20, "indices": [...]}
    },
    "stats": {
        "mean": 5432.10,
        "median": 4890.00,
        "std": 1234.56,
        "min": -1000.00,
        "max": 15000.00,
        "q1": 3500.00,
        "q3": 6200.00,
        "iqr": 2700.00
    },
    "bounds": {
        "lower": -550.00,
        "upper": 10250.00
    }
}
```

## Example Workflows

### Data Cleaning Pipeline
```python
detective = OutlierDetective()
detective.load_csv("raw_data.csv")

# Analyze and visualize
report = detective.analyze("price")
print(f"Found {report['outlier_count']} outliers ({report['outlier_percent']:.1f}%)")

# Get clean data
clean_data = detective.get_clean_data("price", method="iqr")
clean_data.to_csv("clean_data.csv")
```

### Fraud Detection
```python
detective = OutlierDetective()
detective.load_csv("transactions.csv")

# Use multiple methods for consensus
iqr_outliers = set(detective.detect("amount", method="iqr").index)
zscore_outliers = set(detective.detect("amount", method="zscore").index)

# Transactions flagged by both methods
high_confidence = iqr_outliers & zscore_outliers
print(f"High-confidence anomalies: {len(high_confidence)}")
```

### Multi-Variable Analysis
```python
detective = OutlierDetective()
detective.load_csv("sensors.csv")

# Detect multivariate outliers
outliers = detective.detect_multivariate(
    columns=["temp", "pressure", "humidity"],
    method="isolation_forest",
    contamination=0.05
)
print(f"Anomalous readings: {len(outliers)}")
```

## Visualization Examples

```python
# Box plot with outliers highlighted
detective.plot_boxplot("revenue", "revenue_boxplot.png")

# Distribution with bounds
detective.plot_distribution("price", "price_dist.png")

# Scatter plot (2D outliers)
detective.plot_scatter("feature1", "feature2", "scatter.png")
```

## Dependencies

- pandas>=2.0.0
- numpy>=1.24.0
- scipy>=1.10.0
- scikit-learn>=1.3.0
- matplotlib>=3.7.0