---
name: clustering-analyzer
description: Cluster data using K-Means, DBSCAN, hierarchical clustering. Use for customer segmentation, pattern discovery, or data grouping.
---

# Clustering Analyzer

Analyze and cluster data using multiple algorithms with visualization and evaluation.

## Features

- **K-Means**: Partition-based clustering with elbow method
- **DBSCAN**: Density-based clustering for arbitrary shapes
- **Hierarchical**: Agglomerative clustering with dendrograms
- **Evaluation**: Silhouette scores, cluster statistics
- **Visualization**: 2D/3D plots, dendrograms, elbow curves
- **Export**: Labeled data, cluster summaries

## Quick Start

```python
from clustering_analyzer import ClusteringAnalyzer

analyzer = ClusteringAnalyzer()
analyzer.load_csv("customers.csv")

# K-Means clustering
result = analyzer.kmeans(n_clusters=3)
print(f"Silhouette Score: {result['silhouette_score']:.3f}")

# Visualize
analyzer.plot_clusters("clusters.png")
```

## CLI Usage

```bash
# K-Means clustering
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3

# Find optimal clusters (elbow method)
python clustering_analyzer.py --input data.csv --method kmeans --find-optimal

# DBSCAN clustering
python clustering_analyzer.py --input data.csv --method dbscan --eps 0.5 --min-samples 5

# Hierarchical clustering
python clustering_analyzer.py --input data.csv --method hierarchical --clusters 4

# Generate plots
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3 --plot clusters.png

# Export labeled data
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3 --output labeled.csv

# Select specific columns
python clustering_analyzer.py --input data.csv --columns age,income,spending --method kmeans --clusters 3
```

## API Reference

### ClusteringAnalyzer Class

```python
class ClusteringAnalyzer:
    def __init__(self)

    # Data loading
    def load_csv(self, filepath: str, columns: list = None) -> 'ClusteringAnalyzer'
    def load_dataframe(self, df: pd.DataFrame, columns: list = None) -> 'ClusteringAnalyzer'

    # Clustering methods
    def kmeans(self, n_clusters: int, **kwargs) -> dict
    def dbscan(self, eps: float = 0.5, min_samples: int = 5) -> dict
    def hierarchical(self, n_clusters: int, linkage: str = "ward") -> dict

    # Optimal clusters
    def find_optimal_clusters(self, max_k: int = 10) -> dict
    def elbow_plot(self, output: str, max_k: int = 10) -> str

    # Evaluation
    def silhouette_score(self) -> float
    def cluster_statistics(self) -> dict

    # Visualization
    def plot_clusters(self, output: str, dimensions: list = None) -> str
    def plot_dendrogram(self, output: str) -> str
    def plot_silhouette(self, output: str) -> str

    # Export
    def get_labels(self) -> list
    def to_dataframe(self) -> pd.DataFrame
    def save_labeled(self, output: str) -> str
```

## Clustering Methods

### K-Means

Best for spherical clusters with known number of groups:

```python
result = analyzer.kmeans(n_clusters=3)

# Returns:
{
    "labels": [0, 1, 2, 0, ...],
    "n_clusters": 3,
    "silhouette_score": 0.65,
    "inertia": 1234.56,
    "cluster_sizes": {0: 150, 1: 200, 2: 100},
    "centroids": [[...], [...], [...]]
}
```

### DBSCAN

Best for arbitrary-shaped clusters:

```python
result = analyzer.dbscan(eps=0.5, min_samples=5)

# Returns:
{
    "labels": [0, 0, 1, -1, ...],  # -1 = noise
    "n_clusters": 3,
    "n_noise": 15,
    "silhouette_score": 0.58,
    "cluster_sizes": {0: 150, 1: 200, 2: 100}
}
```

### Hierarchical (Agglomerative)

Best for understanding cluster hierarchy:

```python
result = analyzer.hierarchical(n_clusters=4, linkage="ward")

# Returns:
{
    "labels": [0, 1, 2, 3, ...],
    "n_clusters": 4,
    "silhouette_score": 0.62,
    "cluster_sizes": {0: 100, 1: 150, 2: 120, 3: 80}
}
```

## Finding Optimal Clusters

### Elbow Method

```python
optimal = analyzer.find_optimal_clusters(max_k=10)

# Returns:
{
    "optimal_k": 4,
    "inertias": [1000, 800, 500, 300, 280, ...],
    "silhouettes": [0.5, 0.55, 0.6, 0.65, 0.63, ...]
}
```

### Elbow Plot

```python
analyzer.elbow_plot("elbow.png", max_k=10)
```

Generates plot showing inertia vs number of clusters.

## Cluster Statistics

```python
stats = analyzer.cluster_statistics()

# Returns:
{
    "n_clusters": 3,
    "cluster_sizes": {0: 150, 1: 200, 2: 100},
    "cluster_means": {
        0: {"age": 25.5, "income": 45000, ...},
        1: {"age": 45.2, "income": 75000, ...},
        2: {"age": 35.1, "income": 55000, ...}
    },
    "cluster_std": {
        0: {"age": 5.2, "income": 8000, ...},
        ...
    },
    "overall_silhouette": 0.65
}
```

## Visualization

### Cluster Plot

```python
# 2D plot (uses first 2 features or PCA)
analyzer.plot_clusters("clusters_2d.png")

# Specify dimensions
analyzer.plot_clusters("clusters.png", dimensions=["age", "income"])
```

### Dendrogram

```python
# For hierarchical clustering
analyzer.hierarchical(n_clusters=4)
analyzer.plot_dendrogram("dendrogram.png")
```

### Silhouette Plot

```python
analyzer.plot_silhouette("silhouette.png")
```

Shows silhouette coefficient for each sample.

## Export Results

### Get Labels

```python
labels = analyzer.get_labels()
# [0, 1, 2, 0, 1, ...]
```

### Save Labeled Data

```python
analyzer.save_labeled("labeled_data.csv")
# Original data + cluster_label column
```

### Get Full DataFrame

```python
df = analyzer.to_dataframe()
# DataFrame with cluster_label column
```

## Example Workflows

### Customer Segmentation

```python
analyzer = ClusteringAnalyzer()
analyzer.load_csv("customers.csv", columns=["age", "income", "spending_score"])

# Find optimal number of segments
optimal = analyzer.find_optimal_clusters(max_k=8)
print(f"Optimal segments: {optimal['optimal_k']}")

# Cluster with optimal k
result = analyzer.kmeans(n_clusters=optimal['optimal_k'])

# Get segment characteristics
stats = analyzer.cluster_statistics()
for cluster_id, means in stats["cluster_means"].items():
    print(f"\nSegment {cluster_id}:")
    for feature, value in means.items():
        print(f"  {feature}: {value:.2f}")

# Save segmented data
analyzer.save_labeled("customer_segments.csv")
```

### Anomaly Detection with DBSCAN

```python
analyzer = ClusteringAnalyzer()
analyzer.load_csv("transactions.csv", columns=["amount", "frequency"])

# DBSCAN identifies noise points as potential anomalies
result = analyzer.dbscan(eps=0.3, min_samples=10)

print(f"Found {result['n_noise']} potential anomalies")

# Get anomalous records
df = analyzer.to_dataframe()
anomalies = df[df["cluster_label"] == -1]
```

### Document Clustering

```python
# After TF-IDF transformation
analyzer = ClusteringAnalyzer()
analyzer.load_dataframe(tfidf_matrix)

# Hierarchical clustering to see document relationships
result = analyzer.hierarchical(n_clusters=5)
analyzer.plot_dendrogram("doc_dendrogram.png")
```

## Data Preprocessing

The analyzer automatically:
- Handles missing values (imputation)
- Scales features (standardization)
- Reduces dimensions for visualization (PCA)

For custom preprocessing:

```python
from sklearn.preprocessing import StandardScaler

# Preprocess manually
df = pd.read_csv("data.csv")
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

# Load preprocessed data
analyzer.load_dataframe(df_scaled)
```

## Dependencies

- scikit-learn>=1.3.0
- pandas>=2.0.0
- numpy>=1.24.0
- matplotlib>=3.7.0
- scipy>=1.10.0