---
name: duplicate-report
description: >
  Run a comprehensive duplicate analysis on an Immich photo library using perceptual hashing.
  Finds cross-source duplicates (e.g. Apple Photos vs Google Photos exports), internal duplicates,
  and generates a detailed report with removal recommendations.
  Use when the user says "find duplicates", "duplicate report", "how many duplicates",
  "library health check", "photo dedup report", "run duplicate analysis",
  "compare my photo sources", or any variation of wanting to understand duplicate photos
  across import sources.
version: 1.2.0
---

# Duplicate Report

## ⚠️ Connection Required — ALWAYS CHECK FIRST

**Before doing ANYTHING else in this skill, call `ping` on the Immich MCP server.**

- If `ping` succeeds → proceed with the skill normally.
- If `ping` fails or the MCP tools are not available → **STOP. Do not continue.** Tell the user:

> ❌ **Immich is not connected.** This plugin needs a running Immich MCP server to work.
>
> Run **/setup-immich-photo-manager** to configure your Immich connection. You'll need:
> 1. Your Immich server URL (e.g., `http://192.168.1.100:2283`)
> 2. An Immich API key ([how to create one](https://immich.app/docs/features/command-line-interface#obtain-the-api-key))
> 3. The MCP server configured (see **/setup-immich-photo-manager**)
>
> Nothing in this plugin will work until the connection is configured.

**Do NOT skip this check. Do NOT try to run any other tool first. Always ping, always block if it fails.**

Generate a comprehensive duplicate analysis of an Immich photo library. Uses perceptual hashing to find visually identical photos even when they have different checksums (common when photos are exported from Apple Photos and Google Photos).

## Why Perceptual Hashing?

When users import the same photo library from multiple sources (Apple Photos export, Google Takeout, manual folder copies), the files are often **re-encoded** by each platform. This means:

- **Checksums differ** — same photo, different binary → SHA/MD5 won't match
- **Immich's built-in CLIP duplicate detection** uses too strict a threshold for re-encoded content
- **Filename matching** catches only a fraction (filenames often differ across platforms)

Perceptual hashing (pHash) computes a fingerprint based on the **visual content** of the image, not the binary data. Two re-encoded copies of the same photo produce the same perceptual hash.

## Prerequisites

The user's machine needs:

```bash
pip3 install Pillow imagehash pillow-heif --break-system-packages
```

- `Pillow` — image loading
- `imagehash` — perceptual hashing
- `pillow-heif` — HEIC/HEIF support (critical for Apple Photos)

## Analysis Workflow

### Step 0: ML-Based Duplicate Detection (Quick)

Before running the full perceptual hash scan, check Immich's built-in ML duplicate detection:

```
result = get_duplicates()
```

This returns groups of visually similar assets detected by Immich's ML engine. Present the count and let the user resolve obvious duplicates immediately using `resolve_duplicates`.

This is fast (no disk scan needed) but may miss re-encoded copies across import sources. For comprehensive cross-source analysis, proceed to Step 1.

> Note: `resolve_duplicates` handles Immich ML duplicates natively. Perceptual hashing (Steps 1–3 below) catches cross-source re-encoded duplicates that ML may miss.

### Step 1: Discover Import Sources

Query Immich to identify distinct import sources from asset paths:

```sql
SELECT
  CASE
    WHEN "originalPath" LIKE '%Apple Fotos%' OR "originalPath" LIKE '%Apple Photos%' THEN 'Apple Photos'
    WHEN "originalPath" LIKE '%Google Fotos%' OR "originalPath" LIKE '%Google Photos%' THEN 'Google Photos'
    ELSE split_part("originalPath", '/', 5)  -- or whatever level gives the source folder
  END as source,
  count(*) as total
FROM asset WHERE "deletedAt" IS NULL
GROUP BY source ORDER BY total DESC;
```

Present the sources to the user and ask which ones to compare.

### Step 2: Run Perceptual Hash Scan

For each source directory, scan all image files and compute 256-bit perceptual hashes:

```python
from pillow_heif import register_heif_opener
register_heif_opener()

from PIL import Image
import imagehash

def compute_phash(filepath):
    with Image.open(filepath) as img:
        if img.mode != 'RGB':
            img = img.convert('RGB')
        return str(imagehash.phash(img, hash_size=16))
```

**Key parameters:**
- `hash_size=16` → 256-bit hash (high accuracy, very few false positives)
- Use `ThreadPoolExecutor` (NOT `ProcessPoolExecutor` — native HEIF libs deadlock on fork)
- 4 workers is optimal for most machines
- Report progress every 500 files

**Expected performance:** ~500 files/30 seconds on Apple Silicon, ~200 files/30 seconds on Intel.

### Step 3: Compute Overlap

Compare hash sets between sources:

```python
common = set(source_a_hashes.keys()) & set(source_b_hashes.keys())
a_only = set(source_a_hashes.keys()) - set(source_b_hashes.keys())
b_only = set(source_b_hashes.keys()) - set(source_a_hashes.keys())
```

For internal duplicates within a single source:
```python
internal_dupes = sum(len(v) - 1 for v in hashes.values() if len(v) > 1)
```

### Step 4: Generate Report

Present findings in a structured report:

```
DUPLICATE ANALYSIS REPORT

Library: [total] assets ([photos] photos + [videos] videos)
Sources analyzed: [Source A] ([count] files), [Source B] ([count] files)

CROSS-SOURCE DUPLICATES
  [Source A] <-> [Source B] visual matches:    [count] ([pct]% overlap)

UNIQUE TO EACH SOURCE
  [Source A]-only photos:               [count]
  [Source B]-only photos:               [count]

INTERNAL DUPLICATES
  Within [Source A]:                    [count]
  Within [Source B]:                    [count]

TOTAL REMOVABLE
  Cross-source duplicates:         [count]
  Internal duplicates:             [count]
  TOTAL:                           [count] files

RECOMMENDATION
  Keep: [Source with better metadata/folder structure]
  Remove: [Other source] copies where match exists
  Review: [count] [other]-only photos are NOT duplicates — keep them
```

### Step 5: Removal (User-Approved)

**NEVER auto-remove.** Always:

1. Present the report with counts
2. Ask user which categories to remove
3. Confirm the exact count
4. Execute removal in two steps:
   a. Move to Immich trash: `delete_assets(asset_ids=[...], force=False)` — safer, recoverable via `restore_assets` or `restore_trash`
   b. Physical file removal from disk (`os.remove()`) only after user confirms trash is correct
   c. For permanent deletion (user explicitly requests): `delete_assets(asset_ids=[...], force=True)` — irreversible
5. Log everything to a JSON file for audit

Batch Immich deletions in groups of 100 assets per call. For ML-detected duplicates, prefer `resolve_duplicates` which handles them natively in Immich.

### Step 6: Verify

After removal, query Immich statistics to confirm the new count and present before/after comparison.

## Report Variations

### Quick Report (no disk scan)
Uses only Immich database — checksums, filenames, timestamps. Fast but misses re-encoded duplicates.

```sql
-- Exact checksum duplicates
SELECT checksum, count(*) FROM asset
WHERE "deletedAt" IS NULL
GROUP BY checksum HAVING count(*) > 1;

-- Filename overlap between sources
SELECT count(*) FROM (
  SELECT "originalFileName" FROM asset WHERE "originalPath" LIKE '%Source A%'
  INTERSECT
  SELECT "originalFileName" FROM asset WHERE "originalPath" LIKE '%Source B%'
) t;
```

### Full Report (perceptual hash)
Scans actual files on disk. Catches re-encoded duplicates. Requires filesystem access and Python dependencies. Takes 10-20 minutes for ~40K photos on Apple Silicon.

### Year-by-Year Breakdown
Shows which source dominates each year — helps users understand their photo ecosystem history:

```sql
SELECT year, source_a_count, source_b_count,
  CASE WHEN source_a_count > source_b_count THEN 'Source A' ELSE 'Source B' END as dominant
FROM (
  SELECT extract(year from "localDateTime") as year,
    count(*) FILTER (WHERE "originalPath" LIKE '%Source A%') as source_a_count,
    count(*) FILTER (WHERE "originalPath" LIKE '%Source B%') as source_b_count
  FROM asset WHERE "deletedAt" IS NULL
  GROUP BY year
) t ORDER BY year;
```

## Important Notes

- **Perceptual hashing has rare false positives** — two visually very similar (but different) photos may share a hash. The 256-bit hash size minimizes this, but users should spot-check a few matches before bulk removal.
- **Videos are excluded** from perceptual hashing — they need a different approach (frame extraction + hashing).
- **HEIC support is essential** — without `pillow-heif`, Apple Photos libraries will have massive error rates (50%+ of files).
- **ThreadPoolExecutor, not ProcessPoolExecutor** — native HEIF libraries deadlock when forked on macOS. Always use threads.
- **Background Immich scanning** may add new assets during analysis. Note this in the report if the post-cleanup count seems off.