---
name: progress
description: Check pipeline processing progress — all phases including OCR, translation, image extraction, enrichment, and more. Shows real page-level verification, not just job counters. Use when asked "how's it going?", "progress?", "status?", "now?", or any progress check.
---

# Pipeline Progress Check

Report on processing progress with **verified page-level data** across ALL pipeline phases. Job counters lie — always cross-reference against actual page records.

Also remind the user: full pipeline dashboard is live at https://sourcelibrary.org/analytics (Pipeline tab).

## Critical Diagnostic Rules

1. **Never trust job counters alone.** A job showing `status: 'processing'` with `progress.completed: 0` for hours is STUCK, not "in progress".
2. **Never attribute global throughput to a specific operation.** Translation throughput from `pages.translation.updated_at` includes ALL sources (pipeline cron, Lambda workers, backfill jobs). Filter by book IDs to attribute correctly.
3. **Spot-check actual pages.** For any "processing" job, check if its pages actually have `translation.data` or `ocr.data`. This is ground truth.
4. **Flag stale jobs explicitly.** Any job in `processing` for >2 hours with 0 progress is stuck. Say so clearly.

## Snapshot-First Approach (PREFERRED)

A pre-computed snapshot exists at `system_config._id: 'enrichment_snapshot'` (updated every 2h by Hetzner cron). **Always try this first** — it's a single document read vs 15+ slow aggregations.

```javascript
const snapshot = await db.collection('system_config').findOne({ _id: 'enrichment_snapshot' });
```

If the snapshot exists and `computed_at` is within the last 3 hours, use it directly. Display the `computed_at` timestamp so the user knows how fresh the data is.

The snapshot contains: `funnel`, `enrichment`, `milestones`, `gallery`, `active_jobs`, `stuck_jobs`, `throughput`, `paused`, `paused_phases`, and `computation_ms`.

**Only fall back to live queries** if:
- The snapshot doesn't exist
- The snapshot is older than 3 hours
- The user explicitly asks for "live" or "fresh" data

For live queries, prefer running the enrichment-snapshot script directly (`node scripts/workers/enrichment-snapshot.mjs`) rather than inline queries — it writes the snapshot and you can read it back.

## What to Check (Live Fallback)

Run a single MongoDB script that reports ALL of the following. Use `set -a; source .env.production.local; set +a; node -e "..."` to run.

### 1. Pipeline Funnel (bird's eye)

```javascript
// Pipeline status distribution — full lifecycle
const funnel = await db.collection('books').aggregate([
  { $match: { 'pipeline_auto.status': { $exists: true } } },
  { $group: { _id: '$pipeline_auto.status', count: { $sum: 1 } } },
]).toArray();
```

**IMPORTANT: Display in lifecycle order, not sorted by count.** Use this fixed order (skip stages with 0 books):

```
queued → archiving → archive_complete → ocr_submitted → ocr_complete →
metadata_enriched → ft_verifying → ft_verified → translate_submitted →
translate_complete → enriching → enriched → chapters → chapters_complete →
images_submitted → images_complete → complete
```

Then separately: `needs_attention`, `failed`, plus any other statuses (e.g. `empty_shell`, `paused`, `replaced_by_*`).

This matches the Pipeline Funnel chart in the /analytics dashboard.

### 2. Enrichment Coverage (all phases)

```javascript
// Single aggregation for all enrichment phases
const enrichment = await db.collection('books').aggregate([
  { $match: { status: { $ne: 'deleted' }, pages_count: { $gt: 0 } } },
  { $group: {
    _id: null,
    total: { $sum: 1 },
    has_ocr: { $sum: { $cond: [{ $gt: ['$pages_ocr', 0] }, 1, 0] } },
    has_translation: { $sum: { $cond: [{ $gt: ['$pages_translated', 0] }, 1, 0] } },
    has_metadata: { $sum: { $cond: [{ $ifNull: ['$ai_metadata.enriched_at', false] }, 1, 0] } },
    has_ft_verification: { $sum: { $cond: [{ $ifNull: ['$translation_verification.verified_at', false] }, 1, 0] } },
    has_summary: { $sum: { $cond: [{ $ifNull: ['$index.generatedAt', false] }, 1, 0] } },
    has_chapters: { $sum: { $cond: [{ $ifNull: ['$chapters', false] }, 1, 0] } },
    has_collections: { $sum: { $cond: [{ $ifNull: ['$collection_scores', false] }, 1, 0] } },
    has_quality_score: { $sum: { $cond: [{ $ifNull: ['$quality_score', false] }, 1, 0] } },
    has_faceted_tags: { $sum: { $cond: [{ $ifNull: ['$faceted_tags', false] }, 1, 0] } },
    has_author_entity: { $sum: { $cond: [{ $ifNull: ['$author_entity_id', false] }, 1, 0] } },
    fully_translated: { $sum: { $cond: [{ $and: [{ $gt: ['$pages_translated', 0] }, { $gte: ['$pages_translated', { $subtract: [{ $ifNull: ['$pages_ocr', 0] }, { $ifNull: ['$pages_blank', 0] }] }] }] }, 1, 0] } },
    pipeline_complete: { $sum: { $cond: [{ $eq: ['$pipeline_auto.status', 'complete'] }, 1, 0] } },
  }}
]).toArray();
```

### 2b. First Translation Milestones

```javascript
// First translations into English (root-level boolean, set by Stage 2 verification)
const firstTranslations = await db.collection('books').countDocuments({
  is_first_translation: true
});
console.log(`First translations into English: ${firstTranslations}`);

// >90% translated books
const near90 = await db.collection('books').countDocuments({
  status: { $ne: 'deleted' },
  pages_count: { $gt: 0 },
  pages_translated: { $gt: 0 },
  $expr: { $gte: ['$pages_translated', { $multiply: [{ $subtract: [{ $ifNull: ['$pages_ocr', 0] }, { $ifNull: ['$pages_blank', 0] }] }, 0.9] }] }
});
console.log(`>90% translated: ${near90}`);
```

**IMPORTANT:** The first-translation flag is `is_first_translation` at the book root level (NOT `translation_verification.is_first_translation`). It's a boolean derived from the `translation_verification.disposition` field — `true` for `confirmed_first`, `first_complete_translation`, and `first_modern_translation`.

### 3. Image Extraction & Gallery

```javascript
// Books with extracted images — use gallery_images (70K docs), NOT pages.distinct (times out on Atlas)
const booksWithImages = await db.collection('gallery_images').distinct('book_id');
console.log(`Books with extracted images: ${booksWithImages.length}`);

// Gallery images count
const galleryCount = await db.collection('gallery_images').estimatedDocumentCount();
console.log(`Gallery images: ${galleryCount}`);

// Image extraction jobs
const imageJobs = await db.collection('jobs').aggregate([
  { $match: { type: 'image_extraction', status: { $in: ['pending', 'processing'] } } },
  { $group: { _id: '$status', count: { $sum: 1 }, pages: { $sum: '$progress.total' }, done: { $sum: '$progress.completed' } } }
]).toArray();
```

### 4. Active Jobs by Type (all types)

```javascript
const activeJobs = await db.collection('jobs').aggregate([
  { $match: { status: { $in: ['pending', 'processing'] } } },
  { $group: {
    _id: { type: '$type', note: '$note' },
    count: { $sum: 1 },
    totalPages: { $sum: '$progress.total' },
    completedPages: { $sum: '$progress.completed' },
    failedPages: { $sum: '$progress.failed' },
    oldest: { $min: '$created_at' }
  }}
]).toArray();
```

### 5. Stuck Job Detection (CRITICAL)

```javascript
const twoHoursAgo = new Date(Date.now() - 2 * 3600000);
const stuckJobs = await db.collection('jobs').countDocuments({
  status: 'processing',
  'progress.completed': 0,
  created_at: { $lt: twoHoursAgo }
});
// If stuckJobs > 0, these are DEAD. Report them prominently.
```

### 6. Real Throughput (page-level verification)

```javascript
// Translation throughput - last 1h, 3h, 6h
for (const hours of [1, 3, 6]) {
  const since = new Date(Date.now() - hours * 3600000);
  const count = await db.collection('pages').countDocuments({
    'translation.updated_at': { $gte: since }
  });
  console.log(`Translations last ${hours}h: ${count} (${Math.round(count/hours)}/hr)`);
}

// OCR throughput
for (const hours of [1, 3, 6]) {
  const since = new Date(Date.now() - hours * 3600000);
  const count = await db.collection('pages').countDocuments({
    'ocr.updated_at': { $gte: since }
  });
  console.log(`OCR last ${hours}h: ${count} (${Math.round(count/hours)}/hr)`);
}
```

### 7. Backfill-Specific Progress (when applicable)

For any named backfill operation (identified by `note` field on jobs), verify progress against actual pages:

```javascript
// Get a sample of "processing" backfill jobs
const sampleJobs = await db.collection('jobs').find({
  note: 'near-complete backfill', // or whatever the current operation is
  status: 'processing'
}).limit(5).toArray();

// For each, verify pages actually got translated
for (const job of sampleJobs) {
  const bookPages = await db.collection('pages').countDocuments({
    book_id: job.book_id,
    'translation.data': { $exists: true, $ne: '' }
  });
  const totalPages = await db.collection('pages').countDocuments({ book_id: job.book_id });
  console.log(`${job.book_title?.substring(0,40)} — ${bookPages}/${totalPages} pages actually translated, job says ${job.progress.completed}/${job.progress.total}`);
}
```

### 8. ETA Calculation

```javascript
// Only use VERIFIED throughput (page-level, not job-level)
const oneHourAgo = new Date(Date.now() - 3600000);
const hourlyRate = await db.collection('pages').countDocuments({
  'translation.updated_at': { $gte: oneHourAgo }
});

// Remaining work
const remaining = activeJobs
  .filter(j => j._id.type === 'translation')
  .reduce((sum, j) => sum + (j.totalPages - j.completedPages - j.failedPages), 0);

const etaHours = hourlyRate > 0 ? remaining / hourlyRate : Infinity;
```

### 9. Pause Status

```javascript
const control = await db.collection('system_config').findOne({ _id: 'processing_control' });
if (control?.paused) console.log('PIPELINE PAUSED');
if (control?.paused_phases?.length) console.log('Paused phases: ' + control.paused_phases.join(', '));
```

### 10. Batch API Health (IMPORTANT)

The batch-collector writes `system_config._id: 'batch_health'` every 10 minutes with real Gemini-side state. **Always read this.**

```javascript
const batchHealth = await db.collection('system_config').findOne({ _id: 'batch_health' });
if (batchHealth) {
  const age = ((Date.now() - new Date(batchHealth.updated_at).getTime()) / 60000).toFixed(0);
  console.log(`Batch health (${age}m old):`);
  console.log(`  Gemini active: ${batchHealth.geminiActive} (by key: ${batchHealth.geminiActiveByKey?.join(', ')})`);
  console.log(`  DB active: ${batchHealth.dbActive} | Zombies: ${batchHealth.dbZombies}`);
  console.log(`  Completions: ${batchHealth.recentCompletions1h}/1h, ${batchHealth.recentCompletions6h}/6h`);
  console.log(`  OCR pages saved: ${batchHealth.recentPagesSaved1h}/1h, ${batchHealth.recentPagesSaved6h}/6h`);
  console.log(`  Healthy: ${batchHealth.healthy}`);
  if (batchHealth.issues?.length) console.log(`  Issues: ${batchHealth.issues.join('; ')}`);
}
```

**Key signals:**
- `healthy: false` = something needs attention. Check `issues` array.
- `geminiActive > 0` but `recentCompletions1h === 0` = pipeline may be frozen
- `dbZombies > 0` = jobs created but never submitted (auto-cleaned if >1h old)
- `orphansCancelled > 0` = drift detected and auto-fixed

## Output Format

Present results as a concise status report:

```
Pipeline Status — [timestamp]
Dashboard: https://sourcelibrary.org/analytics (Pipeline tab)

Pipeline Funnel (in lifecycle order, skip 0s):
  archiving:          13,869
  archive_complete:   12,014
  ocr_submitted:         118
  metadata_enriched:   5,914
  translate_submitted:     7
  translate_complete:     19
  complete:            3,713
  ---
  needs_attention:     2,624
  failed:                 57
  other:                  50  (empty_shell, paused, replaced_by_*)

Milestones:
  First translations (EN):   X books
  >90% translated:           X books
  Fully translated:          X books

Enrichment coverage (of Y total books):
  OCR:              X (Z%)
  Translation:      X (Z%)
  Metadata:         X (Z%)
  FT Verification:  X (Z%)
  Summary & Index:  X (Z%)
  Chapters:         X (Z%)
  Collections:      X (Z%)
  Quality Score:    X (Z%)
  Image Extraction: X books  |  Gallery: N images
  Faceted Tags:     X (Z%)
  Author Entities:  X (Z%)
  Pipeline Complete: X (Z%)

Active jobs:
  Translation (backfill): X jobs, Y/Z pages done [rate/hr, ETA]
  Translation (pipeline): X jobs, Y/Z pages done
  OCR: X jobs
  Image Extraction: X jobs
  ...

STUCK JOBS: X jobs processing with 0 progress for >2h  ← only if > 0

Throughput (verified from pages):
  Translation: X/hr (1h), Y/hr (3h), Z/hr (6h)
  OCR: X/hr (1h), Y/hr (3h)

Spot check (5 random processing jobs):
  "Book Title" — 45/50 pages translated (job says 43/50)
  "Book Title" — 0/30 pages translated (job says 0/30) STUCK

Batch API Health (updated Xm ago):
  Gemini active: X (key0: X, key1: X, key2: X)  ← real Gemini state
  DB active: X | Zombies: X
  Completions: X/1h, X/6h | OCR pages: X/1h, X/6h
  Status: HEALTHY  (or: ISSUES — list problems)

Pauses: none (or list)
```

## Common Issues

- **Jobs stuck at 0 progress:** SQS messages consumed but Lambda never processed them. Cancel stuck jobs, clear `book.job` locks, re-submit.
- **Throughput looks good but backfill isn't moving:** You're seeing pipeline cron throughput, not backfill. Filter by backfill book IDs.
- **`pages_translated` stale:** `sync-page-counts` cron runs every 6h. For real-time count, query pages directly.
- **Book.job lock prevents re-submission:** Clear with `$unset: { job: '' }` on affected books.
- **Image extraction stuck:** Check if `images` phase is in `paused_phases`. Also check Lambda CloudWatch for image-extraction-processor errors.
- **Enrichment coverage low for summaries/chapters:** These run via Hetzner `enrich-worker.mjs` (every 5 min, flock `/tmp/sl-enrich.lock`). Check `tail /var/log/sourcelibrary/enrich.log` on Hetzner, NOT the old Vercel `enrich-books` cron (deprecated since ~March 2026).