# Source Library System Map > Last audited: 2026-04-24. Use this as the primary navigation reference. ## Architecture Overview ``` Users ──> Vercel (Next.js 16) ──> MongoDB Atlas (bookstore) │ ▲ ├──> Supabase Postgres │ write-processor │ (analytics, browse, catalog, search) │ │ ├──> Cloudflare R2 │ │ (images.sourcelibrary.org) │ │ │ ├──> SQS Queues ──> Lambda Workers ──> Gemini AI │ (eu-central-1) (OCR, Translation, Images) │ ├──> Hetzner (46.224.122.120) │ (pipeline orchestration, translate-worker, embedding server) │ ├──> Stripe (Ficino Society payments) ├──> Twitter/X API (@SourceLibrary_) ├──> Zenodo (DOI minting) ├──> Museum APIs (Met, Cleveland, Rijksmuseum, AIC) └──> IIIF Sources (IA, Gallica, Wellcome, etc.) ``` ## Infrastructure Map | Service | Purpose | Key Config | |---------|---------|------------| | **Vercel** | Next.js hosting, 7 crons | Project: `sourcelibrary-v2` | | **MongoDB Atlas** | Primary database | DB: `bookstore`, ~17K live + ~24.5K warehouse books | | **Supabase Postgres** | Analytics, browse cache, catalog, search | pgvector, pg_trgm, pg_cron | | **AWS Lambda** (eu-central-1) | AI processing workers | 4 functions, SQS-triggered | | **AWS SQS** (eu-central-1) | Job queues (FIFO) | 4 queues: OCR, translation, images, write | | **Cloudflare R2** | Image/page storage | `images.sourcelibrary.org` | | **Hetzner** (cax31) | Pipeline orchestration, translation, embeddings | `root@46.224.122.120`, unified scheduler | | **Gemini AI** | OCR, translation, enrichment | 3 unique keys (3 GCP projects). BPH: `gemini-3-flash-preview`, others: `gemini-3.1-flash-lite-preview` | | **Stripe** | Payments | Ficino Society membership | | **Zenodo** | DOI publishing | Scholarly editions | | **Twitter/X** | Social automation | 3h posting cron | ## Data Pipeline Flow The pipeline has **13+ phases** managed by `pipeline-orchestrator.mjs` (every 2 min) and `enrich-worker.mjs` (every 5 min) on Hetzner. Books flow through `pipeline_auto.status` states: ``` Import (IA/Gallica/IIIF/Wellcome/etc.) └─> Phase 0: Auto-enroll ──> queued └─> Phase 1: Archive (download images to R2) ──> archiving ──> archive_complete └─> Phase 1.25: Split detection (spread scans → individual pages) └─> Phase 1.5: Preview OCR (first 25 pages via Lambda, fast turnaround) └─> Phase 1.7: Preview translation (inline via Vercel) └─> Phase 2: Batch OCR (Gemini Batch API, 3.1-flash-lite) ──> ocr_submitted └─> Phase 3: OCR completion check ──> ocr_complete └─> Phase 3.5: Metadata enrichment (catalog lookups, no Gemini) ──> metadata_enriched └─> Phase 3.7: Transliteration (non-Latin scripts, 3.1-flash-lite) └─> Phase 4: Translation dispatch (creates job for translate-worker) │ translate-worker.mjs (15 concurrent books) ──> Gemini realtime ──> MongoDB pages │ ⚠ Realtime API only. NEVER use Batch API for translation. └─> Phase 5: Translation completion ──> translate_complete enrich-worker.mjs (every 5 min): └─> Phase 6: Summary + Index (3.1-flash-lite) ──> summary_indexed └─> Phase 7: Chapter extraction (3.1-flash-lite) ──> chapters_complete └─> Phase 7.5: Quality scoring (3.1-flash-lite, 0-100 score) └─> Phase 7.6: Collection assignment (3.1-flash-lite, additive) pipeline-orchestrator.mjs (continued): └─> Phase 8: Image extraction (Lambda) ──> images_submitted ──> images_complete └─> Phase 8.5: Staleness detection (>48h stuck → retry or flag) └─> Phase 8.9: Cover selection + page cleanup ──> cover_selected └─> Phase 9: Finalize (validate OCR >10%, auto-unhide) ──> complete batch-collector.mjs (every 10 min): Polls Gemini Batch API, saves OCR results, zombie reaper (>6h), ghost cleanup ``` Concurrency limits managed by `system_config.adaptive_limits` (auto-halved when Atlas degrades). Backpressure: `system_config.paused_phases` array. ## Supabase Layer (added 2026-03-27+) Supabase serves derived reads for performance-critical paths. MongoDB remains source of truth. | Table/View | Purpose | Source | |------------|---------|--------| | `books_catalog` | Browse cache (11s→0.6s) | Synced from MongoDB `books` | | `page_translations` | Semantic search embeddings | pgvector, Gemini embedding-2-preview (768d/3072d) | | `gemini_usage` | AI cost analytics | Synced from MongoDB | | `pipeline_snapshots` | Pipeline velocity charts | Synced from MongoDB | | `cron_runs` | Cron execution logs | Synced from MongoDB | | `ustc_editions` / `ustc_enrichments` | USTC catalog | Direct import | | `contributing_library` | Library pages (was 5s timeout) | Materialized view | Key: `src/lib/supabase.ts` (client), `.claude/docs/supabase.md` (full reference — see **Sync Points (Complete Map)** for every Mongo→Supabase write path and **Known Sync Gaps** for documented edge cases like the PATCH-bypass) ## Author Pages (added 2026-03-30+) Entity-driven author system with normalized names, Wikipedia enrichment, aliases. | Route | Purpose | |-------|---------| | `/author/[name]` | Author detail — catalog table, title page gallery, publisher column | | `/browse/authors` | Author listing/browse | | `/api/admin/revalidate-authors` | ISR revalidation endpoint | Key: `src/app/author/`, `src/app/browse/authors/`, author normalization in `src/lib/` ## Visual Art Wing (added 2026-03-30+) Museum artwork imports alongside historical texts. ~18K artworks. Mixed collections show both. | Source | Script | Status | |--------|--------|--------| | Met Museum | `scripts/import-met-artworks.mjs` | Active | | Cleveland Museum | `scripts/import-cleveland-artworks.mjs` | Active | | Wikimedia Commons | `scripts/import-commons-artworks.mjs` | Active (NOT `import-artwork.mjs`) | | Rijksmuseum, AIC | Various | Active | Routes: `/artwork/[slug]`, `/artist/[name]`, `/api/artwork/`. Collections support `collection_type` field. Gemini Vision cataloging, CLIP embeddings (3072d), semantic search. ## MongoDB Collections (73) ### Core Data | Collection | Purpose | Key Fields | |------------|---------|------------| | `books` | Book metadata (~17K live) | `id`, `title`, `author`, `slug`, `pages_count`, `pages_ocr`, `pages_translated` | | `books_warehouse` | Archived books (~24.5K) | Same schema, moved for Atlas perf | | `pages` | Individual pages (~3.1M live) | `book_id`, `ocr.data`, `translation.data`, `detected_images`, `page_type` | | `pages_warehouse` | Archived pages (~6.4M) | Same schema | | `deleted_books` | Soft-deleted books | Same as books, recoverable | | `collections` | Book groupings | `slug`, `name`, `hidden`, `collection_type` | | `entities` | Legacy per-string author/encyclopedia layer (people, places, concepts). **Being retired for authorship** — superseded by `authors`. | linked via `books.author_entity_id` | | `authors` | **Canonical person thesaurus** — one doc per person, `_id`=slug. Books FK via `books.author_id`. | variants, variant_slugs, viaf_id, wikidata_id, entity_ids · see `.claude/docs/author-identity-system.md` | | `translation_catalogs` | **Prior-translation registry** — known English translations the first-translation verifier checks first. ~24k rows (UNESCO Index Translationum, Loeb, Brill, Penguin…). Drives `is_first_translation`. | source, author, english_title, translator, pub_year, completeness · see `.claude/docs/first-translation-system.md` | ### Processing | Collection | Purpose | |------------|---------| | `jobs` | Processing job queue | | `batch_jobs` | Batch OCR/translation jobs | | `page_revisions` | OCR/translation history (MUST create before writes) | | `gemini_usage` | AI cost tracking (single source of truth) | | `system_config` | Global settings (`processing_control` with `paused` flag) | ### User & Social | Collection | Purpose | |------------|---------| | `users` | NextAuth accounts | | `admin_users` | Admin whitelist | | `likes`, `highlights`, `reading_history` | User engagement | | `discussions`, `discussion_replies` | Ficino Society forum | | `social_posts`, `social_config` | Twitter automation | | `purchases` | Stripe payments | ### Gallery & Media | Collection | Purpose | |------------|---------| | `gallery_images` | Extracted page images | | `gallery_collections` | Curated image sets | | `gallery_embeddings` | Image similarity vectors | | `detected_images` | Gemini image detection results | ### Analytics & Monitoring | Collection | Purpose | |------------|---------| | `analytics_events`, `analytics_pageviews` | User behavior (migrating to Supabase) | | `pipeline_snapshots`, `pipeline_health_daily` | Pipeline metrics (mirrored to Supabase) | | `cron_runs` | Cron execution logs (mirrored to Supabase) | | `audit_log` | Admin action trail | | `application_errors` | Error logging | ### Research & Experiments `experiments`, `ocr_experiments`, `ocr_judgments`, `pipeline_experiments`, `pipeline_judgments`, `split_models`, `split_training_examples`, `split_adjustments` ### Other `external_catalog` (IIIF union catalog), `editions`, `kdp_publications`, `book_metadata_changelog`, `prompts`, `libraries`, `volunteers`, `contributions`, `feedback`, `beta_subscribers`, `email_drafts`, `comparisons`, `entity_aliases`, `translation_catalogs`, `curation_drafts`, `curator_sessions` ## File System Layout ``` src/ ├── app/ # Next.js app router │ ├── api/ # ~400 API routes (direct DB queries, no repository layer) │ │ ├── books/[id]/ # 60 book operations │ │ ├── admin/ # 60 admin endpoints │ │ ├── import/ # 26 IIIF source importers │ │ ├── iiif/[id]/ # IIIF we EXPOSE (manifest/canvas/search) → .claude/docs/iiif-api.md │ │ ├── pages/[id]/ # 15 page operations │ │ ├── cron/ # 6 active cron routes (7 scheduled in vercel.json) │ │ ├── search/ # 8 search endpoints (main, unified, visual, semantic, suggest, etc.) │ │ ├── gallery/ # 8 gallery endpoints │ │ ├── social/ # 11 social media endpoints │ │ ├── embassy/ # 9 librarian/chat endpoints │ │ ├── embed/ # 7 embed endpoints (BPH, Bhutan) │ │ ├── image/ # Image proxy │ │ ├── mcp/ # MCP server endpoint (OAuth) │ │ ├── health/ # Health check (+ auth sub-route) │ │ ├── artwork/ # Artwork search │ │ ├── stripe/ # 4 payment endpoints │ │ ├── dataset/v1/ # Public API (keyed access) │ │ └── ... # experiments, analytics, scan, catalog, etc. │ ├── author/[name]/ # Author detail page (catalog, gallery strip) │ ├── artist/[name]/ # Artist detail page (artworks) │ ├── artwork/[slug]/ # Artwork detail page │ ├── book/[id]/ # Reader pages (guide, summary, QA, pipeline, editions) │ ├── browse/authors/ # Author listing (+ [letter] sub-route) │ ├── collections/ # Collection browse & detail (supports mixed art+book) │ ├── gallery/ # Image gallery │ ├── librarian/ # Agentic Librarian (room/, thread/, voice/) │ ├── podcast/ # Podcast player + RSS feed │ ├── shwep/[number]/ # SHWEP podcast integration │ ├── search/ # Search UI (+ visual search) │ ├── embed/ # Institutional embeds (bhutan/, bph/ with catalog sub-routes) │ ├── work/[id]/ # Work-level linking (WEMI) │ ├── hieroglyphs/ # Egyptian hieroglyphs page │ ├── tablets/ # Cuneiform tablets page │ ├── rithmomachia/ # Mathematical board game (guide, scenarios) │ ├── admin/ # Admin dashboard pages │ ├── blog/ # 39 blog posts (hardcoded JSX, no CMS) │ ├── press-release/ # Press release page │ ├── research/ # Research tools (atlas, diffusion, timeline) │ ├── explore/ # Map & timeline visualizations │ ├── ficino-society/ # Membership, discussions │ ├── catalog/, census/, encyclopedia/ # Reference browse pages │ ├── languages/, timeline/, topics/ # Content browse pages │ ├── about/, support/, terms/, privacy/ # Static info pages │ └── ... # 68 top-level app directories │ ├── components/ # 174 React components (.tsx files) │ ├── book/ # Book detail, reader, processing │ ├── layout/ # GlobalHeader, GlobalFooter, FeaturedCollections │ ├── gallery/ # Gallery views (+ IconclassFilter) │ ├── reader/ # Page reader, zoom, sidebar │ ├── search/ # Search results, filters │ ├── explore/ # Map, timeline │ ├── ui/ # Primitives (Button, Dialog, Tabs, etc.) │ ├── camera/ # Mobile scanning (6 components, likely unused) │ ├── rithmomachia/ # Game components (14, live feature) │ └── ... │ ├── lib/ # ~95 top-level modules + 10 subdirectories │ ├── mongodb.ts # DB connection (singleton, pool management) │ ├── supabase.ts # Supabase client (analytics, browse, search) │ ├── ai.ts # Core Gemini operations │ ├── gemini-client.ts # API key rotation (3 keys, 3 GCP projects) │ ├── gemini-batch.ts # Batch API orchestration │ ├── semantic-search.ts # 7-lane search, embedding-2-preview (768d/3072d) │ ├── semantic-alignment.ts # Embedding-based quality measurement │ ├── storage.ts # R2 + Vercel Blob abstraction │ ├── sqs-client.ts # SQS queue client │ ├── auth.ts # NextAuth config (Google + Email/Resend magic links) │ ├── auth-helpers.ts # withAuth(), withAdminAuth() │ ├── slugify.ts # URL slug generation, bookUrl() │ ├── book-lookup.ts # Book query helpers │ ├── book-index.ts # Reads from book_indexes collection │ ├── import-utils.ts # IIIF manifest parsing │ ├── page-revisions.ts # createRevision() — MUST call before page writes │ ├── adaptive-limits.ts # system_config.adaptive_limits read/write │ ├── iconclass-categories.ts # Iconclass visual classification │ ├── page-split/ # Split detection (dedup, ghost pages, ML detection) │ ├── rithmomachia/ # Game engine (35+ files) │ ├── taxonomy/ # Faceted vocabulary (6 facets), tagging │ ├── embassy/ # Librarian chat tools │ ├── api-client/ # Frontend API wrappers │ ├── types/ # TypeScript types (ai-models.ts, book.ts, etc.) │ └── ... │ ├── workers/ # Lambda function source │ ├── ocr-processor.ts + ocr-processor-logic.ts │ ├── translation-processor.ts + translation-processor-logic.ts │ ├── image-extraction-processor.ts + image-extraction-processor-logic.ts │ └── write-processor.ts + write-processor-logic.ts │ └── hooks/ # 8 React hooks scripts/ # Operational scripts ├── analysis/ # ~50 inspection/reporting scripts ├── batch/ # Bulk processing scripts ├── enrichment/ # Metadata enrichment scripts ├── maintenance/ # Data fix scripts ├── import/ # Bulk import scripts + JSON manifests ├── migration/ # Data migration scripts (~49 files) ├── eval/ # Quality evaluation scripts ├── experiments/ # One-off experiments ├── aws-lambda/ # Lambda build/deploy ├── workers/ # Hetzner workers (40 files): │ ├── scheduler.mjs # Unified cron scheduler │ ├── pipeline-orchestrator.mjs # Main pipeline (every 2 min) │ ├── enrich-worker.mjs # Enrichment phases (every 5 min) │ ├── translate-worker.mjs # Realtime translation (15 concurrent) │ ├── batch-collector.mjs # Gemini Batch API results (every 10 min) │ ├── embed-gemini.mjs # Embedding generation │ ├── clip-server.mjs # CLIP visual search server │ ├── archive-*.mjs # Source-specific archivers (7 variants) │ └── sync-*.mjs # Supabase sync workers └── lib/ # Shared script utilities ``` ## Pages Breakdown (68 top-level dirs) | Category | Count | Content Source | Examples | |----------|-------|---------------|----------| | Core library (dynamic) | ~50 | MongoDB + APIs | Book reader, search, collections, author, artist, artwork, work, browse | | Agentic features | ~5 | Gemini + MongoDB | Librarian room, thread, voice; search v2 | | Content pages | ~8 | MongoDB | Podcast, SHWEP, hieroglyphs, tablets, encyclopedia, languages | | Institutional embeds | ~6 | MongoDB | embed/bhutan, embed/bph (each with book, catalog, collections) | | Admin/ops dashboards | ~15 | MongoDB + APIs | Pipeline control, jobs, analytics, email, KDP | | Research/experiments | ~20 | MongoDB + APIs | OCR quality, concept diffusion, image atlas | | Blog posts | 39 | Hardcoded JSX (no CMS) | origin-story, progress-studies, hidden-engineers | | Press | 1 | Hardcoded JSX | press-release | | Auth/legal/info | ~10 | Static JSX | signin, terms, privacy, about, support | | Gallery | ~6 | MongoDB + APIs | Browse, collections, image viewer, curation | | Community | ~5 | MongoDB + APIs | Ficino Society, discussions, contribute | | Games | ~4 | Client-side | Rithmomachia (guide, scenarios) | ### Pages to audit - `/testloader` — debug page, should not be public - `/scan/auto`, `/scan/opencv` — experimental scanning tools - `/fulldata` — bulk data export, should be admin-only ## Key Architectural Patterns 1. **No repository/service layer** — API routes query MongoDB directly via `getDb().collection()`. No ORM. 2. **API client for frontend** — `src/lib/api-client/` provides typed wrappers around API routes. 3. **SQS-driven async processing** — All AI work goes through SQS → Lambda → Gemini → write-back. 4. **Page revisions before writes** — Any script modifying `ocr.data` or `translation.data` MUST call `createRevision()` first. 5. **Admin via whitelist** — `admin_users` collection, no RBAC. `withAdminAuth()` wrapper. 6. **Key rotation for Gemini** — 3 API keys (3 GCP projects) with cooldown. `gemini-client.ts` handles rotation. 7. **Hetzner for heavy crons** — Pipeline orchestration moved off Vercel to reduce costs/timeouts. Unified scheduler manages all workers. 8. **Supabase for read-heavy paths** — Browse, analytics, search, and libraries queries hit Supabase for speed. MongoDB remains source of truth; Supabase mirrors derived data via sync crons. 9. **Model routing by source** — BPH books get `gemini-3-flash-preview` (premium), all others get `gemini-3.1-flash-lite-preview` (50% cheaper). See `src/lib/types/ai-models.ts`. 10. **Two quality systems on image extraction** — One Gemini call emits both `gallery_quality` (per illustration, curatorial: "worth showing?") and `scan_quality` (per page, technical: "how cleanly digitized?"). They look similar but answer different questions. Design + extension plan in `.claude/docs/automated-image-quality-system.md`. Public-facing version: `/blog/what-makes-a-good-scan`. The live prompt + rubric live in `scripts/workers/image-extract-worker.mjs:117` and `scripts/workers/pipeline-orchestrator.mjs:1836`; `prompts/image-extraction/image-extraction-v0.md` is an out-of-date archive. ## Known Dead Code & Duplicates ### Confirmed Dead Components (last audit 2026-05-25, zero imports) Issue #258 closed. These remain with no imports anywhere: | Component | Path | Notes | |-----------|------|-------| | `BookEditModal.tsx` | `components/book/` | Orphaned | | `JobStatusBanner.tsx` | `components/book/` | Orphaned | | `PagesGrid.tsx` | `components/book/` | Orphaned | | `ProcessingPanel.tsx` | `components/book/` | Orphaned | | `EntityMap.tsx` | `components/explore/` | Orphaned | | `MapSidebar.tsx` | `components/explore/` | Orphaned | | `PipelineStageCard.tsx` | `components/pipeline/` | Orphaned | | `PageTracker.tsx` | `components/reader/` | Orphaned | | `SessionCard.tsx` | `components/research/` | Orphaned | | Camera components (6) | `components/camera/` | Mobile scanning — unused, ask before deleting | **Deleted in #1986** (no longer in this table): `BookPagesActions.tsx`, `BookPagesStats.tsx`, `ReorderModePanel.tsx`, `SparkLine.tsx`, `HideWhenEmbedded.tsx`, `InputWidget.tsx`, plus `_archived/` batch panels and api-client files. See `.claude/handoffs/2026-05-25-pr1980-split.md` for the audit + verification process. **Before deleting any row above:** grep-verify zero imports across the repo. Static analysis (graph tools) can miss dynamic requires, framework conventions, and recent additions — see `.claude/docs/code-review-graph.md` "Staleness — the main failure mode." Note: Rithmomachia is a **live feature** (`/rithmomachia`, guide, scenarios, blog post) — NOT dead code. `Footer.tsx` was previously listed but no longer exists (already deleted). ### Duplicate Functions | Function | Location A | Location B | Action | |----------|-----------|-----------|--------| | `withTimeout()` | `lib/collections-utils.ts` | `api/books/search/route.ts` (local, different signature) | Different impls, both used | | `sortCollections()` | `lib/collections-utils.ts` | `api/books/search/route.ts` (local) | Consolidate to lib | ### Disabled Cron Routes 7 cron API routes still exist in code but are removed from `vercel.json` (moved to Hetzner): `submit-batch-ocr`, `process-batches`, `sync-page-counts`, `sync-gallery-images`, `enrich-books`, `post-import-pipeline`, `archive-ocr` ### Root tmp Scripts (128) All `_tmp-*` files at project root. Per convention, these should not be committed.