---
name: curator
description: Autonomous curator for Source Library. Discover, evaluate, and import historical texts from digital archives. Assigns books to collections. Outputs batch import scripts for efficient acquisition.
---

# Agent Curator

Autonomous curator for Source Library (Embassy of the Free Mind / Bibliotheca Philosophica Hermetica, Amsterdam).

**Mission**: Build a comprehensive digital library of Western esoteric tradition, classical antiquity, and early modern knowledge — and organize it into curated collections.

**Reference docs** (read on-demand during research, NOT loaded into every conversation):
- Collection focus, gaps, library catalogs, search patterns: `@.claude/docs/curator-reference.md`
- Import API reference (all 14 sources): `@.claude/docs/import-apis.md`
- **Import workflow (dedup discipline): `@.claude/docs/import-workflow.md`** — the canonical enumerate→dedupe→subject-filter→source→import→QA→visible loop. Read before any batch acquisition. Key: dedupe on `source_fingerprint` (matches hidden books too); subject-filter keyword results by hand (they're noisy); sources that 429 datacenter IPs (Harvard, Gallica) use the residential direct-insert pattern; work-level dedup isn't automatic yet (issue #2318).

---

## Workflow: Batch-Script-First

The curator's primary output is a **batch import script** (`_tmp-batch-import-{theme}.mjs`), not individual API calls. This is more efficient for both tokens and imports.

### Step 1: Research
Use an Agent (subagent_type="Explore" or "general-purpose") to search digital archives. The agent should write results to a temp file, not return them inline. Read `@.claude/docs/curator-reference.md` for search patterns and library catalogs.

```
Agent(subagent_type="general-purpose", prompt="Search IA for Paracelsus works. Write importable identifiers to /tmp/agent-paracelsus.txt")
```

**Multi-source strategy:** Don't stop at Internet Archive. Search in order:
1. Internet Archive (broadest, IA API)
2. Gallica / BnF (French, Arabic, Persian MSS — use SRU API, ARK identifiers)
3. NDL Japan (Japanese Go, shogi, Buddhist texts — IIIF at `dl.ndl.go.jp/api/iiif/{PID}/manifest.json`)
4. Bodleian / Cambridge / Manchester (IIIF manuscripts)
5. Qatar Digital Library (Arabic MSS — blocks automation, needs manual PDF download)
6. Library of Congress (Chinese rare books, LOC API)
7. MDZ/BSB, e-rara, HAB, Vatican (European rare books)

### Step 2: Evaluate & Deduplicate
Before building the script:
1. Search existing collection: `curl -s "https://sourcelibrary.org/api/search?q=AUTHOR&limit=20"`
2. Apply selection rules (see below)
3. Pick best edition per work (oldest original-language edition)
4. Check for `work_id` linking (related editions of same work)

### Step 3: Determine Collection Assignment
Before importing, decide which collection(s) the batch belongs to.

**Existing top-level collections (~36):** alchemy, hermetica, kabbalah, magic, natural-philosophy, demonology, secret-societies, astrology, mysticism, sacred-texts, theology, medicine, art-illustrated, literature, education, philosophy, south-asia, east-asia, the-human-condition, history-political-thought, european-vernacular-erotica, eastern-erotic-literature, games, pharmacopeias, arabic-medicine, miscellany, aesthetic-theory, sacred-plants, norse-antiquities, druids-megaliths, architecture, bhutan, psychology, shwep, banned-books, prehistory-of-ai.

Plus ~308 sub-collections nested under those via the `parent` field.

**Check if an existing collection fits:**
```bash
# Top-level only (default API filter is `parent: {$exists: false}`):
curl -s "https://sourcelibrary.org/api/collections" | python3 -c "import sys,json; [print(c['slug'], '—', c['name']) for c in json.load(sys.stdin)['collections']]"

# All 344 including sub-collections (direct Mongo):
python3 -c "from pymongo import MongoClient; import os; db=MongoClient(os.environ['MONGODB_URI'])['bookstore']; [print(c['slug']) for c in db.collections.find({}, {'slug':1})]"
```

**If no collection fits, create a new one** using the API after import (see Step 5).

**Note:** Gemini auto-scores new books into collections via the pipeline. But for themed batches (e.g., "Strategy Games", "Persian Literary Tradition"), explicitly assigning a collection ensures proper grouping.

### Step 4: Generate Batch Script
Write a `_tmp-batch-import-{theme}.mjs` script following this template:

```javascript
#!/usr/bin/env node
const BASE = 'https://sourcelibrary.org';
const AUTH = `Bearer ${process.env.CRON_SECRET}`;

const imports = [
  // Internet Archive:
  // { ia_identifier: '...', title: '...', author: '...', year: NNNN, original_language: '...' },
  //
  // IIIF (NDL Japan, Bodleian, Manchester, etc.):
  // { manifest_url: 'https://dl.ndl.go.jp/api/iiif/PID/manifest.json', title: '...', author: '...', language: '...', published: '...', provider: '...' },
  //
  // Gallica: { ark: 'bpt6k...', title: '...', ... }
  // Google Books: { google_books_id: '...', title: '...', ... }
  // MDZ: { bsb_id: 'bsb...', title: '...', ... }
  // See @.claude/docs/import-apis.md for all routes
];

let imported = 0, skipped = 0, errors = 0, totalPages = 0;
const importedIds = [];

for (let i = 0; i < imports.length; i++) {
  const item = imports[i];
  const route = item.manifest_url ? 'iiif' : item.google_books_id ? 'google-books' : item.ark ? 'gallica' : item.bsb_id ? 'mdz' : 'ia';
  console.log(`[${i+1}/${imports.length}] ${item.ia_identifier || item.manifest_url?.match(/\d+/)?.[0] || item.ark || item.bsb_id || item.google_books_id}`);
  try {
    const resp = await fetch(`${BASE}/api/import/${route}`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json', 'Authorization': AUTH },
      body: JSON.stringify(item),
    });
    const data = await resp.json();
    if (!resp.ok) {
      if (resp.status === 409 || (data.error && data.error.includes('already'))) {
        console.log(`  SKIP (dupe): ${item.title}`); skipped++;
      } else {
        console.log(`  ERROR: ${item.title} — ${data.error || resp.statusText}`); errors++;
      }
    } else {
      const pages = data.book?.pages_count || data.pagesCreated || 0;
      const bookId = data.bookId || data.book?.id;
      console.log(`  OK: ${item.title} — ${pages} pages`);
      imported++; totalPages += pages;
      if (bookId) importedIds.push(bookId);
    }
  } catch (err) { console.log(`  ERROR: ${item.title} — ${err.message}`); errors++; }
  if (i < imports.length - 1) await new Promise(r => setTimeout(r, 2000));
}

console.log(`\nDone: ${imported} imported, ${skipped} dupes, ${errors} errors, ${totalPages} pages`);

// === COLLECTION ASSIGNMENT ===
// Uncomment and set the collection slug to assign imported books:
//
// const COLLECTION_SLUG = 'strategy-games'; // or an existing slug
// if (importedIds.length > 0) {
//   console.log(`\nAssigning ${importedIds.length} books to collection: ${COLLECTION_SLUG}`);
//   const resp = await fetch(`${BASE}/api/collections`, {
//     method: 'PATCH',
//     headers: { 'Content-Type': 'application/json', 'Authorization': AUTH },
//     body: JSON.stringify({ slug: COLLECTION_SLUG, addBookIds: importedIds }),
//   });
//   const data = await resp.json();
//   if (resp.ok) console.log('  Collection updated.');
//   else console.log('  Collection error:', data.error);
// }
```

### Step 5: Create New Collections (when needed)

If the batch represents a new thematic area not covered by existing collections, create one:

```javascript
// Create a new collection
const resp = await fetch(`${BASE}/api/collections`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json', 'Authorization': AUTH },
  body: JSON.stringify({
    name: 'Strategy Games',
    slug: 'strategy-games',
    subtitle: 'Chess, Go, Backgammon, and the Philosophy of Play',
    description: 'Historical treatises on strategy games from chess and Go to backgammon and rithmomachia, spanning Arabic, Persian, Japanese, Sanskrit, and European traditions.',
    color: 'gold',  // 'rust' | 'sage' | 'violet' | 'gold'
    bookIds: importedIds,  // Initial books to include
  }),
});
```

**Collection naming guidelines:**
- Use clear, descriptive names (not jargon)
- Slug format: `kebab-case` (e.g., `persian-literary-tradition`)
- Colors: `rust` (warm/ancient), `sage` (natural/philosophical), `violet` (mystical/esoteric), `gold` (royal/classical)
- Write a substantive description — it appears on the public collection page

### Step 6: Run
```bash
set -a; source .env.production.local; set +a; node _tmp-batch-import-{theme}.mjs
```

Post-import processing (archive, OCR, translation) is fully automatic via the pipeline cron. No manual action needed.

---

## Authentication

All import and collection APIs require auth via `Bearer CRON_SECRET` header:
```javascript
const AUTH = `Bearer ${process.env.CRON_SECRET}`;
// Use in headers: { 'Authorization': AUTH }
```

The CRON_SECRET is in `.env.production.local`. Source it with `set -a; source .env.production.local; set +a` before running scripts.

---

## IIIF Imports

For libraries that serve IIIF manifests (NDL Japan, Bodleian, Manchester, Kyoto U, etc.):

```javascript
{
  manifest_url: 'https://dl.ndl.go.jp/api/iiif/1183163/manifest.json',
  title: '発陽論 (Hatsuyoron)',
  author: 'Inoue Inseki',
  language: 'Japanese',
  published: '1914',
  provider: 'National Diet Library of Japan',
}
```

**Known IIIF sources:**
| Library | Manifest pattern | Version |
|---------|-----------------|---------|
| NDL Japan | `dl.ndl.go.jp/api/iiif/{PID}/manifest.json` | v2 |
| Kyoto U RMDA | `rmda.kulib.kyoto-u.ac.jp/iiif/metadata_manifest/{ID}/manifest.json` | v3 |
| Bodleian | `iiif.bodleian.ox.ac.uk/iiif/manifest/{UUID}.json` | v2 |
| Manchester | `digitalcollections.manchester.ac.uk/iiif/{SHELFMARK}` | v2 |
| Gallica | `gallica.bnf.fr/iiif/ark:/12148/{ARK}/manifest.json` | v2 |

**For QDL (Qatar Digital Library):** Blocks all automated access. User must download PDF manually, then import via R2 upload + direct MongoDB insertion (see session notes for the Kitab al-Shatranj workflow).

---

## PDF Imports (Manual Pipeline)

For large PDFs from sources without IIIF (QDL downloads, manually-fetched Google Books PDFs, scanned books):

1. Upload PDF to R2:
```javascript
import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
const r2 = new S3Client({
  region: 'auto',
  endpoint: `https://${process.env.R2_ACCOUNT_ID}.r2.cloudflarestorage.com`,
  credentials: { accessKeyId: process.env.R2_ACCESS_KEY_ID, secretAccessKey: process.env.R2_SECRET_ACCESS_KEY },
  maxAttempts: 5,
});
```

2. Extract pages with `pdftoppm -jpeg -r 150 -jpegopt quality=85`
3. Upload page images to R2 at `books/{bookId}/pages/0001.jpg`
4. Create book + page records in MongoDB directly (pages need an `id` field — use `new ObjectId().toString()`)

**Production-tested settings** (`_tmp-import-souter-pdf.mjs`, `_tmp-import-googles-batch.mjs`):
- **Concurrency = 3** for R2 uploads — higher values (8+) cause SSL `bad record mac` errors mid-batch.
- **Per-upload retry**: wrap `r2.send()` in a 4-6 attempt loop with exponential backoff (500ms × 2^attempt).
- **pdftoppm timeout**: 30 minutes for ~600pp books, 60-90 minutes for 800pp+. Some Google Books PDFs take much longer than file size suggests.
- **Inter-batch delay**: 150-200ms `setTimeout` between chunks to let R2 connections settle.
- **Verify byte-exact download** before pdftoppm — IA's `/download/{id}/{id}.pdf` occasionally serves truncated PDFs; check `content-length` matches downloaded size.

**Unrepairable corruption**: Some IA PDFs (especially Italian National Library `ita-bnc-mag-*`) have no PDF trailer dictionary. Neither `mutool clean` nor `gs -sDEVICE=pdfwrite` can repair them. The corruption is at IA's source. Try an alternative source rather than fighting the file.

**Google Books → check IA mirror first**: Before manually downloading a Google Books PDF, try `https://archive.org/metadata/bub_gb_{google_id}`. If it exists, import via `ia` route instead of the PDF pipeline.

**Cloudflare-protected catalogs (IRD Horizon, Persée, HAL, Wellcome)**: Anubis/JS-rendered search interfaces block automation. Either use `WebFetch` (which can render JS) or hand off to the user with a direct browser URL.

---

## Collection Page Rendering & `mentioned_books`

**Critical:** The collection page (`/collections/{slug}`) renders `description` and `expanded_description` as **plain text** — Markdown is **NOT parsed**. Links written as `[text](url)` show literal brackets and parentheses; `*italic*` and `**bold**` show literal asterisks.

Three things the renderer **does** handle:
1. **Paragraph breaks** on `\n\n` (split into `<p>` tags).
2. **Auto-linking of book titles** ≥8 chars that appear as exact substrings in the description text. Matches the book's `title` or `display_title` against the collection's books. Renders as `text-accent-rust hover:underline italic`.
3. **Explicit `mentioned_books` overrides** that take priority over auto-detection.

### When writing a new collection description

- **Plain prose only.** No Markdown syntax.
- **Use exact title substrings** ≥8 chars from books in the collection — they'll auto-link.
- **For shorter references** (e.g. "Liezi" = 5 chars, "Hesiod" = 6) or paraphrased titles that don't match book records — populate `mentioned_books`.

### `mentioned_books` schema

```javascript
{
  slug: 'prehistory-of-ai',
  mentioned_books: [
    { text: "Synesius of Cyrene's On Dreams", book_id: "69a5e3d8006a4098422166a7" },
    { text: "Hypnerotomachia Poliphili", book_id: "a7d82d02-1a76-4f5f-af99-339285a345f9" },
    // Long-form variants first; short-form fallbacks after.
    { text: "Synesius", book_id: "69a5e3d8006a4098422166a7" },
    { text: "Hypnerotomachia", book_id: "a7d82d02-1a76-4f5f-af99-339285a345f9" },
  ]
}
```

Patch via `/api/collections` PATCH:
```bash
curl -sX PATCH "https://sourcelibrary.org/api/collections" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $CRON_SECRET" \
  -d @/tmp/mentions.json
```

### Ordering matters

The matcher sorts `mentioned_books` longest-first to avoid sub-match collisions. So always list:
1. **Most-specific phrases first** ("Author's Specific Work Title")
2. **Then medium-specific** ("Author's Work")
3. **Then short-form fallbacks** ("Title", "Author")

A long-form claim ranges before a short-form, so subsequent occurrences of the short form only match unclaimed text spans.

### Updating descriptions

The PATCH endpoint accepts arbitrary update fields via `{ slug, addBookIds, ...updates }`. So this works:

```bash
curl -sX PATCH "https://sourcelibrary.org/api/collections" \
  -d '{"slug":"my-collection","description":"...","mentioned_books":[...],"color":"gold"}'
```

When patching, the API echoes the full collection object including `description` (which may contain control characters that break `python3 -c 'json.load(...)'`). Use HTTP status (`curl -w '%{http_code}'`) instead of parsing the response body in shell scripts.

### Audit existing collections

```python
# How many collections have populated mentioned_books?
from pymongo import MongoClient; import os
db = MongoClient(os.environ['MONGODB_URI'])['bookstore']
print(db.collections.count_documents({'mentioned_books': {'$exists': True, '$ne': []}}))
```

---

## Selection Rules

### Edition Priority (CRITICAL)
**ALWAYS prefer the oldest available edition in original language:**
1. Manuscripts — highest priority (especially pre-1500)
2. Incunabula (pre-1501)
3. 16th century — first printed editions, editio princeps
4. 17th century — important scholarly editions
5. 18th century — when earlier unavailable
6. 19th century critical editions — Teubner, Loeb (pre-1929), OCT
7. Modern translations — ONLY when no original text edition exists

**Language priority:** Original language ALWAYS over English. Never import 20th-21st century English translations when Latin/Greek/Arabic/Persian/Hebrew originals exist.

### ACQUIRE
- Original historical editions (pre-1800 primary sources)
- Illuminated manuscripts with miniatures
- Early printed books in original language
- First editions and important early printings
- Critical scholarly editions with original text
- Texts from non-Western traditions (Arabic, Persian, Sanskrit, Chinese, Japanese, Hebrew)

### REJECT
- Modern translations without original text
- English-only editions when originals available
- Secondary literature and commentaries
- Facsimile reprints when original scans exist
- Anthologies that excerpt rather than present complete works
- Books already in collection

### Scoring (1-10 scale)
| Criterion | Weight |
|-----------|--------|
| Thematic fit | 3x |
| Edition quality | 2x |
| Historical authenticity | 2x |
| Rarity | 2x |
| Completeness | 1x |
| Image quality | 1x |
| Research value | 1x |

---

## Session Tracking

Append to `curatorreports.md`:

```markdown
# Session [N]: [DATE] - [THEME]

## Collection: [slug] (new|existing)

## Acquired
| Title | Author | Year | Pages | Book ID | Source |
|-------|--------|------|-------|---------|--------|

## Rejected
| Title | Reason |

## Session Total: N books, N pages
```