---
name: checkpoint-resume-long-job
description: "Persist progress for long-running jobs (batched LLM calls, large ingestions, multi-hour syncs) so that a context reset, crash, or interrupt doesn't lose work. Use whenever a job iterates over N items and completing item K matters independently. Provides a resumable.mjs library pattern plus the skill's invocation heuristics."
format: 2025-10-02
version: 1.0.0
status: active
updated: 2026-04-17
---

# Checkpoint & Resume for Long Jobs

Any job that takes longer than 5 minutes and iterates over N independent
items should checkpoint its progress. Context can reset, processes can
crash, users can Ctrl-C. A re-run shouldn't redo completed work.

## Triggers

Activate when a job:

- Iterates over **≥ 20 items** AND each item takes **≥ 5 seconds**, OR
- Is expected to run **≥ 10 minutes** total, OR
- Calls **external APIs** with rate limits or cost per call (LLM, HTTP), OR
- Is **not naturally idempotent** at the whole-job level

## Shape

The simplest checkpoint is a file listing completed item IDs. On job start:
read the file; on each item completion: append its ID; on job restart: skip
any ID in the file.

### Reference library: `tools/checkpoint-resume/resumable.mjs`

```javascript
import { processBatches } from './tools/checkpoint-resume/resumable.mjs';

await processBatches({
  items: [...1713 lessons...],
  keyFn: l => l.id,
  checkpointFile: '.planning/sessions/tiebreaker-checkpoint.jsonl',
  batchSize: 5,
  async handler(batch) {
    // your per-batch work
    return batch.map(l => ({ id: l.id, status: 'done' }));
  },
  onProgress({ completed, total, skipped }) {
    console.error(`${completed + skipped}/${total} (${skipped} resumed)`);
  },
});
```

On first run, processes all items and appends IDs to the checkpoint file.
On resume, reads the file and skips already-processed items.

## Checkpoint Formats

| Format | When |
|--------|------|
| Append-only JSONL | Most jobs. One line = one completed item. Easy to read, easy to resume. |
| Database column | When items already live in a DB — add `processed_at TIMESTAMP` and `WHERE processed_at IS NULL` at start. |
| Snapshot file | When checkpoint state is a complex structure (progress trees, partial outputs). Write a whole-state JSON every N items. |

Prefer append-only JSONL. Crash-safe by design.

## The Trade-off

Checkpointing adds file I/O per item. Usually negligible compared to the work
itself. The cost of NOT checkpointing, however, is:

- Wasted LLM calls (money)
- Wasted API quota
- User has to manually figure out where the job stopped
- Worst case: job silently half-completes and corrupts DB state

## Anti-patterns

- **Checkpointing to in-memory arrays only.** If the process dies, so does
  the checkpoint.
- **Non-atomic writes.** Use append-only (fsync-safe) or write-temp-then-rename.
- **Checkpoint file in `/tmp`.** It WILL get cleaned up. Put it under
  `.planning/sessions/` or a project-local cache dir.
- **Not logging the checkpoint file path at start.** If the user needs to
  resume manually, they need to know where to look.

## Invocation Heuristic

Before starting any long job, ask:

- "If my process dies halfway, is the user's work gone?"
- "If I'm Ctrl-C'd at item 500 of 1000, can I pick up at 501?"
- "Does item K depend on item K-1, or are they independent?"

If answers are "yes, no, independent" → use checkpointing.

## Example — LLM Tiebreaker (v1.49 release-history work)

Situation: 681 lessons to classify via `claude -p`, 5 per batch, ~30 sec per
batch. Total: ~70 minutes. No checkpointing was in place.

Worst-case loss: 136 wasted LLM calls at batch 137 if context broke.
Actual loss: 0 — but only because the run happened to complete first time.

Fix: wrap the batch loop in `processBatches()` from `resumable.mjs`.
On resume, only unprocessed lessons get classified.

## Related

- `session-observatory-live` — log a `checkpoint` event at every completion
- `decision-framework-invoker` — long jobs often produce irreversible state
- `batch-rewrite-pattern` — similar batching shape, different domain