--- name: update-dataset description: End-to-end dataset update workflow with PR creation, snapshot, meadow, garden, and grapher steps. Use when user wants to update a dataset, refresh data, run ETL update, or mentions updating dataset versions. --- # Update Dataset (PR → snapshot → steps → grapher) Use this skill to run a complete dataset update with Claude Code subagents, keep a live progress checklist, and pause for approval at a checkpoint **after every numbered workflow step** before continuing. ## Inputs - `//` - Get `` as today's date by running `date -u +"%Y-%m-%d"` Optional trailing args: - branch: The working branch name (defaults to current branch) Assumptions: - All artifacts are written to `workbench//`. - Persist progress to `workbench//progress.md` and update it after each step. ## Progress checklist (maintain, tick live, and persist to progress.md) (Checkpoint rule: After you finish each item below that represents a workflow step, immediately run the CHECKPOINT procedure. Do not batch multiple steps before a checkpoint.) - [ ] Parse inputs and resolve: channel, namespace, version, short_name, old_version, branch - [ ] Clean workbench directory: delete `workbench/` unless continuing existing update - [ ] Run ETL update workflow via `etl-update` subagent (help → dry run → approval → real run) - [ ] Create or reuse draft PR and work branch - [ ] Update snapshot and compare to previous version; capture summary - [ ] Meadow step: run + fix + diff + summarize - [ ] Garden step: run + fix + diff + summarize - [ ] Grapher step: run + verify (skip diffs), or explicitly mark N/A - [ ] CHECKPOINT — present consolidated summary and request approval - [ ] If approved, commit, push, and update PR description - [ ] Optional: run indicator upgrade on staging and persist report - [ ] Draft Slack announcement and notify user to post it to #data-updates-comms Persistence: - After ticking each item, update `workbench//progress.md` with the current checklist state and a timestamp. ## CHECKPOINT (mandatory user approval) Always performed **immediately after completing each numbered workflow step** (1–6). Never start the next step until approval is granted. Procedure (each time): 1. Present a concise summary of what just changed, key diffs/issues resolved, and what the next step will do. 2. Ask exactly: Proceed? reply: yes/no 3. Only continue if the user replies exactly yes (case-insensitive). Any other reply = no; stop and wait. 4. On approval: - Update progress checklist (tick the completed item) and write `workbench//progress.md` with timestamp. - Commit related changes (if any), push. - Update (or append to) the PR description: add a collapsed section titled with the step name (e.g., "Snapshot Update", "Meadow Update") containing the summary. ## Mandatory per-step checkpoints (rule) You MUST: - Stop after each workflow step (1–6) and run CHECKPOINT before starting the next (step 7 is optional and still requires a checkpoint if executed). - Never chain multiple steps inside a single approval. - Treat missing or ambiguous replies as no. ## Workflow orchestration 0) Initial setup - Check if `workbench//progress.md` exists to determine if continuing existing update - If starting fresh: delete `workbench/` directory if it exists - Create fresh `workbench/` directory for artifacts 1) Run ETL update command (etl-update subagent) - Inputs: `//` plus any required flags - **CRITICAL**: Run `etl update` ONCE for the full step URI (e.g., `data://garden/namespace/old_version/short_name`). Do NOT run it separately per channel (snapshot, meadow, garden, grapher). Running it once ensures all cross-step DAG dependencies are updated together. Running it per-channel leaves stale version references in `dag/main.yml` (e.g., garden pointing to old meadow version). - Perform help check, dry run, approval, then real execution; capture summary for later PR notes - After running, **always verify `dag/main.yml`**: grep for the old version and confirm all internal references between the new steps point to the new version (e.g., garden depends on new meadow, not old meadow). - CHECKPOINT (stop → summarize → ask → require yes) 2) Create PR and integrate update via subagent (etl-pr) - Inputs: `//` - Create or reuse draft PR, set up work branch, and incorporate the ETL update outputs - CHECKPOINT 3) Snapshot run & compare (snapshot-runner subagent) - Inputs: `//` and `` - CHECKPOINT 4) Meadow step repair/verify (step-fixer subagent, channel=meadow) - Run, fix, re-run; produce diffs - Save diffs and summaries - CHECKPOINT 5) Garden step repair/verify (step-fixer subagent, channel=garden) - Run, fix, re-run; produce diffs - Save diffs and summaries - CHECKPOINT 6) Grapher step run/verify (step-fixer subagent, channel=grapher, add --grapher) - Skip diff - CHECKPOINT 7) Indicator upgrade (optional, staging only) - Use indicator-upgrader subagent with ` ` - **CRITICAL**: After the upgrader finishes, always verify it actually worked by querying staging: `make query SQL="SELECT COUNT(*) FROM chart_dimensions cd JOIN variables v ON cd.variableId = v.id WHERE v.catalogPath LIKE '%/%'"`. If the count is 0, the upgrade did not run — re-run it. - CHECKPOINT (if executed) 8) Slack announcement - Fill out the template at `.claude/skills/update-dataset/slack-announcement-template.md` using facts gathered during the update (coverage, chart count, key changes, etc.) - Ask user if unsure about any details - Save the draft to `workbench//slack-announcement.md` - Tell the user: "Slack announcement drafted at `workbench//slack-announcement.md`. Please review and post it to **#data-updates-comms**." ## Guardrails and tips - **DAG consistency**: After `etl update`, always verify that all new steps in `dag/main.yml` reference each other with the new version. A common bug is garden depending on old meadow or old snapshot — this silently loads stale data. - Never return empty tables or comment out logic as a workaround — fix the parsing/transformations instead. - Column name changes: update garden processing code and metadata YAMLs (garden/grapher) to match schema changes. - Indexing: avoid leaking index columns from `reset_index()`; format tables with `tb.format(["country", "year"])` as appropriate. - Metadata validation errors are guidance — update YAML to add/remove variables as indicated. ## Artifacts (expected) - `workbench//snapshot-runner.md` - `workbench//progress.md` - `workbench//meadow_diff_raw.txt` and `meadow_diff.md` - `workbench//garden_diff_raw.txt` and `garden_diff.md` - `workbench//indicator_upgrade.json` (if indicator-upgrader was used) ## Example usage - Minimal catalog URI with explicit old version: - `update-dataset data://snapshot/irena/2024-11-15/renewable_power_generation_costs 2023-11-15 update-irena-costs` --- ### Common issues when data structure changes - SILENT FAILURES WARNING: Never return empty tables or comment code as workarounds! - Column name changes: If columns are renamed/split (e.g., single cost → local currency + PPP), update: - Python code references in the garden step - Garden metadata YAML (e.g., `food_prices_for_nutrition.meta.yml`) - Grapher metadata YAML (if it exists) - Index issues: Check for unwanted `index` columns from `reset_index()` — ensure proper indexing with `tb.format(["country", "year"])`. - Metadata validation: Use error messages as a guide — they show exactly which variables to add/remove from YAML files.