--- name: projectclownfish-cluster-worker description: "Use when operating ProjectClownfish cluster workflows in this repo: reviewing GitHub Actions run artifacts, tuning worker/applicator guardrails, importing ghcrawl clusters, dispatching canaries, scaling batches, or deciding whether autonomous write-mode cluster cleanup is safe to ramp." --- # ProjectClownfish Cluster Workflow Use this skill for ProjectClownfish operations in this repo. It is not just a one-job worker skill; it is the fast path for running the whole guarded cluster loop. ## Hard Stance - Execute/autonomous gates default on for live ProjectClownfish work. Set `CLOWNFISH_ALLOW_EXECUTE=0` or `CLOWNFISH_ALLOW_FIX_PR=0` only when intentionally pausing mutations. - Do not dispatch a broad write-mode batch if the last canary failed, has unreviewed artifacts, or any stale pre-fix run is still active. - Treat GitHub Actions repository variables as captured at workflow start. Resetting `CLOWNFISH_ALLOW_EXECUTE` does not revoke already-started runs. - If a stale run was started with old guardrails and write mode enabled, cancel it before scaling. - Codex workers never mutate GitHub directly. They emit JSON; `scripts/execute-fix-artifact.mjs` owns guarded fix PR execution and `scripts/apply-result.mjs` owns guarded close/merge replay. - Only the applicator may record `executed`. Worker output containing `executed` is a bug. - Closed historical refs are evidence only. They must not receive `close_*` actions. - Security-sensitive refs do not belong in ProjectClownfish mutation. Quarantine vulnerability, advisory, CVE/GHSA, leaked secret, credential/token/API-key, plaintext secret storage, SSRF/XSS/CSRF/RCE, security-class injection, exploitability, or sensitive-data exposure refs with `route_security`, and keep unrelated non-security work moving. ## Recovery Check Start every session here: ```bash pwd git rev-parse --show-toplevel git branch --show-current git status --short --branch df -h . test -e node_modules && { ls -ld node_modules; test -L node_modules && echo node_modules_symlink=yes || echo node_modules_symlink=no; } || echo node_modules_missing ``` Then check live workflow state and safety vars: ```bash gh run list --repo openclaw/clownfish --workflow cluster-worker.yml --limit 10 \ --json databaseId,headSha,status,conclusion,createdAt,updatedAt,url \ --jq '.[] | {databaseId,headSha,status,conclusion,createdAt,updatedAt,url}' gh variable list --repo openclaw/clownfish --json name,value \ --jq 'map(select(.name|test("^CLOWNFISH_"))) | sort_by(.name) | .[] | {name,value}' ``` ## Review Results For every completed run that matters: ```bash rm -rf /tmp/clownfish-check-RUN_ID mkdir -p /tmp/clownfish-check-RUN_ID gh run download RUN_ID --repo openclaw/clownfish --dir /tmp/clownfish-check-RUN_ID npm run review-results -- /tmp/clownfish-check-RUN_ID ``` Summarize artifacts: ```bash find /tmp/clownfish-check-RUN_ID -name result.json -print -quit | xargs jq '{status,summary,actions_total:(.actions|length),action_counts:(.actions|group_by(.action)|map({action:.[0].action,count:length}))}' find /tmp/clownfish-check-RUN_ID -name apply-report.json -print -quit | xargs jq '{totals:{executed:([.actions[]? | select(.status=="executed")]|length),blocked:([.actions[]? | select(.status=="blocked")]|length),skipped:([.actions[]? | select(.status=="skipped")]|length),planned:([.actions[]? | select(.status=="planned")]|length)}}' find /tmp/clownfish-check-RUN_ID -name fix-execution-report.json -print -quit | xargs jq '{status,actions}' ``` If review fails, inspect the failure class before doing anything else: - `executed` in worker result: tighten schema, prompts, and `scripts/review-results.mjs`. - `close action targets closed item`: tune prompts and planner so closed context refs are evidence-only; use `keep_closed`. - long `Run worker`: reduce prompt size by hydrating canonical + open candidates only; add/verify `CLOWNFISH_CODEX_TIMEOUT_MS`. - applicator blocked because target changed: rerun the job against fresh state, do not force apply. ## Tune Engine Use repo scripts and prompts as the control plane: - `schemas/codex-result.schema.json`: what Codex may emit. - `prompts/worker-system.md`, `prompts/autonomous.md`, `prompts/execute.md`, `prompts/plan-only.md`: worker behavior. - `instructions/low-signal-prs.md`: opt-in manual backlog cleanup policy for random docs churn, blank-template PRs, test-only spam, third-party capability PRs that belong on ClawHub, risky infra drive-bys, and dirty branches. - `scripts/review-results.mjs`: deterministic artifact gate. - `scripts/plan-cluster.mjs`: what gets hydrated into the prompt. - `scripts/execute-fix-artifact.mjs`: deterministic branch repair/replacement PR gate. - `scripts/apply-result.mjs`: deterministic mutation gate. - `scripts/post-flight.mjs`: deterministic post-execution finalizer for ProjectClownfish fix PRs and post-merge closeouts. - `scripts/import-ghcrawl-low-signal-prs.mjs`: local ghcrawl open-PR scanner for opt-in low-signal cleanup jobs. - `.github/workflows/cluster-worker.yml`: runner behavior and env capture. Current autonomy posture: - Hydrate comments and PR review comments by default before model execution. - Hydrate cluster refs and bounded first-hop linked refs so closed representative drift can often be resolved without human review. - Treat failing checks as a merge/fixed-by-candidate blocker, not a reason to stop classifying the whole cluster. - Treat security-sensitive refs as scoped quarantine. Emit/expect `route_security` for that ref only; keep processing unrelated non-security duplicates, bugs, provider gaps, and fix artifacts. - Treat missing `merge_preflight` as a hard merge blocker. Merge preflight must prove security clearance, resolved human comments, resolved review-bot comments, passed Codex `/review`, addressed findings, and validation commands. - For `openclaw/openclaw` fix artifacts, validation commands must use repo-native `pnpm` lanes such as `pnpm test:serial `, `pnpm -s vitest run `, and `pnpm check:changed`; `npm run validate` is not a valid target-repo gate. - Let `execute-fix-artifact` run the agentic merge-prep loop for fix PRs: edit, validate, Codex `/review`, address findings, revalidate, then resolve review threads when `CLOWNFISH_RESOLVE_REVIEW_THREADS=1`. - Prepare the target repo toolchain before the agentic edit/review loop. For OpenClaw this means Node 22+, Corepack, the target `packageManager`, and `pnpm install --frozen-lockfile`; only set `CLOWNFISH_INSTALL_TARGET_DEPS=0` for debugging failed setup. - Review failed worker artifacts before requeueing. The workflow must upload worker artifacts even when `review-results` fails; a missing artifact after a failed review is a workflow bug, not an acceptable blind retry. - Replacement fix PR execution must use the recoverable target branch `clownfish/`. If that branch already exists, resume it instead of starting from scratch. After agent edits and review-fix edits, commit and push checkpoint commits to that branch before expensive validation/review gates so a timed-out run can be requeued without losing the patch. Do not open the PR until validation and Codex `/review` pass. - Resumed replacement branches may be rebased and narrowly refactored onto current `origin/main`. If the rebase conflicts, let the executor run the Codex rebase-repair loop, resolve conflict markers, continue the rebase, then proceed through the normal validation/review gate. Tune `CLOWNFISH_REBASE_REPAIR_ATTEMPTS` instead of disabling the rebase gate. - Useful but uneditable or unsafe source PRs are replacement candidates, not human blockers. When a canonical PR is draft, stale, unmergeable, has `maintainer_can_modify=false`, or has broad unrelated churn, emit or execute `replace_uneditable_branch` with full source PR credit instead of waiting for a maintainer decision. - Fix execution should provide Codex actual repo-discovery context before editing; repeated "no target repo changes" means tune `scripts/execute-fix-artifact.mjs` before replaying more jobs. GitHub Actions may block Codex bwrap write/review sandboxes, so write-mode and review execution default to `danger-full-access` there after tokens are stripped from the Codex environment. A Codex write preflight must fail fast before the expensive repair loop if sandbox/auth/write access is broken; do not wait through multi-attempt edits to discover startup failures. Keep canary execution bounded: default worker timeout is 30 minutes, build-PR step timeout is 30 minutes, fix Codex edit budget is 20 minutes with reserve for artifact writing, preflight timeout is 2 minutes, Codex model is `gpt-5.5`, and Codex reasoning effort is `medium`. Worker timeout/failure and exhausted `/review` attempts must write blocked artifacts and keep the workflow reporting path alive. Fix executor runs must copy Codex debug logs into the run artifact so timeout failures are inspectable. - Match OpenClaw's CI fast lane for fix validation. Use `blacksmith-4vcpu-ubuntu-2404` for cluster planning/review and `blacksmith-16vcpu-ubuntu-2404` for fix/apply execution. The executor sets `OPENCLAW_LOCAL_CHECK=0` and treats `pnpm check:changed` plus diff checks as the default hard gate. It normalizes target validation commands to `pnpm check:changed` unless `CLOWNFISH_TARGET_VALIDATION_MODE=strict` or `CLOWNFISH_STRICT_TARGET_VALIDATION=1` is explicitly set, so unrelated flaky main CI and broad suites do not block narrow ProjectClownfish fixes. - After fix execution, run post-flight finalization before the final closeout replay. Post-flight may merge only ProjectClownfish-opened/pushed fix PRs, only after merge preflight, security clearance, resolved review threads, and non-ignored checks are clean. Default ignored checks are `auto-response`, `Labeler`, and `Stale`; configure `CLOWNFISH_POST_FLIGHT_IGNORE_CHECKS` rather than broadening the hard gate in code. - Prefer `keep_related`, `keep_independent`, `keep_closed`, `fix_needed`, `route_security`, and subcluster notes over blanket `needs_human`. - Use `needs_human` only for the exact maintainer decision still unresolved after hydrated evidence is reviewed. - Worker results must use one action per issue/PR ref. Never emit comma-separated action targets; related follow-up subclusters should be one `keep_related` action per ref or one cluster-scoped `fix_needed` action. - Close-action `canonical`, `duplicate_of`, and `candidate_fix` refs must come from hydrated preflight items. If a PR is only mentioned in comments or previous ProjectClownfish notes, keep it as evidence/fix-artifact context until a refreshed plan hydrates it. - Broad feature/config/docs rewrites are not autonomous executor work. If a fix artifact crosses many implementation, config/schema, docs, and test surfaces, split it into narrower follow-up jobs or let `execute-fix-artifact` block it. Override only with `CLOWNFISH_ALLOW_BROAD_FIX_ARTIFACTS=1`. After tuning, run: ```bash node --check scripts/plan-cluster.mjs node --check scripts/import-ghcrawl-clusters.mjs node --check scripts/run-worker.mjs node --check scripts/post-flight.mjs npm run validate git diff --check ``` Do a narrow planner smoke before committing hydration changes: ```bash rm -rf /tmp/clownfish-plan-check node scripts/plan-cluster.mjs jobs/openclaw/ghcrawl-143793-autonomous-smoke.md \ --offline --run-dir /tmp/clownfish-plan-check jq '{items:(.items|length),seed_refs:(.scope.seed_refs|length),context_refs:(.scope.context_refs|length),hydrate_cluster_refs:.scope.hydrate_cluster_refs}' \ /tmp/clownfish-plan-check/cluster-plan.json ``` For a needs-human reduction smoke, verify the artifact includes real comment and review-comment excerpts: ```bash jq '{items:(.items|length), comment_items:([.items[] | select(.comments_hydrated > 0)] | length), review_comment_prs:([.items[] | select(.pull_request.review_comments_hydrated > 0)] | length)}' \ /tmp/clownfish-plan-check/cluster-plan.json ``` ## Generate Batch Jobs Use ghcrawl read-only inspection first: ```bash ghcrawl doctor --json ghcrawl configure --json ghcrawl clusters openclaw/openclaw --min-size 2 --limit 80 --sort size --json | jq -r '.clusters[] | select(.isClosed == false) | [.clusterId,.totalCount,.issueCount,.pullRequestCount,.latestUpdatedAt,.displayTitle] | @tsv' rg -o 'ghcrawl-[0-9]+' jobs/openclaw -g '*.md' | sed -E 's/.*ghcrawl-([0-9]+).*/\1/' | sort -n | uniq | tr '\n' ' ' ``` Pick the largest active clusters not already imported, then generate autonomous job files: ```bash node scripts/import-ghcrawl-clusters.mjs --from-ghcrawl --limit 40 \ --repo openclaw/openclaw \ --mode autonomous \ --suffix autonomous-smoke \ --allow-instant-close \ --allow-merge \ --allow-fix-pr \ --allow-post-merge-close ``` The importer skips existing ghcrawl IDs and fully security-sensitive clusters by default. Mixed clusters are allowed so the worker can route security refs and continue ordinary bug/dedupe work. Validate before committing: ```bash npm run validate ``` Commit engine changes separately from generated job batches when practical: ```bash git add prompts schemas scripts .github git commit -m "fix: scope autonomous cluster workflow" git add jobs/openclaw/ghcrawl-*-autonomous-smoke.md git commit -m "chore: add next autonomous cluster jobs" git push origin main ``` ## Dispatch Policy Do not jump straight to 20 write-mode jobs. Sequence: 1. Ensure no stale active runs on old SHAs. 2. Ensure `CLOWNFISH_ALLOW_EXECUTE=1` and `CLOWNFISH_ALLOW_FIX_PR=1` unless the operator intentionally paused live work. 3. Dispatch 2-3 canaries on the latest pushed SHA. 4. Review artifacts and applicator reports. 5. Only then dispatch a wider batch. Canary dispatch: ```bash npm run dispatch -- \ jobs/openclaw/ghcrawl-ID1-autonomous-smoke.md \ jobs/openclaw/ghcrawl-ID2-autonomous-smoke.md \ --mode autonomous \ --runner blacksmith-4vcpu-ubuntu-2404 \ --execution-runner blacksmith-16vcpu-ubuntu-2404 ``` Important: after dispatch, already-started runs keep the write gate they captured. If a new bug is found, cancel those runs. Single-job requeue after calibration: ```bash npm run requeue -- 24947178021 npm run requeue -- 24947178021 --execute --open-execute-window \ --runner blacksmith-4vcpu-ubuntu-2404 \ --execution-runner blacksmith-16vcpu-ubuntu-2404 ``` Use a run id when you want to replay the same source job from an artifact, or a job path when you already know the file. The script opens both mutation gates for live execute/autonomous requeues and closes them after the queued run starts. For plan-only scaling, keep write gate off and dispatch with `--mode plan` or `--dry-run` where appropriate. ## Low-Signal PR Sweeps Use this only for manual backlog cleanup and random drive-by PR triage. It is not dedupe and it must stay separate from duplicate/superseded/fixed-by-candidate closeouts. Generate staged jobs from local ghcrawl data: ```bash npm run import-low-signal -- --limit 20 --batch-size 5 --mode autonomous --sort stale ``` Generated jobs set `triage_policy: low_signal_prs` and `allow_low_signal_pr_close: true`. The worker may emit `close_low_signal` only for open pull requests that pass `instructions/low-signal-prs.md`. Before live dispatch: - inspect the generated job candidates; - commit and push the jobs so Actions can read them; - dispatch 1-2 canaries first; - review artifacts before scaling the next batches. ## Self-Heal Failed Jobs Use self-heal after reviewing the failed artifacts and tuning obvious deterministic guardrail issues. Dry-run candidate selection: ```bash npm run self-heal ``` This selects only the latest failed run per source job, skips jobs that have a later success, and skips jobs already retried in `results/self-heal.json`. Live one-attempt retry: ```bash npm run self-heal -- --execute --open-execute-window --max-jobs 5 \ --runner blacksmith-4vcpu-ubuntu-2404 \ --execution-runner blacksmith-16vcpu-ubuntu-2404 ``` The local live path temporarily opens gates only when needed, dispatches the retry jobs, waits until the new runs have started, records the ledger, and restores the prior gate values. If using the manual `self-heal failed clusters` workflow, keep it dry-run by default. For execute mode, open the execution gate before triggering it or it should fail before dispatching write-mode jobs. ## Ramp Decision Say "safe to ramp" only when all are true: - latest canaries run on the current SHA; - no worker result uses `executed`; - no close action targets a closed item; - applicator executed only planned duplicate/superseded/fixed-by-candidate close actions or guarded clean merge actions; - every merge action had passing `merge_preflight`, and live GitHub review threads were resolved before merge; - useful contributor PRs were either repaired when maintainer-editable or have a replacement fix artifact with source PR credit before superseded closeout; - `CLOWNFISH_ALLOW_EXECUTE` and `CLOWNFISH_ALLOW_FIX_PR` are back to their intended default values; - active runs are expected and on the intended SHA; - artifacts are downloaded or easy to retrieve by run URL. If not, say exactly what blocked the ramp and patch that first.