# Drift monitor Dario's bundled CC template (`src/cc-template-data.json`) is the wire-shape fallback the proxy uses when it can't fingerprint a live CC install. For that fallback to be honest, the bundle has to keep up with what real CC is actually sending on the wire. CC drifts in two distinct ways, and there are two distinct watchers — one of which needs a self-hosted runner. ## Two classes of drift **Class A — npm-release drift.** Anthropic ships a new `@anthropic-ai/claude-code` to npm. The binary changes; the wire shape usually changes with it (new tools, new system-prompt slots, beta-header swaps). This is the visible kind. > Watched by `.github/workflows/cc-drift-watch.yml`. Runs on a GitHub-hosted > runner, no auth required — polls npm, diffs the dist file, opens an issue > when the bundled `_version` is behind latest. **Class B — same-binary remote-config drift.** Anthropic does *not* ship a new npm version, but the wire shape changes anyway. We documented an instance of this in [CHANGELOG `4.2.1`](../CHANGELOG.md#421---2026-05-17): same CC `2.1.143` binary, same machine, captures 24h apart produced materially different `/v1/messages` bodies (different anthropic-beta header, +355 char system prompt). The npm watcher can't see this; the binary is unchanged. > Watched by `.github/workflows/cc-drift-template-watch.yml`. Runs on a > **self-hosted** runner because it needs a live, authenticated CC install > to capture against. Runs `node scripts/capture-and-bake.mjs --check` every > 30 min and opens (or comments on) a `cc-drift-template`-labeled issue when > the captured template diverges from the committed bundle. > > Watched-by-the-watcher: `.github/workflows/cc-drift-watcher-liveness.yml` > runs every 2 hours on a github-hosted runner and opens a > `cc-watcher-liveness`-labeled alert if the class-B watcher has not had a > successful run within 8 hours (≥ 16 missed cycles). Catches "runner went > offline silently" — the failure mode where class-B drift goes uncaught > because the watcher itself is down. The liveness watcher lives on > github-hosted infrastructure deliberately so it survives the exact failure > modes it's designed to detect. **Class C — billing-classifier drift** *(v4.6.0)*. Different signal again: Anthropic changes the classifier *rules* — adds a new signal, tightens an existing one, flips a threshold — and dario's canonical-rebuild output no longer scores as `subscription` even though CC's wire shape is unchanged. The template-drift watcher cannot see this because nothing in CC's outbound has moved; only an end-to-end "send a real request, inspect the billing bucket" probe catches it. > Watched by `.github/workflows/cc-billing-classifier-canary.yml`. Runs daily > at 06:30 UTC on the same self-hosted runner. Sends one tiny haiku request > through `dario proxy` (canonical-rebuild mode, NOT `--passthrough`), > captures the `representative-claim` response header, opens a > `cc-billing-canary`-labeled alert when it flips to `overage` / `api` > / `unknown`. Auto-closes when it next returns a subscription bucket. > Cost: ~1 small subscription request per day. ## What --check considers drift The `--check` mode in `scripts/capture-and-bake.mjs` deliberately ignores fields that always differ between runs (`_captured` timestamp, user-agent string, `_version` / `_supportedMaxTested` labels). It flags **shape** drift in: - **tools** added or removed (by name set) - **anthropic_beta** header values added or removed - **system_prompt** content (any character delta) - **body_field_order** (top-level JSON key order) - **header_order** - **agent_identity** content Exit codes: | Code | Meaning | |---|---| | 0 | Full match — wire shape AND `_version` label both current | | 1 | Infrastructure failure (CC not on PATH, capture timeout, scrub leak) | | 2 | **Shape** drift vs current bundled template (needs a real re-bake) | | 3 | **Label-only** drift — wire shape matches but `_version` lags the live CC version | The workflow swallows exit 2 and 3 (continues to the next step) so the remediation steps can run; exit 1 fails the job. ### Exit 2 vs exit 3 — why the split, and why only one auto-merges Because `--check` ignores `_version`, a CC release whose wire shape is *unchanged* (the common case for a patch bump) produces a bundle whose shape matches live CC but whose `_version` label is stale. The shape-only detector sees no drift (exit 0 territory), yet `sdk-drift-watch.yml` — which compares the `_version` label against `@anthropic-ai/claude-code@latest` on npm — flags it, with **nothing to re-bake**. That mismatch used to require a hand PR every time (issues #418, #426/#427, #445/#451). Exit 3 captures exactly that case (`computeDrift` empty **and** `bundled._version !== live._version`) and writes the live version to `label-target.txt`. The **Label-sync** workflow step then runs `scripts/label-sync.mjs`, which bumps only the three version-label fields (`_version`, `_supportedMaxTested`, and the `claude-cli/` token in the user-agent header) — never the wire shape — patch-bumps `package.json`, promotes the CHANGELOG, opens a `bot/template-label-*` PR, and turns on **auto-merge**. Auto-merge is safe for exit 3 but **not** for exit 2: an empty `computeDrift` is a proof that the tools / system_prompt / beta headers / field orders are byte-identical at the live version, so only the version string moves — the same deterministic-bump risk class `cc-drift-watch.yml` already auto-merges for `SUPPORTED_CC_RANGE.maxTested`. Auto-merge still gates on the required checks (build ×3, compat, test, docker-cap-drop-smoke); a red check leaves the PR open with the bot branch preserved. A shape rebake (exit 2) changes the wire-shape contract, so a human reviews compat-test + the diff before merging. ## Setting up the self-hosted runner Any dedicated Linux host works. Hetzner / DO / EC2 / etc. The runner needs Node 22, a logged-in `claude` CLI, and disk for a clone of the repo (~200 MB including `node_modules`). ### Prerequisites ```bash # 1. Node 22 + npm curl -fsSL https://deb.nodesource.com/setup_22.x | sudo bash - sudo apt-get install -y nodejs # 2. GitHub CLI (`gh`) — the workflow's issue-open step shells out to it. # Without this, --check correctly detects drift but the "Open / update # drift issue" step fails with `gh: command not found`. sudo apt-get install -y gh # or follow https://github.com/cli/cli#installation # 3. CC + dario CLI (dario provides the headless OAuth flow) sudo npm i -g @anthropic-ai/claude-code @askalf/dario # 4. OAuth — manual flow for headless boxes. Run from the host's shell, # follow the printed URL in any browser, paste the post-login callback # URL back into the SSH session. # # SHARE the credential with any other CC clients that auto-refresh # (e.g. a platform dario container running 24/7). Just run the standard # `dario login --manual` against /root/.claude/.credentials.json and let # that long-running refresh authority keep the token fresh. The workflows # read whatever's current at fire time and never attempt a refresh from # the runner side. # # History: v4.4.1 isolated the runner's credential at /root/.claude-runner # to avoid OAuth refresh-token rotation races between the runner and other # CC clients on the host. The isolation worked for races but introduced a # different failure: the isolated token had no refresh authority between # workflow fires (each <10 min, often hours apart), and Anthropic invalidates # refresh tokens that idle too long. Result: `invalid_grant` on every # workflow fire, recoverable only via interactive `dario login --manual`. # # Sharing with a 24/7 refresh authority (typical setup: the docker-stack # dario container) fixes that. The race the isolation was protecting # against is rare in practice — platform dario refreshes proactively # when the access token has ~1h life remaining, so workflow runs hit a # fresh token and don't need to refresh themselves. dario login --manual # 5. Smoke test: echo "Reply with PONG" | claude --print # should print PONG ``` ### One-time repo setup The workflow's issue-open step calls `gh issue create --label cc-drift-template` which fails if the label doesn't exist in the repo. Create it once before the runner's first execution: ```bash gh label create cc-drift-template \ --description "Bundled CC template has drifted from live capture" \ --color FBCA04 ``` ### Register the runner In a browser: `https://github.com///settings/actions/runners/new`. Pick Linux x64. GitHub prints a `mkdir`/`curl`/`tar`/`./config.sh` snippet. Paste it into the host. At the labels prompt, type **`dario-drift`** — the workflow gates on `runs-on: [self-hosted, dario-drift]`, so the label is load-bearing. ### Install as a systemd service ```bash cd ~/actions-runner sudo ./svc.sh install $(whoami) # run as the same user that owns ~/.claude sudo ./svc.sh start sudo ./svc.sh status # → "active (running)" ``` If the runner runs as `root` and `~/.claude/.credentials.json` lives under `/root/`, `RUNNER_ALLOW_RUNASROOT=1 ./config.sh ...` lets `./config.sh` run as root; GitHub's runner otherwise refuses root by default. ### Trigger once to verify GitHub UI → Actions → **CC template drift watch (self-hosted)** → "Run workflow." It should pick up the labeled runner within seconds and finish within ~60s. Exit 0 (no drift) or exit 2 (issue auto-opened with the drift report) both mean the pipeline is healthy. Exit 1 means the capture broke; check the workflow logs. After the first successful run, the `*/30 * * * *` cron takes over. ## When --check fires an issue The workflow opens (or comments on) an issue labeled `cc-drift-template` containing the `[bake]` output — the list of differing slots, sizes, tool names. From there: ```bash # On a maintainer machine with CC + a logged-in OAuth credential: npm run build node scripts/capture-and-bake.mjs # rewrites src/cc-template-data.json git diff src/cc-template-data.json # review # Open a PR with the re-bake; the auto-release pipeline publishes a patch # version. The next clean --check cycle auto-closes the drift issue. ``` ## Optional: PAT for downstream workflow triggers Since v4.4.0, the watcher auto-opens a `bot/template-rebake-*` PR on detection. Since v4.3.0, `compat-test-self-hosted.yml` is supposed to run on PRs touching `src/cc-template-data.json`. **Without the setup below, it doesn't.** GitHub Actions has a deliberate restriction: workflows authenticated by the default `GITHUB_TOKEN` cannot trigger downstream workflow runs ([docs](https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication#using-the-github_token-in-a-workflow)). The auto-rebake PR is therefore invisible to compat-test, and the validation gate the v4.4.0 design promised is effectively bypassed. To close the gap, create a fine-grained personal access token (PAT) scoped to this repo and expose it to the watcher as `DARIO_DRIFT_BOT_PAT`: 1. **Generate** at `https://github.com/settings/personal-access-tokens/new`: - Resource owner: your user (or org) - Repository access: select `dario` only - Permissions: **Contents: read & write**, **Pull requests: read & write**, **Issues: read & write** - Expiration: whatever your security policy mandates (90 days / 1 year) 2. **Store** at `Settings → Secrets and variables → Actions → New repository secret`: - Name: `DARIO_DRIFT_BOT_PAT` - Value: the PAT from step 1 3. **Verify** on the next watcher cycle that detects drift. The auto-rebake PR's "Checks" tab should now include the `compat` job (which it didn't pre-v4.6.5). The watcher workflow uses `GH_TOKEN: ${{ secrets.DARIO_DRIFT_BOT_PAT || secrets.GITHUB_TOKEN }}` for `gh` CLI ops, so the PAT is **optional** — the watcher keeps working without it, you just don't get compat-test gating on auto-rebake PRs (same behavior as v4.4.0 through v4.6.4). The fallback exists so a maintainer can defer the PAT setup without breaking the loop. ## Runner credential rate-limit headroom Workflows that exercise live `dario proxy` paths (compat-test, billing canary, future end-to-end probes) all consume against the runner credential's subscription pool. The cadence assumptions are: | Workflow | Declared cadence | Observed cadence | Requests per fire | |---|---|---|---| | `cc-drift-template-watch.yml` (`--check`) | every 30 min (`*/30 * * * *`) | typically every 2-4h | 1 capture (no /v1/messages traffic — MITM-only) | | `cc-billing-classifier-canary.yml` | daily 06:30 UTC | daily | 1 small haiku request | | `compat-test-self-hosted.yml` | per qualifying PR | per qualifying PR | ~11 small requests | **Cron scheduler reality.** GitHub Actions' free-tier cron scheduler is best-effort, not guaranteed. The class-B watcher declares `*/30 * * * *` but in practice GitHub honors it every 2-4 hours on this repo. The liveness alarm (added v4.4.2) has its threshold set to 8h (raised from 3h in v4.7.1) to absorb this skew — anything past that is signal, not scheduler noise. If you need a tighter SLA (sub-hour), self-host the runner *and* the cron driver (e.g. a cron entry on the same Hetzner box invoking `gh workflow run` directly). At steady state, this is comfortably inside Pro/Max headroom. The failure mode to watch for is **batched firing** — manually re-triggering the same workflow several times in a single hour, or PRs landing in rapid succession that each fire compat-test. We tripped this during the v4.6.x rollout: a half-dozen manual re-runs in a 2-hour window 429'd the runner credential. Pro/Max accounts have per-hour rate caps as well as per-5h / per-7d pools, and the per-hour cap is what surfaces first. If the runner credential is rate-limited and a workflow run reports 429s across the board, the right diagnosis order is: (a) check `claude --print` directly — if it 429s, the credential pool is dry, just wait an hour; (b) check the credential is still on a subscription account (`dario doctor`); (c) check workflow cadence assumptions haven't changed. The runner shares its OAuth credential with any other long-running CC client on the box (typically the platform dario container, which auto-refreshes 24/7). Sharing is intentional: a workflow that fires sparsely cannot keep its own refresh token alive — Anthropic invalidates idle refresh tokens, and `invalid_grant` then breaks every subsequent run. Letting a 24/7 refresh authority own the token rotation eliminates that failure mode at the cost of competing for the same Pro/Max headroom. With current cadence (drift-template-watch every 30 min + compat per PR + canary daily), runner-side burn on the shared account is a manageable fraction of the headroom available on a Max plan; reducing cadence further is a knob if a particular workload needs more of it. ## Why a self-hosted runner GitHub-hosted runners can't capture CC. They have no Pro/Max subscription session, no MITM cert trust for CC's loopback proxy, no way to authenticate against `claude.ai/oauth`. Anything that needs real CC running against real Anthropic has to live on a host you control with an account you've logged in to. The runner is read-only against the repo (`contents: read`) and only writes to issues (`issues: write`). It cannot push, tag, or release. ## Platform-superset preservation CC ships different tools on different platforms — currently just `PowerShell` on Windows, but the surface grows over time. The bundled template is meant to be a **union** across platforms, and `filterToolsForPlatform()` strips it down at request time. So a bake on Linux must not silently drop the Windows tool set, or Windows dario users would lose those tools on the next release. `scripts/capture-and-bake.mjs` preserves tools from the previous bundle whose names are listed in `PLATFORM_ONLY_TOOLS` for a platform other than the baking host's. The merged set is re-sorted alphabetically to match CC's wire order. The runner can therefore bake from Linux without regressing Windows users.