# pulselog — adopter contract A scheduled **external** watcher for the apps you run — the outside sibling to [flightlog](https://github.com/hamr0/flightlog) (which records errors from *inside* your app). You point a cron job or systemd timer at it. It has three modes — the triad **health** (is it up) · **stats** (how's it trending) · **backup** (is it safe): - **health** — "is it up right now?" Probe HTTP/TCP/TLS/disk/backups/systemd on a schedule; **stay silent when green**; email **one** summary when something breaks. - **digest** — "how is it trending?" Once a week, collect a few **foundational numbers you declare**, append one snapshot line to a history log, and email a week-over-week table — optionally with a flightlog error summary. - **backup** — "is it safe?" Stage **curated DB dumps** (sqlite/postgres/mysql) plus static **includes** (certs, configs, keys) into one archive, tar atomically, enforce a size floor, and roll retention. A failed run **exits `1` (loud)**. Every signal is **one JSON line** in flightlog's core dialect (`ts`, `kind`, …), so `tail`/`jq`/an uploader work across all your streams. Zero production dependencies (`node:*` + global `fetch`). No daemon, no SaaS, no telemetry. This file is the complete contract: every option, all three modes, what pulselog deliberately does **not** do, the privacy model, and the gotchas. For a copy-paste, VPS-ready **deployment** walkthrough — mail deliverability (SPF/DKIM/PTR), health/digest/backup config, systemd timers, an off-box watchdog, and the optional flightlog pairing — see the [Implementation Guide](IMPLEMENTATION_GUIDE.md) (Part B is pulselog; it ships in the package and needs no flightlog). > **Status:** `0.7.2` is published — all three modes (health + digest + **`backup`**) > are on npm, and `0.7.2` ships the deployment > **[Implementation Guide](IMPLEMENTATION_GUIDE.md)** in the package. `0.4.0` added a per-check `timeoutMs` and opt-in in-run `retries`, and a > security pass (config-perms gate, backup dir/umask tightening, per-engine password env, > name-escape guards). `0.4.1` refines the config-ownership gate to allow a **root-owned** > config (not just self-owned), matching `ssh`. `0.6.0` aligns the `command` check's > timeout reason with the others (`timeout after Ns`, not the misleading `exit 1 > (timeout)`). `0.7.0` adds an opt-in **`alert.fallback`** sink (a second, out-of-band > delivery path so a dead MTA can't silence the tool). Defaults are unchanged. ## What pulselog is and is NOT - It is a **lightweight wrapper** of what your server/OS already offers (`curl`, `systemctl`, `df`, a SQL `count`), generalized into config-driven mechanism. - It is **not** a daemon/scheduler (you bring cron/systemd), **not** a log aggregator/shipper/SIEM, **not** a metrics database/dashboard, **not** an uptime SaaS, **not** an alerting platform (one email, no paging/routing), **not** a backup *engine* (it wraps your dump + tars/rotates; off-host copy, encryption, and restore-testing stay the operator's job), and **not** a transport (it never uploads — shipping the JSONL is a separate layer you build). - **Mechanism is in pulselog; policy and data are yours.** You choose which checks run and which numbers to watch; pulselog never invents either, and never stores anything you didn't ask it to. ## Which mode do I need? | You want… | Mode | Cadence | |---|---|---| | To be emailed when the app/DB/cert/backup breaks | `health` | often (e.g. every 5 min) | | A weekly "is it growing?" stats email + error summary | `digest` | weekly | | Safe, rotated archives of your DBs + certs/configs | `backup` | nightly | All read **one** config file (`pulselog.config.json` — one source of truth) with separate sections (`checks` / `digest` / `backup`); the mode flag picks which runs. ``` pulselog --config ./pulselog.config.json # health (default) pulselog --digest --config ./pulselog.config.json # digest pulselog --digest --dry-run --config … # render the digest, don't send/append ``` --- ## Health mode ```jsonc { "output": { "file": "/var/lib/myapp/health.jsonl", // its OWN file — never flightlog's errors.jsonl "maxBytes": 5000000, // rotate to .1 at this size; 0 disables "heartbeat": false // also log one "all ok" line per run }, "alert": { "email": "ops@myapp.com", // omit → log only, no email "from": "alerts@myapp.com", "app": "myapp", "logTail": "/var/lib/myapp/errors.jsonl" // optional: paste recent flightlog errors into the alert }, "checks": [ { "type": "http", "name": "api", "enabled": true, "url": "http://127.0.0.1:3000/api/health", "expectStatus": 200 }, { "type": "tcp", "name": "db", "enabled": true, "host": "127.0.0.1", "port": 5432 }, { "type": "ssl", "name": "cert", "enabled": true, "host": "myapp.com", "warnDays": 14 }, { "type": "disk", "name": "disk", "enabled": true, "path": "/var/lib/myapp", "maxPercent": 85 }, { "type": "file-age", "name": "backup", "enabled": true, "path": "/var/lib/myapp-backups", "maxAgeHours": 26, "pattern": ".sqlite", "recursive": true }, { "type": "service", "name": "postfix", "enabled": false, "unit": "postfix.service" }, { "type": "command", "name": "mailq", "enabled": false, "command": "sh", "args": ["-c", "test $(mailq | grep -c '^[A-F0-9]') -lt 50"] } ] } ``` `enabled: false` switches a check off — each app turns on only what it needs. | Check | Passes when | Key fields (defaults) | |---|---|---| | `http` | endpoint returns the expected **status code** | `url`, `expectStatus` (200), `timeoutMs` (5000) | | `tcp` | host:port accepts a connection | `host`, `port`, `timeoutMs` (5000) | | `ssl` | TLS cert is not near expiry | `host`, `port` (443), `warnDays` (14) | | `disk` | path is below a usage threshold | `path`, `maxPercent` (85), `timeoutMs` (5000) | | `file-age` | newest file in a dir is fresh (backups ran) | `path`, `maxAgeHours`, `pattern`, `recursive` (false — set true for date-stamped `daily//` layouts) | | `service` | a systemd unit is `active` | `unit`, `timeoutMs` (5000) | | `command` | any command exits `0` — the escape hatch | `command`, `args`, `timeoutMs` (10000) | > `http` checks the **status code only** — by design. App-specific body assertions > (e.g. a `/health` JSON field) go through `command` (`curl … | jq -e …`). pulselog > core never grows body parsing. > `service` tests `systemctl is-active` — correct for a **long-running unit** or an > **armed timer** (active while waiting). It is **wrong for a `oneshot` `.service`**: a > healthy oneshot finishes `inactive (dead)`, so `service` would report it DOWN. For > "did the last oneshot/timer run **succeed**?", use `command`. `systemctl is-failed` > exits **0 when the unit *is* failed**, so invert it through a shell (the `command` > check is healthy on exit 0): > `{ "type": "command", "command": "sh", "args": ["-c", "! systemctl is-failed --quiet my.service"] }` > — healthy whenever the unit is **not** failed, which includes a clean > dead-after-success. Add `systemctl show -p Result,ActiveExitTimestamp` if you also > want last-run recency. pulselog core stays `is-active`; oneshot semantics live in > your `command`. ### Retry — don't page on a transient blip A single timed-out probe on a loaded or shared host shouldn't alert. Set `retries` (default `0`) and `retryDelayMs` (default `1000`) to re-probe a **failing** check in the same run before it's recorded — per-check, or globally via a top-level `retry` block that each check can override: ```jsonc { "retry": { "retries": 2, "retryDelayMs": 2000 }, // default for every check "checks": [ { "type": "http", "name": "api", "url": "…" }, // inherits 2×/2s { "type": "service", "name": "worker", "unit": "worker.service", "retries": 0 }, // opt OUT for this one { "type": "tcp", "name": "db", "host": "…", "port": 5432, "retries": 4, "retryDelayMs": 500 } // tune per service ] } ``` - A check that **recovers** on a retry is treated as green (no line, no email). One that fails **every** attempt is recorded **once** (never one line per attempt), its reason noting `(after N attempts)`. - **Stateless on purpose.** Retry decides whether a probe is *really* failing **within one run** — it never remembers failures across runs. "Page only after N consecutive *runs* fail" is alert **policy** and stays in the layer that consumes the JSONL (see the refusals); pulselog keeps no cross-run health state. - Pair with per-check `timeoutMs` (now on every check incl. `service`/`disk`): loosen the timeout where a probe is legitimately slow, retry where it's flaky — different knobs. On a failure it appends one JSONL line **per failing check** and sends **one** summary email. Silent on success. > **`alert.logTail` carries payloads — by design.** When set, the alert email includes > the **last 20 raw lines** of that file verbatim (messages, stacks, whatever it holds). > That's the opposite stance from the digest's flightlog rollup (counts + names only): > an *actionable* alert wants the detail, but it means the alert email may contain > PII/secrets. Point it at a file you're willing to email, and send to a trusted > recipient. Omit it and the alert stays summary-only. --- ## Digest mode **pulselog asks you one question: _what foundational numbers do you want to watch weekly?_** Declare them as `metrics` — each a name and a `command` that prints one integer. Everything else (collect → snapshot → history → week-over-week → render → email) is built in. You never write a `stats.js` again. ```jsonc { "digest": { "app": "addypin", "history": "/var/lib/addypin/stats.jsonl", // pulselog writes + reads this (one line/week, the record) "email": "ops@addypin.com", // omit / --dry-run → print, no send "from": "alerts@addypin.com", "weeks": 4, // rows in the table (default 4) "skipIfFlat": false, // true → no email when every Δ=0 and nothing flagged "metrics": [ // ← the ONLY per-app customization { "name": "users", "command": "sqlite3", "args": ["/var/lib/addypin/addypin.db", "select count(distinct customer) from pins"] }, { "name": "pins", "command": "sqlite3", "args": ["/var/lib/addypin/addypin.db", "select count(*) from pins"] } ], "flightlog": { "file": "/var/lib/addypin/errors.jsonl", "groupBy": "name", "flagAtLeast": 20 } // optional } } ``` | Option | Default | Meaning | |---|---|---| | `app` | `"app"` | Label in the snapshot line, email subject, and header. | | `history` | — | The snapshot JSONL pulselog appends to (one line/week) and reads back for the table. **Its own file.** | | `email` / `from` | — | Recipient/sender. Omit → no email; the history line is the artifact. | | `weeks` | `4` | How many weeks the table shows. | | `skipIfFlat` | `false` | `true` → skip the email when every metric is unchanged vs last week **and** nothing is flagged. | | `metrics[]` | — | `{ name, command?, args?, timeoutMs? }`. A metric with its own `command` prints **one integer**; anything else records `null` for that metric (noted, never fatal). Run **without a shell** (`command` + `args` array, like the health `command` check) — for a pipe/shell metric, use `"command": "sh", "args": ["-c", "… | …"]`. A metric with **no** `command` is filled by name from `metricsCommand` (below). | | `metricsCommand` | — | Optional. `{ command, args?, timeoutMs? }` — one command that prints a **flat JSON object of named integers** in a single pass; each `metrics[]` entry without its own `command` takes its value by `name` from that object. See "Batch metrics" below. | | `flightlog` | — | Optional. `{ file, groupBy?, flagAtLeast? }` — see below. | **The snapshot line** appended each week (the record — metrics *and* any error summary, kept for trend): ```json {"ts":"2026-05-31T06:00:00Z","kind":"stats","app":"addypin","users":2,"pins":5,"errors":{"total_7d":31,"top":{"ApiTimeout":24,"SmtpAuthError":7},"flagged":["ApiTimeout"]}} ``` **The email** (rendered from `history`): ``` addypin weekly stats — 2026-W19 → 2026-W22 weeks in log: 4 week | users Δ | pins Δ 2026-W22 | 3 +1 | 7 +2 2026-W21 | 2 | 5 +1 … flightlog (last 7d): 31 errors. top: ApiTimeout×24, SmtpAuthError×7. ≥flag: ApiTimeout ``` ### flightlog enrichment (optional) If you point `digest.flightlog.file` at a flightlog `errors.jsonl`, the digest adds one line: the **7-day error count**, the **top error names with counts**, and a flag for any group whose 7-day count reached `flagAtLeast` (default 20). So you see *which* area is noisy (e.g. `ApiTimeout` vs `SmtpAuthError`), not just *that* something broke. - `groupBy` (default `"name"`) — the field to group by. If your apps distinguish areas via flightlog **context** (e.g. `capture(err, { where: 'mail-auth' })`), set `groupBy: "where"`. - **Counts and names only.** pulselog never copies error **messages or stacks** into the digest or email — those can carry payloads/PII. flightlog stays private on the box; you read the detail there. ### Batch metrics (one command, many numbers) By default each metric is its own `command` → one integer. If a single pass already computes *several* numbers (e.g. one scan over an event log yields `events`, `completed`, `pending`, `orgs`…), declare `metricsCommand` and let each metric pick its value by name instead of paying one spawn per metric: ```jsonc "digest": { "app": "gitdone", "history": "/var/lib/gitdone/stats.jsonl", "metricsCommand": { "command": "node", "args": ["bin/stats.js", "--metrics-json"] }, // // stdout: {"events":42,"completed":18,"pending":3,"orgs":7} "metrics": [ { "name": "events" }, { "name": "completed" }, { "name": "pending" }, { "name": "orgs" } ] } ``` - The command must print a **flat JSON object** (`{"name": , …}`) on stdout. An array, a scalar, non-JSON, a non-zero exit, or a timeout records `null` for every batch-sourced metric — same "never sinks the run" guard as a single metric. - **You still declare every name.** Only names in `metrics[]` reach the snapshot — a key the batch emits but you didn't declare is ignored; a name you declared but the batch omits (or whose value isn't a whole number — a float, bool, or string) records `null`. The "store only what you read" contract is unchanged; this only amortizes an expensive computation. - **Mix freely.** A `metrics[]` entry *with* its own `command` runs that command (and overrides any same-named batch key), so you can pair one batch pass with a couple of standalone metrics. ### Cadence vs the table (run daily if you want) The table groups history **by ISO week — the latest snapshot in each week wins** — so cadence and the table are decoupled. Run `--digest` **daily** and you get finer history (every line is kept; `history` is never rotated) *and* free proof-of-life, while the WoW table still collapses to one row per week. The week's row simply reflects its most recent run. (If you run daily for proof-of-life, leave `skipIfFlat` **off** — otherwise an unchanged day sends no email, though the history line is still written.) --- ## Backup mode > Shipped in `0.2.0`. `pulselog --backup --config ./backup.config.json`. One scheduled run stages your state into a fresh, private staging dir (`$PULSELOG_STAGE`), tars it to one archive, **publishes atomically**, enforces a size floor, and rolls retention. pulselog owns the **envelope**; you declare the **sources**. At least one source (`db` / `include` / `command`) is required. ```jsonc "backup": { "app": "myapp", "dir": "/var/lib/myapp/backups", // archives live here, its OWN dir → -.tar.gz "name": "myapp-backup", // archive prefix (also the retention key) "db": [ // (A) curated safe-default dumps — see the table below { "engine": "sqlite", "path": "/var/lib/myapp/app.db", "name": "app" }, { "engine": "sqlite", "path": "/var/lib/myapp/cache.db", "optional": true }, // absent file → skip+record { "engine": "postgres", "url": "postgres://u@/app", "passwordEnv": "PGPASSWORD" }, { "engine": "mysql", "url": "mysql://u@host:3306/app", "passwordEnv": "MYSQL_PWD" } ], "include": [ // (B) static copy into the stage (symlinks preserved) "/etc/letsencrypt", // string = REQUIRED (missing → fail loud, exit 1) { "path": "/etc/myapp/optional.d", "optional": true } // {path,optional} = skip + record if missing ], "command": "node", "args": ["dump.mjs"], // (C) opt-out: your own dump writes into $PULSELOG_STAGE "timeoutMs": 600000, // optional cap on the command "keepLast": 7, // retention: keep newest N … "keepDays": 30, // … and/or newer-than-D days (≥1 required; union — never deletes what a rule keeps) "minBytes": 1024, // integrity floor — a smaller archive fails the run (no publish, no rotation) "history": "/var/lib/myapp/backup.jsonl", // one kind:"backup" line per run, its OWN file (0600) "email": "ops@example.com", "from": "alerts@myapp.com" // alert on FAILURE only; omit → the line is the record } ``` **Built-in `db` engines** — the safe default *encodes the consistency opinion* (the value over a hand-rolled command). The tool must be on `PATH` (else the run fails loud) except `sqlite`, which is in-process: | `engine` | pulselog runs | Output in stage | Connection | |---|---|---|---| | `sqlite` | `node:sqlite` `VACUUM INTO` (online, checkpoints WAL; **needs Node ≥ 22.5**) | `