--- title: Monitoring and health description: Read PgQue's observability functions, and know what to alert on — ticker lag, consumer lag, a stuck consumer, and dead-letter depth. --- PgQue exposes its health through a small set of read-only functions. This page explains the columns that matter operationally, the one failure mode you must catch early — a stuck consumer that blocks table rotation — and the queries to wire into your monitoring. All of the `get_*_info` functions and `pgque.version()` are granted to `pgque_reader`, so a read-only monitoring role can run everything here. `pgque.status()` is admin-only. For role setup see [Installation and operations](installation.md); for vocabulary see [Concepts](concepts.md). The examples assume: ```bash PAGER=cat psql --no-psqlrc -d yourdb ``` ## The observability surface ### pgque.status() — is the engine wired up `pgque.status()` returns `(component, status, detail)` rows. It is the one-stop check that the ticker and maintenance jobs are scheduled. If `pg_cron` is installed and `pgque.start()` has run, you will see `ticker` and `maintenance` rows with a `scheduled` status and their cron job ids. This function is admin-only. ```sql select * from pgque.status(); ``` If `status()` shows nothing scheduled, no ticks are being created, and every `pgque.receive()` returns zero rows forever. That is the first thing to rule out. ### pgque.get_queue_info([queue]) — is the queue flowing Call with no argument for all queues, or pass a queue name for one. The operationally important output columns: | column | meaning | watch for | |---|---|---| | `ticker_lag` | wall time since this queue's last tick | grows without bound when the ticker is not running | | `ev_per_sec` | recent event throughput (float8, from the last ~20 ticks) | sudden drop to zero, or unexpected spikes | | `ev_new` | events sent but not yet covered by a tick | climbs and stays high if ticking stalls | | `last_tick_id` | id of the most recent tick | should keep advancing | | `queue_ticker_paused` | whether ticking is paused on this queue | `true` means no delivery by design | | `queue_ticker_max_count` / `queue_ticker_max_lag` / `queue_ticker_idle_period` | the tick-trigger thresholds | context for interpreting `ticker_lag` | | `queue_rotation_period` / `queue_switch_time` | rotation period and last rotation time | stale `queue_switch_time` hints rotation is stuck | ```sql select queue_name, ticker_lag, ev_per_sec, ev_new, last_tick_id from pgque.get_queue_info('orders'); ``` `ticker_lag` is the single most useful queue signal. With the default settings, the queue ticks at least every `ticker_idle_period` (1 minute) even when idle, so a `ticker_lag` that keeps climbing past that means the ticker has stopped. ### pgque.get_consumer_info([queue[, consumer]]) — is the consumer keeping up Call with no arguments for every consumer on every queue, with a queue name to scope to one queue, or with both to inspect a single consumer. Output columns: | column | meaning | watch for | |---|---|---| | `lag` | age of the events the consumer is currently positioned at | grows when the consumer falls behind | | `last_seen` | elapsed time since the consumer last processed a batch | grows when the consumer has stopped calling `receive` | | `pending_events` | events waiting past the consumer's position, not yet consumed | a growing backlog | | `last_tick` | tick id of the consumer's last processed tick | should advance; a frozen value is the stuck-consumer signal | | `current_batch` | active batch id, or NULL if none open | a long-lived non-NULL value means a batch is never being acked | | `next_tick` | final tick of the active batch, if one is open | — | ```sql select queue_name, consumer_name, lag, last_seen, pending_events, last_tick from pgque.get_consumer_info('orders', 'processor'); ``` In a healthy system `lag` and `last_seen` both stay low and `pending_events` stays near zero. A consumer whose `last_tick` stops advancing while `last_seen` keeps climbing is stuck — see the next section. ### pgque.get_batch_info(batch_id) — inspect one in-flight batch Given a batch id (the `batch_id` on a `pgque.message`, or `current_batch` from `get_consumer_info`), this returns one row describing the batch: `queue_name`, `consumer_name`, `batch_start`, `batch_end`, `prev_tick_id`, `tick_id`, `lag`, `seq_start`, `seq_end`. Use it to debug a specific batch that seems stalled — `lag` is `now()` minus the batch's end-tick time, and `seq_end - seq_start` approximates the batch's event span. ```sql select queue_name, consumer_name, lag, seq_start, seq_end from pgque.get_batch_info(12345); ``` ## What to alert on ### The critical one: a stuck consumer blocks rotation This is the headline operational risk in PgQue, and it is worth understanding before any other alert. PgQue stores events in a set of inherited tables and reclaims space by **rotating** them: periodically it advances to the next table in the set and `TRUNCATE`s the one it is reusing. Rotation is the only thing that frees disk — there are no per-row deletes. Rotation is gated on the slowest consumer. Step one of rotation finds the lowest `sub_last_tick` across all subscriptions on the queue; if the slowest consumer still needs the table about to be truncated, rotation returns zero and skips. A consumer that has stopped — crashed, deadlocked, deploy gone wrong, or simply far too slow — pins that lowest tick and **blocks the TRUNCATE indefinitely.** The event tables then grow without bound until the consumer recovers or is unsubscribed. So the alert that protects your disk is not a disk alert — it is a stuck-consumer alert. Catch it by watching `get_consumer_info`: - `last_seen` keeps growing for a consumer that should be active, and - its `last_tick` is not advancing while `last_tick_id` on the queue is, - typically with `pending_events` climbing alongside. When you confirm a consumer is wedged and will not come back, unsubscribe it so rotation can proceed: ```sql select pgque.unsubscribe('orders', 'dead_consumer'); ``` (Or `pgque.drop_queue('orders', true)` to unregister all consumers, if you are tearing the queue down.) A dead consumer that you do not intend to restart must be unsubscribed, or it will hold the queue's storage forever. ### Threshold table Frame these relatively — PgQue ships no SLA. Alert on trends across several sampling intervals, not on a single reading, and tune absolute thresholds to your own tick rate and traffic. | signal | source | alert when | why it matters | |---|---|---|---| | ticker lag | `get_queue_info.ticker_lag` | climbs and stays above `ticker_idle_period` (default 1 minute) across intervals | ticker not running → no batches → no delivery | | consumer lag | `get_consumer_info.lag` / `pending_events` | `lag` and `pending_events` keep growing across intervals | a consumer is falling behind real-time | | stuck consumer | `get_consumer_info.last_seen` + frozen `last_tick` | `last_seen` grows while `last_tick` stays put and the queue's `last_tick_id` advances | pins the lowest tick → blocks `TRUNCATE` rotation → event tables grow unbounded (the critical one) | | DLQ depth | `dlq_inspect` row count / `pgque.dead_letter` | the dead-letter backlog grows or is non-empty when you expect zero | events are exhausting retries; a downstream is failing | ### Dead-letter depth Events that exhaust their retries (5 by default) land in `pgque.dead_letter`. A growing dead-letter backlog means a downstream is failing repeatedly. Count it two ways — directly on the table, or via `dlq_inspect` (both granted to `pgque_reader`): ```sql -- depth per queue, straight from the table select dl_queue_id, count(*) as dlq_depth from pgque.dead_letter group by dl_queue_id order by dlq_depth desc; -- inspect the most recent dead-lettered events for one queue select dl_id, ev_id, dl_time, dl_reason, ev_type from pgque.dlq_inspect('orders', 20); ``` To replay or purge dead-letter entries, see the DLQ functions in the [Reference](reference.md) and the patterns in [Examples](examples.md). ## Read-only monitoring queries Everything below runs as `pgque_reader`. Confirm the installed version: ```sql select pgque.version(); ``` Queue health across all queues at a glance: ```sql select queue_name, ticker_lag, ev_per_sec, ev_new, last_tick_id from pgque.get_queue_info() order by ticker_lag desc; ``` Every consumer's lag and liveness, worst first: ```sql select queue_name, consumer_name, lag, last_seen, pending_events, last_tick from pgque.get_consumer_info() order by last_seen desc nulls last; ``` Stuck-consumer hunt — join consumer position against the queue's latest tick so a frozen `last_tick` stands out against an advancing `last_tick_id`: ```sql select c.queue_name, c.consumer_name, c.last_seen, c.last_tick, q.last_tick_id, q.last_tick_id - c.last_tick as ticks_behind, c.pending_events from pgque.get_consumer_info() c join pgque.get_queue_info() q using (queue_name) order by ticks_behind desc nulls last; ``` Dead-letter depth per queue: ```sql select dl_queue_id, count(*) as dlq_depth, max(dl_time) as latest from pgque.dead_letter group by dl_queue_id order by dlq_depth desc; ``` ## Related - [Concepts](concepts.md) — tick, batch, rotation, and the snapshot rule. - [Installation and operations](installation.md) — `pg_cron` setup, the ticker cadence, and roles. - [Latency and tuning](latency-and-tuning.md) — how `tick_period_ms` and the ticker thresholds trade latency against overhead. - [Reference](reference.md) — full signatures, return columns, and role grants. - [Examples](examples.md) — DLQ replay, fan-out, and exactly-once patterns.