--- title: WAL and Checkpoint Operations description: Write-ahead log internals, checkpoint tuning, durability guarantees, and WAL disk management tags: postgres, wal, checkpoints, durability, crash-recovery, fsync, operations --- # WAL and Checkpoint Operations ## WAL Fundamentals Write-Ahead Logging: logs changes to `pg_wal/` **before** modifying data files. WAL segments are 16MB (fixed at initdb). On COMMIT, PostgreSQL fsyncs WAL to disk and returns SUCCESS — data files are updated lazily. WAL records are written for all changes (including uncommitted transactions and rollbacks). **Never disable `fsync` in production** — power loss without fsync risks unrecoverable data loss. `wal_level`: `minimal` (crash recovery only), `replica` (default; replication + archiving), `logical` (logical replication). ## Dirty Pages and Checkpoints A dirty page is modified in shared_buffers but not yet written to data files. A checkpoint flushes all dirty pages to disk and writes a checkpoint record to WAL; recovery only replays WAL since the last checkpoint. - `checkpoint_timeout` (default 5 min) and `max_wal_size` (default 1GB) — checkpoint on whichever triggers first. - `checkpoint_completion_target=0.9` spreads I/O over 90% of the interval; avoid spikes. - "Checkpoints are occurring too frequently" in logs → increase `max_wal_size`. - **Target: >90% of checkpoints should be time-based** (`num_timed` in `pg_stat_checkpointer`), not size-based (`num_requested`). If num_requested/(num_timed+num_requested) > 10%, tune `max_wal_size` up. ## WAL Disk Management Replication slots prevent WAL deletion even when standbys are offline — they can fill disk. WAL archiving failures also block recycling. `max_wal_size` is a *soft* limit; WAL can grow beyond it under heavy load. WAL size: `SELECT count(*) AS files, pg_size_pretty(sum(size)) AS total FROM pg_ls_waldir();` Slot lag: `SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes FROM pg_replication_slots;` ## Checkpoint Monitoring PG17+ moved checkpoint stats from `pg_stat_bgwriter` to `pg_stat_checkpointer` and renamed columns. `SELECT num_timed, num_requested, write_time, sync_time, buffers_written FROM pg_stat_checkpointer;` Backend-direct writes (formerly `buffers_backend` in `pg_stat_bgwriter`) are now tracked in `pg_stat_io`: `SELECT writes FROM pg_stat_io WHERE backend_type = 'client backend' AND object = 'relation';` ## Crash Recovery On crash, PostgreSQL replays WAL from the last checkpoint. Longer checkpoint intervals → more WAL to replay → longer recovery. Trade-off: frequent checkpoints (faster recovery, more I/O) vs infrequent (less I/O, slower recovery). For most workloads, `checkpoint_timeout=5min` and `max_wal_size` tuned to keep checkpoints time-based is the right balance.