---
name: pg-health-check
description: Performs comprehensive PostgreSQL database health assessments covering connections, cache performance, locks, replication, wait events, active queries, and configuration review. Produces a scored health report with prioritized recommendations.
---

# PostgreSQL Health Check

This skill guides a comprehensive health assessment of a PostgreSQL database using pgtuner-mcp tools, producing a scored report with prioritized action items.

## When to Use This Skill

Use this skill when the user:

- Asks for a database health check or status report
- Reports general performance degradation without specific symptoms
- Wants a routine database checkup or audit
- Asks "is my database healthy?" or "what should I fix first?"
- Is setting up monitoring or wants to establish a baseline
- Mentions connection issues, high CPU, or unexplained slowness

## MCP Resources Available

Before running full tool scans, the agent can use targeted resources:

- `pgtuner://health/{check_type}` -- Targeted health checks: `connections`, `cache`, `locks`, `replication`, `bloat`, `all` (lightweight, avoids running the full health tool)
- `pgtuner://settings/{category}` -- Settings by category: `memory`, `checkpoint`, `wal`, `autovacuum`, `connections`
- `pgtuner://docs/workflows` -- Recommended workflow patterns

**Smart resource usage:** After the full health check in Step 1, use targeted resources for follow-up instead of re-running the full tool. For example, if the locks dimension scores low, read `pgtuner://health/locks` for quick re-checks after remediation.

## Related MCP Prompt

This skill corresponds to the **`health_check`** MCP Prompt. If the user triggers that prompt, follow this skill's workflow.

## Prerequisites

- The pgtuner-mcp server must be connected with a valid `DATABASE_URI`
- Required: `pg_stat_statements` extension
- Recommended: monitoring user with access to `pg_stat_activity`, `pg_stat_bgwriter`, `pg_locks`

## Agent Decision Tree

```
User asks for health check
  |
  +--> Quick check or full check?
  |    QUICK --> Read pgtuner://health/all resource (fast, no tool call)
  |    FULL  --> Start with Step 1 (check_database_health)
  |
  +--> After Step 1, which dimensions scored low?
       |
       +--> Connections < 90 --> Prioritize Step 2, then pg-connection-analysis skill
       +--> Cache < 90      --> Prioritize Step 5 (I/O), then pg-io-deep-dive skill
       +--> Locks < 90      --> Prioritize Step 2 + Step 3, then pg-lock-diagnosis skill
       +--> Wraparound < 90 --> URGENT: check pg-vacuum-tuning skill
       +--> Checkpoints < 90 --> Prioritize Step 5, then pg-config-tuning skill
       +--> All >= 90       --> Summary report, suggest pg-performance-baseline
```

## Health Check Workflow

### Step 1: Overall Health Assessment

```
Tool: check_database_health
Parameters:
  include_recommendations: true
  verbose: true
```

**What this checks (8 health dimensions):**
1. **Connections**: Usage ratio vs `max_connections`
2. **Cache hit ratio**: `shared_buffers` effectiveness
3. **Lock contention**: Blocking locks and deadlocks
4. **Replication**: Lag and slot status
5. **Transaction wraparound**: XID age approaching limits
6. **Disk usage**: Tablespace utilization
7. **Background writer**: Checkpoint and bgwriter efficiency
8. **Checkpoint performance**: Timing and frequency

**Interpret the scores:**
- **90-100 (Healthy)**: No immediate action needed
- **70-89 (Warning)**: Issues that should be addressed soon
- **< 70 (Critical)**: Immediate attention required

Record the overall score and any dimensions scoring below 90.

### Step 2: Active Query Analysis

```
Tool: get_active_queries
Parameters:
  include_idle: true          # Include idle-in-transaction
  include_system: false
  min_duration_seconds: 0     # Show all active queries
```

**What to look for:**

| Condition | Severity | Action |
|-----------|----------|--------|
| Query running > 5 minutes | Warning | Investigate the query, may need optimization |
| Query running > 30 minutes | Critical | Consider canceling unless it's a known batch job |
| idle in transaction > 60s | Warning | Application not closing transactions properly |
| idle in transaction > 5 min | Critical | Leaking connections, blocks vacuum |
| Blocked queries waiting on locks | Critical | Identify the blocker and resolve |
| Connection count > 80% of max | Warning | Add connection pooling (PgBouncer) |

### Step 3: Wait Event Analysis

```
Tool: analyze_wait_events
Parameters:
  active_only: true
```

**Interpret wait event types:**

| Wait Type | Meaning | Common Solutions |
|-----------|---------|-----------------|
| **Lock** | Row-level or table-level contention | Optimize transactions, reduce lock duration |
| **IO/DataFileRead** | Reading data from disk | Increase `shared_buffers`, add RAM, or optimize queries |
| **IO/WALWrite** | WAL write bottleneck | Use faster disk for pg_wal, tune `wal_buffers` |
| **BufferPin** | Buffer contention | Reduce concurrent access to hot pages |
| **LWLock/buffer_mapping** | Shared buffer contention | Increase `shared_buffers` or reduce concurrent access |
| **Client/ClientRead** | Waiting for client response | Network latency or slow application |
| **Activity/WalSenderMain** | Replication sender idle | Normal for streaming replication |

### Step 4: Configuration Review

```
Tool: review_settings
Parameters:
  category: "all"
  include_all_settings: false
```

**Key settings to validate:**

| Setting | Rule of Thumb | Red Flag |
|---------|--------------|----------|
| `shared_buffers` | 25% of RAM | < 128 MB |
| `effective_cache_size` | 50-75% of RAM | < 1 GB |
| `work_mem` | 32-256 MB depending on `max_connections` | < 4 MB |
| `maintenance_work_mem` | 256 MB - 2 GB | < 64 MB |
| `checkpoint_completion_target` | 0.9 | < 0.7 |
| `random_page_cost` | 1.1 for SSD, 4.0 for HDD | Default 4.0 on SSD |
| `effective_io_concurrency` | 200 for SSD, 2 for HDD | Default 1 on SSD |
| `max_connections` | Based on actual need | > 200 without pooler |
| `autovacuum` | on | off |
| `wal_level` | replica or logical | minimal (can't do replication) |
| `max_wal_size` | 2-8 GB | < 1 GB (frequent checkpoints) |

### Step 5: Disk I/O Assessment (If health score < 90 or IO waits detected)

Use targeted analysis types based on which health dimension scored low:

For cache-related issues:
```
Tool: analyze_disk_io_patterns
Parameters:
  analysis_type: "buffer_pool"
  top_n: 20
```

For checkpoint-related issues:
```
Tool: analyze_disk_io_patterns
Parameters:
  analysis_type: "checkpoints"
  top_n: 10
```

For a comprehensive I/O overview:
```
Tool: analyze_disk_io_patterns
Parameters:
  analysis_type: "all"
  top_n: 20
```

**What to look for:**
- Buffer pool hit ratio < 99% for OLTP workloads
- Tables with high `heap_blks_read` relative to `heap_blks_hit`
- Excessive temp file usage (queries spilling to disk)
- Backend fsync count > 0 (not all writes going through checkpointer)
- Requested checkpoints > timed checkpoints (increase `max_wal_size`)

### Step 6: Quick Bloat Scan (If vacuum metrics are concerning)

```
Tool: get_bloat_summary
Parameters:
  schema_name: "public"
  top_n: 10
```

**What to look for:**
- Tables with dead tuple ratio > 10%
- Tables approaching transaction wraparound
- Large tables that haven't been vacuumed recently

## Output Format

### Health Report Card

| Dimension | Score | Status | Key Finding |
|-----------|-------|--------|-------------|
| Connections | 95 | Healthy | 23/100 connections used |
| Cache | 99.8 | Healthy | Cache hit ratio: 99.8% |
| Locks | 85 | Warning | 2 blocking locks detected |
| Replication | 100 | Healthy | No replication configured |
| Wraparound | 92 | Healthy | Oldest XID age: 450M |
| Disk | 78 | Warning | /data at 82% capacity |
| Background Writer | 88 | Warning | High buffer allocation rate |
| Checkpoints | 65 | Critical | 40% requested checkpoints |
| **Overall** | **88** | **Warning** | |

### Critical Issues (Fix Now)

Numbered list with specific remediation steps and SQL commands.

### Warnings (Fix Soon)

Issues that should be addressed within days/weeks.

### Observations (Monitor)

Things that are currently OK but worth watching.

### Recommended Configuration Changes

```sql
-- Present as ALTER SYSTEM commands with explanations
ALTER SYSTEM SET shared_buffers = '4GB';        -- Currently 128MB, server has 16GB RAM
ALTER SYSTEM SET max_wal_size = '4GB';           -- Reduce checkpoint frequency
-- Requires restart: shared_buffers
-- Reload only: max_wal_size (SELECT pg_reload_conf();)
```

## Health Check Cheatsheet

### Quick Triage Priority
1. **Connection saturation** > 90%: Immediate risk of outage
2. **Transaction wraparound** XID age > 1 billion: Risk of forced shutdown
3. **Blocking locks** > 30 seconds: Active user impact
4. **Long idle-in-transaction**: Blocks autovacuum, causes bloat
5. **Cache hit ratio** < 95%: Widespread performance degradation
6. **Checkpoint frequency**: Too many requested checkpoints waste I/O

### When to Escalate
- XID age > 1.5 billion: Emergency, run manual VACUUM FREEZE
- Connections at 100%: Emergency, consider `pg_terminate_backend()` for idle connections
- Replication lag growing continuously: Replica may never catch up
- OOM killer active: Reduce `shared_buffers` or `work_mem`

## Important Notes

- The health check is read-only and safe to run on production databases
- Health scores are relative: a score of 85 is acceptable for many workloads
- Run health checks regularly (weekly) to detect trends
- Compare results against previous baselines for meaningful trending
- Some settings require a PostgreSQL restart to take effect (marked with `(restart)` in `pg_settings`)

## Iterative Verification

After applying remediation for any health dimension:

1. Read `pgtuner://health/{check_type}` for the specific dimension to verify improvement
2. For settings changes that require reload: verify with `SHOW setting_name` after `pg_reload_conf()`
3. For lock contention fixes: re-run `get_active_queries` with `include_idle: true` to confirm

## When to Stop and Ask the User

- **If all dimensions score > 90**: "Your database is healthy. Would you like me to generate a baseline report for future comparison?"
- **If wraparound risk is detected (XID age > 1 billion)**: "URGENT: Transaction wraparound risk detected. This requires immediate VACUUM FREEZE. Do you want me to provide the emergency procedure?"
- **If configuration changes require restart**: "The recommended `shared_buffers` change requires a PostgreSQL restart. When is your next maintenance window?"
- **If the issue is outside database scope** (e.g., network, application): "The database health metrics look fine. The performance issue may be in the application layer or infrastructure."

## Cross-References to Other Skills

- **Deep connection analysis**: Use `pg-connection-analysis` skill
- **Lock contention investigation**: Use `pg-lock-diagnosis` skill
- **I/O deep-dive**: Use `pg-io-deep-dive` skill
- **Configuration tuning**: Use `pg-config-tuning` skill
- **Vacuum/wraparound issues**: Use `pg-vacuum-tuning` skill
- **Bloat detected**: Use `pg-bloat-analysis` skill
- **Establishing a baseline**: Use `pg-performance-baseline` skill

## PostgreSQL Version Notes

| Feature | Version |
|---------|---------|
| `pg_stat_io` (detailed I/O stats) | PG16+ |
| `pg_stat_wal` (WAL statistics) | PG14+ |
| `pg_stat_progress_vacuum` | PG9.6+ |
| `idle_in_transaction_session_timeout` | PG9.6+ |
| `client_connection_check_interval` | PG14+ |

## Production Safety

- All tools used in this workflow are read-only
- No data or configuration is modified
- Health checks can be run during peak traffic without impact