--- name: pg-slow-query-diagnosis description: Diagnoses slow PostgreSQL queries by analyzing pg_stat_statements, execution plans, index opportunities, and table statistics. Guides the agent through a systematic investigation workflow using pgtuner-mcp tools to produce actionable optimization recommendations. --- # PostgreSQL Slow Query Diagnosis This skill guides systematic diagnosis and optimization of slow queries in PostgreSQL databases using the pgtuner-mcp server tools. ## When to Use This Skill Use this skill when the user: - Reports slow queries or poor query performance - Asks "why is my query slow?" or "how can I speed up this query?" - Wants to find and fix the top resource-consuming queries - Mentions high query latency, timeouts, or slow API responses - Asks to investigate `pg_stat_statements` data ## MCP Resources Available Before calling tools, the agent can read lightweight resources for quick context: - `pgtuner://docs/tools` -- Full tool reference documentation - `pgtuner://docs/workflows` -- Recommended workflow patterns - `pgtuner://query/{query_hash}/stats` -- Detailed stats for a specific query by `queryid` (from `get_slow_queries` output) - `pgtuner://table/{schema}/{table_name}/stats` -- Quick table statistics without running the full tool - `pgtuner://table/{schema}/{table_name}/indexes` -- Indexes on a specific table ## Related MCP Prompt This skill corresponds to the **`diagnose_slow_queries`** MCP Prompt and partially to **`query_tuning`**. If the user triggers either prompt, follow this skill's workflow. ## Prerequisites - The pgtuner-mcp server must be connected with a valid `DATABASE_URI` - Required: `pg_stat_statements` extension (check availability early -- if missing, tool will return an error) - Optional: `hypopg` extension for hypothetical index testing - Optional: `pglast` library on the server for SQL parsing in index recommendations ## Pre-Flight Checks Before starting the workflow, verify extension availability: - If `get_slow_queries` returns an error about `pg_stat_statements`, inform the user it must be enabled and provide setup instructions. - If the user wants hypothetical index testing, use `manage_hypothetical_indexes` with `action: "check"` to verify HypoPG is available. If not, skip Step 4 and rely on heuristic recommendations. ## Agent Decision Tree ``` User reports slow queries | +--> User has a SPECIFIC query? | YES --> Skip Step 1, go directly to Step 2 (analyze_query) | NO --> Start with Step 1 (get_slow_queries) | +--> Step 1 returns 0 results? | --> Check if pg_stat_statements was recently reset | --> ASK user: "pg_stat_statements has limited data. | Can you provide a specific query, or should we wait | for more workload data to accumulate?" | +--> Step 2 shows temp_blks_written > 0? | YES --> work_mem issue. Go to Step 6 (I/O) before indexes. | NO --> Continue to Step 3 (index recommendations) | +--> Step 3 shows < 10% improvement from indexes? | YES --> Query structure may be the issue. | Redirect to pg-query-rewrite skill. | NO --> Continue to Step 4 (verify with HypoPG) | +--> Step 2 shows row estimate mismatch > 10x? YES --> Run ANALYZE on the table FIRST, then re-check NO --> Continue with index recommendations ``` ## Diagnosis Workflow Follow these steps in order. At each step, analyze the output before proceeding. ### Step 1: Identify the Slowest Queries Call the `get_slow_queries` tool to retrieve the most expensive queries: ``` Tool: get_slow_queries Parameters: limit: 10 # Start with top 10 min_calls: 5 # Focus on queries called frequently enough to matter order_by: mean_time # Sort by average execution time ``` Also check for over-fetching queries (returning too many rows): ``` Tool: get_slow_queries Parameters: limit: 10 min_calls: 5 order_by: rows # Find queries returning the most rows ``` **What to look for:** - Queries with high `mean_time` (> 100ms warrants investigation) - Queries with high `total_time` (high total even if individual calls are fast) - High `rows` returned vs actual need (over-fetching -- use `order_by: rows`) - Queries with low `shared_blks_hit / (shared_blks_hit + shared_blks_read)` cache hit ratio - Queries with `temp_blks_written > 0` (spilling to disk -- `work_mem` issue) **Decision point:** If no slow queries are found, check if `pg_stat_statements` has been recently reset. Suggest the user wait for workload data to accumulate, or ask them to provide a specific query. **Cross-reference with resources:** For any interesting `queryid` found, read `pgtuner://query/{queryid}/stats` for deeper per-query statistics without re-running the tool. ### Step 2: Analyze Execution Plans For each problematic query identified in Step 1, run EXPLAIN ANALYZE: ``` Tool: analyze_query Parameters: query: "" analyze: true # Execute the query to get actual timing buffers: true # Show buffer usage (I/O information) settings: true # Show GUC settings that affect the plan verbose: false # Keep output readable format: "text" # Human-readable indented plan ``` The `settings: true` parameter reveals if session-level settings (e.g., modified `work_mem`) affect the plan. The `format: "text"` produces the classic readable plan; use `format: "json"` for structured programmatic analysis when needed. **What to look for in the execution plan:** - **Sequential scans on large tables** (> 10,000 rows): Indicates missing indexes - **Row estimate mismatches**: When `actual rows` differs from `estimated rows` by 10x or more, statistics are stale (run `ANALYZE`) - **Nested Loop with high iterations**: Consider if a Hash Join or Merge Join would be better - **Sort operations using disk** (`external merge`): Increase `work_mem` or add an index for the sort key - **Hash Batches > 1**: Hash table spilling to disk, consider increasing `work_mem` - **Bitmap Heap Scan with many recheck cond rows**: The index is not selective enough **Present findings** as a table: | Issue | Location in Plan | Impact | Recommended Fix | |-------|-----------------|--------|-----------------| | Sequential scan | Seq Scan on orders | High - 2.3M rows | Create index on filter columns | | Row estimate mismatch | ... | Medium | Run ANALYZE on table | ### Step 3: Get Index Recommendations Use the index advisor to find optimization opportunities: ``` Tool: get_index_recommendations Parameters: workload_queries: ["", ""] # The slow queries from Step 1 max_recommendations: 10 min_improvement_percent: 10 include_hypothetical_testing: true ``` **What to look for:** - Indexes with > 50% estimated improvement should be prioritized - Check if recommended indexes overlap with existing ones - Consider the write overhead: each index slows INSERT/UPDATE/DELETE ### Step 4: Verify with Hypothetical Indexes For the most promising index recommendations, verify using HypoPG: ``` Tool: explain_with_indexes Parameters: query: "" hypothetical_indexes: - table: "orders" columns: ["customer_id", "created_at"] index_type: "btree" analyze: false # Use false for hypothetical testing ``` **What to look for:** - Cost reduction percentage (aim for > 30% to justify the index) - Plan changes: Sequential Scan -> Index Scan or Bitmap Index Scan - If improvement is minimal, the index may not be the right solution ### Step 5: Check Table Health For tables involved in slow queries, check their maintenance status: ``` Tool: get_table_stats Parameters: table_name: "" schema_name: "public" include_indexes: true ``` **What to look for:** - **High dead tuple ratio** (> 10%): Table needs `VACUUM` - **Stale statistics** (`last_analyze` is old): Run `ANALYZE` - **Sequential scan ratio** (seq_scan >> idx_scan): Missing indexes - **Large table size** with no indexes: Index candidates ### Step 6: Check Disk I/O (Optional, for complex cases) If buffer/IO issues were found in the execution plan: ``` Tool: analyze_disk_io_patterns Parameters: analysis_type: "all" top_n: 20 ``` **What to look for:** - Low buffer cache hit ratio (< 99% is a warning for OLTP) - Tables with high heap_blks_read (cold data) - Excessive temp file usage (increase `work_mem`) ## Output Format Present the final diagnosis as: ### Summary Brief overview of findings (2-3 sentences). ### Critical Issues (Fix Immediately) Numbered list of urgent issues with specific remediation SQL. ### Recommended Optimizations Prioritized list with: - **Issue**: What was found - **Impact**: Estimated improvement - **Fix**: Exact SQL command (CREATE INDEX, ANALYZE, VACUUM, config change) ### Configuration Suggestions Any `postgresql.conf` changes that would help (e.g., `work_mem`, `effective_cache_size`). ## Common Patterns and Solutions | Symptom | Likely Cause | Solution | |---------|-------------|----------| | Seq Scan on large table | Missing index | CREATE INDEX on WHERE/JOIN columns | | Row estimate 1 vs actual 50000 | Stale statistics | ANALYZE table_name | | Sort Method: external merge | work_mem too low | SET work_mem = '256MB' (session) | | Nested Loop (actual loops=10000) | Bad join strategy | Check join column indexes, statistics | | Hash Batches: 16 | work_mem too low for hash | Increase work_mem | | Bitmap Heap Scan (lossy) | Index not selective | Consider composite index | ## Iterative Verification After applying any fix, always verify: 1. **After creating an index**: Re-run `analyze_query` on the slow query to confirm the plan now uses the new index and timing improved. 2. **After running ANALYZE**: Re-run `analyze_query` to confirm row estimates are now accurate. 3. **After increasing work_mem**: Re-run the query to confirm temp files are eliminated (check `buffers: true` output). 4. **After query rewrite**: Compare before/after execution plans and timing. ## When to Stop and Ask the User - **Before `analyze_query` with `analyze: true` on write queries**: "This will execute the query. For INSERT/UPDATE/DELETE, the tool wraps it in READ ONLY/ROLLBACK for safety, but please confirm you want to proceed." - **If no clear performance issue is found**: "The database-side query performance appears normal. The issue may be in the application layer (N+1 queries, missing caching, network latency). Would you like me to investigate further?" - **If recommendations require superuser**: "Creating this extension requires superuser privileges. Do you have access, or should I provide instructions for your DBA?" - **If index recommendations have high write overhead**: "This table has X inserts/sec. Adding Y new indexes will increase write latency. Is this acceptable?" ## Cross-References to Other Skills - **Query needs structural rewrite** (subqueries, CTEs, OR patterns): Use `pg-query-rewrite` skill - **Index optimization beyond this query**: Use `pg-index-optimization` skill - **work_mem / configuration tuning needed**: Use `pg-config-tuning` skill - **I/O patterns need investigation**: Use `pg-io-deep-dive` skill - **Table bloat causing slow scans**: Use `pg-bloat-analysis` skill ## PostgreSQL Version Notes | Feature | Version | Impact on This Workflow | |---------|---------|------------------------| | `pg_stat_statements` `total_exec_time` | PG13+ (renamed from `total_time`) | Column name may differ | | `compute_query_id` setting | PG14+ | Required for `queryid` in `pg_stat_statements` | | `pg_stat_statements.track_planning` | PG13+ | Enables planning time tracking | | CTE inlining | PG12+ | Non-recursive CTEs may be optimized differently | | Memoize plan node | PG14+ | Improves nested loop performance | | Incremental sort | PG13+ | Optimizer may choose different sort strategies | ## Important Notes - `analyze_query` with `analyze: true` actually executes the query. Use caution with INSERT/UPDATE/DELETE statements (they are wrapped in READ ONLY transaction for safety). - Always verify index recommendations with `explain_with_indexes` before applying in production. - Consider write amplification: each new index adds overhead to every write operation on the table. - If `pg_stat_statements` data is sparse, ask the user to collect more workload data first. - The `PGTUNER_EXCLUDE_USERIDS` environment variable can filter out monitoring user queries from `get_slow_queries` results. Suggest configuring it if internal monitoring queries pollute the results. ## Production Safety - All diagnostic tools (`get_slow_queries`, `get_table_stats`, `analyze_disk_io_patterns`) are read-only. - `analyze_query` with `analyze: true` executes the query but wraps it in `BEGIN TRANSACTION READ ONLY` / `ROLLBACK`. - `explain_with_indexes` creates temporary HypoPG indexes that exist only in the session and are never persisted. - No tool in this workflow modifies data or schema. All CREATE INDEX / ANALYZE / VACUUM commands are recommendations for the user to execute.