--- name: altinity-expert-clickhouse-kafka description: Diagnose ClickHouse Kafka engine health, consumer status, thread pool capacity, and consumption issues. Use for Kafka lag, consumer errors, and thread starvation. --- ## Diagnostics Run all queries from the file checks.sql and analyze the results. --- ## Interpreting Results ### Consumer Health Check if consumers are stuck by comparing exception time vs activity times: - `last_exception_time >= last_poll_time` OR `last_exception_time >= last_commit_time` → consumer stuck on error, not progressing - Otherwise → consumer healthy The `exceptions` column is a tuple of arrays with matching indices — `exceptions.time[-1]` and `exceptions.text[-1]` give the most recent error. ### Thread Pool Capacity - `kafka_consumers > mb_pool_size` → thread starvation — consumers waiting for available threads - Fix: increase `background_message_broker_schedule_pool_size` (default: 16) - Sizing: total Kafka + RabbitMQ/NATS consumers + 25% buffer ### Slow Materialized Views (Poll Interval Risk) - MV avg duration > 30s → consumer may exceed `max.poll.interval.ms` and get kicked from the group - MV executions with error status → likely consumer rebalances (consumer kicked, MV interrupted mid-batch) - **Most common root cause for slow MVs:** multiple `JSONExtract` calls re-parsing the same JSON blob - **Fix:** rewrite to one-pass `JSONExtract(json, 'Tuple(...)') AS parsed` + `tupleElement()` — see [troubleshooting.md](troubleshooting.md) ### Pool Utilization Trends (12h) - Sustained high values near pool size → capacity pressure - Spikes correlating with lag → temporary overload - Flat zero → Kafka consumers may not be active --- ## Advanced Diagnostics For deeper investigation, run queries from advanced_checks.sql: - **Consumer exception drill-down** — filter to a specific problematic Kafka table - **Consumption speed measurement** — snapshot-based rate calculation - **Topic lag via rdkafka_stat** — total lag per table and per-partition breakdown - **Broker connection health** — connection state, errors, disconnects **Important:** `rdkafka_stat` is **not enabled by default** in ClickHouse. It requires `` in the Kafka engine settings. See advanced_checks.sql for setup instructions. --- ## Common Issues For troubleshooting common errors and configuration guidance, see [troubleshooting.md](troubleshooting.md): - Topic authorization / ACL errors - Poll interval exceeded (slow MV / JSON parsing optimization) - Thread pool starvation - Parsing errors / dead letter queue - Data loss with multiple materialized views - Offset rewind / replay - Parallel consumption tuning --- ## Cross-Module Triggers | Finding | Load Module | Reason | |---------|-------------|--------| | Slow MV inserts | `altinity-expert-clickhouse-ingestion` | Insert pipeline analysis | | High merge memory | `altinity-expert-clickhouse-merges` | Merge patterns | | Query-level issues | `altinity-expert-clickhouse-reporting` | Query optimization | | Schema concerns | `altinity-expert-clickhouse-schema` | Table design | --- ## Settings Reference | Setting | Scope | Notes | |---------|-------|-------| | `background_message_broker_schedule_pool_size` | Server | Thread pool for Kafka/RabbitMQ/NATS consumers (default: 16) | | `kafka_num_consumers` | Table | Parallel consumers per table (limited by cores) | | `kafka_thread_per_consumer` | Table | Required for parallel inserts (`= 1`) | | `kafka_handle_error_mode` | Table | `stream` (21.6+) or `dead_letter` (25.8+) | | `max_poll_interval_ms` | librdkafka | Max time between polls before consumer is kicked (default: 300s) | | `statistics_interval_ms` | librdkafka | Enable rdkafka_stat collection (disabled by default) |