---
name: clickhouse-materialized-views
description: |
  Guides for implementing real-time data aggregation and transformation pipelines using ClickHouse materialized views. Use when building real-time dashboards, streaming analytics, metrics aggregation, event processing, or continuous data transformation. Covers incremental aggregation, State/Merge functions, chaining views, schema evolution, and troubleshooting streaming pipelines.
allowed-tools: Read, Bash, Glob, Grep
---

## Table of Contents

1. [Purpose](#purpose)
2. [Quick Start](#quick-start)
3. [Core Concepts](#core-concepts)
   - [Aggregation Patterns](#aggregation-patterns)
   - [Destination Table Engines](#destination-table-engines)
   - [State/Merge Functions](#statemerge-functions)
   - [View Chaining](#view-chaining)
4. [Instructions](#instructions)
   - [Step 1: Choose Your Aggregation Pattern](#step-1-choose-your-aggregation-pattern)
   - [Step 2: Select the Right Destination Engine](#step-2-select-the-right-destination-engine)
   - [Step 3: Design Your Aggregation Query](#step-3-design-your-aggregation-query)
   - [Step 4: Implement State/Merge Functions](#step-4-implement-statemerge-functions-for-complex-aggregates)
   - [Step 5: Chain Views for Multi-Level Aggregation](#step-5-chain-views-for-multi-level-aggregation)
   - [Step 6: Handle Schema Evolution](#step-6-handle-schema-evolution-and-versioning)
   - [Step 7: Monitor Performance](#step-7-monitor-and-optimize-view-performance)
   - [Step 8: Backfill Historical Data](#step-8-backfill-historical-data)
5. [Examples & Patterns](#examples--patterns)
   - [Real-Time Dashboards](#real-time-dashboard-metrics)
   - [User Segmentation](#user-segmentation)
   - [Multi-Destination Routing](#multi-destination-routing)
   - [Complete Examples](#complete-examples)
6. [References](#references)
   - [Engine Comparison](#engine-comparison)
   - [State/Merge Functions Reference](#statemerge-functions-reference)
   - [Kafka Streaming Pipeline](#kafka-streaming-pipeline)
   - [Troubleshooting Guide](#troubleshooting-guide)
7. [Requirements](#requirements)

# ClickHouse Materialized Views Skill

## Purpose

Materialized views in ClickHouse are incremental triggers that automatically transform and aggregate incoming data in real-time. Unlike traditional databases, ClickHouse views process only new data as it arrives, enabling efficient streaming analytics, real-time dashboards, and continuous aggregation without scheduled batch jobs.

## When to Use This Skill

Use when asked to:
- Build real-time data aggregation pipelines ("create real-time metrics")
- Implement streaming analytics or continuous aggregation
- Create pre-aggregated tables for fast dashboard queries
- Set up automatic data transformation as events arrive
- Chain multi-level aggregations (hourly → daily → monthly)

Do NOT use when:
- Data is batch-processed infrequently (use scheduled jobs instead)
- Aggregation logic changes often (materialized views are harder to modify)
- Source data is small and queries are already fast

## Quick Start

**Create a simple real-time aggregation pipeline:**

```sql
-- Step 1: Create source table (raw events)
CREATE TABLE events (
    user_id UInt32,
    event_type String,
    timestamp DateTime,
    revenue Decimal(10, 2)
) ENGINE = MergeTree()
ORDER BY (user_id, timestamp);

-- Step 2: Create destination table (aggregated)
CREATE TABLE daily_revenue (
    date Date,
    total_revenue Decimal(18, 2),
    event_count UInt64
) ENGINE = SummingMergeTree()
ORDER BY date;

-- Step 3: Create materialized view (automatic aggregation)
CREATE MATERIALIZED VIEW daily_revenue_mv TO daily_revenue AS
SELECT
    toDate(timestamp) as date,
    SUM(revenue) as total_revenue,
    COUNT() as event_count
FROM events
GROUP BY date;

-- Now every INSERT into events automatically updates daily_revenue
INSERT INTO events VALUES (12345, 'purchase', '2024-01-15 10:30:00', 99.99);
SELECT * FROM daily_revenue;  -- Result: 2024-01-15 | 99.99 | 1
```

## Instructions

### Step 1: Choose Your Aggregation Pattern

Materialized views support three main patterns depending on your use case:

**Incremental Views (Most Common)**
- Process only new data as it arrives
- Best for streaming event data and continuous updates
- Use with `SummingMergeTree` for automatic deduplication

**Refreshable Views**
- Recalculate entire view on a schedule (e.g., hourly)
- Best for complex queries that need full recalculation
- Syntax: `CREATE MATERIALIZED VIEW ... REFRESH EVERY 1 HOUR AS ...`

**Populate-Based Views**
- Backfill existing data when view is created
- Warning: Can be expensive on large tables
- Syntax: `CREATE MATERIALIZED VIEW ... POPULATE AS ...`

See `examples/aggregation-patterns.md` for complete pattern examples.

### Step 2: Select the Right Destination Engine

The destination table engine determines how your aggregated data behaves:

**SummingMergeTree** (Simple Aggregates)
- Use when aggregating with COUNT, SUM, AVG
- Automatically sums columns during background merges
- No duplicate rows in final results
- Example: Hourly sales totals, daily revenue

**AggregatingMergeTree** (Complex Aggregates)
- Use with State/Merge functions for uniq, quantile, topK
- Stores intermediate aggregation states
- Combine with `uniqState()`, `quantileState()`, `topKState()` in views
- Query with `uniqMerge()`, `quantileMerge()`, `topKMerge()` functions
- Example: Unique user counts, percentile calculations

**ReplacingMergeTree** (Deduplication)
- Use when you need to replace/update existing rows
- Maintains version column to determine latest record
- Example: User profile updates, entity state snapshots

**MergeTree** (Raw Events)
- Use when storing raw events without aggregation
- No automatic deduplication
- Example: Event audit trail, raw clickstream data

See `references/engine-comparison.md` for detailed comparison.

### Step 3: Design Your Aggregation Query

Follow these principles:

**GROUP BY Cardinality**
- Keep cardinality moderate (not millions of unique combinations)
- Example: Good = `GROUP BY date, product_id` | Bad = `GROUP BY user_id, timestamp`

**ORDER BY for Query Patterns**
- Design ORDER BY for your most common query filters
- If queries filter by date first: `ORDER BY (date, product_id)`
- If queries filter by product first: `ORDER BY (product_id, date)`

**Include Necessary Columns**
- Include columns needed for post-aggregation filtering
- Store raw values alongside aggregates when needed for flexibility
- Example: Store `timestamp` and `revenue` along with hourly aggregates

**Use Conditional Aggregation**
- Use `countIf()`, `sumIf()`, `avgIf()` for event filtering
- Filter within the aggregation rather than in views
- Example: `countIf(event_type = 'purchase') as purchase_count`

See `examples/aggregation-patterns.md` for SQL patterns.

### Step 4: Implement State/Merge Functions for Complex Aggregates

When using `AggregatingMergeTree`, use special State/Merge function pairs:

**In Materialized View (State Functions)**
```sql
CREATE MATERIALIZED VIEW user_stats_mv TO user_stats AS
SELECT
    toDate(timestamp) as date,
    uniqState(user_id) as unique_users,
    quantileState(0.95)(response_time) as p95_latency
FROM events
GROUP BY date;
```

**In Queries (Merge Functions)**
```sql
SELECT
    date,
    uniqMerge(unique_users) as total_unique_users,
    quantileMerge(0.95)(p95_latency) as latency_p95
FROM user_stats
GROUP BY date;
```

Common function pairs:
- `uniq()` ↔ `uniqState()` / `uniqMerge()`
- `sum()` ↔ `sumState()` / `sumMerge()`
- `avg()` ↔ `avgState()` / `avgMerge()`
- `quantile(0.95)()` ↔ `quantileState(0.95)()` / `quantileMerge(0.95)()`
- `topK(10)()` ↔ `topKState(10)()` / `topKMerge(10)()`

See `references/state-merge-reference.md` for all function pairs.

### Step 5: Chain Views for Multi-Level Aggregation

Build hierarchical aggregations to reduce redundant computation:

```sql
-- Level 1: Raw events (source)
CREATE TABLE events (...) ENGINE = MergeTree() ORDER BY (user_id, timestamp);

-- Level 2: Hourly aggregation
CREATE TABLE hourly_stats (...) ENGINE = SummingMergeTree() ORDER BY (hour, product_id);
CREATE MATERIALIZED VIEW hourly_stats_mv TO hourly_stats AS
SELECT
    toStartOfHour(timestamp) as hour,
    product_id,
    COUNT() as event_count,
    SUM(revenue) as total_revenue
FROM events
GROUP BY hour, product_id;

-- Level 3: Daily aggregation (chain from hourly, not raw events)
CREATE TABLE daily_stats (...) ENGINE = SummingMergeTree() ORDER BY (date, product_id);
CREATE MATERIALIZED VIEW daily_stats_mv TO daily_stats AS
SELECT
    toDate(hour) as date,
    product_id,
    SUM(event_count) as event_count,
    SUM(total_revenue) as total_revenue
FROM hourly_stats
GROUP BY date, product_id;
```

Benefits:
- Reduces redundant computation (don't re-process raw events for daily stats)
- Each level optimized for different query patterns
- Easier to debug and modify intermediate aggregations

### Step 6: Handle Schema Evolution and Versioning

When modifying aggregation logic:

```sql
-- Create new versioned view with updated schema
CREATE MATERIALIZED VIEW hourly_stats_v2_mv TO hourly_stats_v2 AS
SELECT
    toStartOfHour(timestamp) as hour,
    product_id,
    COUNT() as sales_count,
    SUM(revenue) as total_revenue,
    COUNT(DISTINCT user_id) as unique_customers  -- New column
FROM orders
GROUP BY hour, product_id;

-- Migrate queries gradually to v2
-- Then drop old view:
DROP VIEW hourly_stats_v1_mv;
DROP TABLE hourly_stats_v1;
```

Avoid ALTER TABLE on materialized view destinations when possible (can cause data loss).

### Step 7: Monitor and Optimize View Performance

**Check View Execution Performance**
```sql
SELECT
    view_name,
    elapsed,
    read_rows,
    memory_usage
FROM system.query_log
WHERE query_kind = 'MaterializedView'
  AND elapsed > 1
ORDER BY elapsed DESC;
```

**Common Performance Issues and Solutions**
- **High memory usage**: Reduce GROUP BY cardinality or add WHERE filters
- **Slow execution**: Simplify aggregation logic or chain views instead of complex single view
- **Data lag**: Increase `kafka_max_block_size` if consuming from Kafka

See `references/troubleshooting.md` for debugging guide.

### Step 8: Backfill Historical Data

When adding new views to existing data:

**Option 1: Direct Insert (Safest)**
```sql
-- Backfill historical data directly to destination
INSERT INTO hourly_stats
SELECT
    toStartOfHour(timestamp) as hour,
    product_id,
    COUNT() as sales_count,
    SUM(revenue) as total_revenue
FROM orders
WHERE date < '2024-01-01'
GROUP BY hour, product_id;
```

**Option 2: Use POPULATE (for empty tables)**
```sql
-- Only use if destination table is empty
CREATE MATERIALIZED VIEW hourly_stats_mv TO hourly_stats
POPULATE AS
SELECT ... FROM orders;
```

⚠️ Avoid `POPULATE` on large source tables (can be very slow).

## Examples & Patterns

The following patterns are common use cases for materialized views:

### Real-Time Dashboard Metrics

Pre-aggregate metrics to enable instant dashboard queries:
```sql
CREATE MATERIALIZED VIEW dashboard_metrics_mv TO dashboard_metrics AS
SELECT
    toStartOfHour(timestamp) as hour,
    COUNT() as total_events,
    uniq(user_id) as unique_users,
    countIf(event_type = 'purchase') as purchases
FROM events
GROUP BY hour;
```

### User Segmentation

Automatically segment users based on behavior:
```sql
CREATE MATERIALIZED VIEW user_segments_mv TO user_segments AS
SELECT
    user_id,
    COUNT() as event_count,
    SUM(revenue) as lifetime_value,
    CASE
        WHEN lifetime_value > 1000 THEN 'vip'
        WHEN lifetime_value > 100 THEN 'regular'
        ELSE 'casual'
    END as segment
FROM events
GROUP BY user_id;
```

### Multi-Destination Routing

Route data to multiple aggregations:
```sql
-- High-value orders
CREATE MATERIALIZED VIEW high_value_orders_mv TO high_value_orders AS
SELECT * FROM orders WHERE total_amount > 1000;

-- All orders
CREATE MATERIALIZED VIEW all_orders_mv TO all_orders AS
SELECT * FROM orders;
```

### Complete Examples

For comprehensive, production-ready examples covering real-world scenarios, see the [examples/](./examples/) directory:
- Real-time dashboards and metrics aggregation
- User segmentation and cohort analysis
- Funnel analysis and conversion tracking
- Time-series downsampling and rollups
- Kafka streaming pipelines with end-to-end examples

## Requirements

- ClickHouse 20.6+: Basic materialized views
- ClickHouse 21.9+: Refreshable materialized views
- ClickHouse 22.0+: Advanced State/Merge functions
- Python client (for querying views): `pip install clickhouse-driver`
- Kafka/Redpanda: For streaming pipelines (optional)

## References

### Engine Comparison

[references/engine-comparison.md](./references/engine-comparison.md) - Detailed comparison of ClickHouse table engines for materialized views:
- **SummingMergeTree vs AggregatingMergeTree**: When to use each for aggregation
- **ReplacingMergeTree**: For deduplication and entity updates
- **Engine characteristics**: Memory usage, merge behavior, query performance
- **Decision matrix**: Choosing the right engine for your use case

### State/Merge Functions Reference

[references/state-merge-reference.md](./references/state-merge-reference.md) - Complete reference for all State/Merge function pairs:
- Function pair mapping (State ↔ Merge)
- Supported aggregation functions
- Parameter specifications and usage
- Performance considerations
- Examples for each function

### Kafka Streaming Pipeline

[examples/kafka-streaming-pipeline.md](./examples/kafka-streaming-pipeline.md) - Real-world Kafka integration patterns:
- Consuming from Kafka with Kafka table engine
- Chaining materialized views with Kafka sources
- Error handling and data validation
- Complete end-to-end pipeline example
- Monitoring Kafka consumption in ClickHouse

### Aggregation Patterns

[examples/aggregation-patterns.md](./examples/aggregation-patterns.md) - SQL pattern examples and common scenarios:
- Real-time dashboards and metrics
- User segmentation and cohort analysis
- Funnel analysis and conversion tracking
- Time-series downsampling and rollups
- Copy-paste ready SQL examples

### Complex Aggregates with State/Merge

[examples/state-merge-aggregates.md](./examples/state-merge-aggregates.md) - Advanced aggregation techniques:
- Using uniq, quantile, and topK in views
- Multi-stage aggregation with State/Merge
- Accuracy vs performance tradeoffs
- Real-world performance tuning examples
- Working with approximate functions

### Troubleshooting Guide

[references/troubleshooting.md](./references/troubleshooting.md) - Comprehensive debugging guide for common issues:
- Performance problems and optimization strategies
- Data correctness and consistency issues
- Schema evolution and migration challenges
- Monitoring and health checks
- Common error messages and solutions