--- name: clickhouse-schema-design description: "15+ schema rules for ClickHouse tables. ALWAYS LOAD when creating or modifying tables. Achieves sub-second queries, 10x compression, and automated data lifecycle." --- # ClickHouse Schema Design **ALWAYS LOAD** when creating or modifying ClickHouse tables. ## Goals - Sub-second query response on billion-row tables - 10x+ compression ratios vs raw data - Zero-maintenance data retention and tiered storage - Queries that scan <1% of data instead of full table scans ## Reference Documentation - [MergeTree Engine](https://clickhouse.com/docs/engines/table-engines/mergetree-family/mergetree) - [Primary Keys and Indexes](https://clickhouse.com/docs/optimize/sparse-primary-indexes) - [Data Types](https://clickhouse.com/docs/sql-reference/data-types) - [TTL](https://clickhouse.com/docs/sql-reference/statements/alter/ttl) - [Altinity: Pick Keys Guide](https://kb.altinity.com/engines/mergetree-table-engine-family/pick-keys/) **Search terms:** schema design, ORDER BY, PRIMARY KEY, PARTITION BY, data types, TTL, index_granularity ## Critical Rules ### [CRITICAL] 1. **ORDER BY determines query performance.** Put most-filtered columns first, low cardinality before high. 2. **PARTITION BY is for data management, not query speed.** Use ORDER BY to speed up queries. 3. **Use smallest data types possible.** UInt32 not UInt64 if values fit. ### [HIGH] 4. **LowCardinality for strings with <10K unique values.** Saves storage and speeds queries. 5. **Avoid Nullable when possible.** Use DEFAULT values instead. 6. **PRIMARY KEY can be a prefix of ORDER BY.** Reduces index size. ### [MEDIUM] 7. **3-5 columns in ORDER BY is typical.** More columns rarely help, fewer may miss optimization opportunities. ## Engine Selection | Engine | Use Case | Key Behavior | | ---------------------- | ----------------------------------- | ----------------------------------------------------------- | | `MergeTree` | Append-only data (events, logs) | Standard storage, no special merge logic | | `ReplacingMergeTree` | Deduplication, upserts | Keeps latest row per ORDER BY key; query with `argMax` pattern | | `SummingMergeTree` | Pre-aggregated counters | Automatically sums numeric columns on merge | | `AggregatingMergeTree` | Complex aggregates (uniq, quantile) | Stores intermediate states; use `-State`/`-Merge` functions | | `CollapsingMergeTree` | Mutable rows | Uses +1/-1 sign column to cancel/update rows | **For engine examples and Materialized Views, see `clickhouse-materialized-views` skill.** ## ORDER BY Selection ### The 3-5 Column Rule ```sql ORDER BY ( tenant_id, -- 1. Lowest cardinality, most filtered event_date, -- 2. Time component (common filter) user_id, -- 3. Medium cardinality event_type -- 4. Higher cardinality (optional) ) ``` ### Decision Process 1. **Which columns appear in WHERE clauses?** - These go first 2. **What's the cardinality?** - Lower cardinality columns before higher 3. **Is there a time dimension?** - Usually included for range queries ### Good vs Bad ```sql -- BAD: High cardinality first, rarely-filtered columns ORDER BY (user_id, timestamp, tenant_id) -- GOOD: Low cardinality first, commonly-filtered columns ORDER BY (tenant_id, toDate(timestamp), user_id) ``` ### PRIMARY KEY as Prefix ```sql -- Full sorting key for data layout ORDER BY (tenant_id, event_date, user_id, event_type) -- Shorter primary key for smaller index (optional) PRIMARY KEY (tenant_id, event_date) ``` ## PARTITION BY Guidelines ### Size Targets | Scenario | Target Partition Size | | ------------------------------------- | --------------------- | | General MergeTree tables | 1-300 GB | | SummingMergeTree / ReplacingMergeTree | 400 MB - 40 GB | | Small tables (<5 GB total) | No partitioning | ### Common Patterns ```sql -- Monthly (most common for analytics) PARTITION BY toYYYYMM(event_date) -- Daily (high volume, >1TB/month) PARTITION BY toDate(event_date) -- Multi-tenant with time PARTITION BY (tenant_id, toYYYYMM(event_date)) -- No partitioning (small tables) -- Simply omit PARTITION BY clause ``` ### Anti-Pattern ```sql -- BAD: Over-partitioning creates thousands of small parts PARTITION BY (toDate(event_date), user_id) -- BAD: Partitioning by high-cardinality column PARTITION BY user_id ``` ## Data Type Optimization ### Numbers ```sql -- Use smallest type that fits your data count UInt16, -- Max 65,535 (instead of UInt64) percentage Float32, -- Instead of Float64 if 6-7 digits precision is enough flags UInt8, -- For small integers, booleans ``` ### Strings ```sql -- LowCardinality for <10K unique values country LowCardinality(String), status LowCardinality(String), event_type LowCardinality(String), -- Regular String for high cardinality user_agent String, url String, ``` ### Dates and Times ```sql -- Use simplest type that meets requirements event_date Date, -- If you only need date event_time DateTime, -- If you need seconds event_time_precise DateTime64(3), -- Only if you need milliseconds ``` ### Avoiding Nullable ```sql -- BAD: Nullable adds overhead and complexity user_id Nullable(UInt64), -- GOOD: Use DEFAULT for missing values user_id UInt64 DEFAULT 0, -- GOOD: Use empty string for missing text name String DEFAULT '', ``` ## TTL Configuration ### Delete Old Data ```sql CREATE TABLE events ( event_date Date, ... ) ENGINE = MergeTree() ORDER BY (...) TTL event_date + INTERVAL 90 DAY DELETE; ``` ### Tiered Storage ```sql CREATE TABLE events ( event_date Date, ... ) ENGINE = MergeTree() ORDER BY (...) TTL event_date + INTERVAL 7 DAY TO VOLUME 'hot', event_date + INTERVAL 30 DAY TO VOLUME 'warm', event_date + INTERVAL 365 DAY DELETE; ``` ### Column-Level TTL ```sql CREATE TABLE events ( event_date Date, user_id UInt64, -- Delete PII after 30 days, keep aggregated data email String TTL event_date + INTERVAL 30 DAY, ip_address String TTL event_date + INTERVAL 30 DAY ) ENGINE = MergeTree() ORDER BY (event_date, user_id); ``` ## Complete Example ```sql CREATE TABLE analytics_events ( -- Time dimension event_date Date, event_time DateTime, -- Identifiers (low to high cardinality) tenant_id UInt32, user_id UInt64, session_id String, -- Categorical data (use LowCardinality) event_type LowCardinality(String), country LowCardinality(String), device_type LowCardinality(String), -- Metrics duration_ms UInt32, -- Flexible data properties String, -- JSON as string -- Skip indices for secondary lookups INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 3, INDEX idx_session session_id TYPE bloom_filter(0.01) GRANULARITY 3 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(event_date) ORDER BY (tenant_id, event_date, user_id) TTL event_date + INTERVAL 12 MONTH DELETE SETTINGS index_granularity = 8192; ``` ## Anti-Patterns | Anti-Pattern | Problem | Solution | | ------------------------------------------ | ---------------------------------- | -------------------------------------- | | `ORDER BY (uuid, timestamp)` | High cardinality first | Put low cardinality columns first | | `PARTITION BY toDate(ts)` for small tables | Too many small partitions | Omit partitioning or use monthly | | `Nullable(UInt64)` everywhere | Storage and query overhead | Use DEFAULT values | | `String` for status codes | Wastes space | Use `LowCardinality(String)` or `Enum` | | `DateTime64(9)` always | Nanosecond precision rarely needed | Use `DateTime` or `DateTime64(3)` | | Putting timestamp first in ORDER BY | Poor compression and filtering | Put categorical columns first | ## Verification Queries ```sql -- Check table size and compression SELECT table, formatReadableSize(sum(bytes_on_disk)) AS size, sum(rows) AS rows, round(sum(bytes_on_disk) / sum(rows), 2) AS bytes_per_row FROM system.parts WHERE active AND database = 'default' GROUP BY table; -- Check partition sizes SELECT partition, count() AS parts, sum(rows) AS rows, formatReadableSize(sum(bytes_on_disk)) AS size FROM system.parts WHERE active AND table = 'your_table' GROUP BY partition ORDER BY partition; -- Verify column types SELECT name, type, compression_codec FROM system.columns WHERE database = 'default' AND table = 'your_table'; ``` ## Troubleshooting **Always ask for user confirmation before applying schema changes (ALTER TABLE, recreating tables).** ### "Too Many Parts" Error **Problem:** `DB::Exception: Too many parts`, inserts rejected, merge queue growing **Diagnose:** ```sql SELECT table, partition, count() AS parts FROM system.parts WHERE active AND database = 'default' GROUP BY table, partition HAVING parts > 300 ORDER BY parts DESC; ``` **Solutions:** | Cause | Fix | |-------|-----| | Over-partitioning (daily + high cardinality) | Use monthly partitions: `PARTITION BY toYYYYMM(date)` | | Too many small inserts | Batch inserts: 1000+ rows per INSERT | | High-cardinality partition key | Remove high-cardinality columns from PARTITION BY | ```sql -- Fix: Change from daily to monthly partitioning (requires table recreation) CREATE TABLE events_new (...) PARTITION BY toYYYYMM(event_date) ...; INSERT INTO events_new SELECT * FROM events; RENAME TABLE events TO events_old, events_new TO events; DROP TABLE events_old; ``` ### Poor Compression (<3x Ratio) **Problem:** Table using more disk than expected, compression ratio below 3x **Diagnose:** ```sql SELECT column, type, formatReadableSize(sum(column_data_compressed_bytes)) AS compressed, round(sum(column_data_uncompressed_bytes) / sum(column_data_compressed_bytes), 2) AS ratio FROM system.parts_columns WHERE active AND table = 'your_table' GROUP BY column, type ORDER BY ratio ASC; ``` **Solutions:** | Cause | Fix | |-------|-----| | High-cardinality column first in ORDER BY | Reorder: low cardinality columns first | | String for low-cardinality data | Use `LowCardinality(String)` | | Wrong codec for data pattern | Use DoubleDelta for timestamps, Gorilla for floats | | Random UUIDs in ORDER BY | Move UUID later in ORDER BY, or use different key | ```sql -- Check cardinality to decide LowCardinality usage SELECT uniq(status) FROM events; -- If <10K, use LowCardinality -- Fix column type (requires recreation or new column) ALTER TABLE events ADD COLUMN status_new LowCardinality(String); ALTER TABLE events UPDATE status_new = status WHERE 1; -- Then migrate queries to use status_new ``` ### Slow Queries Despite Good Schema **Problem:** Queries slow even with proper ORDER BY and partitioning **Diagnose:** ```sql EXPLAIN indexes = 1 SELECT ... FROM your_table WHERE ...; -- Check: Are granules being skipped? Is partition pruning happening? ``` **Solutions:** | Cause | Fix | |-------|-----| | Query doesn't filter on ORDER BY prefix | Add ORDER BY columns to WHERE clause | | Function on filter column | Store computed column, filter on that | | Missing skip index for secondary lookups | Add bloom_filter index | | Selecting too many columns | Select only needed columns | ```sql -- Example: Table ORDER BY (tenant_id, event_date, user_id) -- BAD: Skips ORDER BY prefix SELECT * FROM events WHERE user_id = 123; -- GOOD: Include prefix SELECT * FROM events WHERE tenant_id = 1 AND event_date = today() AND user_id = 123; -- ALT: Add skip index for direct user_id lookups ALTER TABLE events ADD INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 4; ```