--- name: estimation-techniques description: Back-of-envelope calculations for system design. Use when estimating QPS, storage, bandwidth, or latency for capacity planning. Includes latency numbers every programmer should know and common estimation patterns. allowed-tools: Read, Glob, Grep --- # Estimation Techniques This skill provides frameworks for back-of-envelope calculations essential for system design and capacity planning. ## When to Use This Skill **Keywords:** back-of-envelope, estimation, QPS, storage calculation, bandwidth, latency, capacity planning, scale estimation **Use this skill when:** - Estimating system capacity requirements - Calculating storage needs for a feature - Determining bandwidth requirements - Sizing infrastructure for expected load - Justifying architectural decisions with numbers - Preparing for system design interviews ## Core Principle **Estimation is not about precision, it's about order of magnitude.** Getting within 10x is usually good enough for architectural decisions. The goal is to identify if you need: - 1 server or 100 servers - 1 GB or 1 TB of storage - 10 ms or 1 second latency ## Essential Numbers to Know ### Powers of 2 | Power | Value | Approximate | | ----- | ----- | ----------- | | 2^10 | 1,024 | ~1 Thousand (KB) | | 2^20 | 1,048,576 | ~1 Million (MB) | | 2^30 | 1,073,741,824 | ~1 Billion (GB) | | 2^40 | 1,099,511,627,776 | ~1 Trillion (TB) | ### Time Conversions | Unit | Seconds | Useful For | | ---- | ------- | ---------- | | 1 minute | 60 | Short operations | | 1 hour | 3,600 | Batch jobs | | 1 day | 86,400 (~100K) | Daily aggregations | | 1 month | 2,592,000 (~2.5M) | Monthly calculations | | 1 year | 31,536,000 (~30M) | Annual projections | ### Availability Targets | Availability | Downtime/Year | Downtime/Month | Downtime/Day | | ------------ | ------------- | -------------- | ------------ | | 99% (two 9s) | 3.65 days | 7.31 hours | 14.4 min | | 99.9% (three 9s) | 8.76 hours | 43.8 min | 1.44 min | | 99.99% (four 9s) | 52.6 min | 4.38 min | 8.64 sec | | 99.999% (five 9s) | 5.26 min | 26.3 sec | 864 ms | ### Latency Numbers Every Programmer Should Know **See full reference:** `references/latency-numbers.md` Quick reference: | Operation | Latency | Relative | | --------- | ------- | -------- | | L1 cache reference | 0.5 ns | 1x | | L2 cache reference | 7 ns | 14x | | Main memory reference | 100 ns | 200x | | SSD random read | 16 us | 32,000x | | HDD seek | 2 ms | 4,000,000x | | Round trip same datacenter | 0.5 ms | 1,000,000x | | Round trip CA to Netherlands | 150 ms | 300,000,000x | ## Estimation Patterns ### Pattern 1: QPS (Queries Per Second) #### Formula ```text QPS = (Number of Users) x (Actions per User per Day) / (Seconds per Day) ``` #### Example: Twitter-like service ```text Given: - 300 million monthly active users - 50% are daily active = 150M DAU - Average user reads 20 tweets/day QPS = 150M * 20 / 86,400 = 3 billion / 100,000 = 30,000 QPS Peak load (typically 2-3x average): Peak QPS = 30,000 * 3 = 90,000 QPS ``` ### Pattern 2: Storage Estimation #### Formula ```text Storage = (Number of Items) x (Size per Item) x (Replication Factor) x (Time Period) ``` #### Example: Photo storage service ```text Given: - 100 million users - 10% upload daily = 10M uploads/day - Average photo size = 2 MB - Keep 5 years of data - Replication factor = 3 Daily storage = 10M * 2 MB = 20 TB Yearly storage = 20 TB * 365 = 7.3 PB 5-year storage = 7.3 * 5 = 36.5 PB With replication = 36.5 * 3 = ~110 PB ``` ### Pattern 3: Bandwidth Estimation #### Formula ```text Bandwidth = (QPS) x (Request Size or Response Size) ``` #### Example: Video streaming service ```text Given: - 1 million concurrent viewers - Average bitrate = 5 Mbps - Peak hours: 8 PM - 11 PM Bandwidth = 1M * 5 Mbps = 5 Tbps With 20% overhead: ~6 Tbps CDN egress cost (rough): $0.02/GB * 6 Tbps * 3 hours * 3600 sec/hour / 8 bits/byte = massive cost (hence why Netflix built their own CDN) ``` ### Pattern 4: Cache Size Estimation #### Formula ```text Cache Size = (QPS) x (Cache TTL) x (Response Size) x (Unique Ratio) ``` #### Example: API response cache ```text Given: - 10,000 QPS - Cache TTL = 5 minutes = 300 seconds - Average response = 10 KB - 20% of requests are unique Cache entries = 10,000 * 300 * 0.20 = 600,000 entries Cache size = 600,000 * 10 KB = 6 GB With overhead (keys, metadata): ~10 GB ``` ### Pattern 5: Database Sizing #### Formula ```text DB Size = (Number of Rows) x (Row Size) x (Index Overhead) x (Replication) ``` #### Example: User profile database ```text Given: - 500 million users - Average profile = 1 KB (name, email, settings, etc.) - Index overhead = 30% - Primary + 2 replicas = 3x Data size = 500M * 1 KB = 500 GB With indexes = 500 GB * 1.3 = 650 GB With replication = 650 GB * 3 = ~2 TB Memory for hot data (20%): ~400 GB ``` ## Common Estimation Scenarios ### Scenario 1: URL Shortener ```text Requirements: - 100M new URLs/month - 10:1 read:write ratio Writes: - 100M / (30 * 24 * 3600) = ~40 writes/second - Peak: ~100 writes/second Reads: - 40 * 10 = 400 reads/second - Peak: ~1000 reads/second Storage (5 years): - 100M URLs/month * 60 months = 6 billion URLs - Average URL = 100 bytes (short) + 500 bytes (long) = 600 bytes - 6B * 600 bytes = 3.6 TB - With indexes and overhead: ~5 TB ``` ### Scenario 2: Chat Application ```text Requirements: - 10M daily active users - Average 50 messages sent/day - Average 200 messages received/day Message throughput: - Sends: 10M * 50 / 86,400 = ~6,000 messages/second - Peak: ~20,000 messages/second Connections: - Each user maintains 1-3 connections (phone, laptop, tablet) - Peak concurrent: 10M * 0.1 (10% online) * 2 = 2M connections Storage (1 year): - 10M users * 50 msgs/day * 365 days = 182B messages/year - Average message = 200 bytes - 182B * 200 bytes = 36.4 TB/year ``` ### Scenario 3: Video Streaming ```text Requirements: - 100M monthly active users - 30% watch daily = 30M DAU - Average 1 hour/day viewing Concurrent viewers (peak): - 30M DAU / 24 hours * 3 (peak factor) = ~4M concurrent Bandwidth: - Average stream: 5 Mbps - 4M * 5 Mbps = 20 Tbps peak bandwidth Storage (library of 10K titles): - Average video = 2 hours - Multiple qualities: 480p (1GB), 720p (3GB), 1080p (5GB), 4K (20GB) - Per title: ~30 GB - Library: 10K * 30 GB = 300 TB ``` ## Estimation Tips ### Round Aggressively ```text Instead of: Use: 86,400 seconds ~100,000 (10^5) 2.5 million ~3 million 7.3 petabytes ~10 petabytes ``` ### Use Orders of Magnitude Think in powers of 10: - Thousands (10^3) - Millions (10^6) - Billions (10^9) - Trillions (10^12) ### State Your Assumptions Always verbalize: - "I'm assuming 10% of users are active at peak" - "I'm estimating average message size at 200 bytes" - "I'm using a 3x replication factor" ### Sanity Check Results After calculating, ask: - "Does this make sense?" - "Is this in the right order of magnitude?" - "What would change if my assumption is off by 10x?" ## Common Mistakes ### Mistake 1: Ignoring Peak vs Average #### Problem Sizing for average load. ```text Average QPS: 10,000 Peak QPS: 30,000 (often 2-3x average) If you size for 10,000, you'll fail at peak. ``` ### Mistake 2: Forgetting Replication #### Problem Calculating raw storage without copies. ```text Data: 1 TB With 3 replicas: 3 TB With backups: 4-5 TB ``` ### Mistake 3: Not Accounting for Growth #### Problem Sizing for current, not future. ```text Current users: 10M Expected growth: 50%/year Year 3: 10M * 1.5^3 = 34M users Size for at least 2x current to avoid near-term issues. ``` ### Mistake 4: Over-Precision #### Problem Calculating to 3 decimal places. ```text Bad: "We need exactly 3,456,789 IOPS" Good: "We need roughly 3-4 million IOPS" ``` ## Quick Reference Calculations | Need | Formula | | ---- | ------- | | QPS | users * actions/day / 86400 | | Storage/day | items/day * size/item | | Bandwidth | QPS * response_size | | Cache hit rate | 1 - (DB_QPS / total_QPS) | | Servers needed | QPS / QPS_per_server | | Shards needed | data_size / max_shard_size | ## Related Skills - `design-interview-methodology` - Overall interview framework - `quality-attributes-taxonomy` - NFR definitions (scalability, performance) - `database-scaling` - Database capacity planning (Phase 3) - `caching-strategies` - Cache sizing and hit rates (Phase 3) ## Related Commands - `/sd:estimate ` - Calculate capacity interactively ## Related Agents - `capacity-planner` - Guided estimation with calculations ## References - `references/latency-numbers.md` - Complete latency reference table --- ## Version History - **v1.0.0** (2025-12-26): Initial release --- ## Last Updated **Date:** 2025-12-26 **Model:** claude-opus-4-5-20251101