--- name: design-system type: reference description: "Decomposes a product concept into architectural components, domain systems, data models, and integration boundaries. Use when starting system architecture or when the user mentions system design or component breakdown." effort: 3 allowed-tools: Read, Glob, Grep, Write, Edit, Bash user-invocable: true when_to_use: "When designing system architecture, defining domain boundaries, or creating a component breakdown for a new product or feature" --- # System Design ## Phase 1: Clarify requirements (always do this first) Ask before designing: 1. **Scale**: How many users/requests/day? Read-heavy or write-heavy? 2. **Consistency**: Strong (banking) or eventual (social feed)? 3. **Availability target**: 99.9% (8.7h/yr downtime) or 99.99% (52min/yr)? 4. **Latency budget**: p99 < 100ms? < 1s? 5. **Geography**: Single region or multi-region? ## Capacity estimation shortcuts ``` 1M users/day active → ~12 req/s avg, ~120 req/s peak (10x) 1KB per request → 1M req/day = ~1GB/day = ~365GB/year Read:write ratio 10:1 (typical social) → optimize read path first 1 server handles ~1000 req/s (rule of thumb for I/O-bound services) ``` ## Component breakdown template ``` Client layer → Web / Mobile / API consumers CDN → Static assets, edge caching API Gateway → Rate limiting, auth, routing, SSL termination Services → Domain-specific services (User, Order, Payment, Notification) Cache → Redis for hot data (sessions, rate limits, computed results) Database → Primary DB + Read replicas Message queue → Async operations, event-driven decoupling Storage → Object storage for files (S3/GCS) Monitoring → Metrics, logs, traces, alerts ``` ## Database selection guide | Need | Choose | |---|---| | ACID transactions, relations | PostgreSQL | | High-scale document store | MongoDB | | Key-value, cache, pub/sub | Redis | | Time-series data | TimescaleDB / InfluxDB | | Graph relationships | Neo4j | | Full-text search | Elasticsearch | | Analytical/OLAP | ClickHouse / BigQuery | ## Caching strategies ``` Cache-aside (read): App checks cache → miss → DB → write to cache Write-through: Write to cache AND DB simultaneously (consistent, slower writes) Write-behind: Write to cache → async flush to DB (fast writes, risk of loss) Read-through: Cache handles DB reads automatically TTL guidelines: - Sessions: 15-30 min - User profile: 5 min - Product catalog: 1 hour - Config/settings: 24 hours ``` ## Message queue patterns ``` When to use queues: ✓ Async processing (email, PDF generation, notifications) ✓ Rate-limiting downstream services ✓ Decoupling services (order → payment → shipping) ✓ Fan-out (1 event → multiple consumers) Queue selection: - RabbitMQ: complex routing, request-reply, low latency - Kafka: high throughput, event log/replay, stream processing - SQS: managed, simple, AWS-native, at-least-once delivery - Redis Streams: lightweight, same infra as cache ``` ## API design decisions ``` REST: Standard CRUD, simple clients, team familiarity (default choice) GraphQL: Multiple clients with different data needs, reduce over-fetching gRPC: Internal service-to-service, binary protocol, streaming needed WebSocket: Real-time bidirectional (chat, live updates, collaborative tools) ``` ## Scaling patterns ``` Vertical (scale up): More CPU/RAM — quick, limited ceiling Horizontal (scale out): More instances — requires stateless services Database read replicas: Offload read traffic (good for 80%+ read workloads) Database sharding: Shard by user_id, geography — last resort, complex CQRS: Separate read/write models — when read/write patterns diverge heavily ``` ## Common design mistakes | Mistake | Better approach | |---|---| | Over-engineering for scale you don't have | Start monolith, extract services at clear pain points | | Synchronous calls to all dependencies | Use async queues for non-critical paths | | No caching strategy | Cache at API layer + DB query results | | Storing sessions in DB | Use Redis; DB sessions don't scale horizontally | | Single point of failure | Redundancy at every critical layer |