---
name: system-architecture
description: System design and architecture expert for creating scalable distributed systems. Covers system design interviews, architecture patterns, and real-world case studies like Netflix, Twitter, Uber. Use when designing systems, writing architecture docs, or preparing for system design interviews.
allowed-tools: Read, Glob, Grep, Write
---

# System Architecture Expert

## When to use this Skill

Use this Skill when:
- Designing distributed systems
- Writing system design documentation
- Preparing for system design interviews
- Creating architecture diagrams
- Analyzing trade-offs between design choices
- Reviewing or improving existing system designs

## System Design Framework

### 1. Requirements Gathering (5-10 minutes)

**Functional Requirements:**
- What are the core features?
- What actions can users perform?
- What are the inputs and outputs?

**Non-Functional Requirements:**
- Scale: How many users? How much data?
- Performance: Latency requirements? (p50, p95, p99)
- Availability: What uptime is needed? (99.9%, 99.99%)
- Consistency: Strong or eventual consistency?

**Constraints:**
- Budget limitations
- Technology stack constraints
- Team expertise
- Timeline

**Example Questions:**
```
- How many daily active users?
- What's the read:write ratio?
- What's the average data size?
- What's the peak load vs average load?
- Do we need real-time updates?
- Can we have data loss?
```

### 2. Capacity Estimation (Back-of-the-envelope)

**Calculate:**
```
Traffic:
- DAU = 100M users
- Each user makes 10 requests/day
- QPS = 100M * 10 / 86400 ≈ 11,574 QPS
- Peak QPS = 2-3x average ≈ 30,000 QPS

Storage:
- 100M users * 1KB per user = 100GB
- With 3x replication = 300GB
- Growth: 300GB * 365 days = 109.5TB/year

Bandwidth:
- QPS * average request size
- 11,574 * 10KB = 115.74MB/s
```

**Memory/Cache:**
- 80-20 rule: 20% of data gets 80% of traffic
- Cache = 20% of total data for hot data

### 3. High-Level Design

**Core Components:**
1. **Client Layer** (Web, Mobile, Desktop)
2. **API Gateway / Load Balancer**
3. **Application Servers** (Business logic)
4. **Cache Layer** (Redis, Memcached)
5. **Database** (SQL, NoSQL, or both)
6. **Message Queue** (Kafka, RabbitMQ)
7. **Object Storage** (S3, GCS)
8. **CDN** (CloudFront, Akamai)

**Draw Architecture:**
```
[Clients] → [CDN]
            ↓
        [Load Balancer]
            ↓
    [Application Servers]
        ↙     ↓     ↘
   [Cache] [DB] [Queue] → [Workers]
                            ↓
                      [Object Storage]
```

### 4. Database Design

**SQL vs NoSQL Decision:**

**Use SQL when:**
- ACID transactions required
- Complex queries with JOINs
- Structured data with relationships
- Examples: PostgreSQL, MySQL

**Use NoSQL when:**
- Massive scale (horizontal scaling)
- Flexible schema
- High write throughput
- Examples: Cassandra, DynamoDB, MongoDB

**Sharding Strategy:**
- Hash-based: `user_id % num_shards`
- Range-based: Users 1-100M on shard 1
- Geographic: US users on US shard
- Consistent hashing: For even distribution

**Schema Design:**
```sql
-- Example: URL Shortener
CREATE TABLE urls (
    id BIGSERIAL PRIMARY KEY,
    short_url VARCHAR(10) UNIQUE NOT NULL,
    long_url TEXT NOT NULL,
    user_id BIGINT,
    created_at TIMESTAMP DEFAULT NOW(),
    expires_at TIMESTAMP,
    click_count INT DEFAULT 0,
    INDEX (short_url),
    INDEX (user_id)
);
```

### 5. Deep Dive Components

**Caching Strategy:**
- **Cache-Aside**: App reads from cache, loads from DB on miss
- **Write-Through**: Write to cache and DB together
- **Write-Behind**: Write to cache, async write to DB

**Eviction Policies:**
- LRU (Least Recently Used) - Most common
- LFU (Least Frequently Used)
- TTL (Time To Live)

**Load Balancing:**
- Round Robin: Simple, equal distribution
- Least Connections: Route to least busy server
- Consistent Hashing: Minimize redistribution
- Weighted: Based on server capacity

**Message Queue Patterns:**
- **Pub/Sub**: One-to-many (notifications)
- **Work Queue**: Task distribution (job processing)
- **Fan-out**: Broadcast to multiple queues

### 6. Scalability Patterns

**Horizontal Scaling:**
- Add more servers
- Use load balancers
- Stateless application servers
- Session stored in cache/DB

**Vertical Scaling:**
- Add more CPU/RAM to servers
- Limited by hardware
- Simpler but has limits

**Microservices:**
```
Monolith:
[Single App] → [DB]

Microservices:
[User Service] → [User DB]
[Post Service] → [Post DB]
[Feed Service] → [Feed DB]
```

**Benefits:**
- Independent scaling
- Technology flexibility
- Fault isolation

**Drawbacks:**
- Increased complexity
- Network latency
- Distributed transactions

### 7. Reliability & Availability

**Replication:**
- Master-Slave: One writer, multiple readers
- Master-Master: Multiple writers (conflict resolution needed)
- Multi-region: Geographic redundancy

**Failover:**
- Active-Passive: Standby server takes over
- Active-Active: Both servers handle traffic

**Rate Limiting:**
- Token bucket algorithm
- Leaky bucket algorithm
- Fixed window counter
- Sliding window log

**Circuit Breaker:**
```
States:
Closed → Normal operation
Open → Reject requests immediately
Half-Open → Test if service recovered
```

### 8. Common System Design Patterns

**Content Delivery:**
- Use CDN for static assets
- Geo-distributed edge servers
- Cache at edge locations

**Data Consistency:**
- **Strong Consistency**: Read reflects latest write (ACID)
- **Eventual Consistency**: Reads eventually reflect write (BASE)
- **CAP Theorem**: Choose 2 of 3: Consistency, Availability, Partition Tolerance

**API Design:**
```
RESTful:
GET    /api/users/{id}
POST   /api/users
PUT    /api/users/{id}
DELETE /api/users/{id}

GraphQL:
query {
  user(id: "123") {
    name
    posts {
      title
    }
  }
}
```

### 9. System Design Template

Use this structure (based on `system_design/00_template.md`):

```markdown
# {System Name}

## 1. Requirements
### Functional
- [List core features]

### Non-Functional
- Scale: [Users, QPS, Data]
- Performance: [Latency requirements]
- Availability: [Uptime target]

## 2. Capacity Estimation
- Traffic: [QPS calculations]
- Storage: [Data size, growth]
- Bandwidth: [Network requirements]

## 3. API Design
```
[endpoint] - [description]
```

## 4. High-Level Architecture
[Diagram]

## 5. Database Schema
[Tables and relationships]

## 6. Detailed Design
### Component 1
[Deep dive]

### Component 2
[Deep dive]

## 7. Scalability
[How to scale each component]

## 8. Trade-offs
[Decisions and alternatives]
```

### 10. Real-World Examples

**Reference case studies in `system_design/`:**
- Netflix: Video streaming, recommendation
- Twitter: Timeline, tweet storage, trending
- Uber: Real-time matching, location tracking
- Instagram: Image storage, feed generation
- WhatsApp: Message delivery, presence

**Common Patterns:**
- **News Feed**: Fan-out on write vs fan-out on read
- **Rate Limiter**: Token bucket with Redis
- **URL Shortener**: Base62 encoding, hash collision
- **Chat System**: WebSocket, message queue
- **Notification**: Push notification service, APNs/FCM

## Interview Tips

**Time Management:**
- Requirements: 10%
- High-level design: 25%
- Deep dive: 50%
- Wrap up: 15%

**Communication:**
- Think out loud
- Ask clarifying questions
- Discuss trade-offs
- Acknowledge limitations

**What interviewers look for:**
- Problem-solving approach
- Technical depth
- Trade-off analysis
- Scale awareness
- Communication skills

## Common Mistakes to Avoid

- Jumping to solution without requirements
- Over-engineering simple problems
- Under-estimating scale requirements
- Ignoring single points of failure
- Not considering monitoring/alerting
- Forgetting about data consistency
- Missing security considerations

## Project Context

- Templates in `system_design/00_template.md`
- Case studies in `system_design/*.md`
- Reference materials in `doc/system_design/`
- Follow the established documentation pattern