--- name: architecture-design description: Use when designing system architecture, making high-level technical decisions, or planning major system changes. Focuses on structure, patterns, and long-term strategy. allowed-tools: - Read - Grep - Glob - Bash --- # Architecture Design Skill Design system architecture and make strategic technical decisions. ## Core Principle **Good architecture enables change while maintaining simplicity.** ## Architecture vs Planning **Architecture Design (this skill):** - Strategic: "How should the system be structured?" - Component interactions and boundaries - Technology and pattern choices - Long-term implications - System-level decisions **Technical Planning (technical-planning skill):** - Tactical: "How do I implement feature X?" - Specific implementation tasks - Execution details - Short-term focus **Use architecture when:** - Designing new systems or subsystems - Major refactors affecting multiple components - Technology selection decisions - Defining system boundaries and interfaces - Making decisions with long-term impact **Use planning when:** - Implementing within existing architecture - Breaking down specific features - Task sequencing and execution ## Architecture Process ### 1. Understand Context **Business context:** - What problem are we solving? - Who are the users? - What are the business goals? - What are the success metrics? **Technical context:** - What exists today? - What constraints exist? - What must we integrate with? - What scale must we support? **Team context:** - What's our expertise? - What can we maintain? - What's our velocity? ### 2. Gather Requirements **Functional requirements:** - What must the system do? - What are the features? - What are the user scenarios? **Non-functional requirements:** - **Performance**: Response time, throughput - **Scalability**: Expected load, growth - **Availability**: Uptime requirements - **Security**: Compliance, data protection - **Maintainability**: Team size, skills - **Cost**: Budget constraints **Example:** ```markdown ## Requirements ### Functional - Users can search products by name/category - Users can add items to cart - Users can checkout and pay ### Non-Functional - Search response time < 200ms (p95) - Support 10,000 concurrent users - 99.9% uptime - PCI DSS compliant for payments - Team of 5 developers can maintain ``` ### 3. Identify Constraints **Technical constraints:** - Must use existing authentication system - Must integrate with legacy inventory system - Database must be PostgreSQL (existing infrastructure) **Business constraints:** - Must launch in 3 months - Budget of $50k for infrastructure - Must support EU data residency **Team constraints:** - Team experienced in Python, less in Go - No DevOps specialist on team - Remote team across timezones ### 4. Consider Alternatives **Never design in a vacuum - consider options:** **Example: Data storage choice** **Option 1: PostgreSQL** - Pros: Team knows it, ACID guarantees, rich query support - Cons: Vertical scaling limits, setup complexity **Option 2: MongoDB** - Pros: Flexible schema, horizontal scaling - Cons: Team unfamiliar, eventual consistency **Option 3: DynamoDB** - Pros: Fully managed, auto-scaling - Cons: Vendor lock-in, query limitations, cost at scale **Decision: PostgreSQL** - Team expertise outweighs scaling concerns - Can re-evaluate if scale becomes issue - Faster initial development ### 5. Design System Structure **Define components and their responsibilities:** ``` ┌─────────────────────────────────────────────┐ │ Client Apps │ │ (Web, iOS, Android) │ └────────────────┬────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ API Gateway / Load Balancer │ └────────────────┬────────────────────────────┘ │ ┌────────┴────────┐ ▼ ▼ ┌───────────────┐ ┌───────────────┐ │ Auth │ │ Core API │ │ Service │ │ Service │ └───────┬───────┘ └───────┬───────┘ │ │ │ ┌────────┴────────┐ │ ▼ ▼ │ ┌──────────────┐ ┌──────────────┐ │ │ PostgreSQL │ │ Redis │ │ │ (Primary) │ │ (Cache) │ │ └──────────────┘ └──────────────┘ │ ▼ ┌───────────────┐ │ User DB │ └───────────────┘ ``` **Component descriptions:** ```markdown ## Components ### API Gateway **Responsibility:** Route requests, rate limiting, authentication **Technology:** Nginx **Dependencies:** Auth Service, Core API Service **Scale:** 2-3 instances behind load balancer ### Auth Service **Responsibility:** User authentication, session management, JWT issuing **Technology:** Python (Flask), PostgreSQL **API:** REST **Scale:** Stateless, 2-N instances ### Core API Service **Responsibility:** Business logic, data access, external integrations **Technology:** Python (FastAPI), PostgreSQL, Redis **API:** REST **Scale:** Stateless, 2-N instances ### PostgreSQL **Responsibility:** Primary data store **Scale:** Primary with read replica ### Redis **Responsibility:** Session storage, caching, rate limiting **Scale:** Cluster mode (3 nodes) ``` ### 6. Define Interfaces **API contracts:** ```markdown ## API Design ### POST /api/auth/login **Purpose:** Authenticate user, issue JWT **Request:** ```json { "email": "user@example.com", "password": "secure_password" } ``` **Response (200):** ```json { "token": "eyJ...", "user": { "id": "123", "email": "user@example.com", "name": "John Doe" } } ``` **Errors:** - 400: Invalid request - 401: Invalid credentials - 429: Rate limit exceeded ``` ### 7. Plan for Failure **What can go wrong?** - Database unavailable - External API down - Network partition - High load - Data corruption **Mitigation strategies:** - Retry with exponential backoff - Circuit breakers for external services - Graceful degradation - Health checks and monitoring - Database backups **Example:** ```markdown ## Failure Scenarios ### Database Unavailable **Impact:** Cannot read/write data **Mitigation:** - Read replica failover (automated) - Circuit breaker after 3 failures - Cache serves stale data for 5 minutes - User sees degraded experience message **Recovery:** Manual failover to replica, fix primary ### External Payment API Down **Impact:** Cannot process payments **Mitigation:** - Retry 3 times with exponential backoff - Queue payments for later processing - User notified of delay - Alert on-call engineer **Recovery:** Process queued payments once API recovers ``` ### 8. Document Decisions **Architecture Decision Record (ADR):** ```markdown # ADR-001: Use PostgreSQL for Primary Database **Status:** Accepted **Date:** 2024-01-15 **Deciders:** Tech Lead, Backend Team ## Context We need to choose a primary database for user data, products, and orders. Requirements: - Strong consistency (ACID) - Complex queries (joins, aggregations) - < 200ms query time for 90% of queries - Support 100k users initially ## Decision Use PostgreSQL as primary database. ## Alternatives Considered ### MongoDB - **Pros:** Flexible schema, horizontal scaling - **Cons:** Team unfamiliar, eventual consistency issues - **Why not:** Team expertise more valuable than flexibility ### DynamoDB - **Pros:** Managed service, auto-scaling - **Cons:** Vendor lock-in, limited query capability, cost - **Why not:** Query limitations would hurt development velocity ### MySQL - **Pros:** Similar to PostgreSQL, team knows it - **Cons:** Less feature-rich than PostgreSQL - **Why not:** PostgreSQL offers JSON support, better full-text search ## Consequences **Positive:** - Team can be productive immediately - Strong consistency guarantees - Rich query capabilities - JSON support for flexible data **Negative:** - Vertical scaling limits (mitigated: can add read replicas) - More complex than managed services (mitigated: use RDS) - Higher operational overhead **Trade-offs:** - Chose familiarity over horizontal scaling - Chose rich queries over eventual consistency - Can re-evaluate if scale requirements change ## Validation - Team confirmed expertise in PostgreSQL - Load testing shows meets performance requirements - Cost analysis shows acceptable for first year ``` ## Architecture Principles ### 1. Simplicity **Start simple, add complexity only when needed.** ``` ❌ BAD: Microservices from day 1 with 20 services ✅ GOOD: Start with monolith, split when needed ``` **Apply YAGNI:** You Aren't Gonna Need It - Don't build for hypothetical future - Add when actually needed - Simpler is easier to maintain ### 2. Separation of Concerns **Each component has one clear responsibility.** ``` ✅ GOOD: - Auth Service: Authentication only - User Service: User profile management - Order Service: Order processing ❌ BAD: - God Service: Does everything ``` **Apply SOLID principles:** - Single Responsibility - Open/Closed - Liskov Substitution - Interface Segregation - Dependency Inversion ### 3. Loose Coupling **Components depend on interfaces, not implementations.** ```typescript // ❌ BAD: Tight coupling class OrderService { constructor(private db: PostgresDatabase) {} } // ✅ GOOD: Loose coupling class OrderService { constructor(private db: Database) {} // Interface } ``` **Benefits:** - Easier to test (mock interface) - Easier to swap implementations - Components can evolve independently ### 4. High Cohesion **Related functionality stays together.** ``` ✅ GOOD: user/ - create_user.ts - update_user.ts - delete_user.ts - user_repository.ts ❌ BAD: create/ - create_user.ts - create_order.ts update/ - update_user.ts - update_order.ts ``` ### 5. Explicit Over Implicit **Make dependencies and contracts clear.** ```typescript // ❌ BAD: Implicit dependency function processOrder(orderId: string) { const db = global.database // Where does this come from? // ... } // ✅ GOOD: Explicit dependency function processOrder( orderId: string, db: Database, logger: Logger ) { // Dependencies are clear } ``` ### 6. Fail Fast **Detect and report errors early.** ```typescript // ❌ BAD: Silent failure function divide(a: number, b: number) { if (b === 0) return 0 // Wrong! return a / b } // ✅ GOOD: Fail fast function divide(a: number, b: number) { if (b === 0) { throw new Error('Division by zero') } return a / b } ``` ### 7. Design for Testability **Make it easy to test.** ```typescript // ❌ BAD: Hard to test class OrderService { processOrder(orderId: string) { const db = new PostgresDatabase() // Can't mock const api = new PaymentAPI() // Can't mock // ... } } // ✅ GOOD: Easy to test class OrderService { constructor( private db: Database, // Can inject mock private api: PaymentAPI // Can inject mock ) {} processOrder(orderId: string) { // ... } } ``` ## Common Architecture Patterns ### Layered Architecture ``` ┌─────────────────────┐ │ Presentation │ (UI, API controllers) ├─────────────────────┤ │ Business Logic │ (Domain, services) ├─────────────────────┤ │ Data Access │ (Repositories, ORMs) ├─────────────────────┤ │ Database │ (Storage) └─────────────────────┘ ``` **When to use:** Simple to moderate complexity ### Hexagonal Architecture (Ports & Adapters) ``` ┌───────────────────────┐ │ External Systems │ │ (UI, DB, APIs) │ └──────────┬────────────┘ │ ┌──────────▼────────────┐ │ Adapters │ (Implementation) │ (REST, PostgreSQL) │ └──────────┬────────────┘ │ ┌──────────▼────────────┐ │ Ports │ (Interfaces) │ (IUserRepo, IAuth) │ └──────────┬────────────┘ │ ┌──────────▼────────────┐ │ Core Domain │ (Business logic) │ (Pure logic) │ └───────────────────────┘ ``` **When to use:** Want to isolate business logic, multiple frontends ### Microservices ``` ┌─────────┐ ┌─────────┐ ┌─────────┐ │ User │ │ Order │ │ Payment │ │ Service │ │ Service │ │ Service │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ └────────────┴────────────┘ │ ┌───────▼────────┐ │ Message Bus │ │ (Event-driven)│ └────────────────┘ ``` **When to use:** Large team, need independent deploy, clear boundaries **Avoid when:** Small team, unclear boundaries, early stage ### Event-Driven Architecture ``` ┌─────────┐ ┌─────────────┐ ┌─────────┐ │Producer │──────▶│ Event Bus │──────▶│Consumer │ └─────────┘ └─────────────┘ └─────────┘ ``` **When to use:** Async processing, decoupled systems, audit trails ## Anti-Patterns ### ❌ Premature Optimization **Don't optimize for scale you don't have.** ``` BAD: Build microservices for 100 users GOOD: Start with monolith, split when needed ``` ### ❌ Resume-Driven Architecture **Don't choose technology to pad resume.** ``` BAD: "I want to learn Kubernetes, let's use it" GOOD: "Kubernetes fits our scale needs" ``` ### ❌ Distributed Monolith **Microservices that are tightly coupled.** ``` BAD: Service A can't deploy without Service B GOOD: Services are independently deployable ``` ### ❌ Big Ball of Mud **No structure, everything depends on everything.** ``` BAD: Any code can call any other code GOOD: Clear layers and boundaries ``` ### ❌ Analysis Paralysis **Over-analyzing, never shipping.** ``` BAD: Spend 6 months on perfect architecture GOOD: Design enough to start, iterate ``` ## Architecture Review Checklist - [ ] Business goals clearly understood - [ ] Functional requirements documented - [ ] Non-functional requirements defined - [ ] Constraints identified - [ ] Multiple alternatives considered - [ ] Trade-offs explicitly stated - [ ] Component responsibilities clear - [ ] Interfaces well-defined - [ ] Failure scenarios planned for - [ ] Security considered - [ ] Scalability addressed - [ ] Testability designed in - [ ] Decisions documented (ADRs) - [ ] Team can implement and maintain ## Integration with Other Skills - Apply **solid-principles** - Guide component design - Apply **simplicity-principles** - KISS, YAGNI - Apply **orthogonality-principle** - Independent components - Apply **structural-design-principles** - Composition patterns - Use **technical-planning** - For implementation after design ## Remember 1. **Simplicity first** - Start simple, add complexity when needed 2. **Document decisions** - Future you will thank you 3. **Consider alternatives** - Never the first idea only 4. **State trade-offs** - Every decision has consequences 5. **Design for change** - Systems evolve **The best architecture is the one that's simple enough to ship and flexible enough to evolve.**