---
name: architecture-discipline
description: Use when designing/modifying system architecture or evaluating technology choices. Enforces 7-section TodoWrite with 22+ items. Triggers: "design architecture", "system design", "architectural decision", "should we use [tech]", "compare [A] vs [B]", "add new service", "microservices", "database choice", "API design", "scale to [X] users", "infrastructure decision". If thinking ANY of these, USE THIS SKILL: "quick recommendation is fine", "obvious choice", "we already know the answer", "just need to pick one", "simple architecture question".
---

# Architecture Discipline

## When to Use

**Use when decisions affect:**
- Data models or schema changes
- Service boundaries or new services
- Deployment topology or infrastructure
- Scale characteristics (10x growth implications)
- Technology stack choices
- API contracts or external integrations

**Skip for:**
- Parameter tweaks or configuration changes
- UI/styling changes
- Localization or copy changes
- Bug fixes within existing architecture
- Adding fields to existing models (unless schema migration)

**Threshold:** If the decision could cause a 3+ month re-architecture project if wrong, use this skill.

## CRITICAL: This Is Reasoning Discipline, Not a Checklist

The 7 sections are a **reasoning sequence**, not boxes to check:
- Alternatives BEFORE choosing
- Scale requirements BEFORE design
- Failure modes built in, not bolted on

🚨 **If you wrote architecture without starting with all 7 sections:** DELETE and restart. Retrofitting analysis is rationalization, not evaluation.

---

## MANDATORY FIRST STEP

**CREATE TodoWrite** with these 7 sections (22+ items total):

| Section | Minimum Items |
|---------|---------------|
| Scale Analysis | 4+ |
| Architectural Options | 3+ |
| Ripple Effect Analysis | 5+ |
| Failure Modes | 3+ |
| Observability | 3+ |
| Documentation | 2+ |
| Migration/Compatibility | 2+ |

**Do not design, propose solutions, or implement until TodoWrite is verified.**

---

## Verification Checkpoint

After creating TodoWrite, verify 3 random items pass this test:

**Each item must have ALL THREE:**
- ✓ Concrete numbers/thresholds ("100K users", "$500/mo", "P95 < 500ms")
- ✓ Specific tools/technologies ("PostgreSQL", "Redis", "CloudWatch")
- ✓ Measurable outcome ("handles 1M req/sec", "costs $X at 10x")

| ❌ FAILS | ✅ PASSES |
|----------|-----------|
| "Add monitoring" | "CloudWatch: `websocket.connections.active`, alert if >5% error rate via PagerDuty" |
| "Evaluate caching" | "Compare: Redis (1ms, $300/mo) vs In-memory LRU (0.1ms, $0) vs No cache (100ms)" |
| "Analyze scale" | "Current: 100K DAU, 50 req/sec. 10x: 1M users, 500 req/sec. Bottleneck: PostgreSQL connection pool" |

**DO NOT PROCEED until 22+ items AND quality check passes.**

---

## Section Requirements

### 1. Scale Analysis (4+ items)

**NEVER design for current scale only.** Before proposing any solution:
- Current scale: Users (DAU/MAU), requests/sec, data volume, read/write ratio
- 10x scale: What numbers at 10x? When expected?
- Bottlenecks: What breaks at 10x? (DB connections, API limits, memory)
- Mitigation: Specific solution for each bottleneck

### 2. Architectural Options (3+ items)

**NEVER present single solution.** Minimum 3 distinct options, each with:
- Performance: Latency (P50/P95/P99), throughput, scale limit
- Complexity: LOC estimate, services involved, operational burden
- Cost: Infrastructure ($X/mo current, $Y/mo at 10x), development (engineer-weeks)
- Trade-offs: Specific advantages (✅) and disadvantages (❌)

**If stakeholder suggests solution:** Add as Option A, evaluate with SAME rigor as alternatives.

### 3. Ripple Effect Analysis (5+ items)

Changes propagate across layers. Analyze ALL:
- Data layer: Schema changes, migrations, indexes, query performance
- Services: Which need updates? API contracts changed?
- API: Breaking changes? Version bump? Backward compatibility?
- Clients: Mobile updates? Web UI changes?
- Operations: Deployment changes? New monitoring? Cost changes?

### 4. Failure Modes (3+ items)

For each mode:
- Scenario: [Component] fails because [reason]
- Detection: How we know (metrics drop, error rate spike)
- Impact: What breaks (user features, data integrity)
- Mitigation: Circuit breaker, fallback, redundancy

### 5. Observability (3+ items)

- Metrics: Specific (latency P95, error rate %, throughput)
- Alerts: Conditions (error rate > 5%, latency P95 > 500ms)
- Dashboards: Key visualizations

### 6. Documentation (2+ items)

- ADR: Chosen option, rejected alternatives, trade-offs, constraints
- Diagram: New components, data flows, failure paths

### 7. Migration/Compatibility (2+ items)

- Backward compatibility: Old clients work? API versioning?
- Migration path: Phased rollout, feature flags, rollback procedure

---

## Red Flags - STOP When You Think:

| Thought | Reality |
|---------|---------|
| "Analysis paralysis" | This IS the analysis that prevents expensive mistakes |
| "We'll add scale/alternatives/failure modes later" | Retrofitting costs 5-10x more |
| "CTO already decided" | Still needs independent evaluation |
| "Being pragmatic not dogmatic" | These requirements ARE pragmatic |
| "Just a simple feature" | Simple becomes complex at scale |
| "We already know the solution" | Compare 3 alternatives first |
| "Keep it simple" | Simple for current scale = complex re-architecture at 10x |
| "I can add missing sections to existing work" | DELETE and restart |

---

## Override Requirements

To skip ANY requirement, you MUST provide ALL 4:
1. Specific retrofit date (not "later")
2. Budget allocated (engineer-weeks)
3. Risk acceptance signed by decision maker
4. Interim mitigation plan

| Skipped | Risk | Cost |
|---------|------|------|
| Scale Analysis | Re-architecture in 6-12 months | 3-6 month project, 5-10x cost |
| Alternatives | Optimize wrong dimension | 2-4 month migration |
| Failure Modes | Production incidents | $5-50K per incident |
| Ripple Effects | Broken clients, data issues | Deployment failures |

---

## Verification Before Complete

| Category | Requirements |
|----------|-------------|
| Scale | ✓ Current + 10x projected ✓ Bottlenecks ✓ Mitigations |
| Trade-offs | ✓ 3+ options ✓ Performance/complexity/cost ✓ Rationale |
| Impact | ✓ All layers analyzed ✓ Breaking changes identified |
| Failure | ✓ Specific modes ✓ Detection ✓ Mitigation ✓ Rollback |
| Documentation | ✓ ADR ✓ Diagram updated |

**If any item missing, do not proceed to implementation.**