--- name: infrastructure-documenter description: Expert guide for documenting infrastructure including architecture diagrams, runbooks, system documentation, and operational procedures. Use when creating technical documentation for systems and deployments. --- # Infrastructure Documenter Skill ## Overview This skill helps you create clear, maintainable infrastructure documentation. Covers architecture diagrams, runbooks, system documentation, operational procedures, and documentation-as-code practices. ## Documentation Philosophy ### Principles 1. **Living documentation**: Keep it in sync with reality 2. **Audience-aware**: Different docs for different readers 3. **Actionable**: Every doc should help someone do something 4. **Version-controlled**: Documentation changes tracked with code ### Document Types | Type | Audience | Purpose | |------|----------|---------| | Architecture | Engineers | Understand system design | | Runbooks | Ops/SRE | Handle incidents | | API Docs | Developers | Integrate with system | | Onboarding | New hires | Get up to speed | | Decision Records | Future you | Understand why | ## Architecture Documentation ### System Architecture Overview ```markdown # System Architecture ## Overview [Project Name] is a [type] application that [purpose]. ## High-Level Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Users │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Vercel Edge │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Next.js App │ │ Edge Functions │ │ │ └─────────────────┘ └─────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ┌───────────────┼───────────────┐ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Supabase │ │ Redis │ │ Stripe │ │ - PostgreSQL │ │ - Session │ │ - Payments │ │ - Auth │ │ - Cache │ │ - Webhooks │ │ - Realtime │ │ │ │ │ │ - Storage │ │ │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ ``` ## Components ### Frontend (Next.js App) - **Location**: Vercel Edge Network - **Framework**: Next.js 14 (App Router) - **Styling**: Tailwind CSS + shadcn/ui - **State**: Zustand + React Query ### Backend Services | Service | Provider | Purpose | |---------|----------|---------| | Database | Supabase | PostgreSQL with RLS | | Auth | Supabase Auth | User authentication | | Storage | Supabase Storage | File uploads | | Cache | Upstash Redis | Session & API cache | | Payments | Stripe | Subscriptions | | Email | Resend | Transactional emails | ### Data Flow 1. User request → Vercel Edge 2. SSR/API Route processes request 3. Database queries via Supabase client 4. Response cached at edge (when applicable) 5. Response returned to user ## Security ### Authentication Flow 1. User signs in via Supabase Auth 2. JWT token issued and stored in cookie 3. Server validates token on each request 4. RLS policies enforce data access ### Data Protection - All data encrypted at rest (AES-256) - TLS 1.3 for data in transit - Secrets stored in Vercel environment - PII fields encrypted in database ``` ### Mermaid Diagrams ```markdown ## Request Flow ```mermaid sequenceDiagram participant U as User participant V as Vercel participant N as Next.js participant S as Supabase participant R as Redis U->>V: HTTPS Request V->>N: Route to App alt Cached Response N->>R: Check Cache R-->>N: Cache Hit N-->>U: Return Cached else Cache Miss N->>S: Query Database S-->>N: Data N->>R: Store in Cache N-->>U: Return Response end ``` ## Database Schema ```mermaid erDiagram users ||--o{ projects : owns users { uuid id PK text email text name timestamp created_at } projects ||--o{ tasks : contains projects { uuid id PK uuid user_id FK text name text status } tasks { uuid id PK uuid project_id FK text title boolean completed } ``` ``` ## Runbooks ### Runbook Template ```markdown # Runbook: [Service Name] - [Issue Type] ## Overview Brief description of the issue and when this runbook applies. ## Severity - **P1 (Critical)**: Complete outage - **P2 (High)**: Degraded service - **P3 (Medium)**: Minor impact - **P4 (Low)**: No user impact ## Detection How this issue is typically detected: - [ ] Alert from [monitoring system] - [ ] User report - [ ] Automated check failure ## Impact Assessment - **Users affected**: All / Segment / None - **Data at risk**: Yes / No - **Revenue impact**: High / Medium / Low / None ## Prerequisites - [ ] Access to [system/dashboard] - [ ] Credentials for [service] - [ ] Contact info for [team/person] ## Resolution Steps ### Step 1: Verify the Issue ```bash # Check service status curl -I https://api.example.com/health # Check logs vercel logs --follow ``` ### Step 2: Identify Root Cause Common causes: - [ ] Database connection pool exhausted - [ ] Memory limit reached - [ ] External service down - [ ] Bad deployment ### Step 3: Apply Fix #### If Database Issue: ```bash # Check connection count SELECT count(*) FROM pg_stat_activity; # Kill idle connections SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour'; ``` #### If Bad Deployment: ```bash # Rollback to previous deployment vercel rollback ``` ### Step 4: Verify Fix ```bash # Check service health curl https://api.example.com/health # Monitor error rates for 15 minutes ``` ## Escalation If unable to resolve within 30 minutes: 1. Page on-call engineer: [contact] 2. Notify stakeholders in #incidents 3. Update status page ## Post-Incident - [ ] Create incident report - [ ] Schedule post-mortem (P1/P2 only) - [ ] Update this runbook if needed ## Related Links - [Dashboard](https://dashboard.example.com) - [Logs](https://logs.example.com) - [Metrics](https://metrics.example.com) ``` ### Database Runbooks ```markdown # Runbook: Database Performance Issues ## Symptoms - Slow API responses (>1s) - Timeout errors in logs - High database CPU in dashboard ## Quick Checks ### 1. Check Active Connections ```sql SELECT state, count(*), max(now() - query_start) as max_duration FROM pg_stat_activity GROUP BY state; ``` ### 2. Find Long-Running Queries ```sql SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '30 seconds' ORDER BY duration DESC; ``` ### 3. Check Table Sizes ```sql SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) as size FROM pg_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC LIMIT 10; ``` ### 4. Check Missing Indexes ```sql SELECT relname, seq_scan, idx_scan, seq_scan - idx_scan AS difference FROM pg_stat_user_tables WHERE seq_scan > idx_scan ORDER BY difference DESC; ``` ## Resolution ### Kill Problematic Queries ```sql SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid = [PID_FROM_ABOVE]; ``` ### Add Missing Index ```sql CREATE INDEX CONCURRENTLY idx_table_column ON table_name (column_name); ``` ``` ## Decision Records (ADRs) ### ADR Template ```markdown # ADR-001: Choose Supabase for Database ## Status Accepted ## Context We need a database solution for [Project Name] that supports: - PostgreSQL compatibility - Real-time subscriptions - Built-in authentication - Easy local development - Generous free tier ## Decision We will use Supabase as our primary database and auth provider. ## Alternatives Considered ### PlanetScale **Pros:** - Excellent scaling - Branching for schema changes - MySQL compatible **Cons:** - No built-in auth - No real-time subscriptions - Additional services needed ### Firebase **Pros:** - Real-time built-in - Mature platform - Good mobile SDKs **Cons:** - NoSQL (not ideal for our use case) - Vendor lock-in concerns - Complex security rules ## Consequences ### Positive - Single provider for DB + Auth + Storage - Great developer experience - Row Level Security for data protection - Local development with supabase CLI ### Negative - PostgreSQL-specific features tie us to provider - Supabase still maturing (some rough edges) - Limited to their managed offering ### Risks - Supabase scaling limitations at high traffic - Migration cost if we need to move ## References - [Supabase Documentation](https://supabase.com/docs) - [Comparison: Supabase vs Firebase](https://...) ``` ## API Documentation ### Endpoint Documentation ```markdown # API Reference ## Base URL ``` Production: https://api.example.com/v1 Staging: https://staging-api.example.com/v1 ``` ## Authentication All API requests require authentication via Bearer token. ```bash curl -H "Authorization: Bearer YOUR_TOKEN" \ https://api.example.com/v1/users ``` ## Endpoints ### Users #### Get Current User ``` GET /users/me ``` **Response:** ```json { "id": "usr_123", "email": "user@example.com", "name": "John Doe", "created_at": "2024-01-01T00:00:00Z" } ``` #### Update User ``` PATCH /users/me ``` **Request Body:** | Field | Type | Required | Description | |-------|------|----------|-------------| | name | string | No | Display name | | avatar_url | string | No | Profile image URL | **Example:** ```bash curl -X PATCH \ -H "Authorization: Bearer YOUR_TOKEN" \ -H "Content-Type: application/json" \ -d '{"name": "Jane Doe"}' \ https://api.example.com/v1/users/me ``` ### Error Responses | Status | Code | Description | |--------|------|-------------| | 400 | BAD_REQUEST | Invalid request body | | 401 | UNAUTHORIZED | Missing or invalid token | | 403 | FORBIDDEN | Insufficient permissions | | 404 | NOT_FOUND | Resource not found | | 429 | RATE_LIMITED | Too many requests | | 500 | INTERNAL_ERROR | Server error | **Error Response Format:** ```json { "error": { "code": "NOT_FOUND", "message": "User not found" } } ``` ``` ## Environment Documentation ### Environment Matrix ```markdown # Environments ## Overview | Environment | URL | Purpose | Deploy | |-------------|-----|---------|--------| | Production | https://myapp.com | Live users | Manual (main) | | Staging | https://staging.myapp.com | Pre-release testing | Auto (main) | | Preview | https://pr-*.vercel.app | PR review | Auto (PR) | | Development | http://localhost:3000 | Local dev | Manual | ## Configuration ### Production ```env NODE_ENV=production DATABASE_URL=[Supabase Production] NEXT_PUBLIC_APP_URL=https://myapp.com ``` ### Staging ```env NODE_ENV=production DATABASE_URL=[Supabase Staging Branch] NEXT_PUBLIC_APP_URL=https://staging.myapp.com ``` ### Development ```env NODE_ENV=development DATABASE_URL=[Local Supabase] NEXT_PUBLIC_APP_URL=http://localhost:3000 ``` ## Access ### Production - **Vercel**: Admin only - **Database**: Read-only for devs, write for admin - **Logs**: All engineers ### Staging - **Vercel**: All engineers - **Database**: All engineers - **Logs**: All engineers ## Secrets Rotation | Secret | Rotation | Last Rotated | |--------|----------|--------------| | Database password | 90 days | 2024-01-15 | | API keys | 90 days | 2024-01-15 | | JWT secret | Never | Initial setup | ``` ## Documentation-as-Code ### Documentation Structure ``` docs/ ├── README.md # Documentation index ├── architecture/ │ ├── overview.md # System architecture │ ├── data-flow.md # Data flow diagrams │ └── decisions/ # ADRs │ ├── 001-database.md │ └── 002-hosting.md ├── runbooks/ │ ├── README.md # Runbook index │ ├── database.md # Database issues │ ├── deployment.md # Deployment issues │ └── outage.md # Service outage ├── api/ │ └── reference.md # API documentation └── onboarding/ ├── setup.md # Local setup └── contributing.md # How to contribute ``` ### Auto-Generated Documentation ```yaml # .github/workflows/docs.yml name: Generate Docs on: push: branches: [main] paths: - 'src/**' - 'docs/**' jobs: generate-docs: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Generate API docs from OpenAPI run: | npx @redocly/cli build-docs openapi.yaml \ --output docs/api/index.html - name: Generate TypeDoc run: npx typedoc --out docs/api/typescript - name: Deploy to GitHub Pages uses: peaceiris/actions-gh-pages@v3 with: github_token: ${{ secrets.GITHUB_TOKEN }} publish_dir: ./docs ``` ## Documentation Checklist ### Architecture Docs - [ ] System overview diagram - [ ] Component descriptions - [ ] Data flow documentation - [ ] Security architecture - [ ] Technology decisions (ADRs) ### Operational Docs - [ ] Runbooks for common issues - [ ] Deployment procedures - [ ] Monitoring and alerting - [ ] Incident response plan - [ ] On-call procedures ### Developer Docs - [ ] Local setup guide - [ ] API reference - [ ] Contributing guidelines - [ ] Code conventions - [ ] Testing guide ### Maintenance - [ ] Documentation review schedule - [ ] Ownership assigned - [ ] Change process defined - [ ] Versioning strategy ## When to Use This Skill Invoke this skill when: - Creating architecture documentation - Writing runbooks for operations - Documenting decision rationale (ADRs) - Setting up documentation structure - Creating onboarding materials - Building automated documentation - Planning incident response procedures