--- name: cto-risk-resilience description: Expert methodology for identifying, assessing, and mitigating technical and operational risks including security, incidents, compliance, and disaster recovery. --- # CTO Risk & Resilience Skill ## Purpose This skill provides a comprehensive framework for managing technical risk and building resilient systems. Use it to conduct risk assessments, plan incident response, achieve compliance certifications, and ensure business continuity. ## When to Use Trigger this skill when you need to: - Conduct security risk assessments - Plan for compliance certifications (SOC 2, ISO 27001, GDPR, HIPAA) - Design incident response processes and runbooks - Create disaster recovery and business continuity plans - Assess vendor and third-party risks - Define SLAs, SLOs, and SLIs - Prepare for security audits or penetration testing - Implement chaos engineering and resilience testing - Respond to security incidents or breaches ## Core Methodology Follow this systematic approach to risk and resilience: ### Phase 1: Risk Identification 1. **Categorize Risks** **Technical Risks**: - System failures and outages - Data loss or corruption - Performance degradation - Scalability limitations - Technical debt accumulation **Security Risks**: - Unauthorized access - Data breaches - DDoS attacks - Malware and ransomware - Supply chain vulnerabilities **Operational Risks**: - Human error - Process failures - Knowledge gaps (bus factor) - Inadequate monitoring - Poor change management **Compliance Risks**: - Regulatory violations (GDPR, CCPA, HIPAA) - Certification failures (SOC 2, ISO 27001) - Contract breaches (SLA violations) - Audit findings **Business Risks**: - Vendor dependencies - Technology obsolescence - Team attrition - Budget constraints 2. **Conduct Risk Assessment** Use `references/frameworks/risk-assessment-matrix.md` to systematically identify and score risks. For each risk: - Describe the risk scenario - Assess probability (High/Medium/Low) - Assess impact (Critical/High/Medium/Low) - Calculate risk score - Identify existing controls - Define mitigation strategies --- ### Phase 2: Risk Prioritization 1. **Create Risk Matrix** ``` High Impact (Critical) | | [Low Priority] | [Medium Priority] | [HIGH PRIORITY] | Low Prob | Medium Prob | High Prob | High Impact | High Impact | High Impact (5)|___________________|_____________________|____________________ | | | | [Low Priority] | [Medium Priority] | [High Priority] | Low Prob | Medium Prob | High Prob | Medium Impact | Medium Impact | Medium Impact (3)|___________________|_____________________|____________________ | | | | [Very Low] | [Low Priority] | [Medium Priority] | Low Prob | Medium Prob | High Prob | Low Impact | Low Impact | Low Impact Low (1)|___________________|_____________________|____________________ Impact Low (1) Medium (3) High (5) Probability ``` 2. **Priority Levels** **Critical (Priority 1)**: Address immediately - High probability + High impact - Example: Known security vulnerability in production **High (Priority 2)**: Address within 30 days - Medium probability + High impact, OR High probability + Medium impact - Example: Single point of failure in critical system **Medium (Priority 3)**: Address within 90 days - Low probability + High impact, OR Medium probability + Medium impact - Example: Lack of disaster recovery testing **Low (Priority 4)**: Monitor and address opportunistically - Any low impact scenarios - Example: Minor technical debt in non-critical systems Use `references/templates/risk-register.md` to maintain ongoing risk inventory. --- ### Phase 3: Security & Compliance #### Security Framework Implement security controls across multiple layers: 1. **Preventive Controls** (Stop threats before they happen) - Access control (MFA, RBAC, least privilege) - Network security (firewalls, VPNs, segmentation) - Encryption (at rest and in transit) - Secure coding practices - Security awareness training 2. **Detective Controls** (Identify threats quickly) - Logging and monitoring - Intrusion detection systems (IDS) - Security information and event management (SIEM) - Vulnerability scanning - Penetration testing 3. **Responsive Controls** (React to incidents) - Incident response plan - Automated threat response - Backup and recovery procedures - Communication protocols Use `references/frameworks/security-controls-framework.md` for comprehensive checklist. --- #### Compliance Roadmaps **SOC 2 Type II Certification** Timeline: 12-18 months (9 months preparation + 3-6 months audit period + 3 months report) Use `references/templates/soc2-roadmap.md` for detailed plan: **Phases**: 1. Gap assessment (Month 1-2) 2. Control implementation (Month 3-9) 3. Evidence collection period (Month 10-15) 4. Audit (Month 16-18) **Key Areas**: - Security: Access controls, encryption, monitoring - Availability: Uptime, incident response, disaster recovery - Confidentiality: Data protection, privacy controls - Processing Integrity: Data accuracy, error handling - Privacy: GDPR/CCPA compliance (if applicable) --- **ISO 27001 Certification** Timeline: 12-24 months Use `references/templates/iso27001-roadmap.md`: **Phases**: 1. Gap analysis (Month 1-3) 2. ISMS implementation (Month 4-15) 3. Internal audit (Month 16-18) 4. Certification audit (Month 19-24) **Key Requirements**: - 114 controls across 14 domains - Risk assessment methodology - Information security policies - Employee training and awareness - Continuous improvement process --- **GDPR / CCPA Compliance** Timeline: 6-12 months Use `references/templates/data-privacy-compliance.md`: **Key Areas**: - Data inventory and mapping - Consent management - Data subject rights (access, deletion, portability) - Data processing agreements - Privacy policy and notices - Breach notification procedures --- ### Phase 4: Incident Response Create structured incident response capability: #### Incident Response Framework **1. Preparation** - Define incident severity levels - Create on-call rotation - Develop runbooks for common scenarios - Set up communication channels - Train team on procedures **2. Detection** - Monitoring and alerting - User reports - Security scanning - Automated anomaly detection **3. Triage** - Assess severity (P0/P1/P2/P3) - Assign incident commander - Assemble response team - Begin communication **4. Investigation** - Gather data and logs - Identify root cause - Assess blast radius - Document timeline **5. Containment** - Stop the bleeding - Isolate affected systems - Prevent further damage - Implement temporary fixes **6. Resolution** - Deploy permanent fix - Verify resolution - Monitor for recurrence - Update systems **7. Post-Mortem** - Blameless retrospective - Document what happened - Identify improvements - Create action items Use `references/templates/incident-response-playbook.md` for detailed procedures. --- #### Incident Severity Levels | Level | Definition | Response Time | Escalation | | ----------------- | ------------------------------------------------------- | ----------------- | ------------------------------ | | **P0 - Critical** | Complete service outage, data breach, security incident | Immediate | All-hands, exec team notified | | **P1 - High** | Major feature broken, significant degradation | <15 minutes | On-call team, manager notified | | **P2 - Medium** | Partial functionality impaired, workaround exists | <2 hours | On-call team | | **P3 - Low** | Minor issue, minimal customer impact | Next business day | Normal ticket queue | --- #### On-Call Best Practices **Structure**: - Primary and secondary on-call rotation - 1-week rotations (avoid burnout) - Compensate with time off or pay - Maximum 2-3 incidents per week on average **Tools**: - PagerDuty, Opsgenie, or similar - Automated escalation - Mobile app for notifications - Integration with monitoring systems **Health Metrics**: - On-call incidents per week - Time spent on-call - Sleep disruption frequency - On-call satisfaction score Use `references/frameworks/on-call-framework.md` for detailed guidance. --- ### Phase 5: Business Continuity Ensure critical business functions can continue during disruptions: #### Disaster Recovery Planning **Recovery Objectives**: **RTO (Recovery Time Objective)**: How long can we be down? - Critical systems: 1 hour - Important systems: 4 hours - Standard systems: 24 hours **RPO (Recovery Point Objective)**: How much data can we lose? - Financial data: 0 (real-time replication) - Customer data: 15 minutes (frequent backups) - Analytics data: 24 hours (daily backups) **Disaster Scenarios**: 1. Single server failure 2. Availability zone outage 3. Regional outage 4. Complete cloud provider outage 5. Ransomware attack 6. Accidental data deletion 7. Key personnel unavailable For each scenario: - Detection method - Recovery procedure - Responsible team - Expected RTO/RPO - Testing frequency Use `references/templates/disaster-recovery-plan.md` for comprehensive planning. --- #### Business Continuity Testing **Regular Drills**: - Quarterly: Tabletop exercises (discuss scenarios) - Bi-annually: Simulated incidents (practice procedures) - Annually: Full disaster recovery test (actual failover) **Game Days**: - Chaos engineering exercises - Intentional failure injection - Test recovery procedures - Identify weaknesses **Documentation**: - Keep runbooks updated - Document lessons learned - Update procedures based on findings - Share knowledge across team --- ### Phase 6: Resilience Engineering Build systems that gracefully handle failures: #### Resilience Patterns **1. Circuit Breakers** - Detect failing services - Prevent cascade failures - Automatic recovery attempts - Fallback behavior **2. Retry with Exponential Backoff** - Handle transient failures - Avoid overwhelming systems - Progressive delay between attempts - Maximum retry limits **3. Timeout and Bulkheads** - Prevent resource exhaustion - Isolate failures - Limit blast radius - Protect critical paths **4. Graceful Degradation** - Continue with reduced functionality - Non-critical features can fail - User experience maintained - Clear communication to users **5. Rate Limiting and Load Shedding** - Prevent overload - Protect system stability - Prioritize critical requests - Fair resource allocation Use `references/frameworks/resilience-patterns.md` for implementation guidance. --- #### Service Level Objectives (SLOs) Define reliability targets: **SLI (Service Level Indicator)**: What we measure - Availability: % of successful requests - Latency: % of requests under threshold - Throughput: Requests per second handled - Error rate: % of failed requests **SLO (Service Level Objective)**: Our target - Availability: 99.9% (43 minutes downtime/month allowed) - Latency: 95% of requests < 200ms - Error rate: < 0.1% **SLA (Service Level Agreement)**: Promise to customers - Usually more lenient than internal SLO - Has financial consequences if missed - Example: 99.5% uptime guarantee with credits **Error Budget**: - Amount of allowed unreliability - 99.9% SLO = 0.1% error budget = 43 minutes/month - Can "spend" budget on releases, changes, experiments - When exhausted, focus shifts to reliability Use `references/templates/slo-definition.md` for framework. --- ## Key Principles - **Prevention Over Reaction**: Build security and resilience from the start - **Defense in Depth**: Multiple layers of security controls - **Assume Breach**: Plan for when (not if) defenses fail - **Blameless Culture**: Learn from incidents without blame - **Continuous Improvement**: Regular testing and refinement - **Clear Communication**: Transparent about risks and incidents ## Bundled Resources **Frameworks** (`references/frameworks/`): - `risk-assessment-matrix.md` - Systematic risk identification and scoring - `security-controls-framework.md` - Comprehensive security checklist - `on-call-framework.md` - Sustainable on-call practices - `resilience-patterns.md` - Architecture patterns for resilience - `chaos-engineering.md` - Controlled failure testing **Templates** (`references/templates/`): - `risk-register.md` - Ongoing risk tracking - `incident-response-playbook.md` - Step-by-step incident procedures - `soc2-roadmap.md` - SOC 2 certification plan - `iso27001-roadmap.md` - ISO 27001 certification plan - `data-privacy-compliance.md` - GDPR/CCPA compliance guide - `disaster-recovery-plan.md` - DR procedures and testing - `slo-definition.md` - Service level objective framework - `security-audit-checklist.md` - Pre-audit preparation - `post-mortem-template.md` - Incident analysis format **Examples** (`references/examples/`): - Real post-mortems from major incidents (anonymized) - Security audit results and remediation - Compliance certification timelines - DR testing scenarios and results ## Usage Patterns **Example 1**: User says "We need to get SOC 2 certified for enterprise sales" → Load `references/templates/soc2-roadmap.md` → Conduct gap assessment against SOC 2 requirements → Create 12-18 month roadmap with phases → Identify control implementations needed → Estimate costs (audit fees, tools, consulting) → Assign ownership and timeline → Provide monthly checklist for evidence collection --- **Example 2**: User says "Create incident response process for my 30-person team" → Load `references/templates/incident-response-playbook.md` → Define severity levels (P0-P3) with examples → Design on-call rotation structure → Create runbooks for common scenarios → Set up communication channels (Slack, status page) → Define escalation paths → Schedule incident response training --- **Example 3**: User says "Conduct security risk assessment for Series B due diligence" → Load `references/frameworks/risk-assessment-matrix.md` → Inventory all systems and data → Identify risks across security, compliance, operational → Score by probability and impact → Document existing controls → Create risk mitigation roadmap → Prepare executive summary for investors --- **Example 4**: User says "We had a major outage, help with post-mortem" → Load `references/templates/post-mortem-template.md` → Document incident timeline → Identify root cause(s) → Analyze what went well and poorly → Create blameless narrative → Generate action items with owners → Share with team and stakeholders → Track action item completion --- ## Risk Management by Company Stage ### Early Stage Startup (Pre-PMF) **Focus**: Security basics, avoid catastrophic risks **Priorities**: 1. Basic security (encryption, access control) 2. Data backup and recovery 3. Privacy compliance basics 4. Simple incident response **Avoid**: Over-investing in compliance certifications too early --- ### Growth Stage (Post-PMF, Scaling) **Focus**: Scalability, reliability, security hardening **Priorities**: 1. SOC 2 preparation (if selling B2B) 2. Comprehensive monitoring and alerting 3. Incident response process 4. On-call rotation structure 5. DR planning and testing **Investment**: 10-15% of engineering time on resilience --- ### Scale Stage (Enterprise) **Focus**: Compliance, resilience, enterprise security **Priorities**: 1. Multiple compliance certifications (SOC 2, ISO 27001) 2. Advanced security (SIEM, threat detection) 3. Chaos engineering and resilience testing 4. Comprehensive BC/DR 5. Security team and CISO **Investment**: 20-25% of engineering time on reliability/security --- ## Warning Signs | Indicator | Risk | Action | | ----------------------------------- | ----------------------------------- | -------------------------------------------- | | No monitoring on production | High - can't detect issues | Immediate: Implement basic monitoring | | No backup/DR tested in 6+ months | High - recovery may fail | Test DR procedures this quarter | | Single person knows critical system | High - bus factor = 1 | Document and cross-train immediately | | Increasing incident frequency | Medium-High - system degrading | Root cause analysis, resilience improvements | | Failed security scan findings | High - vulnerable to attack | Remediate critical/high findings in 30 days | | Compliance deadline <6 months | High - may not certify in time | Accelerate roadmap, consider consultant | | On-call team burned out | Medium - quality and retention risk | Reduce incident load, improve tooling | --- ## Communication Templates ### For Board/Investors **Security & Risk Update** **Status**: 🟢 Secure and compliant **Key Metrics**: - System uptime: 99.87% (target: 99.9%) - Security incidents: 0 critical, 2 low (resolved) - Compliance: SOC 2 on track for Q3 certification **Top Risks & Mitigations**: 1. Single cloud provider dependency → Implementing multi-region DR 2. Growing on-call burden → Hiring SRE, improving automation 3. Compliance timeline tight → Weekly checkpoint, external consultant engaged **Investment Request**: $150K for penetration testing and SOC 2 audit --- ### For Engineering Team **Incident Review - Service Outage Feb 15** **What Happened**: Database connection pool exhaustion caused 45-minute outage **Timeline**: - 2:15 PM: Increased load from marketing campaign - 2:22 PM: First alerts fired - 2:25 PM: Team paged, investigation started - 2:40 PM: Root cause identified - 3:00 PM: Service restored **What Went Well**: - ✅ Alerts fired within 7 minutes - ✅ Team assembled quickly - ✅ Clear communication to customers **What We'll Improve**: - ⚠️ Auto-scaling for connection pool - ⚠️ Load testing before campaigns - ⚠️ Better runbook documentation **Action Items**: [See detailed list] No blame - systems fail, we learn and improve. --- ## Writing Style All outputs should be: - **Risk-Aware**: Identify and communicate risks clearly - **Action-Oriented**: Focus on concrete mitigation steps - **Balanced**: Realistic about risk vs. cost/effort - **Empathetic**: Blameless culture, learning mindset - **Transparent**: Honest about gaps and limitations --- **Version**: 1.0.0 **Philosophy**: Prevent where possible, detect quickly, respond effectively, learn continuously