---
name: aws-well-architected-framework
description: Use when reviewing AWS architecture, designing cloud systems, addressing operational issues, security concerns, reliability problems, performance bottlenecks, cost overruns, or sustainability goals
---

# AWS Well-Architected Framework

Expert guidance for designing, reviewing, and improving AWS architectures using the six pillars of the Well-Architected Framework.

## When to Use

Use this skill when:

- Reviewing existing AWS architecture for best practices
- Designing new cloud systems or applications
- Troubleshooting operational issues, security vulnerabilities, or reliability problems
- Optimizing costs or improving performance
- Preparing for architecture reviews or audits
- Migrating workloads to AWS
- Addressing compliance or sustainability requirements
- User asks "is my architecture good?" or "how can I improve my AWS setup?"

## Core Principle

**Systematic architecture evaluation across 6 pillars ensures balanced, well-designed systems that meet business objectives.**

The AWS Well-Architected Framework provides a consistent approach for evaluating cloud architectures and implementing scalable designs.

## The Six Pillars

| Pillar | Focus | Key Question |
|--------|-------|--------------|
| **Operational Excellence** | Run and monitor systems | How do we operate effectively? |
| **Security** | Protect information and systems | How do we protect data and resources? |
| **Reliability** | Recover from failures | How do we ensure workload availability? |
| **Performance Efficiency** | Use resources effectively | How do we meet performance requirements? |
| **Cost Optimization** | Avoid unnecessary costs | How do we achieve cost-effective outcomes? |
| **Sustainability** | Minimize environmental impact | How do we reduce carbon footprint? |

## Architecture Review Workflow

**CRITICAL: You MUST review ALL 6 pillars systematically. Never skip a pillar because it "seems not applicable" - every workload has considerations across all pillars.**

```dot
digraph review_flow {
    "Architecture review needed" [shape=doublecircle];
    "Identify workload scope" [shape=box];
    "Review each pillar systematically" [shape=box];
    "Document findings per pillar" [shape=box];
    "Prioritize improvements" [shape=box];
    "Create action plan" [shape=box];
    "All pillars reviewed?" [shape=diamond];
    "Complete" [shape=doublecircle];

    "Architecture review needed" -> "Identify workload scope";
    "Identify workload scope" -> "Review each pillar systematically";
    "Review each pillar systematically" -> "Document findings per pillar";
    "Document findings per pillar" -> "All pillars reviewed?";
    "All pillars reviewed?" -> "Review each pillar systematically" [label="no"];
    "All pillars reviewed?" -> "Prioritize improvements" [label="yes"];
    "Prioritize improvements" -> "Create action plan";
    "Create action plan" -> "Complete";
}
```

**Red Flags - You're Skipping the Framework:**
- "This pillar doesn't apply to this workload" - WRONG, every pillar applies
- Jumping straight to recommendations without documenting current state
- Only reviewing 3-4 pillars instead of all 6
- Providing generic advice instead of workload-specific assessment

## Pillar 1: Operational Excellence

**Goal:** Support development and run workloads effectively, gain insight into operations, and continuously improve processes.

### Design Principles

- Perform operations as code (IaC)
- Make frequent, small, reversible changes
- Refine operations procedures frequently
- Anticipate failure
- Learn from operational events and failures

### Key Areas

**Organization:**
- How do teams share architecture knowledge?
- Are there clear ownership and accountability models?

**Prepare:**
- How do you design workloads for observability?
- Infrastructure as code implementation?
- Deployment practices (CI/CD)?

**Operate:**
- What's the runbook for common operations?
- How do you understand workload health?
- How do you respond to events?

**Evolve:**
- How do you learn from operational events?
- Process for continuous improvement?

### Common Issues & Solutions

| Issue | Solution |
|-------|----------|
| Manual deployments | Implement CI/CD with CloudFormation/CDK/Terraform |
| No visibility into system health | Add CloudWatch dashboards, metrics, alarms |
| Operational procedures outdated | Regular runbook reviews, post-incident learning |
| Slow incident response | Create automated remediation with Lambda/Systems Manager |

### Quick Implementation Checklist

- [ ] Infrastructure defined as code (CloudFormation/CDK/Terraform)
- [ ] CI/CD pipeline implemented
- [ ] CloudWatch dashboards for key metrics
- [ ] Alarms for critical thresholds
- [ ] Runbooks documented and accessible
- [ ] Regular game days to test procedures
- [ ] Post-incident review process

## Pillar 2: Security

**Goal:** Protect data, systems, and assets through cloud security practices.

### Design Principles

- Implement strong identity foundation
- Enable traceability
- Apply security at all layers
- Automate security best practices
- Protect data in transit and at rest
- Keep people away from data
- Prepare for security events

### Key Areas

**Security Foundations:**
- How do you manage credentials and authentication?
- IAM roles and policies following least privilege?

**Identity and Access Management:**
- How do you manage identities for people and machines?
- MFA enabled for all human access?

**Detection:**
- How do you detect and investigate security events?
- CloudTrail, GuardDuty, Security Hub configured?

**Infrastructure Protection:**
- How do you protect networks and compute?
- VPC configuration, security groups, NACLs?

**Data Protection:**
- How do you classify and protect data?
- Encryption at rest and in transit?

**Incident Response:**
- How do you respond to security incidents?
- Incident response plan tested?

### Critical Security Patterns

**Never Do:**
```typescript
// ❌ DANGEROUS: Hardcoded credentials
const AWS = require('aws-sdk');
const s3 = new AWS.S3({
  accessKeyId: 'AKIAIOSFODNN7EXAMPLE',
  secretAccessKey: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
});
```

**Always Do:**
```typescript
// ✅ CORRECT: Use IAM roles
const AWS = require('aws-sdk');
const s3 = new AWS.S3(); // Credentials from IAM role

// Lambda function with IAM role
const lambda = new lambda.Function(this, 'MyFunction', {
  // IAM role with least privilege
  role: myRole,
  // ...
});
```

### Security Checklist

- [ ] No hardcoded credentials anywhere (check git history!)
- [ ] IAM roles follow least privilege principle
- [ ] MFA enabled for root and privileged accounts
- [ ] CloudTrail enabled in all regions
- [ ] VPC with proper public/private subnet architecture
- [ ] Security groups with minimal inbound rules
- [ ] Encryption at rest for all data stores
- [ ] HTTPS/TLS for all data in transit
- [ ] Secrets Manager or Parameter Store for secrets
- [ ] Regular security patching process
- [ ] AWS Config for compliance monitoring
- [ ] GuardDuty for threat detection

## Pillar 3: Reliability

**Goal:** Ensure workload performs its intended function correctly and consistently.

### Design Principles

- Automatically recover from failure
- Test recovery procedures
- Scale horizontally
- Stop guessing capacity
- Manage change through automation

### Key Areas

**Foundations:**
- How do you manage service quotas and constraints?
- Network topology designed for HA?

**Workload Architecture:**
- How do you design workload service architecture?
- Microservices vs monolith considerations?

**Change Management:**
- How do you monitor workload resources?
- How are changes deployed safely?

**Failure Management:**
- How do you back up data?
- How do you design for resilience?
- DR plan and RTO/RPO defined?

### High Availability Patterns

**Multi-AZ Deployment:**
```
Region
├── AZ-1: Application + Database
├── AZ-2: Application + Database (standby)
└── AZ-3: Application + Database (standby)
```

**Multi-Region Deployment:**
```
Primary Region          Secondary Region
├── Active workload    ├── Standby/Active
├── Database (primary) ├── Database (replica)
└── Route 53 health check monitoring
```

### Backup Strategy

| Data Type | Solution | RPO | RTO |
|-----------|----------|-----|-----|
| **RDS** | Automated backups + snapshots | < 5 min | < 30 min |
| **DynamoDB** | Point-in-time recovery | Seconds | Minutes |
| **S3** | Versioning + cross-region replication | Real-time | Immediate |
| **EBS** | Snapshots via AWS Backup | Hours | Hours |

### Reliability Checklist

- [ ] Multi-AZ deployment for critical components
- [ ] Health checks configured (ELB, Route 53)
- [ ] Auto Scaling groups with proper sizing
- [ ] RDS automated backups enabled
- [ ] DynamoDB point-in-time recovery enabled
- [ ] S3 versioning for critical buckets
- [ ] Disaster recovery plan documented and tested
- [ ] Chaos engineering tests (failure injection)
- [ ] Graceful degradation strategies
- [ ] Circuit breaker patterns implemented

## Pillar 4: Performance Efficiency

**Goal:** Use computing resources efficiently to meet requirements and maintain efficiency as demand changes.

### Design Principles

- Democratize advanced technologies
- Go global in minutes
- Use serverless architectures
- Experiment more often
- Consider mechanical sympathy

### Key Areas

**Selection:**
- How do you select appropriate resource types and sizes?
- Compute: EC2, Lambda, Fargate, ECS, EKS?
- Database: RDS, DynamoDB, Aurora, ElastiCache?
- Storage: S3, EFS, EBS, Glacier?

**Review:**
- How do you evolve workload to use new resources?
- Regular review of AWS new features?

**Monitoring:**
- How do you monitor resources?
- CloudWatch, X-Ray for distributed tracing?

**Trade-offs:**
- How do you use trade-offs to improve performance?
- Caching, consistency models, compression?

### Performance Patterns

**Caching Strategy:**
```
Client → CloudFront (edge cache)
  → API Gateway
    → Lambda
      → ElastiCache (data cache)
        → DynamoDB/RDS
```

**Database Selection:**

| Use Case | Recommended Service |
|----------|-------------------|
| Relational, complex queries | RDS (PostgreSQL/MySQL) |
| High throughput, simple queries | DynamoDB |
| Graph relationships | Neptune |
| Search and analytics | OpenSearch |
| Time-series data | Timestream |
| In-memory cache | ElastiCache (Redis/Memcached) |

### Performance Checklist

- [ ] Right-sized compute instances (not over-provisioned)
- [ ] Content delivery through CloudFront
- [ ] Database read replicas for read-heavy workloads
- [ ] Caching layer (ElastiCache, DAX, CloudFront)
- [ ] Asynchronous processing with SQS/SNS/EventBridge
- [ ] Auto Scaling configured appropriately
- [ ] Database indexes optimized
- [ ] Monitoring with CloudWatch and X-Ray
- [ ] Regular performance testing under load

## Pillar 5: Cost Optimization

**Goal:** Run systems to deliver business value at lowest price point.

### Design Principles

- Implement cloud financial management
- Adopt consumption model
- Measure overall efficiency
- Stop spending on undifferentiated heavy lifting
- Analyze and attribute expenditure

### Key Areas

**Practice Cloud Financial Management:**
- Cost allocation tags implemented?
- Budgets and alerts configured?

**Expenditure and Usage Awareness:**
- How do you govern usage?
- Cost Explorer and AWS Budgets configured?

**Cost-Effective Resources:**
- How do you evaluate cost when selecting services?
- Reserved Instances or Savings Plans for predictable workloads?

**Manage Demand:**
- How do you manage demand and supply resources?
- Throttling, caching to reduce demand?

**Optimize Over Time:**
- How do you evaluate new services?
- Regular review of cost optimization opportunities?

### Cost Optimization Strategies

| Strategy | Implementation | Potential Savings |
|----------|---------------|-------------------|
| **Right-sizing** | Use Compute Optimizer recommendations | 20-40% |
| **Reserved Instances** | 1-year or 3-year commitments | 30-75% |
| **Savings Plans** | Flexible compute commitments | 30-70% |
| **Spot Instances** | Fault-tolerant workloads | 50-90% |
| **S3 Intelligent-Tiering** | Automatic storage class optimization | 40-60% |
| **Auto Scaling** | Scale resources with demand | 30-50% |
| **Lambda instead of EC2** | For appropriate workloads | Varies |

### Cost Monitoring

```typescript
// CDK Example: Set up budget alerts
import * as budgets from 'aws-cdk-lib/aws-budgets';

new budgets.CfnBudget(this, 'MonthlyBudget', {
  budget: {
    budgetType: 'COST',
    timeUnit: 'MONTHLY',
    budgetLimit: {
      amount: 1000,
      unit: 'USD',
    },
  },
  notificationsWithSubscribers: [{
    notification: {
      notificationType: 'ACTUAL',
      comparisonOperator: 'GREATER_THAN',
      threshold: 80, // Alert at 80%
    },
    subscribers: [{
      subscriptionType: 'EMAIL',
      address: 'team@example.com',
    }],
  }],
});
```

### Cost Optimization Checklist

- [ ] Cost allocation tags applied consistently
- [ ] AWS Budgets configured with alerts
- [ ] Cost Explorer reviewed monthly
- [ ] Reserved Instances or Savings Plans for stable workloads
- [ ] Spot Instances for fault-tolerant workloads
- [ ] Unused resources identified and terminated
- [ ] S3 lifecycle policies for data management
- [ ] Right-sized instances (not over-provisioned)
- [ ] Lambda memory optimization
- [ ] DynamoDB on-demand vs provisioned analysis
- [ ] Data transfer costs analyzed and optimized

## Pillar 6: Sustainability

**Goal:** Minimize environmental impact of running cloud workloads.

### Design Principles

- Understand your impact
- Establish sustainability goals
- Maximize utilization
- Anticipate and adopt new, more efficient offerings
- Use managed services
- Reduce downstream impact

### Key Areas

**Region Selection:**
- Choose regions with renewable energy
- AWS regions with lower carbon intensity

**User Behavior Patterns:**
- Scale resources with demand
- Remove unused resources

**Software and Architecture:**
- Optimize code for efficiency
- Use appropriate services (serverless over provisioned)

**Data Patterns:**
- Minimize data movement
- Use data compression
- Implement lifecycle policies

**Hardware Patterns:**
- Use minimum necessary hardware
- Use instance types with best performance per watt

**Development Process:**
- Test sustainability improvements
- Measure and report carbon footprint

### Sustainability Checklist

- [ ] Workloads in regions with renewable energy
- [ ] Auto Scaling to match demand (no idle resources)
- [ ] Unused resources regularly cleaned up
- [ ] Graviton processors considered for better efficiency
- [ ] Managed services used where appropriate
- [ ] Data lifecycle policies to reduce storage
- [ ] Efficient code (async processing, optimized queries)
- [ ] Monitoring resource utilization
- [ ] Carbon footprint tracked (AWS Customer Carbon Footprint Tool)

## Review Process

### 1. Scoping Phase

**Questions to ask:**
- What is the workload scope? (entire system vs specific component)
- What are the business objectives?
- What are the compliance requirements?
- What are the current pain points?

### 2. Review Each Pillar

For each pillar, use this template:

**Current State:**
- Document what exists today

**Gaps:**
- What's missing or needs improvement?

**Risks:**
- What are the high/medium/low priority risks?

**Recommendations:**
- Specific, actionable improvements

### 3. Prioritization Matrix

| Priority | Criteria |
|----------|----------|
| **High** | Security vulnerabilities, critical availability risks, major cost waste |
| **Medium** | Performance issues, moderate cost optimization, operational improvements |
| **Low** | Nice-to-haves, future considerations, minor optimizations |

### 4. Action Plan Template

```markdown
## Pillar: [Name]

### Issue: [Description]
- **Risk Level:** High/Medium/Low
- **Impact:** [Business impact]
- **Effort:** Low/Medium/High

### Recommendation:
[Specific actions]

### Implementation Steps:
1. [Step 1]
2. [Step 2]
3. [Step 3]

### Success Criteria:
- [Measurable outcome 1]
- [Measurable outcome 2]

### Resources:
- [AWS documentation links]
- [Blog posts or examples]
```

## Common Anti-Patterns

| Anti-Pattern | Issue | Better Approach |
|--------------|-------|-----------------|
| **Single AZ deployment** | No fault tolerance | Multi-AZ architecture |
| **No IaC** | Manual config, drift | CloudFormation/CDK/Terraform |
| **Hardcoded secrets** | Security vulnerability | Secrets Manager/Parameter Store |
| **No monitoring** | Blind operation | CloudWatch dashboards + alarms |
| **No backups** | Data loss risk | Automated backup strategy |
| **Over-provisioning** | Cost waste | Right-sizing + Auto Scaling |
| **No cost tracking** | Budget overruns | Tags + Budgets + Cost Explorer |
| **Monolithic architecture** | Hard to scale | Microservices or serverless |

## Real-World Example

**Scenario:** Serverless API with authentication

**Architecture Review:**

**Operational Excellence:**
- ✅ Lambda functions deployed via CDK
- ✅ CloudWatch logs enabled
- ❌ Missing: Distributed tracing (X-Ray), dashboards

**Security:**
- ❌ CRITICAL: Hardcoded API keys in Lambda environment variables
- ✅ API Gateway with IAM authorization
- ❌ Missing: Secrets Manager, encryption at rest

**Reliability:**
- ✅ Multi-AZ DynamoDB table
- ❌ Single region deployment
- ❌ Missing: Backup strategy, DR plan

**Performance:**
- ✅ CloudFront for static assets
- ❌ No caching for API responses
- ❌ Lambda cold starts not optimized

**Cost:**
- ❌ DynamoDB provisioned capacity, but traffic is spiky
- ✅ Lambda usage-based pricing
- ❌ Missing: Budget alerts, cost allocation tags

**Sustainability:**
- ✅ Serverless architecture (good utilization)
- ❌ Unused dev/test resources running 24/7

**Priority Actions:**
1. **HIGH**: Move API keys to Secrets Manager (Security)
2. **HIGH**: Implement DynamoDB backups (Reliability)
3. **MEDIUM**: Add X-Ray tracing (Operational Excellence)
4. **MEDIUM**: Switch DynamoDB to on-demand (Cost)
5. **LOW**: Add API Gateway caching (Performance)

## Resources

- [AWS Well-Architected Framework Whitepaper](https://docs.aws.amazon.com/wellarchitected/latest/framework/welcome.html)
- [AWS Well-Architected Tool](https://console.aws.amazon.com/wellarchitected) (Interactive reviews)
- [Well-Architected Labs](https://wellarchitectedlabs.com/)
- [AWS Architecture Center](https://aws.amazon.com/architecture/)
- [Sustainability Pillar Whitepaper](https://docs.aws.amazon.com/wellarchitected/latest/sustainability-pillar/sustainability-pillar.html)

## Common Mistakes When Using This Framework

| Mistake | Why It's Wrong | Correct Approach |
|---------|---------------|------------------|
| "Sustainability doesn't apply to this workload" | Every workload consumes resources and energy | Review all 6 pillars, even if findings are minimal |
| Skipping current state documentation | Can't measure improvement without baseline | Always document "Current State" before recommendations |
| Generic recommendations | Not actionable or specific to this workload | Provide specific AWS services, code examples, priorities |
| No prioritization | Everything seems equally important | Use HIGH/MEDIUM/LOW risk levels, create phased plan |
| Forgetting about trade-offs | Optimizing one pillar at expense of others | Explicitly call out trade-offs (e.g., multi-region cost vs reliability) |

## Using This Skill

When conducting architecture reviews:

1. **Start with context** - understand business objectives and constraints
2. **Review systematically** - go through all 6 pillars, don't skip ANY
3. **Document findings** - use consistent format per pillar (Current State → Gaps → Recommendations)
4. **Prioritize ruthlessly** - security and availability issues first
5. **Be specific** - actionable recommendations with examples and AWS service names
6. **Provide resources** - link to AWS docs and examples
7. **Create action plan** - clear next steps with success criteria and effort estimates
8. **Call out trade-offs** - be explicit about costs and benefits of each recommendation

**Remember:** Architecture is about trade-offs. A perfect architecture doesn't exist - aim for a well-balanced one that meets business needs.

**No exceptions to reviewing all 6 pillars** - even if a pillar seems "not applicable", document why and what the current state is.