--- name: senior-cloud-architect description: license: MIT + Commons Clause metadata: version: 1.0.0 author: borghei category: engineering domain: cloud-architecture updated: 2026-03-31 tags: [cloud, aws, gcp, azure, architecture, infrastructure, terraform] --- # Senior Cloud Architect Expert cloud architecture and infrastructure design across AWS, GCP, and Azure. ## Keywords cloud, aws, gcp, azure, terraform, infrastructure, vpc, eks, ecs, lambda, cost-optimization, disaster-recovery, multi-region, iam, security, migration --- ## Quick Start ```bash # Analyze infrastructure costs python scripts/cost_analyzer.py --account production --period monthly # Run DR validation python scripts/dr_test.py --region us-west-2 --type failover # Audit security posture python scripts/security_audit.py --framework cis --output report.html # Generate resource inventory python scripts/inventory.py --accounts all --format csv ``` --- ## Tools | Script | Purpose | |--------|---------| | `scripts/cost_analyzer.py` | Analyze cloud spend by service, environment, and tag | | `scripts/dr_test.py` | Validate disaster recovery failover procedures | | `scripts/security_audit.py` | Audit against CIS benchmarks and compliance frameworks | | `scripts/inventory.py` | Inventory all resources across accounts and regions | --- ## Cloud Platform Comparison | Service | AWS | GCP | Azure | |---------|-----|-----|-------| | Compute | EC2, ECS, EKS | GCE, GKE | VMs, AKS | | Serverless | Lambda | Cloud Functions | Azure Functions | | Storage | S3 | Cloud Storage | Blob Storage | | Database | RDS, DynamoDB | Cloud SQL, Spanner | SQL DB, CosmosDB | | ML | SageMaker | Vertex AI | Azure ML | | CDN | CloudFront | Cloud CDN | Azure CDN | --- ## Workflow 1: Design a Production AWS Architecture 1. **Define requirements** -- Identify compute, storage, database, and networking needs. Determine RTO/RPO targets. 2. **Provision VPC with Terraform:** ```hcl module "vpc" { source = "terraform-aws-modules/vpc/aws" version = "~> 5.0" name = "${var.project}-${var.environment}" cidr = var.vpc_cidr azs = ["${var.region}a", "${var.region}b", "${var.region}c"] private_subnets = var.private_subnets public_subnets = var.public_subnets enable_nat_gateway = true single_nat_gateway = var.environment != "production" enable_dns_hostnames = true tags = local.common_tags } ``` 3. **Deploy compute** -- ECS/EKS in private subnets behind an ALB in public subnets. Use at least 2 AZs for redundancy. 4. **Configure database** -- RDS Multi-AZ for production, single-AZ for staging. Set backup retention to 30 days (production) or 7 days (non-production). 5. **Add caching layer** -- ElastiCache (Redis) between application and database. 6. **Layer security** -- WAF on CloudFront, NACLs on subnets, security groups on instances. Apply least-privilege IAM. 7. **Validate** -- Run `python scripts/security_audit.py --framework cis` and resolve all high-severity findings. ### Reference Architecture ``` Route 53 (DNS) -> CloudFront + WAF -> ALB -> ECS/EKS Cluster (AZ-a) + ECS/EKS Cluster (AZ-b) -> ElastiCache (Redis) -> RDS Multi-AZ (Primary + Standby) ``` ## Workflow 2: Optimize Cloud Costs 1. **Audit current spend** -- `python scripts/cost_analyzer.py --account production --period monthly` 2. **Right-size instances** -- Identify instances with avg CPU <10% and max CPU <30% as downsize candidates: ```python # Pseudocode for right-sizing logic if avg_cpu < 10 and max_cpu < 30: recommendation = 'downsize' elif avg_cpu > 80: recommendation = 'upsize' else: recommendation = 'optimal' ``` 3. **Convert steady-state workloads** to Reserved Instances or Savings Plans: | Type | Discount | Commitment | Use Case | |------|----------|------------|----------| | On-Demand | 0% | None | Variable workloads | | Reserved | 30-72% | 1-3 years | Steady-state | | Savings Plans | 30-72% | 1-3 years | Flexible compute | | Spot | 60-90% | None | Fault-tolerant batch | 4. **Enforce cost allocation tags** -- Require `Environment`, `Project`, `Owner`, `CostCenter` on all resources. Alert on untagged resources after 24 hours. 5. **Validate** -- Re-run cost analyzer and confirm savings target achieved. ## Workflow 3: Plan Disaster Recovery 1. **Select DR strategy** based on RTO/RPO requirements: | Strategy | RTO | RPO | Cost | |----------|-----|-----|------| | Backup & Restore | Hours | Hours | $ | | Pilot Light | Minutes | Minutes | $$ | | Warm Standby | Minutes | Seconds | $$$ | | Multi-Site Active | Seconds | Near-zero | $$$$ | 2. **Configure cross-region replication** -- Database replication to secondary region. S3 cross-region replication for object storage. 3. **Set up Route 53 failover routing** -- Health checks on primary. Automatic DNS failover to secondary. 4. **Define backup policy:** - Database: continuous replication, 35-day retention, cross-region, encrypted - Application data: daily, 90-day retention, lifecycle to IA at 30d, Glacier at 90d - Configuration: on-change via git + S3, unlimited retention 5. **Test** -- `python scripts/dr_test.py --region us-west-2 --type failover` and confirm RTO/RPO targets met. ## Workflow 4: Audit Security Posture 1. **Run audit** -- `python scripts/security_audit.py --framework cis --output report.html` 2. **Review network segmentation** -- Public subnets contain only NAT GW, ALB, bastion. Private subnets contain application tier. Data subnets contain RDS, Redis, Elasticsearch. 3. **Enforce least-privilege IAM** -- Every policy scoped to specific resources and conditions: ```json { "Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject"], "Resource": "arn:aws:s3:::my-bucket/uploads/*", "Condition": { "StringEquals": { "aws:PrincipalTag/Team": "engineering" }, "IpAddress": { "aws:SourceIp": ["10.0.0.0/8"] } } } ``` 4. **Verify encryption** -- Data encrypted at rest (KMS) and in transit (TLS 1.2+). 5. **Validate** -- Re-run audit and confirm all critical and high findings resolved. --- ## AWS Well-Architected Pillars (Decision Checklist) - **Operational Excellence**: IaC everywhere? Monitoring and alerting? Runbooks for incidents? - **Security**: Least-privilege IAM? Encryption at rest and in transit? VPC segmentation? - **Reliability**: Multi-AZ? Auto-scaling? DR tested? - **Performance**: Right-sized instances? Caching layer? CDN for static assets? - **Cost Optimization**: Reserved capacity for steady-state? Spot for batch? Unused resources cleaned? - **Sustainability**: Efficient regions? Right-sized compute? Data lifecycle policies? --- ## Reference Materials | Document | Path | |----------|------| | AWS Patterns | [references/aws_patterns.md](references/aws_patterns.md) | | GCP Patterns | [references/gcp_patterns.md](references/gcp_patterns.md) | | Multi-Cloud Strategies | [references/multi_cloud.md](references/multi_cloud.md) | | Cost Optimization Guide | [references/cost_optimization.md](references/cost_optimization.md) | --- ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | Cross-region latency exceeds 200ms | No regional caching or CDN configured | Deploy CloudFront/Cloud CDN with edge locations closest to user base; enable regional API Gateway caches | | Terraform state lock conflicts across teams | Shared state backend without proper locking | Use DynamoDB (AWS) or GCS (GCP) state locking with per-team state file partitioning via workspaces | | Multi-cloud DNS failover not triggering | Health check thresholds too lenient or misconfigured endpoints | Set health check interval to 10s, failure threshold to 3, and verify endpoint returns 200 on the exact path monitored | | IAM permission errors after cross-account migration | Trust policies not updated for new account IDs | Update AssumeRole trust policies with correct account principals and external IDs; validate with `aws sts assume-role` | | Cloud costs spike unexpectedly after scaling event | Auto-scaling max limits set too high or no budget alerts | Set hard max instance counts per ASG, configure billing alerts at 80%/100%/120% thresholds, and review Spot fallback behavior | | VPC peering routes not propagating between clouds | Route tables missing entries for peered CIDR ranges | Add explicit route entries in both VPCs pointing peered CIDRs to the peering connection; verify no overlapping CIDRs | | DR failover test fails with data inconsistency | Replication lag between primary and secondary regions | Switch to synchronous replication for critical databases or implement application-level consistency checks pre-failover | --- ## Success Criteria - **99.99% availability SLA met** across all production workloads with documented uptime reports - **Cost optimization savings above 25%** compared to on-demand baseline through Reserved Instances, Savings Plans, and right-sizing - **RTO < 15 minutes and RPO < 1 minute** validated through quarterly DR failover tests - **Zero critical CIS benchmark findings** in production accounts after security audit remediation - **Infrastructure drift < 2%** measured by Terraform plan diffs on scheduled compliance scans - **Cross-region failover completes within 60 seconds** with automated Route 53 health check validation - **100% resource tagging compliance** enforced via automated policy checks with no untagged resources older than 24 hours --- ## Scope & Limitations **This skill covers:** - Multi-cloud architecture design and comparison across AWS, GCP, and Azure - Infrastructure-as-Code with Terraform including VPC, compute, database, and networking - Disaster recovery planning, cross-region replication, and failover strategies - Cloud cost optimization, right-sizing, and reserved capacity planning **This skill does NOT cover:** - Application-level code architecture or microservice design patterns (see `senior-architect`) - Kubernetes cluster internals, pod scheduling, or service mesh configuration (see `senior-devops`) - Security compliance frameworks beyond CIS benchmarks such as SOC 2, HIPAA, or GDPR (see `ra-qm-team/` compliance skills) - CI/CD pipeline design, build automation, or deployment workflows (see `senior-devops`) --- ## Integration Points | Skill | Integration | Data Flow | |-------|-------------|-----------| | `senior-devops` | Infrastructure provisioning feeds into CI/CD deployment pipelines | Terraform outputs (endpoints, ARNs) → deployment configs | | `senior-secops` | Security audit findings inform cloud hardening decisions | CIS benchmark results → security remediation tasks | | `senior-architect` | Application architecture requirements drive cloud resource selection | Capacity requirements → compute/storage/network sizing | | `aws-solution-architect` | AWS-specific deep dives complement multi-cloud strategy | Cloud platform comparison → AWS implementation details | | `ra-qm-team/soc2-compliance` | Compliance requirements shape infrastructure security controls | Compliance matrices → IAM policies, encryption configs, audit logging | | `senior-fullstack` | Fullstack application stacks deploy onto cloud infrastructure | Application stack definitions → ECS/EKS task definitions, RDS configs |