--- name: optimizing-costs description: Optimize cloud infrastructure costs through FinOps practices, commitment discounts, right-sizing, and automated cost management. Use when reducing cloud spend, implementing budget controls, or establishing cost visibility across AWS, Azure, GCP, and Kubernetes environments. --- # Cost Optimization ## Purpose Cloud cost optimization transforms uncontrolled spending into strategic resource allocation through the FinOps lifecycle: Inform, Optimize, and Operate. This skill provides decision frameworks for commitment-based discounts (Reserved Instances, Savings Plans), right-sizing strategies, Kubernetes cost management, and automated cost governance across multi-cloud environments. ## When to Use This Skill Invoke cost-optimization when: - Reducing cloud spend by 15-40% through systematic optimization - Implementing cost visibility dashboards and allocation tracking - Establishing budget alerts and anomaly detection - Optimizing Kubernetes resource requests and cluster efficiency - Managing Reserved Instances, Savings Plans, or Committed Use Discounts - Automating idle resource cleanup and right-sizing recommendations - Setting up showback/chargeback models for internal teams - Preventing cost overruns through CI/CD cost estimation (Infracost) - Responding to finance team requests for cloud cost reduction ## FinOps Principles ### The FinOps Lifecycle ``` ┌─────────────────────────────────────────────────────┐ │ INFORM → OPTIMIZE → OPERATE (continuous loop) │ │ ↓ ↓ ↓ │ │ Visibility Action Automation │ └─────────────────────────────────────────────────────┘ ``` **Inform Phase:** Establish cost visibility - Enable cost allocation tags (Owner, Project, Environment) - Deploy real-time cost dashboards for engineering teams - Integrate cloud billing data (AWS CUR, Azure Consumption API, GCP BigQuery) - Set up Kubernetes cost monitoring (Kubecost, OpenCost) **Optimize Phase:** Take action on cost drivers - Purchase commitment-based discounts (40-72% savings) - Right-size over-provisioned resources (target 60-80% utilization) - Implement spot/preemptible instances for fault-tolerant workloads - Clean up idle resources (unattached volumes, old snapshots) **Operate Phase:** Automate and govern - Budget alerts with cascading notifications (50%, 75%, 90%, 100%) - Automated cleanup scripts for idle resources - CI/CD cost estimation to prevent surprise increases - Continuous monitoring with anomaly detection ### Core FinOps Principles 1. **Collaboration:** Cross-functional teams (finance, engineering, operations, product) 2. **Accountability:** Teams own the cost of their services 3. **Transparency:** All costs visible and understandable to stakeholders 4. **Optimization:** Continuous improvement of cost efficiency For detailed FinOps maturity models and organizational structures, see `references/finops-foundations.md`. ## Cost Optimization Strategies ### 1. Commitment-Based Discounts **Reserved Instances (RIs):** 40-72% discount for 1-3 year commitments - **Standard RI:** Instance type locked, highest discount (60% for 3-year) - **Convertible RI:** Flexible instance types, moderate discount (54% for 3-year) - **Use for:** Databases (RDS, ElastiCache), stable production EC2 workloads **Savings Plans:** Flexible compute commitments - **Compute Savings Plans:** Applies to EC2, Fargate, Lambda (54% discount for 3-year) - **EC2 Instance Savings Plans:** Tied to instance family (66% discount for 3-year) - **Use for:** Workloads that change instance types or regions **GCP Committed Use Discounts (CUDs):** 25-70% discount - **Resource-based CUDs:** Commit to vCPU, memory, GPUs - **Spend-based CUDs:** Commit to dollar amount (flexible) - **Sustained Use Discounts:** Automatic 20-30% discount for sustained usage (no commitment) **Decision Framework:** ``` Reserve when: ├─ Workload is production-critical (24/7 uptime required) ├─ Usage is predictable (stable baseline over 6+ months) ├─ Architecture is stable (unlikely to change instance types) └─ Financial commitment acceptable (1-3 year lock-in) Use On-Demand when: ├─ Development/testing environments ├─ Unpredictable spiky workloads ├─ Short-term projects (<6 months) └─ Evaluating new instance types ``` For detailed commitment strategies and RI coverage analysis, see `references/commitment-strategies.md`. ### 2. Spot and Preemptible Instances **Discount:** 70-90% off on-demand pricing (interruptible with 2-minute warning) **Use Spot For:** CI/CD workers, batch jobs, ML training (with checkpointing), Kubernetes workers, data analytics **Avoid Spot For:** Stateful databases, real-time services, long-running jobs without checkpointing **Best Practices:** - Diversify instance types and spread across Availability Zones - Implement graceful shutdown handlers - Auto-fallback to on-demand when capacity unavailable - Kubernetes: Mix 70% spot + 30% on-demand nodes with taints/tolerations ### 3. Right-Sizing Strategies **Target Utilization:** 60-80% average (leave headroom for spikes) **Compute Right-Sizing:** - Analyze actual CPU/memory utilization over 30+ days - Downsize instances with <40% average utilization - Consolidate underutilized workloads - Switch instance families (compute-optimized vs. memory-optimized) **Database Right-Sizing:** - Analyze connection pool usage (max connections vs. allocated) - Downgrade storage IOPS if utilization <50% - Evaluate read replica necessity (can caching replace it?) - Consider serverless options (Aurora Serverless, Azure SQL Serverless) **Kubernetes Right-Sizing:** - Set requests = average usage (not peak) - Set limits = 2-3x requests (allow bursting) - Use Vertical Pod Autoscaler (VPA) for automated recommendations - Identify pods with 0% CPU usage (candidates for consolidation) **Storage Right-Sizing:** - Delete unattached volumes (EBS, Azure Disks, GCP Persistent Disks) - Delete old snapshots (>90 days, retention policy not required) - Implement lifecycle policies (S3 Intelligent-Tiering, Azure Blob Lifecycle) - Compress/deduplicate data **Right-Sizing Tools:** - **AWS Compute Optimizer:** ML-based EC2, Lambda, EBS recommendations - **Azure Advisor:** VM rightsizing, reserved instance advice - **GCP Recommender:** VM, disk, commitment recommendations - **VPA (Vertical Pod Autoscaler):** Automated container resource requests ### 4. Kubernetes Cost Management **Resource Requests and Limits:** ```yaml # Set requests = average usage (enables efficient bin-packing) resources: requests: cpu: 500m # 0.5 CPU cores (average usage) memory: 1Gi # 1 GiB memory (average usage) limits: cpu: 1500m # 1.5 CPU cores (3x requests, allows bursting) memory: 3Gi # 3 GiB memory (3x requests) ``` **Namespace Quotas:** Prevent runaway resource consumption - ResourceQuota: Limit total CPU/memory per namespace - LimitRange: Default/max requests per pod - PriorityClass: Ensure critical pods get resources **Cluster Autoscaling:** - Scale down idle nodes to reduce costs - Scale-to-zero for dev clusters during off-hours - Use multiple node pools (spot + on-demand mix) - Set max node limits to prevent overspend **Cost Visibility:** - Deploy Kubecost or OpenCost for namespace-level cost tracking - Allocate costs by labels (team, project, environment) - Track idle cost (cluster capacity not allocated to workloads) - Generate showback/chargeback reports For detailed Kubernetes cost optimization patterns, see `references/kubernetes-cost-optimization.md`. ## Cost Visibility and Monitoring ### Tagging for Cost Allocation **Required Tags:** - `Owner` or `Team` - Responsible team/department - `Project` or `Application` - Business unit or application name - `Environment` - prod, staging, dev, test - `CostCenter` - Finance cost center code **Enable Cost Allocation Tags:** - **AWS:** Activate tags in Cost Allocation Tags console - **Azure:** Apply tags via Azure Policy enforcement - **GCP:** Use labels on all resources, export to BigQuery For comprehensive tagging strategies, see `references/tagging-for-cost-allocation.md`. ### Monitoring and Dashboards **Native Cloud Tools:** - **AWS Cost Explorer:** Analyze spending patterns, forecast costs - **Azure Cost Management + Billing:** Budget tracking, cost analysis - **GCP Cloud Billing:** BigQuery export for custom analysis **Third-Party Platforms:** - **Kubecost:** Kubernetes cost visibility and optimization - **CloudZero:** Unit cost economics, anomaly detection - **CloudHealth:** Multi-cloud cost management - **Infracost:** Terraform cost estimation in CI/CD **Key Metrics to Track:** - Total monthly cloud spend (trend over time) - Cost per service/team/project (allocation accuracy) - Unit cost metrics (cost per customer, cost per transaction) - Reserved Instance/Savings Plan utilization (target >95%) - Idle resource waste (target <5% of total spend) - Budget variance (forecasted vs. actual) ### Budget Alerts and Anomaly Detection **Cascading Budget Alerts:** ``` 50% of budget → Email to team lead (informational) 75% of budget → Email + Slack to team (warning) 90% of budget → Email + Slack + PagerDuty (urgent) 100% of budget → Automated shutdown (non-prod only) or escalation ``` **Anomaly Detection:** Alert on unexpected cost spikes - >20% cost increase week-over-week - >$500 unexpected daily cost spike - New resource types (unusual spend patterns) **Budget Granularity:** - Organization-level (total cloud spend) - Department-level (engineering, data, marketing) - Project-level (per application/service) - Environment-level (prod vs. dev/staging) ## Decision Frameworks ### Framework 1: Commitment Discount Decision Tree ``` Should we purchase Reserved Instances / Savings Plans? STEP 1: Analyze Historical Usage (6-12 months) ├─ Identify steady-state baseline (minimum usage) ├─ Exclude spiky/seasonal workloads └─ Calculate: (baseline usage) / (total usage) = commitment % STEP 2: Choose Commitment Type ├─ RESERVED INSTANCES │ ├─ Pros: Highest discount (up to 72%) │ ├─ Cons: Instance type locked (unless convertible) │ └─ Use for: Databases, stable production workloads │ ├─ SAVINGS PLANS │ ├─ Pros: Flexible (across instance types, regions) │ ├─ Cons: Slightly lower discount than RI │ └─ Use for: Compute workloads, Lambda, Fargate │ └─ COMMITTED USE DISCOUNTS (GCP) ├─ Resource-based: vCPU/memory commitments └─ Spend-based: Dollar amount commitments STEP 3: Determine Commitment Period ├─ 1-year commitment │ ├─ Lower discount (40-50%) │ └─ Less risk if architecture changes │ └─ 3-year commitment ├─ Higher discount (60-72%) └─ Only for mature, stable workloads STEP 4: Monitor and Optimize ├─ Target >95% RI/Savings Plan utilization ├─ Sell unused RIs on AWS Reserved Instance Marketplace └─ Adjust commitments quarterly based on usage trends ``` ### Framework 2: Right-Sizing Priority Matrix **Cost Impact vs. Effort:** **High Impact, Low Effort (DO FIRST):** - Idle resources (100% waste): Stopped instances, unattached volumes, old snapshots - Unused NAT Gateways ($32/month each) - Over-provisioned databases (<20% CPU for 30 days) - Kubernetes pods with no resource requests set **High Impact, Medium Effort (DO SECOND):** - Over-provisioned compute (<40% CPU/memory for 30 days) - Lambda functions with max memory >2x used memory - Storage optimization (S3 Intelligent-Tiering, gp3 vs. gp2) **Low Impact, High Effort (DO LAST):** - Application code optimization (requires profiling, refactoring) - Architecture redesign (serverless migration, multi-region optimization) **Weekly Optimization Routine:** 1. Delete idle resources (automated script) 2. Review top 10 cost drivers (manual analysis) 3. Right-size 3-5 instances/week (incremental approach) 4. Monitor impact (cost trend over 4 weeks) ### Framework 3: Spot vs. On-Demand Decision ``` Should this workload use Spot/Preemptible instances? ├─ Is the workload fault-tolerant? │ ├─ NO → Use On-Demand │ └─ YES → Continue │ ├─ Is the workload stateless (or has checkpointing)? │ ├─ NO → Use On-Demand (data loss risk) │ └─ YES → Continue │ ├─ Can the workload handle interruptions gracefully? │ ├─ NO → Use On-Demand │ └─ YES → Continue │ └─ Workload Type Assessment: ├─ Batch Jobs / CI/CD → ✅ Use Spot (70-90% savings) ├─ ML Training → ✅ Use Spot (with checkpointing) ├─ Kubernetes Workers → ✅ Use Spot (mixed with on-demand) ├─ Production API Servers → ⚠️ Mixed fleet (70% spot, 30% on-demand) ├─ Databases → ❌ Use On-Demand (or Reserved) └─ Real-time Services → ❌ Use On-Demand (or Reserved) ``` ## Tool Selection Guide ### By Platform | Platform | Cost Visibility | Right-Sizing | Automation | |----------|----------------|--------------|------------| | **AWS** | Cost Explorer, CUR | Compute Optimizer | AWS Budgets, Lambda cleanup | | **Azure** | Cost Management | Azure Advisor | Azure Policy, Automation | | **GCP** | Cloud Billing | Recommender | Budget Alerts, Cloud Functions | | **Kubernetes** | Kubecost, OpenCost | VPA | Cluster Autoscaler | | **Multi-Cloud** | CloudZero, CloudHealth | Densify | ParkMyCloud | ### By Use Case | Use Case | Recommended Tool | Key Feature | |----------|------------------|-------------| | K8s cost visibility | Kubecost | Real-time namespace cost allocation | | Terraform cost estimation | Infracost | PR comments with cost diffs | | Multi-cloud aggregation | CloudHealth | Unified cost view across AWS/Azure/GCP | | Automated optimization | nOps (AWS), CAST AI (K8s) | ML-based automation | | Unit cost economics | CloudZero | Cost per customer/transaction tracking | | Spot instance management | Spot.io | Automated spot orchestration | For detailed tool comparisons and selection criteria, see `references/tools-comparison.md`. ## Cloud-Specific Tactics ### AWS Optimization Tactics 1. **Enable Cost & Usage Reports (CUR):** Export detailed billing to S3 2. **Use AWS Compute Optimizer:** ML-based EC2 rightsizing recommendations 3. **Implement Savings Plans:** More flexible than Reserved Instances 4. **S3 Intelligent-Tiering:** Automatic storage class optimization 5. **Lambda Right-Sizing:** Adjust memory allocation (CPU scales proportionally) 6. **EBS gp3 Migration:** 20% cheaper than gp2 with same performance ### Azure Optimization Tactics 1. **Enable Azure Advisor:** VM rightsizing and reserved instance recommendations 2. **Azure Hybrid Benefit:** Bring Windows Server licenses for discounts 3. **Dev/Test Pricing:** Reduced rates for non-production workloads 4. **Azure Spot VMs:** Up to 90% discount for interruptible workloads 5. **Storage Lifecycle Management:** Auto-tier blobs to cool/archive tiers ### GCP Optimization Tactics 1. **Export Billing to BigQuery:** Custom cost analysis with SQL 2. **Sustained Use Discounts:** Automatic 20-30% discount (no commitment) 3. **Committed Use Discounts:** 52-70% savings for 3-year commitments 4. **Preemptible VMs:** Up to 91% discount for batch workloads 5. **GCP Recommender:** Idle VM detection and rightsizing advice For cloud-specific deep dives, see `references/cloud-specific-tactics.md`. ## Implementation Checklist ### Phase 1: Establish Visibility (Week 1-2) - [ ] Enable cost allocation tags (Owner, Project, Environment) - [ ] Activate cost allocation tags in cloud billing console - [ ] Deploy Kubecost for Kubernetes cost visibility (if using K8s) - [ ] Create cost dashboards (Grafana, CloudWatch, Azure Monitor, GCP) - [ ] Set up weekly cost reports (emailed to team leads) ### Phase 2: Set Up Governance (Week 2-3) - [ ] Create budget alerts (50%, 75%, 90%, 100% thresholds) - [ ] Enable anomaly detection (>20% WoW increase) - [ ] Implement tagging policy enforcement (Azure Policy, AWS Config, GCP Org Policy) - [ ] Establish showback reports (cost by team/project) - [ ] Document cost ownership (who owns which services) ### Phase 3: Quick Wins (Week 3-4) - [ ] Delete idle resources (unattached volumes, old snapshots) - [ ] Stop/terminate unused development instances - [ ] Right-size top 10 over-provisioned instances (<40% utilization) - [ ] Implement S3 Intelligent-Tiering or lifecycle policies - [ ] Evaluate Reserved Instance/Savings Plan coverage ### Phase 4: Commitment Discounts (Month 2) - [ ] Analyze 6-12 months usage history - [ ] Calculate baseline usage for commitment sizing - [ ] Purchase Reserved Instances for databases - [ ] Purchase Savings Plans for compute workloads - [ ] Monitor RI/SP utilization (target >95%) ### Phase 5: Automation (Month 2-3) - [ ] Deploy automated cleanup scripts (weekly schedule) - [ ] Integrate Infracost into CI/CD pipelines - [ ] Implement auto-shutdown for dev/test environments (off-hours) - [ ] Enable Vertical Pod Autoscaler (VPA) for K8s rightsizing - [ ] Set up Spot instance automation (Spot.io, CAST AI, or native) ### Phase 6: Continuous Optimization (Ongoing) - [ ] Weekly cost reviews with engineering teams - [ ] Monthly optimization sprints (top cost drivers) - [ ] Quarterly commitment adjustments (RI/SP coverage) - [ ] Annual FinOps maturity assessment ## Common Pitfalls ### Pitfall 1: No Cost Visibility ❌ **Problem:** Finance team sees cloud bill at end of month, surprises everywhere ✅ **Solution:** Deploy real-time cost dashboards, daily Slack reports to engineering teams ### Pitfall 2: Reserved Instance Underutilization ❌ **Problem:** Purchased 100 RIs, only using 60 (40% wasted commitment) ✅ **Solution:** Monitor RI utilization weekly (target >95%), sell unused RIs on marketplace ### Pitfall 3: Missing Kubernetes Resource Requests ❌ **Problem:** Pods with no requests set → inefficient bin-packing → wasted nodes ✅ **Solution:** Use VPA to auto-generate recommendations, enforce via admission control ### Pitfall 4: Idle Resources Not Cleaned Up ❌ **Problem:** 50 stopped EC2 instances (still paying for EBS), 200 unattached volumes ✅ **Solution:** Weekly automated cleanup of idle resources >7 days old ### Pitfall 5: No Budget Alerts ❌ **Problem:** Accidentally left test cluster running, $10K bill surprise ✅ **Solution:** Budget alerts at 50%, 75%, 90%, 100% with Slack/PagerDuty notifications ## Related Skills - **resource-tagging:** Cost allocation tags enable showback/chargeback models - **kubernetes-operations:** K8s rightsizing, VPA, cluster autoscaling for cost optimization - **infrastructure-as-code:** Infracost for Terraform cost estimation and policy-as-code - **aws-patterns:** AWS-specific cost optimization tactics (EC2, RDS, S3, Lambda) - **gcp-patterns:** GCP-specific optimizations (Compute Engine, BigQuery, Cloud Storage) - **azure-patterns:** Azure-specific optimizations (VMs, Storage, App Service, Functions) - **platform-engineering:** Internal FinOps platforms and self-service cost dashboards - **disaster-recovery:** Balance cost vs. RTO/RPO (warm standby vs. cold standby) ## Examples See `examples/` directory for: - **terraform/**: AWS, Azure, GCP cost optimization infrastructure (budgets, alerts) - **kubernetes/**: Kubecost deployment, resource quotas, VPA configurations - **ci-cd/**: Infracost GitHub Actions, cost approval workflows - **dashboards/**: Grafana cost dashboards, CloudWatch alarms ## Scripts See `scripts/` directory for: - **cleanup_idle_resources.py:** Automated AWS/Azure/GCP idle resource cleanup - **ri_coverage_report.py:** Reserved Instance coverage analysis - **cost_allocation_report.py:** Generate showback/chargeback reports - **spot_savings_calculator.py:** Estimate savings from spot instances - **k8s_rightsizing_audit.py:** Find K8s pods with missing resource requests ## Key Takeaways 1. **FinOps is a Culture:** Collaboration between finance, engineering, and operations 2. **Visibility First:** Can't optimize what can't measure (tags + dashboards mandatory) 3. **Commitment = Savings:** Reserved Instances/Savings Plans provide 40-72% discounts 4. **Right-Size Continuously:** Target 60-80% utilization (leave headroom for spikes) 5. **Automate Cleanup:** Idle resources are 100% waste (weekly automated deletion) 6. **Kubernetes Costs Hidden:** Use Kubecost/OpenCost for namespace-level visibility 7. **Shift-Left Cost Awareness:** Infracost in CI/CD prevents surprise cost increases 8. **Budget Alerts Prevent Overspend:** Cascading notifications at 50%, 75%, 90%, 100% 9. **Spot for Fault-Tolerant Workloads:** 70-90% discount (CI/CD, batch jobs, ML training) 10. **Unit Cost Metrics Drive Value:** Track cost per customer, cost per transaction