--- name: infrastructure-cost-optimization description: Optimize cloud infrastructure costs through resource rightsizing, reserved instances, spot instances, and waste reduction strategies. --- # Infrastructure Cost Optimization ## Overview Reduce infrastructure costs through intelligent resource allocation, reserved instances, spot instances, and continuous optimization without sacrificing performance. ## When to Use - Cloud cost reduction - Budget management and tracking - Resource utilization optimization - Multi-environment cost allocation - Waste identification and elimination - Reserved instance planning - Spot instance integration ## Implementation Examples ### 1. **AWS Cost Optimization Configuration** ```yaml # cost-optimization-setup.yaml apiVersion: v1 kind: ConfigMap metadata: name: cost-optimization-scripts namespace: operations data: analyze-costs.sh: | #!/bin/bash set -euo pipefail echo "=== AWS Cost Analysis ===" # Get daily cost trend echo "Daily costs for last 7 days:" aws ce get-cost-and-usage \ --time-period Start=$(date -d '7 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \ --granularity DAILY \ --metrics "BlendedCost" \ --group-by Type=DIMENSION,Key=SERVICE \ --query 'ResultsByTime[*].[TimePeriod.Start,Total.BlendedCost.Amount]' \ --output table # Find unattached resources echo -e "\n=== Unattached EBS Volumes ===" aws ec2 describe-volumes \ --filters Name=status,Values=available \ --query 'Volumes[*].[VolumeId,Size,CreateTime]' \ --output table echo -e "\n=== Unattached Elastic IPs ===" aws ec2 describe-addresses \ --filters Name=association-id,Values=none \ --query 'Addresses[*].[PublicIp,AllocationId]' \ --output table echo -e "\n=== Unused RDS Instances ===" aws rds describe-db-instances \ --query 'DBInstances[?DBInstanceStatus==`available`].[DBInstanceIdentifier,DBInstanceClass,Engine,AllocatedStorage]' \ --output table # Estimate savings with Reserved Instances echo -e "\n=== Reserved Instance Savings Potential ===" aws ce get-reservation-purchase-recommendation \ --service "EC2" \ --lookback-period THIRTY_DAYS \ --query 'Recommendations[0].[RecommendationSummary.TotalEstimatedMonthlySavingsAmount,RecommendationSummary.TotalEstimatedMonthlySavingsPercentage]' \ --output table optimize-resources.sh: | #!/bin/bash set -euo pipefail echo "Starting resource optimization..." # Remove unattached volumes echo "Removing unattached volumes..." aws ec2 describe-volumes \ --filters Name=status,Values=available \ --query 'Volumes[*].VolumeId' \ --output text | \ while read volume_id; do echo "Deleting volume: $volume_id" aws ec2 delete-volume --volume-id "$volume_id" 2>/dev/null || true done # Release unused Elastic IPs echo "Releasing unused Elastic IPs..." aws ec2 describe-addresses \ --filters Name=association-id,Values=none \ --query 'Addresses[*].AllocationId' \ --output text | \ while read alloc_id; do echo "Releasing EIP: $alloc_id" aws ec2 release-address --allocation-id "$alloc_id" 2>/dev/null || true done # Modify RDS to smaller instances echo "Analyzing RDS for downsizing..." # Implement logic to check CloudWatch metrics and downsize if needed echo "Optimization complete" --- # Terraform cost optimization resource "aws_ec2_instance" "spot" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t3.medium" # Use spot instances for non-critical workloads instance_market_options { market_type = "spot" spot_options { max_price = "0.05" # Set max price spot_instance_type = "persistent" interrupt_behavior = "terminate" valid_until = "2025-12-31T23:59:59Z" } } tags = { Name = "spot-instance" CostCenter = "engineering" } } # Reserved instance for baseline capacity resource "aws_ec2_instance" "reserved" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t3.medium" # Tag for reserved instance matching tags = { Name = "reserved-instance" ReservationType = "reserved" } } resource "aws_ec2_fleet" "mixed" { name = "mixed-capacity" launch_template_configs { launch_template_specification { launch_template_id = aws_launch_template.app.id version = "$Latest" } overrides { instance_type = "t3.medium" weighted_capacity = "1" priority = 1 # Reserved } overrides { instance_type = "t3.large" weighted_capacity = "2" priority = 2 # Reserved } overrides { instance_type = "t3a.medium" weighted_capacity = "1" priority = 3 # Spot } overrides { instance_type = "t3a.large" weighted_capacity = "2" priority = 4 # Spot } } target_capacity_specification { total_target_capacity = 10 on_demand_target_capacity = 6 spot_target_capacity = 4 default_target_capacity_type = "on-demand" } fleet_type = "maintain" } ``` ### 2. **Kubernetes Cost Optimization** ```yaml # k8s-cost-optimization.yaml apiVersion: v1 kind: ConfigMap metadata: name: cost-optimization-policies namespace: kube-system data: policies.yaml: | # Resource quotas per namespace apiVersion: v1 kind: ResourceQuota metadata: name: compute-quota namespace: production spec: hard: requests.cpu: "100" requests.memory: "200Gi" limits.cpu: "200" limits.memory: "400Gi" pods: "500" scopeSelector: matchExpressions: - operator: In scopeName: PriorityClass values: ["high", "medium"] --- # Pod Disruption Budget for cost-effective scaling apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: cost-optimized-pdb namespace: production spec: minAvailable: 1 selector: matchLabels: tier: backend --- # Prioritize spot instances with taints/tolerations apiVersion: v1 kind: Node metadata: name: spot-node-1 spec: taints: - key: cloud.google.com/gke-preemptible value: "true" effect: NoSchedule --- apiVersion: apps/v1 kind: Deployment metadata: name: cost-optimized-app namespace: production spec: replicas: 3 selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: # Tolerate spot instances tolerations: - key: cloud.google.com/gke-preemptible operator: Equal value: "true" effect: NoSchedule # Prefer nodes with lower cost affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: karpenter.sh/capacity-type operator: In values: ["spot"] containers: - name: app image: myapp:latest resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi ``` ### 3. **Cost Monitoring Dashboard** ```python # cost-monitoring.py import boto3 import json from datetime import datetime, timedelta class CostOptimizer: def __init__(self): self.ce_client = boto3.client('ce') self.ec2_client = boto3.client('ec2') self.rds_client = boto3.client('rds') def get_daily_costs(self, days=30): """Get daily costs for past N days""" end_date = datetime.now().date() start_date = end_date - timedelta(days=days) response = self.ce_client.get_cost_and_usage( TimePeriod={ 'Start': str(start_date), 'End': str(end_date) }, Granularity='DAILY', Metrics=['BlendedCost'], GroupBy=[ {'Type': 'DIMENSION', 'Key': 'SERVICE'} ] ) return response def find_underutilized_instances(self): """Find EC2 instances with low CPU usage""" cloudwatch = boto3.client('cloudwatch') instances = [] ec2_instances = self.ec2_client.describe_instances() for reservation in ec2_instances['Reservations']: for instance in reservation['Instances']: instance_id = instance['InstanceId'] # Check CPU utilization response = cloudwatch.get_metric_statistics( Namespace='AWS/EC2', MetricName='CPUUtilization', Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}], StartTime=datetime.now() - timedelta(days=7), EndTime=datetime.now(), Period=3600, Statistics=['Average'] ) if response['Datapoints']: avg_cpu = sum(d['Average'] for d in response['Datapoints']) / len(response['Datapoints']) if avg_cpu < 10: # Less than 10% average instances.append({ 'InstanceId': instance_id, 'Type': instance['InstanceType'], 'AverageCPU': avg_cpu, 'Recommendation': 'Downsize or terminate' }) return instances def estimate_reserved_instance_savings(self): """Estimate potential savings from reserved instances""" response = self.ce_client.get_reservation_purchase_recommendation( Service='EC2', LookbackPeriod='THIRTY_DAYS', PageSize=100 ) total_savings = 0 for recommendation in response.get('Recommendations', []): summary = recommendation['RecommendationSummary'] savings = float(summary['EstimatedMonthlyMonthlySavingsAmount']) total_savings += savings return total_savings def generate_report(self): """Generate comprehensive cost optimization report""" print("=== Cost Optimization Report ===\n") # Daily costs print("Daily Costs:") costs = self.get_daily_costs(7) for result in costs['ResultsByTime']: date = result['TimePeriod']['Start'] total = result['Total']['BlendedCost']['Amount'] print(f" {date}: ${total}") # Underutilized instances print("\nUnderutilized Instances:") underutilized = self.find_underutilized_instances() for instance in underutilized: print(f" {instance['InstanceId']}: {instance['AverageCPU']:.1f}% CPU - {instance['Recommendation']}") # Reserved instance savings print("\nReserved Instance Savings Potential:") savings = self.estimate_reserved_instance_savings() print(f" Estimated Monthly Savings: ${savings:.2f}") # Usage if __name__ == '__main__': optimizer = CostOptimizer() optimizer.generate_report() ``` ## Cost Optimization Strategies ### ✅ DO - Use reserved instances for baseline - Leverage spot instances - Right-size resources - Monitor cost trends - Implement auto-scaling - Use multi-region pricing - Tag resources consistently - Schedule non-essential resources ### ❌ DON'T - Over-provision resources - Ignore unused resources - Neglect cost monitoring - Run all on-demand - Forget to release EIPs - Mix cost centers - Ignore savings opportunities - Deploy without budgets ## Cost Saving Opportunities - **Reserved Instances**: 40-70% savings - **Spot Instances**: 70-90% savings - **Committed Use Discounts**: 25-55% savings - **Right-sizing**: 10-30% savings - **Resource cleanup**: 5-20% savings ## Resources - [AWS Cost Optimization](https://aws.amazon.com/architecture/cost-optimization/) - [GCP Cost Optimization](https://cloud.google.com/cost-management) - [Azure Cost Management](https://docs.microsoft.com/en-us/azure/cost-management-billing/) - [Kubernetes Cost Optimization](https://kubernetes.io/docs/tasks/debug-application-cluster/resource-cost/)