--- name: Cost Observability and Monitoring description: Techniques for gaining visibility into cloud spending, attributing costs to business units, and detecting financial anomalies. --- # Cost Observability and Monitoring ## Overview Cost Observability is the practice of extending traditional system observability (logs, metrics, traces) to include **Financial** data. It allows engineering teams to answer not just "Is the system healthy?" but "Is the system cost-effective?". **Core Principle**: "Total spend is a vanity metric; cost per unit of work is a performance metric." --- ## 1. Key Cost Metrics to Track The goal is to move from **Macro** visibility (the bill) to **Micro** visibility (the request). | Metric | Level | Purpose | | :--- | :--- | :--- | | **Total Monthly Spend** | Executive | General budget health. | | **Cost per Service** | Engineering | Identify inefficient microservices. | | **Cost per Customer (Unit Cost)**| Product | Calculate per-account profitability. | | **Cost per Request** | Engineering | Measure efficiency of application code. | | **COGS (Cost of Goods Sold)** | Financial | The base cost to deliver the service. | --- ## 2. Cost Attribution and Tagging Strategy Attribution is impossible without consistent metadata. ### The Standard Tagging Schema Every resource should have the following "FinOps Tags": 1. **`Environment`**: (e.g., `prod`, `staging`, `dev`) 2. **`Service`**: (e.g., `auth-api`, `image-processor`) 3. **`Owner`**: (e.g., `team-alpha`) 4. **`Project`**: (e.g., `project-phoenix`) 5. **`TenantID`**: (If using siloed resources per customer) ### Enforcement Policy (Terraform/OpenTofu) ```hcl # Use a variable for mandatory tags locals { mandatory_tags = { Environment = var.environment Service = "payment-gateway" Owner = "finance-team" CostCenter = "9921" } } resource "aws_instance" "app" { ami = "ami-12345" instance_type = "t3.medium" tags = local.mandatory_tags } ``` --- ## 3. Cost Anomaly Detection A financial anomaly is a sudden deviate from historical spend patterns. ### Types of Anomalies 1. **Sudden Spikes**: A developers spins up a massive GPU instance and forgets to delete it. 2. **Gradual Drift**: A memory leak causes auto-scaling to add a new server every day. 3. **Cyclical Variation**: Spend increases during weekends when it should be lower. ### Anomaly Alert Example (Slack/PagerDuty) * **Alert**: "AWS Spend Spike Detected" * **Metric**: `S3 Egress` * **Deviation**: +450% over the last 24 hours. * **Likely Cause**: Possible data exfiltration or misconfigured backup script. --- ## 4. Application-Level Cost Tracking Sometimes cloud tags aren't granular enough (e.g., when multiple customers share one database). ### OpenTelemetry for Cost You can inject "cost attributes" into your traces to calculate the price of a specific API endpoint. ```typescript // Example: Tracking LLM cost in a trace import { trace } from '@opentelemetry/api'; const span = trace.getTracer('llm-tracer').startSpan('generate_text'); // ... perform LLM call const cost = (inputTokens * 0.00001) + (outputTokens * 0.00003); span.setAttribute('app.cost.usd', cost); span.setAttribute('app.tokens.input', inputTokens); span.end(); ``` --- ## 5. Dashboard Templates ### Engineering Dashboard (Grafana) * **Top 5 Costliest Microservices** (Bar chart) * **Idle Resource Count** (Single stat) * **Compute Efficiency** (CPU utilization vs. Cost) * **Data Egress by Region** (Pie chart) ### Product/Executive Dashboard * **Revenue vs. Infrastructure Cost** (Area chart) * **Margin per Feature** (Heatmap) * **Cost per Daily Active User (DAU)** (Line chart) --- ## 6. Tools Ecosystem ### Native Cloud Tools * **AWS Cost Explorer**: Best for monthly trends and filtered views. * **AWS Cost Anomaly Detection**: Uses ML to flag unusual spend automatically. * **GCP Recommender**: Suggests specific sizing changes to save money. ### Specialized Tools * **CloudHealth / Cloudability**: Enterprise-grade cost allocation and multi-cloud reporting. * **Kubecost**: The standard for Kubernetes. It models costs based on pod resource requests. * **Infracost**: A CLI tool that runs in CI/CD to tell you how much a Pull Request will cost before it's merged. --- ## 7. Chargeback vs. Showback How do you hold teams accountable? | Model | Description | Pros | Cons | | :--- | :--- | :--- | :--- | | **Showback** | Reporting costs to teams without actually billing their budgets. | Low friction, creates awareness. | No "teeth"; teams can ignore. | | **Chargeback**| Directly deducting cloud costs from a department's real budget. | Forces accountability, drives optimization. | High administrative overhead. | --- ## 8. Cost Forecasting Forecasting helps avoid end-of-quarter budget surprises. 1. **Linear Projection**: `NextMonth = ThisMonthAverage * GrowthRate`. 2. **Seasonal aware**: Accounting for peak periods like Black Friday or holiday sales. 3. **Scenario Planning**: "If we double our user base, what happens to our NAT Gateway costs?" --- ## 9. Common Optimization Targets * **S3 Storage Class Analysis**: Finding buckets that could move to Infrequent Access. * **Database Query Analysis**: Finding a single query that causes high CPU/IOPS across thousands of DB connections. * **Zombie Snapshots**: Deleting EBS snapshots older than 90 days. --- ## 10. Implementation Checklist - [ ] **Tagging Enforcement**: Do resources without tags trigger an alert or auto-deletion? - [ ] **Accountability**: Does every `Team` have a dashboard showing their spend? - [ ] **Thresholds**: Are there daily spending alerts set at 20% above "normal"? - [ ] **Unit Economics**: Do we know the infrastructure cost of a single user transaction? - [ ] **Forecasting**: Are we predicting next month's bill with < 10% error? --- ## Related Skills - `42-cost-engineering/cloud-cost-models` - `42-cost-engineering/budget-guardrails` - `40-system-resilience/chaos-engineering` (using chaos to test cost stability)