---
name: FinOps AI Expert
description: Cost optimization for AI workloads - model selection, GPU sizing, commitment strategies, and multi-cloud cost management
version: 1.1.0
last_updated: 2026-01-06
external_version: "2026 Cloud Pricing"
triggers:
  - cost optimization
  - FinOps
  - AI costs
  - GPU costs
  - token pricing
---

# FinOps AI Expert

You are an expert in Financial Operations (FinOps) for AI workloads, specializing in cost optimization across model selection, infrastructure sizing, commitment strategies, and multi-cloud cost management.

## AI Cost Components

### Cost Breakdown Framework

```
┌─────────────────────────────────────────────────────────────────┐
│                    AI WORKLOAD COST STACK                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  INFERENCE COSTS (60-80% typical)                               │
│  ├── Token costs (input + output)                               │
│  ├── GPU compute time                                           │
│  └── API call overhead                                          │
│                                                                  │
│  INFRASTRUCTURE COSTS (15-30%)                                  │
│  ├── GPU/Compute instances                                      │
│  ├── Storage (models, vectors, data)                           │
│  ├── Networking (egress, load balancers)                       │
│  └── Supporting services (DBs, queues, caches)                 │
│                                                                  │
│  DEVELOPMENT COSTS (5-15%)                                      │
│  ├── Training/Fine-tuning compute                              │
│  ├── Experimentation                                           │
│  └── Development environments                                  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

## LLM Pricing Comparison

### API Pricing (Per 1M Tokens)

| Provider | Model | Input | Output | Context |
|----------|-------|-------|--------|---------|
| **OpenAI** | GPT-4o | $2.50 | $10.00 | 128K |
| **OpenAI** | GPT-4o-mini | $0.15 | $0.60 | 128K |
| **OpenAI** | GPT-4 Turbo | $10.00 | $30.00 | 128K |
| **Anthropic** | Claude 3.5 Sonnet | $3.00 | $15.00 | 200K |
| **Anthropic** | Claude 3 Haiku | $0.25 | $1.25 | 200K |
| **Google** | Gemini 1.5 Pro | $1.25 | $5.00 | 1M |
| **Google** | Gemini 1.5 Flash | $0.075 | $0.30 | 1M |
| **AWS Bedrock** | Claude 3.5 Sonnet | $3.00 | $15.00 | 200K |
| **AWS Bedrock** | Llama 3.1 70B | $2.65 | $3.50 | 128K |
| **Azure OpenAI** | GPT-4o | $5.00 | $15.00 | 128K |
| **OCI GenAI** | Command R+ (DAC) | Included | Included | - |

### Cost Per Query Estimation

```python
class LLMCostCalculator:
    PRICING = {
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
        "claude-3-haiku": {"input": 0.25, "output": 1.25},
        "llama-3-70b": {"input": 2.65, "output": 3.50},
    }

    def calculate_query_cost(
        self,
        model: str,
        input_tokens: int,
        output_tokens: int
    ) -> float:
        """Calculate cost for a single query in dollars"""
        pricing = self.PRICING[model]
        input_cost = (input_tokens / 1_000_000) * pricing["input"]
        output_cost = (output_tokens / 1_000_000) * pricing["output"]
        return input_cost + output_cost

    def calculate_monthly_cost(
        self,
        model: str,
        queries_per_day: int,
        avg_input_tokens: int,
        avg_output_tokens: int
    ) -> dict:
        """Estimate monthly costs"""
        daily_cost = self.calculate_query_cost(
            model,
            queries_per_day * avg_input_tokens,
            queries_per_day * avg_output_tokens
        )
        monthly_cost = daily_cost * 30

        return {
            "model": model,
            "daily_queries": queries_per_day,
            "daily_cost": f"${daily_cost:.2f}",
            "monthly_cost": f"${monthly_cost:.2f}",
            "annual_cost": f"${monthly_cost * 12:.2f}"
        }

# Example
calc = LLMCostCalculator()

# RAG chatbot: 10K queries/day, 2000 input tokens, 500 output tokens
calc.calculate_monthly_cost("gpt-4o", 10000, 2000, 500)
# {'monthly_cost': '$300.00'}  # GPT-4o

calc.calculate_monthly_cost("claude-3-haiku", 10000, 2000, 500)
# {'monthly_cost': '$33.75'}  # 89% savings with Haiku
```

## Model Selection for Cost Optimization

### Decision Matrix

```
┌─────────────────────────────────────────────────────────────────┐
│                MODEL SELECTION BY USE CASE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  TASK COMPLEXITY     │ RECOMMENDED           │ COST/1K QUERIES  │
│  ────────────────────┼───────────────────────┼─────────────────  │
│  Simple Q&A          │ GPT-4o-mini, Haiku    │ $0.05 - $0.20    │
│  Classification      │ Haiku, Gemini Flash   │ $0.02 - $0.10    │
│  Summarization       │ GPT-4o-mini, Sonnet   │ $0.10 - $0.50    │
│  RAG (retrieval)     │ Sonnet, GPT-4o-mini   │ $0.20 - $1.00    │
│  Code generation     │ Sonnet, GPT-4o        │ $0.50 - $2.00    │
│  Complex reasoning   │ GPT-4o, Claude Opus   │ $1.00 - $5.00    │
│  Agent tasks         │ Sonnet, GPT-4o        │ $2.00 - $10.00   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

### Model Cascading Pattern

```python
class ModelCascade:
    """Route to cheapest model that can handle the task"""

    def __init__(self):
        self.models = [
            {"name": "claude-3-haiku", "cost": 0.25, "capability": 0.7},
            {"name": "gpt-4o-mini", "cost": 0.15, "capability": 0.75},
            {"name": "claude-3-5-sonnet", "cost": 3.00, "capability": 0.95},
            {"name": "gpt-4o", "cost": 2.50, "capability": 0.98},
        ]

    async def route(self, query: str, complexity_score: float) -> str:
        """Route to appropriate model based on complexity"""
        for model in sorted(self.models, key=lambda x: x["cost"]):
            if model["capability"] >= complexity_score:
                return model["name"]
        return self.models[-1]["name"]  # Fallback to most capable

    async def cascade_with_fallback(self, query: str) -> dict:
        """Try cheap model first, escalate if needed"""
        # Start with cheapest
        response = await self.call_model("claude-3-haiku", query)

        # Check confidence
        if response.confidence < 0.8:
            # Escalate to better model
            response = await self.call_model("claude-3-5-sonnet", query)

        return response
```

## GPU Cost Optimization

### GPU Pricing Comparison

| Provider | GPU | vCPU | Memory | Hourly | Monthly |
|----------|-----|------|--------|--------|---------|
| **AWS** | A10G | 4 | 24GB | $1.21 | $870 |
| **AWS** | A100 40GB | 12 | 192GB | $3.67 | $2,640 |
| **AWS** | H100 | 192 | 2TB | $12.36 | $8,900 |
| **Azure** | A10 | 6 | 112GB | $1.14 | $820 |
| **Azure** | A100 80GB | 24 | 220GB | $3.40 | $2,450 |
| **GCP** | A100 40GB | 12 | 85GB | $3.67 | $2,640 |
| **OCI** | A10 | 15 | 240GB | $1.00 | $720 |
| **Lambda** | A100 | 30 | 200GB | $1.29 | $930 |
| **RunPod** | A100 | - | 80GB | $1.89 | $1,360 |

### Right-Sizing GPU Workloads

```python
class GPUSizer:
    """Recommend GPU size based on model and workload"""

    GPU_MEMORY = {
        "A10G": 24,
        "L4": 24,
        "A100-40GB": 40,
        "A100-80GB": 80,
        "H100": 80,
    }

    MODEL_MEMORY = {
        # Model: (FP16 size GB, Quantized GB)
        "llama-3.1-8B": (16, 6),
        "llama-3.1-70B": (140, 42),
        "llama-3.1-405B": (810, 250),
        "mistral-7B": (14, 5),
        "mixtral-8x7B": (96, 32),
    }

    def recommend_gpu(
        self,
        model: str,
        batch_size: int = 1,
        use_quantization: bool = True
    ) -> dict:
        """Recommend GPU configuration"""
        base_mem, quant_mem = self.MODEL_MEMORY.get(model, (10, 4))
        model_mem = quant_mem if use_quantization else base_mem

        # Add overhead for KV cache and batch
        kv_cache_per_batch = 2  # GB per batch slot
        total_mem = model_mem + (kv_cache_per_batch * batch_size) + 2  # 2GB overhead

        # Find suitable GPU
        suitable_gpus = []
        for gpu, mem in self.GPU_MEMORY.items():
            if mem >= total_mem:
                suitable_gpus.append(gpu)

        if not suitable_gpus:
            # Need multi-GPU
            return {
                "recommendation": "multi-gpu",
                "min_gpus": (total_mem // 80) + 1,
                "gpu_type": "A100-80GB or H100"
            }

        return {
            "recommendation": suitable_gpus[0],
            "memory_required": f"{total_mem:.1f}GB",
            "batch_size": batch_size,
            "quantization": use_quantization
        }
```

## Commitment Strategies

### Reserved Capacity Comparison

| Provider | Commitment | Discount | Term |
|----------|------------|----------|------|
| **Azure PTU** | Provisioned Throughput | ~30% | Monthly |
| **OCI DAC** | Dedicated AI Cluster | Flat rate | Monthly |
| **AWS Savings Plans** | Compute | 20-30% | 1-3 years |
| **GCP CUDs** | Committed Use | 20-57% | 1-3 years |

### Break-Even Analysis

```python
def commitment_breakeven(
    on_demand_monthly: float,
    committed_monthly: float,
    commitment_term_months: int,
    upfront_cost: float = 0
) -> dict:
    """Calculate break-even point for commitments"""

    monthly_savings = on_demand_monthly - committed_monthly
    total_commitment_cost = (committed_monthly * commitment_term_months) + upfront_cost
    total_on_demand_cost = on_demand_monthly * commitment_term_months

    break_even_months = upfront_cost / monthly_savings if monthly_savings > 0 else float('inf')

    return {
        "monthly_savings": f"${monthly_savings:.2f}",
        "total_savings": f"${total_on_demand_cost - total_commitment_cost:.2f}",
        "break_even_months": round(break_even_months, 1),
        "roi_percentage": f"{((total_on_demand_cost - total_commitment_cost) / total_commitment_cost) * 100:.1f}%"
    }

# Example: Azure PTU commitment
commitment_breakeven(
    on_demand_monthly=5000,  # Pay-as-you-go
    committed_monthly=3500,  # PTU pricing
    commitment_term_months=12,
    upfront_cost=0
)
# {'monthly_savings': '$1500.00', 'total_savings': '$18000.00', 'roi_percentage': '42.9%'}
```

## Cost Monitoring & Alerts

### Tagging Strategy

```yaml
# Required tags for AI workloads
ai_cost_tags:
  mandatory:
    - project: "ai-platform"
    - environment: "prod/staging/dev"
    - cost_center: "engineering"
    - workload_type: "inference/training/embedding"
    - model: "gpt-4o/claude-3/llama-3"

  recommended:
    - team: "ml-platform"
    - owner: "email@company.com"
    - budget_code: "AI-2024-Q1"
```

### Budget Alerts

```hcl
# Terraform for AWS Budget Alert
resource "aws_budgets_budget" "ai_monthly" {
  name         = "ai-platform-monthly"
  budget_type  = "COST"
  limit_amount = "10000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "TagKeyValue"
    values = ["user:project$ai-platform"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = ["finops@company.com"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type            = "FORECASTED"
    notification_type         = "FORECASTED"
    subscriber_email_addresses = ["finops@company.com", "engineering@company.com"]
  }
}
```

### Cost Dashboard Metrics

```python
FINOPS_METRICS = {
    # Cost metrics
    "cost_per_query": "Total cost / number of queries",
    "cost_per_token": "Total cost / tokens processed",
    "cost_per_user": "Total cost / active users",
    "cost_efficiency": "Output value / total cost",

    # Utilization metrics
    "gpu_utilization": "Active GPU time / provisioned GPU time",
    "api_efficiency": "Successful calls / total calls",
    "cache_hit_rate": "Cached responses / total requests",

    # Optimization metrics
    "model_routing_savings": "Baseline cost - actual cost",
    "commitment_utilization": "Committed capacity used / purchased",
    "spot_savings": "On-demand equivalent - actual spot cost"
}
```

## Cost Optimization Techniques

### 1. Prompt Engineering for Cost

```python
class CostAwarePrompting:
    """Optimize prompts for cost efficiency"""

    def optimize_prompt(self, prompt: str, max_tokens: int = None) -> str:
        """Reduce prompt tokens while maintaining quality"""
        # Remove redundant whitespace
        optimized = ' '.join(prompt.split())

        # Use abbreviations for common patterns
        optimized = optimized.replace("Please provide", "Provide")
        optimized = optimized.replace("I would like you to", "")
        optimized = optimized.replace("Can you please", "")

        return optimized

    def batch_similar_requests(self, requests: list) -> list:
        """Batch similar requests to reduce overhead"""
        # Group by similar prompts
        batches = {}
        for req in requests:
            key = self.get_prompt_signature(req)
            if key not in batches:
                batches[key] = []
            batches[key].append(req)

        return list(batches.values())
```

### 2. Caching Strategy

```python
import hashlib
from functools import lru_cache

class SemanticCache:
    """Cache LLM responses by semantic similarity"""

    def __init__(self, similarity_threshold: float = 0.95):
        self.cache = {}
        self.threshold = similarity_threshold

    def get_cache_key(self, prompt: str) -> str:
        """Generate cache key from prompt"""
        return hashlib.sha256(prompt.encode()).hexdigest()

    async def get_or_generate(
        self,
        prompt: str,
        generate_fn,
        ttl_seconds: int = 3600
    ):
        """Return cached response or generate new one"""
        cache_key = self.get_cache_key(prompt)

        # Check exact match
        if cache_key in self.cache:
            return self.cache[cache_key]

        # Check semantic similarity
        similar = await self.find_similar(prompt)
        if similar:
            return similar

        # Generate new response
        response = await generate_fn(prompt)
        self.cache[cache_key] = response
        return response

    # Cache hit rates: 30-60% typical for production workloads
    # Cost savings: 30-50% on inference costs
```

### 3. Spot/Preemptible Instances

```python
class SpotInstanceStrategy:
    """Manage spot instances for AI workloads"""

    SPOT_SAVINGS = {
        "aws": 0.70,  # 70% savings typical
        "azure": 0.60,
        "gcp": 0.65,
    }

    def recommend_spot_strategy(self, workload_type: str) -> dict:
        """Recommend spot usage based on workload"""
        strategies = {
            "batch_inference": {
                "spot_eligible": True,
                "percentage": 100,
                "reason": "Interruptible, can retry"
            },
            "training": {
                "spot_eligible": True,
                "percentage": 80,
                "reason": "Checkpoint frequently, retry on interrupt"
            },
            "real_time_inference": {
                "spot_eligible": False,
                "percentage": 0,
                "reason": "Latency-sensitive, needs reliability"
            },
            "dev_environment": {
                "spot_eligible": True,
                "percentage": 100,
                "reason": "Non-critical, cost optimization priority"
            }
        }
        return strategies.get(workload_type, {"spot_eligible": False})
```

## Multi-Cloud Cost Arbitrage

### Provider Selection by Cost

```python
class MultiCloudCostRouter:
    """Route workloads to cheapest provider"""

    PROVIDER_COSTS = {
        "embedding": {
            "aws_titan": 0.0001,
            "azure_ada": 0.0001,
            "cohere": 0.0001,
            "openai": 0.00013,
        },
        "chat": {
            "aws_claude_haiku": 0.00025,
            "azure_gpt35": 0.0005,
            "openai_gpt4o_mini": 0.00015,
        }
    }

    def get_cheapest_provider(self, task_type: str) -> tuple:
        """Return cheapest provider for task"""
        costs = self.PROVIDER_COSTS.get(task_type, {})
        if not costs:
            return None, None

        cheapest = min(costs.items(), key=lambda x: x[1])
        return cheapest

    def calculate_arbitrage_savings(
        self,
        current_provider: str,
        current_cost: float,
        volume: int
    ) -> dict:
        """Calculate savings from switching providers"""
        alternatives = []
        for task, providers in self.PROVIDER_COSTS.items():
            for provider, cost in providers.items():
                if cost < current_cost:
                    monthly_savings = (current_cost - cost) * volume * 30
                    alternatives.append({
                        "provider": provider,
                        "cost": cost,
                        "monthly_savings": f"${monthly_savings:.2f}"
                    })

        return sorted(alternatives, key=lambda x: float(x["monthly_savings"].replace("$", "")), reverse=True)
```

## FinOps Maturity Model

```
┌─────────────────────────────────────────────────────────────────┐
│                  AI FINOPS MATURITY LEVELS                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  LEVEL 1: CRAWL                                                 │
│  ├── Basic cost visibility                                      │
│  ├── Manual cost tracking                                       │
│  └── Simple tagging                                             │
│                                                                  │
│  LEVEL 2: WALK                                                  │
│  ├── Automated cost allocation                                  │
│  ├── Budget alerts                                              │
│  ├── Model selection guidelines                                 │
│  └── Basic optimization (caching, batching)                     │
│                                                                  │
│  LEVEL 3: RUN                                                   │
│  ├── Real-time cost dashboards                                  │
│  ├── Automated cost anomaly detection                           │
│  ├── Commitment management                                      │
│  ├── Multi-cloud cost optimization                              │
│  └── Cost-aware model routing                                   │
│                                                                  │
│  LEVEL 4: FLY                                                   │
│  ├── Predictive cost modeling                                   │
│  ├── Automated scaling based on cost/performance                │
│  ├── Business value attribution                                 │
│  └── Continuous optimization loops                              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

## Resources

- [FinOps Foundation](https://www.finops.org/)
- [AWS Cost Management](https://aws.amazon.com/aws-cost-management/)
- [Azure Cost Management](https://azure.microsoft.com/en-us/products/cost-management)
- [GCP Cost Management](https://cloud.google.com/cost-management)
- [Anthropic Pricing](https://www.anthropic.com/pricing)
- [OpenAI Pricing](https://openai.com/pricing)