---
name: cache-cost-tracking
description: LLM cost tracking with Langfuse for cached responses. Use when monitoring cache effectiveness, tracking cost savings, or attributing costs to agents in multi-agent systems.
tags: [llm, cost, caching, langfuse, observability]
context: fork
agent: metrics-architect
version: 1.0.0
author: OrchestKit
user-invocable: false
---

# Cache Cost Tracking

Monitor LLM costs and cache effectiveness.

## Langfuse Automatic Tracking

```python
from langfuse.decorators import observe, langfuse_context

@observe(as_type="generation")
async def call_llm_with_cache(
    prompt: str,
    agent_type: str,
    analysis_id: UUID
) -> str:
    """LLM call with automatic cost tracking."""

    # Link to parent trace
    langfuse_context.update_current_trace(
        name=f"{agent_type}_generation",
        session_id=str(analysis_id)
    )

    # Check caches
    if cache_key in lru_cache:
        langfuse_context.update_current_observation(
            metadata={"cache_layer": "L1", "cache_hit": True}
        )
        return lru_cache[cache_key]

    similar = await semantic_cache.get(prompt, agent_type)
    if similar:
        langfuse_context.update_current_observation(
            metadata={"cache_layer": "L2", "cache_hit": True}
        )
        return similar

    # LLM call - Langfuse tracks tokens/cost automatically
    response = await llm.generate(prompt)

    langfuse_context.update_current_observation(
        metadata={
            "cache_layer": "L4",
            "cache_hit": False,
            "prompt_cache_hit": response.usage.cache_read_input_tokens > 0
        }
    )

    return response.content
```

## Hierarchical Cost Rollup

```python
class AnalysisWorkflow:
    @observe(as_type="trace")
    async def run_analysis(self, url: str, analysis_id: UUID):
        """Parent trace aggregates child costs.

        Trace Hierarchy:
        run_analysis (trace)
        ├── security_agent (generation)
        ├── tech_agent (generation)
        └── synthesis (generation)
        """
        langfuse_context.update_current_trace(
            name="content_analysis",
            session_id=str(analysis_id),
            tags=["multi-agent"]
        )

        for agent in self.agents:
            await self.run_agent(agent, content, analysis_id)

    @observe(as_type="generation")
    async def run_agent(self, agent, content, analysis_id):
        """Child generation - costs roll up to parent."""
        langfuse_context.update_current_observation(
            name=f"{agent.name}_generation",
            metadata={"agent_type": agent.name}
        )
        return await agent.analyze(content)
```

## Cost Queries

```python
from langfuse import Langfuse

async def get_analysis_costs(analysis_id: UUID) -> dict:
    langfuse = Langfuse()

    traces = langfuse.get_traces(session_id=str(analysis_id), limit=1)

    if traces.data:
        trace = traces.data[0]
        return {
            "total_cost": trace.total_cost,
            "input_tokens": trace.usage.input_tokens,
            "output_tokens": trace.usage.output_tokens,
            "cache_read_tokens": trace.usage.cache_read_input_tokens,
        }

async def get_costs_by_agent() -> list[dict]:
    generations = langfuse.get_generations(
        from_timestamp=datetime.now() - timedelta(days=7),
        limit=1000
    )

    costs = {}
    for gen in generations.data:
        agent = gen.metadata.get("agent_type", "unknown")
        if agent not in costs:
            costs[agent] = {"total": 0, "calls": 0, "cache_hits": 0}

        costs[agent]["total"] += gen.calculated_total_cost or 0
        costs[agent]["calls"] += 1
        if gen.metadata.get("cache_hit"):
            costs[agent]["cache_hits"] += 1

    return list(costs.values())
```

## Cache Effectiveness

```python
cache_hits = 0
cache_misses = 0
cost_saved = 0.0

for gen in generations:
    if gen.metadata.get("cache_hit"):
        cache_hits += 1
        cost_saved += estimate_full_cost(gen)
    else:
        cache_misses += 1

hit_rate = cache_hits / (cache_hits + cache_misses)
print(f"Cache Hit Rate: {hit_rate:.1%}")
print(f"Cost Saved: ${cost_saved:.2f}")
```

## Key Decisions

| Decision | Recommendation |
|----------|----------------|
| Trace grouping | session_id = analysis_id |
| Cost attribution | metadata.agent_type |
| Query window | 7-30 days |
| Dashboard | Langfuse web UI |

## Common Mistakes

- Not linking child to parent trace
- Missing metadata for attribution
- Not tracking cache hits separately
- Ignoring prompt cache savings

## Related Skills

- `semantic-caching` - Redis caching
- `prompt-caching` - Provider caching
- `langfuse-observability` - Full observability

## Capability Details

### prompt-caching
**Keywords:** prompt cache, cache prompt, prefix caching, cache breakpoints
**Solves:**
- Reduce token costs with cached prompts
- Configure cache breakpoints
- Implement provider-native caching

### response-caching
**Keywords:** response cache, semantic cache, cache response, LLM cache
**Solves:**
- Cache LLM responses for repeated queries
- Implement semantic similarity caching
- Reduce API calls with cached responses

### cost-calculation
**Keywords:** cost, token cost, calculate cost, pricing, usage cost
**Solves:**
- Calculate token costs by model
- Track input/output token pricing
- Estimate cost before execution

### usage-tracking
**Keywords:** usage, track usage, token usage, API usage, metrics
**Solves:**
- Track LLM API usage over time
- Monitor token consumption
- Generate usage reports

### cache-invalidation
**Keywords:** invalidate, cache invalidation, TTL, expire, refresh
**Solves:**
- Implement cache invalidation strategies
- Configure TTL for cached responses
- Handle stale cache entries