---
name: llm-call-tracing
description: Instrument LLM API calls with proper spans, tokens, and latency
triggers:
  - "trace LLM calls"
  - "instrument model calls"
  - "LLM observability"
  - "track model latency"
priority: 1
---

# LLM Call Tracing

Instrument LLM API calls to track latency, tokens, costs, and errors.

## Core Principle

Every LLM call should capture:
1. **What model** was called
2. **How long** it took
3. **How many tokens** were used
4. **Did it succeed** or fail
5. **Why it failed** (if applicable)

## Essential Span Attributes

```python
# Required (P0)
span.set_attribute("llm.model", "claude-3-opus-20240229")
span.set_attribute("llm.provider", "anthropic")
span.set_attribute("llm.latency_ms", 2340)
span.set_attribute("llm.success", True)

# Token tracking (P1)
span.set_attribute("llm.tokens.input", 1500)
span.set_attribute("llm.tokens.output", 350)
span.set_attribute("llm.tokens.total", 1850)

# Cost (P1)
span.set_attribute("llm.cost_usd", 0.025)

# Configuration (P2)
span.set_attribute("llm.temperature", 0.7)
span.set_attribute("llm.max_tokens", 4096)
span.set_attribute("llm.stop_reason", "end_turn")

# Error context (when applicable)
span.set_attribute("llm.error.type", "rate_limit")
span.set_attribute("llm.error.message", "Rate limit exceeded")
span.set_attribute("llm.retry_count", 2)
```

## What NOT to Log

**Never log full prompts/responses:**
```python
# BAD - PII risk, storage explosion
span.set_attribute("llm.prompt", messages)
span.set_attribute("llm.response", completion.content)

# GOOD - Safe metadata
span.set_attribute("llm.prompt.message_count", len(messages))
span.set_attribute("llm.prompt.system_length", len(system_prompt))
span.set_attribute("llm.response.length", len(completion.content))
```

## Streaming Considerations

For streaming responses:
- Start span when stream begins
- Update token counts as chunks arrive
- End span when stream completes
- Track time-to-first-token (TTFT)

```python
span.set_attribute("llm.streaming", True)
span.set_attribute("llm.ttft_ms", 145)  # Time to first token
span.set_attribute("llm.chunks", 47)     # Number of chunks
```

## Cost Calculation

Calculate cost from tokens and model pricing:

```python
PRICING = {
    "claude-3-opus": {"input": 15.00, "output": 75.00},  # per 1M tokens
    "claude-3-sonnet": {"input": 3.00, "output": 15.00},
    "claude-3-haiku": {"input": 0.25, "output": 1.25},
    "gpt-4-turbo": {"input": 10.00, "output": 30.00},
    "gpt-4o": {"input": 5.00, "output": 15.00},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    pricing = PRICING.get(model, {"input": 0, "output": 0})
    input_cost = (input_tokens / 1_000_000) * pricing["input"]
    output_cost = (output_tokens / 1_000_000) * pricing["output"]
    return round(input_cost + output_cost, 6)
```

## Framework-Specific Patterns

### LangChain
```python
from langfuse.callback import CallbackHandler

handler = CallbackHandler()
chain.invoke(input, config={"callbacks": [handler]})
```

### Direct Anthropic SDK
```python
from langfuse.decorators import observe

@observe(as_type="generation")
def call_claude(messages):
    response = client.messages.create(...)
    return response
```

### OpenAI SDK
```python
from langfuse.openai import openai

# Automatic instrumentation
client = openai.OpenAI()
```

## Error Handling

Capture errors with context:
```python
try:
    response = client.messages.create(...)
except RateLimitError as e:
    span.set_attribute("llm.error.type", "rate_limit")
    span.set_attribute("llm.error.retry_after", e.retry_after)
    raise
except APIError as e:
    span.set_attribute("llm.error.type", "api_error")
    span.set_attribute("llm.error.status", e.status_code)
    raise
```

## Anti-Patterns

See `references/anti-patterns/llm-tracing.md`:
- Logging full prompts (PII, storage)
- Blocking on telemetry (latency)
- Missing token counts (cost blindness)
- No retry tracking (hidden failures)

## Related Skills
- `token-cost-tracking` - Detailed cost attribution
- `error-retry-tracking` - Error handling patterns