---
name: langfuse-observability
description: LLM observability platform for tracing, evaluation, prompt management, and cost tracking. Use when setting up Langfuse, monitoring LLM costs, tracking token usage, or implementing prompt versioning.
context: fork
agent: metrics-architect
version: 1.0.0
author: OrchestKit AI Agent Hub
tags: [langfuse, llm, observability, tracing, evaluation, prompts, 2026]
user-invocable: false
---

# Langfuse Observability

## Overview

**Langfuse** is the open-source LLM observability platform that OrchestKit uses for tracing, monitoring, evaluation, and prompt management. Unlike LangSmith (deprecated), Langfuse is self-hosted, free, and designed for production LLM applications.

**When to use this skill:**
- Setting up LLM observability from scratch
- Debugging slow or incorrect LLM responses
- Tracking token usage and costs
- Managing prompts in production
- Evaluating LLM output quality
- Migrating from LangSmith to Langfuse

**OrchestKit Integration:**
- **Status**: Migrated from LangSmith (Dec 2025)
- **Location**: `backend/app/shared/services/langfuse/`
- **MCP Server**: `orchestkit-langfuse` (optional)

---

## Quick Start

### Setup

```python
# backend/app/shared/services/langfuse/client.py
from langfuse import Langfuse
from app.core.config import settings

langfuse_client = Langfuse(
    public_key=settings.LANGFUSE_PUBLIC_KEY,
    secret_key=settings.LANGFUSE_SECRET_KEY,
    host=settings.LANGFUSE_HOST  # Self-hosted or cloud
)
```

### Basic Tracing with @observe

```python
from langfuse.decorators import observe, langfuse_context

@observe()  # Automatic tracing
async def analyze_content(content: str):
    langfuse_context.update_current_observation(
        metadata={"content_length": len(content)}
    )
    return await llm.generate(content)
```

### Session & User Tracking

```python
langfuse.trace(
    name="analysis",
    user_id="user_123",
    session_id="session_abc",
    metadata={"content_type": "article", "agent_count": 8},
    tags=["production", "orchestkit"]
)
```

---

## Core Features Summary

| Feature | Description | Reference |
|---------|-------------|-----------|
| Distributed Tracing | Track LLM calls with parent-child spans | `references/tracing-setup.md` |
| Cost Tracking | Automatic token & cost calculation | `references/cost-tracking.md` |
| Prompt Management | Version control for prompts | `references/prompt-management.md` |
| LLM Evaluation | Custom scoring with G-Eval | `references/evaluation-scores.md` |
| Session Tracking | Group related traces | `references/session-tracking.md` |
| Experiments API | A/B testing & benchmarks | `references/experiments-api.md` |
| Multi-Judge Eval | Ensemble LLM evaluation | `references/multi-judge-evaluation.md` |

---

## References

### Tracing Setup
**See: `references/tracing-setup.md`**

Key topics covered:
- Initializing Langfuse client with @observe decorator
- Creating nested traces and spans
- Tracking LLM generations with metadata
- LangChain/LangGraph CallbackHandler integration
- Workflow integration patterns

### Cost Tracking
**See: `references/cost-tracking.md`**

Key topics covered:
- Automatic cost calculation from token usage
- Custom model pricing configuration
- Monitoring dashboard SQL queries
- Cost tracking per analysis/user
- Daily cost trend analysis

### Prompt Management
**See: `references/prompt-management.md`**

Key topics covered:
- Prompt versioning and labels (production/staging/draft)
- Template variables with Jinja2 syntax
- A/B testing prompt versions
- OrchestKit 4-level caching architecture (L1-L4)
- Linking prompts to generation spans

### LLM Evaluation
**See: `references/evaluation-scores.md`**

Key topics covered:
- Custom scoring with numeric/categorical values
- G-Eval automated quality assessment
- Score trends and comparisons
- Filtering traces by score thresholds

### Session Tracking
**See: `references/session-tracking.md`**

Key topics covered:
- Grouping traces by session_id
- Multi-turn conversation tracking
- User and metadata analytics

### Experiments API
**See: `references/experiments-api.md`**

Key topics covered:
- Creating test datasets in Langfuse
- Running automated evaluations
- Regression testing for LLMs
- Benchmarking prompt versions

### Multi-Judge Evaluation
**See: `references/multi-judge-evaluation.md`**

Key topics covered:
- Multiple LLM judges for quality assessment
- Weighted scoring across judges
- OrchestKit langfuse_evaluators.py integration

---

## Best Practices

1. **Always use @observe decorator** for automatic tracing
2. **Set user_id and session_id** for better analytics
3. **Add meaningful metadata** (content_type, analysis_id, etc.)
4. **Score all production traces** for quality monitoring
5. **Use prompt management** instead of hardcoded prompts
6. **Monitor costs daily** to catch spikes early
7. **Create datasets** for regression testing
8. **Tag production vs staging** traces

---

## LangSmith Migration Notes

**Key Differences:**
| Aspect | Langfuse | LangSmith |
|--------|----------|-----------|
| Hosting | Self-hosted, open-source | Cloud-only, proprietary |
| Cost | Free | Paid |
| Prompts | Built-in management | External storage needed |
| Decorator | `@observe` | `@traceable` |

---

## External References

- [Langfuse Docs](https://langfuse.com/docs)
- [Python SDK](https://langfuse.com/docs/sdk/python)
- [Decorators Guide](https://langfuse.com/docs/sdk/python/decorators)
- [Prompt Management](https://langfuse.com/docs/prompts)
- [Self-Hosting](https://langfuse.com/docs/deployment/self-host)

---

## Related Skills

- `observability-monitoring` - General observability patterns for metrics, logging, and alerting
- `llm-evaluation` - Evaluation patterns that integrate with Langfuse scoring
- `llm-streaming` - Streaming response patterns with trace instrumentation
- `prompt-caching` - Caching strategies that reduce costs tracked by Langfuse

## Key Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Observability platform | Langfuse (not LangSmith) | Open-source, self-hosted, free, built-in prompt management |
| Tracing approach | @observe decorator | Automatic, low-overhead instrumentation |
| Cost tracking | Automatic token counting | Built-in model pricing with custom overrides |
| Prompt management | Langfuse native | Version control, A/B testing, labels in one place |

## Capability Details

### distributed-tracing
**Keywords:** trace, tracing, observability, span, nested, parent-child, observe
**Solves:**
- How do I trace LLM calls across my application?
- How to debug slow LLM responses?
- Track execution flow in multi-agent workflows
- Create nested trace spans

### cost-tracking
**Keywords:** cost, token usage, pricing, budget, spend, expense
**Solves:**
- How do I track LLM costs?
- Calculate token usage and pricing
- Monitor AI budget and spending
- Track cost per user or session

### prompt-management
**Keywords:** prompt version, prompt template, prompt control, prompt registry
**Solves:**
- How do I version control prompts?
- Manage prompts in production
- A/B test different prompt versions
- Link prompts to traces

### llm-evaluation
**Keywords:** score, quality, evaluation, rating, assessment, g-eval
**Solves:**
- How do I evaluate LLM output quality?
- Score responses with custom metrics
- Track quality trends over time
- Compare prompt versions by quality

### session-tracking
**Keywords:** session, user tracking, conversation, group traces
**Solves:**
- How do I group related traces?
- Track multi-turn conversations
- Monitor per-user performance
- Organize traces by session

### langchain-integration
**Keywords:** langchain, callback, handler, langgraph integration
**Solves:**
- How do I integrate Langfuse with LangChain?
- Use CallbackHandler for tracing
- Automatic LangGraph workflow tracing
- LangChain observability setup

### datasets-evaluation
**Keywords:** dataset, test set, evaluation dataset, benchmark
**Solves:**
- How do I create test datasets in Langfuse?
- Run automated evaluations
- Regression testing for LLMs
- Benchmark prompt versions

### ab-testing
**Keywords:** a/b test, experiment, compare prompts, variant testing
**Solves:**
- How do I A/B test prompts?
- Compare two prompt versions
- Experimental prompt evaluation
- Statistical prompt testing

### monitoring-dashboard
**Keywords:** dashboard, analytics, metrics, monitoring, queries
**Solves:**
- What are the most expensive traces?
- Average cost by agent type
- Quality score trends
- Custom monitoring queries

### orchestkit-integration
**Keywords:** orchestkit, migration, setup, workflow integration
**Solves:**
- How does OrchestKit use Langfuse?
- Migrate from LangSmith to Langfuse
- OrchestKit workflow tracing patterns
- Cost tracking per analysis

### multi-judge-evaluation
**Keywords:** multi judge, g-eval, multiple evaluators, ensemble evaluation, weighted scoring
**Solves:**
- How do I use multiple LLM judges to evaluate quality?
- Set up G-Eval criteria evaluation
- Configure weighted scoring across judges
- Wire OrchestKit's existing langfuse_evaluators.py

### experiments-api
**Keywords:** experiment, dataset, benchmark, regression test, prompt testing
**Solves:**
- How do I run experiments across datasets?
- A/B test models and prompts systematically
- Track quality regression over time
- Compare experiment results