---
name: distributed-tracing
description: Use when implementing distributed tracing, understanding trace propagation, or debugging cross-service issues. Covers OpenTelemetry, span context, and trace correlation.
allowed-tools: Read, Glob, Grep
---

# Distributed Tracing

Patterns and practices for implementing distributed tracing across microservices and understanding request flows in distributed systems.

## When to Use This Skill

- Implementing distributed tracing in microservices
- Debugging cross-service request issues
- Understanding trace propagation
- Choosing tracing infrastructure
- Correlating logs, metrics, and traces

## Why Distributed Tracing?

```text
Problem: Request flows through multiple services
How do you debug when something fails?

Without tracing:
User → API → ??? → ??? → Error somewhere

With tracing:
User → API (50ms) → OrderService (20ms) → PaymentService (ERROR: timeout)
         └── Full visibility into request flow
```

## Core Concepts

### Traces, Spans, and Context

```text
Trace: End-to-end request journey
├── Span: Single operation within a service
│   ├── SpanID: Unique identifier
│   ├── ParentSpanID: Link to parent span
│   ├── TraceID: Shared across all spans
│   ├── Operation Name: What is being done
│   ├── Start/End Time: Duration
│   ├── Status: Success/Error
│   ├── Attributes: Key-value metadata
│   └── Events: Point-in-time annotations
│
└── Context: Propagated across service boundaries
    ├── TraceID
    ├── SpanID
    ├── Trace Flags
    └── Trace State
```

### Trace Visualization

```text
TraceID: abc123

Service A (API Gateway)
├──────────────────────────────────────────────────────┤ 200ms
    │
    └─► Service B (Order Service)
        ├───────────────────────────────────┤ 150ms
            │
            ├─► Service C (Inventory)
            │   ├───────────────┤ 50ms
            │
            └─► Service D (Payment)
                ├───────────────────────┤ 80ms
                    │
                    └─► External API
                        ├─────────┤ 60ms
```

## OpenTelemetry

### Overview

```text
OpenTelemetry = Unified observability framework

Components:
┌─────────────────────────────────────────────────────┐
│  Application                                        │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │
│  │    SDK      │  │   Tracer    │  │   Meter     │ │
│  │             │  │   Provider  │  │   Provider  │ │
│  └─────────────┘  └─────────────┘  └─────────────┘ │
└─────────────────────────────────────────────────────┘
           │               │               │
           └───────────────┼───────────────┘
                           ▼
              ┌─────────────────────────┐
              │    OTLP Exporter        │
              └─────────────────────────┘
                           │
                           ▼
              ┌─────────────────────────┐
              │    Collector            │
              │  (Optional)             │
              └─────────────────────────┘
                           │
           ┌───────────────┼───────────────┐
           ▼               ▼               ▼
      ┌─────────┐    ┌─────────┐    ┌─────────┐
      │ Jaeger  │    │  Zipkin │    │  Tempo  │
      └─────────┘    └─────────┘    └─────────┘
```

### Trace Context Propagation

```text
HTTP Headers (W3C Trace Context):
traceparent: 00-{trace-id}-{span-id}-{flags}
tracestate: vendor1=value1,vendor2=value2

Example:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
              │   │                               │                └─ sampled
              │   │                               └─ parent span id
              │   └─ trace id (128-bit)
              └─ version

Propagation across services:
┌─────────────┐                      ┌─────────────┐
│  Service A  │  ─── HTTP ──────────►│  Service B  │
│             │  traceparent: 00-... │             │
│ Create Span │                      │ Extract     │
│ Inject      │                      │ Create Span │
└─────────────┘                      └─────────────┘
```

### Span Attributes

```text
Semantic conventions (standard attributes):

HTTP:
- http.method: GET, POST, etc.
- http.url: Full URL
- http.status_code: 200, 404, 500
- http.route: /users/{id}

Database:
- db.system: postgresql, mysql
- db.statement: SELECT * FROM...
- db.operation: query, insert

RPC:
- rpc.system: grpc
- rpc.service: OrderService
- rpc.method: CreateOrder

Custom:
- user.id: 12345
- order.total: 99.99
- feature.flag: experiment_v2
```

## Tracing Backends

### Jaeger

```text
Features:
- Open source (CNCF)
- Built-in UI
- Multiple storage backends
- OpenTelemetry native

Architecture:
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   Agent     │─►│  Collector  │─►│   Storage   │
│ (optional)  │  │             │  │ (Cassandra/ │
└─────────────┘  └─────────────┘  │ Elasticsearch)
                       │          └─────────────┘
                       ▼
                ┌─────────────┐
                │    Query    │
                │   Service   │
                └─────────────┘
                       │
                       ▼
                ┌─────────────┐
                │     UI      │
                └─────────────┘
```

### Zipkin

```text
Features:
- Mature, battle-tested
- Simple architecture
- Low resource overhead
- Good ecosystem support

Best for:
- Simpler setups
- Lower resource environments
- Teams familiar with Zipkin
```

### Grafana Tempo

```text
Features:
- Object storage backend (cheap)
- Deep Grafana integration
- Log-based trace discovery
- Exemplars support

Best for:
- Grafana-heavy environments
- Cost-sensitive deployments
- Large-scale traces
```

### Cloud Native Options

| Provider | Service | Integration |
| -------- | ------- | ----------- |
| AWS | X-Ray | Native AWS services |
| GCP | Cloud Trace | Native GCP services |
| Azure | Application Insights | Native Azure services |
| Datadog | APM | Full-stack observability |

## Sampling Strategies

### Why Sample?

```text
High-traffic systems generate millions of spans.
Storing all spans is expensive and often unnecessary.

Sampling: Collect a subset of traces

Goal: Keep enough data to debug issues
      while managing costs
```

### Sampling Types

```text
1. Head-based sampling (at trace start):
   - Decision made when trace begins
   - Consistent across services
   - Simple but may miss rare events

2. Tail-based sampling (after trace complete):
   - Decision made after seeing full trace
   - Can keep interesting traces (errors, slow)
   - Requires buffering spans
   - More complex infrastructure

3. Priority sampling:
   - Assign priority based on attributes
   - Keep all errors, sample normal traffic
```

### Sampling Strategies

```text
Rate-based:
- Sample 10% of all traces
- Simple, predictable cost

Priority-based:
- 100% of errors
- 100% of slow requests (>1s)
- 5% of normal requests

Adaptive:
- Adjust rate based on traffic
- Target specific traces/second
- Handle traffic spikes
```

## Correlation Patterns

### Logs-Traces-Metrics

```text
Three Pillars of Observability:

Logs ◄──────────► Traces ◄──────────► Metrics
  │                  │                   │
  │ trace_id         │ exemplars         │
  │ span_id          │                   │
  └──────────────────┴───────────────────┘

Correlation:
1. Add trace_id/span_id to log entries
2. Add exemplars (trace links) to metrics
3. Click from metric → trace → logs
```

### Log Correlation

```text
Structured log with trace context:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "ERROR",
  "message": "Payment failed",
  "trace_id": "abc123def456",
  "span_id": "789xyz",
  "service": "payment-service",
  "user_id": "12345",
  "error": "Card declined"
}

Query in log aggregator:
trace_id:"abc123def456"
→ See all logs for this request
```

### Exemplars (Metrics to Traces)

```text
Metric with exemplar:
http_request_duration{service="api"} = 2.5s
  └── exemplar: trace_id=abc123

When latency spikes:
1. See metric spike in dashboard
2. Click on data point
3. Jump directly to slow trace
4. See exactly what caused latency
```

## Instrumentation Patterns

### Automatic Instrumentation

```text
Zero-code instrumentation:
- HTTP clients/servers
- Database clients
- Message queues
- gRPC

Pros: Easy, comprehensive
Cons: Less control, more noise
```

### Manual Instrumentation

```text
Add spans for business logic:

with tracer.start_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("order.items", len(items))

    result = process(order)

    if result.error:
        span.set_status(Status(StatusCode.ERROR))
        span.record_exception(result.error)

Pros: Precise, business-relevant
Cons: More code, maintenance
```

### Hybrid Approach (Recommended)

```text
1. Auto-instrument infrastructure:
   - HTTP, database, queue calls

2. Manual instrument business logic:
   - Key operations
   - Business metrics
   - Error context
```

## Best Practices

### Span Design

```text
Good span names:
- HTTP GET /api/orders/{id}
- ProcessPayment
- db.query users

Bad span names:
- Handler (too generic)
- /api/orders/12345 (cardinality explosion)
- doStuff (meaningless)
```

### Attribute Guidelines

```text
Do:
- Use semantic conventions
- Add business context (user_id, order_id)
- Keep cardinality low
- Include error details

Don't:
- Add PII (personally identifiable info)
- Use high-cardinality values as attributes
- Add large payloads
- Include sensitive data
```

### Performance Considerations

```text
1. Use async span export
2. Sample appropriately
3. Limit attribute count
4. Use span processor batching
5. Consider span limits
```

## Troubleshooting with Traces

### Common Patterns

```text
Finding slow requests:
1. Query traces by duration > threshold
2. Identify slow spans
3. Check span attributes for context

Finding errors:
1. Query traces by status = ERROR
2. See error span and context
3. Check exception details

Finding dependencies:
1. View service map from traces
2. Identify critical paths
3. Find hidden dependencies
```

## Related Skills

- `observability-patterns` - Three pillars overview
- `slo-sli-error-budget` - Using traces for SLIs
- `incident-response` - Using traces in incidents