# OpenTelemetry Observability

The kubernetes-mcp-server supports distributed tracing and metrics via OpenTelemetry (OTEL). Observability is **optional** and disabled by default.

## What Gets Traced

The server automatically traces all operations through middleware without requiring any code changes to individual tools:

1. **MCP Tool Calls** - Every tool invocation with details:
   - Tool name
   - Success/failure status
   - Duration
   - Error details (when applicable)

2. **HTTP Requests** - All HTTP endpoints when running in HTTP mode:
   - Request method and path
   - Response status
   - Client information
   - Duration

**Note**: When running in STDIO mode only MCP tool calls are traced since there is no HTTP server.

## Metrics

The server collects and exposes metrics through two mechanisms:

1. **Stats Endpoint** (`/stats`) - JSON endpoint for real-time statistics:
   - Tool call counts by name
   - Tool call errors
   - HTTP request counts by method/path/status
   - Server uptime

2. **OTLP Export** - When an endpoint is configured, metrics are also exported to your OTLP backend every 30 seconds.

## Quick Start

### 1. Run an OTLP Backend Locally

**Option A: Jaeger (traces only)**

```bash
docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  docker.io/jaegertracing/all-in-one:latest
```

Access the Jaeger UI at http://localhost:16686

> **Note**: Jaeger only supports traces, not metrics. To disable metrics export and avoid warnings about `MetricsService` being unimplemented, set `OTEL_METRICS_EXPORTER=none`.

**Option B: Grafana LGTM Stack (traces + metrics + logs)**

For full observability with metrics support:

```bash
docker run -d --name lgtm \
  -p 3000:3000 \
  -p 4317:4317 \
  -p 4318:4318 \
  docker.io/grafana/otel-lgtm:latest
```

Access Grafana at http://localhost:3000 (default credentials: admin/admin)

### 2. Enable Tracing

```bash
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# Run the server
npx -y kubernetes-mcp-server@latest
```

### 3. View Traces

Make some tool calls through your MCP client, then view traces in the Jaeger UI.

### Example Trace

When you call `resources_get` for a Pod, you'll see a trace like this in Jaeger:

```
Trace ID: abc123def456789
Duration: 145ms

└─ tools/call resources_get [145ms]
   ├─ mcp.method.name: tools/call
   ├─ gen_ai.tool.name: resources_get
   ├─ gen_ai.operation.name: execute_tool
   ├─ rpc.jsonrpc.version: 2.0
   ├─ network.transport: pipe
   └─ Status: OK
```

If the tool call triggers an HTTP request (in HTTP mode), you'll also see:

```
Trace ID: abc123def456789
Duration: 150ms

├─ POST /message [150ms]
│  ├─ http.request.method: POST
│  ├─ url.path: /message
│  ├─ http.response.status_code: 200
│  ├─ client.address: 192.168.1.100
│  │
│  └─ tools/call resources_get [145ms]
       ├─ mcp.method.name: tools/call
       ├─ gen_ai.tool.name: resources_get
       ├─ gen_ai.operation.name: execute_tool
       ├─ rpc.jsonrpc.version: 2.0
       ├─ network.transport: tcp
       └─ Status: OK
```

## Configuration

OpenTelemetry can be configured via **TOML config file** or **environment variables**. Environment variables take precedence over TOML config values.

**Note**: Telemetry is automatically enabled when an endpoint is configured. Use `enabled = false` in TOML to explicitly disable it.

### Configuration Reference

| TOML Field | Environment Variable | Description |
|------------|---------------------|-------------|
| `enabled` | - | Explicit enable/disable (overrides all) |
| `endpoint` | `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP endpoint URL |
| `protocol` | `OTEL_EXPORTER_OTLP_PROTOCOL` | Protocol: `grpc` or `http/protobuf` |
| `traces_sampler` | `OTEL_TRACES_SAMPLER` | Sampling strategy |
| `traces_sampler_arg` | `OTEL_TRACES_SAMPLER_ARG` | Sampling ratio (0.0-1.0) |

### TOML Configuration

Add a `[telemetry]` section to your config file:

```toml
[telemetry]
# Optional: explicitly enable/disable (omit to auto-enable when endpoint is set)
enabled = true

endpoint = "http://localhost:4317"

# Protocol: "grpc" (default) or "http/protobuf"
protocol = "grpc"

# Trace sampling strategy
# Options: "always_on", "always_off", "traceidratio", "parentbased_always_on", "parentbased_always_off", "parentbased_traceidratio"
traces_sampler = "traceidratio"

# Sampling ratio for ratio-based samplers (0.0 to 1.0)
traces_sampler_arg = 0.1
```

#### TOML Examples

**Enable with endpoint:**
```toml
[telemetry]
endpoint = "http://localhost:4317"
```

**Production with sampling:**
```toml
[telemetry]
endpoint = "http://tempo-distributor:4317"
traces_sampler = "traceidratio"
traces_sampler_arg = 0.05  # 5% sampling
```

**Explicitly disable:**
```toml
[telemetry]
enabled = false
```

### Environment Variables

Environment variables take precedence over TOML config. This allows you to override config file settings at runtime.

#### Endpoint

```bash
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
```

**Note**: The server gracefully handles failures. If the endpoint is unreachable, the server logs a warning and continues without tracing.

#### Optional Variables

```bash
# Service name (defaults to "kubernetes-mcp-server")
export OTEL_SERVICE_NAME=kubernetes-mcp-server

# Service version (auto-detected from binary, rarely needs manual override)
export OTEL_SERVICE_VERSION=1.0.0

# Additional resource attributes (useful for multi-environment deployments)
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,team=platform"
```

#### Endpoint Protocols

The server supports both gRPC and HTTP/protobuf protocols:

```bash
# gRPC (default, port 4317)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# HTTP/protobuf (port 4318)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

# Secure endpoints (HTTPS/gRPC with TLS)
export OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-secure.example.com:4317

# Custom CA certificate (for self-signed certificates)
export OTEL_EXPORTER_OTLP_CERTIFICATE=/path/to/ca.crt
```

#### Sampling Configuration

By default, the server uses **`ParentBased(AlwaysSample)`** sampling:
- **Root spans** (no parent): Always sampled (100%)
- **Child spans**: Inherit parent's sampling decision

This is ideal for development but may generate high trace volumes in production.

#### Production Sampling

For production with high traffic, use ratio-based sampling:

```bash
# Sample 10% of traces
export OTEL_TRACES_SAMPLER=traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1
```

#### Available Samplers

- `always_on` - Sample everything (default for root spans)
- `always_off` - Disable tracing entirely
- `traceidratio` - Sample a percentage (requires `OTEL_TRACES_SAMPLER_ARG` between 0.0 and 1.0)
- `parentbased_always_on` - Respect parent span, default to always_on
- `parentbased_always_off` - Respect parent span, default to always_off
- `parentbased_traceidratio` - Respect parent span, default to ratio

#### Sampling Examples

```bash
# Development: Sample everything
export OTEL_TRACES_SAMPLER=always_on

# Production: 5% sampling (good for high-traffic services)
export OTEL_TRACES_SAMPLER=traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.05

# Temporarily disable tracing
export OTEL_TRACES_SAMPLER=always_off

# Or just unset the endpoint
unset OTEL_EXPORTER_OTLP_ENDPOINT
```

## Deployment Examples

### Claude Code (STDIO Mode)

Add the MCP server to your project's `.mcp.json` or global `~/.claude/settings.json`:

```json
{
  "mcpServers": {
    "kubernetes": {
      "command": "npx",
      "args": ["-y", "kubernetes-mcp-server@latest"],
      "env": {
        "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
        "OTEL_TRACES_SAMPLER": "always_on"
      }
    }
  }
}
```

**For Jaeger (traces only)**: Add `"OTEL_METRICS_EXPORTER": "none"` to disable metrics export.

**Note**: In STDIO mode, only MCP tool calls are traced (no HTTP request spans).

### Kubernetes Deployment (HTTP Mode)

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kubernetes-mcp-server
spec:
  template:
    spec:
      containers:
      - name: kubernetes-mcp-server
        image: quay.io/containers/kubernetes_mcp_server:latest
        env:
        # OTLP endpoint (required to enable tracing)
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://tempo-distributor.observability:4317"

        # Sampling (recommended for production)
        - name: OTEL_TRACES_SAMPLER
          value: "traceidratio"
        - name: OTEL_TRACES_SAMPLER_ARG
          value: "0.1"  # 10% sampling

        # Resource attributes (helps identify this deployment)
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "deployment.environment=production,k8s.cluster.name=prod-us-west-2"

        # Kubernetes metadata (optional, helps correlate traces with K8s resources)
        - name: KUBERNETES_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: KUBERNETES_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
```

**Note**: The Kubernetes metadata environment variables are optional but recommended for production deployments. They help correlate traces with specific pods, namespaces, and nodes.

### Docker

```bash
docker run \
  -e OTEL_EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4317 \
  -e OTEL_TRACES_SAMPLER=always_on \
  quay.io/containers/kubernetes_mcp_server:latest
```

## Trace Attributes

### MCP Tool Call Spans

Each tool call creates a span following MCP and OpenTelemetry semantic conventions:

**Span Name Format**: `{mcp.method.name} {target}` (e.g., "tools/call resources_get")

**Attributes**:
- `mcp.method.name` - MCP protocol method (e.g., "tools/call") **[Required]**
- `gen_ai.tool.name` - Name of the tool being called (e.g., "resources_get", "helm_install") **[Required for tool calls]**
- `gen_ai.operation.name` - Set to "execute_tool" for tool calls **[Recommended]**
- `rpc.jsonrpc.version` - JSON-RPC version (typically "2.0") **[Recommended]**
- `network.transport` - Transport protocol: "pipe" for STDIO, "tcp" for HTTP **[Recommended]**
- `error.type` - Error classification: "tool_error" for tool failures, "_OTHER" for other errors **[Conditional]**

### HTTP Request Spans

HTTP requests create spans following [OpenTelemetry HTTP semantic conventions](https://opentelemetry.io/docs/specs/semconv/http/http-spans/):

**Span Name Format**: `{METHOD} {path}` (e.g., "POST /message")

**Attributes**:
- `http.request.method` - Request method (GET, POST, etc.) **[Required]**
- `url.path` - URL path **[Required]**
- `url.scheme` - URL scheme (http or https) **[Required]**
- `server.address` - Server host **[Recommended]**
- `network.protocol.name` - Protocol name (http) **[Recommended]**
- `network.protocol.version` - Protocol version (HTTP/1.1, HTTP/2) **[Recommended]**
- `client.address` - Client IP address **[Recommended]**
- `http.route` - Normalized route pattern (when different from path) **[Conditional]**
- `user_agent.original` - User agent string (when present) **[Conditional]**
- `http.request.body.size` - Request body size (when present) **[Conditional]**
- `http.response.status_code` - Response status code **[Required]**
- `error.type` - HTTP status code for 4xx/5xx responses **[Conditional]**

**Note**: HTTP spans only appear when running in HTTP mode. STDIO mode (Claude Code) only creates MCP tool call spans. The `/healthz` endpoint is not traced to reduce noise.

## Stats Endpoint

When running in HTTP mode, the server exposes a `/stats` endpoint that returns real-time statistics as JSON:

```bash
curl http://localhost:8080/stats
```

Example response:
```json
{
  "total_tool_calls": 42,
  "tool_call_errors": 2,
  "tool_calls_by_name": {
    "resources_list": 15,
    "pods_get": 12,
    "helm_list": 10,
    "resources_get": 5
  },
  "total_http_requests": 100,
  "http_requests_by_path": {
    "/mcp": 50,
    "/sse": 30,
    "/message": 20
  },
  "uptime_seconds": 3600.5
}
```

The stats endpoint is useful for:
- Health monitoring and alerting
- Quick debugging without a full observability stack
- Integration with simple monitoring systems

**Note**: The `/stats` endpoint is only available in HTTP mode. In STDIO mode, use OTLP export for metrics.

## Metrics Endpoint

When running in HTTP mode, the server exposes a `/metrics` endpoint for Prometheus scraping:

```bash
curl http://localhost:8080/metrics
```

This endpoint returns metrics in OpenMetrics/Prometheus text format, suitable for scraping by Prometheus or compatible systems.

### Available Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `k8s_mcp_tool_calls_total` | Counter | Total MCP tool calls (labeled by `tool_name`) |
| `k8s_mcp_tool_errors_total` | Counter | Total MCP tool errors (labeled by `tool_name`) |
| `k8s_mcp_tool_duration_seconds` | Histogram | Tool call duration in seconds |
| `k8s_mcp_http_requests_total` | Counter | HTTP requests (labeled by `http_request_method`, `url_path`, `http_response_status_class`) |
| `k8s_mcp_server_info` | Gauge | Server info (labeled by `version`, `go_version`) |

### Prometheus Scrape Configuration

```yaml
scrape_configs:
  - job_name: 'kubernetes-mcp-server'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: /metrics
```

### Kubernetes ServiceMonitor

When deployed in Kubernetes with the Helm chart, enable the ServiceMonitor:

```yaml
metrics:
  serviceMonitor:
    enabled: true
    interval: 30s
```

**Note**: The `/metrics` endpoint is only available in HTTP mode.

## Troubleshooting

### Tracing not working?

1. **Check endpoint is set**:
   ```bash
   echo $OTEL_EXPORTER_OTLP_ENDPOINT
   ```

2. **Check server logs** (increase verbosity):
   ```bash
   # Look for "OpenTelemetry tracing initialized successfully"
   kubernetes-mcp-server -v 2
   ```

   If tracing fails to initialize, you'll see:
   ```
   Failed to create OTLP exporter, tracing disabled: <error details>
   ```

3. **Verify OTLP collector is reachable**:
   ```bash
   # For gRPC endpoint (port 4317)
   telnet localhost 4317

   # For HTTP endpoint (port 4318)
   curl http://localhost:4318/v1/traces
   ```

### No traces appearing in backend?

1. **Check sampling** - you might be sampling at 0% or using `always_off`:
   ```bash
   echo $OTEL_TRACES_SAMPLER
   echo $OTEL_TRACES_SAMPLER_ARG
   ```

2. **Verify service name**:
   ```bash
   echo $OTEL_SERVICE_NAME
   ```
   Search for this service name in your tracing UI (defaults to "kubernetes-mcp-server").

3. **Check backend configuration** - ensure your OTLP collector is forwarding to the right backend.

4. **Verify protocol compatibility**:
   - If using HTTP-based backends, ensure you set `OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf`
   - Check if you need port 4317 (gRPC) or 4318 (HTTP)

### TLS/Certificate Issues

If using HTTPS/secure endpoints:

1. **Certificate errors**:
   ```bash
   # Provide custom CA certificate
   export OTEL_EXPORTER_OTLP_CERTIFICATE=/path/to/ca.crt
   ```

2. **Self-signed certificates**:
   ```bash
   # For testing only - not recommended for production
   export OTEL_EXPORTER_OTLP_INSECURE=true
   ```

## Performance Impact

Tracing has minimal performance overhead:

- **Middleware tracing**: Typically 1-2ms per tool call
- **Network overhead**: Spans are batched and exported every 5 seconds
- **Memory**: Approximately 1-5MB for span buffers
- **CPU**: Negligible (<1% for most workloads)

For production deployments with high traffic, use ratio-based sampling to reduce costs while maintaining observability.

## Advanced Topics

### Resource Detection

The OpenTelemetry SDK automatically detects and adds resource attributes from the environment:

- **Host information**: hostname, OS, architecture
- **Process information**: PID, executable name
- **Container information**: container ID (when running in containers)
- **Kubernetes information**: pod name, namespace (when K8s env vars are present)

These are merged with any attributes you set via `OTEL_RESOURCE_ATTRIBUTES`.

### Distributed Tracing

When the kubernetes-mcp-server is part of a distributed system:

1. **Parent spans** are automatically detected and respected
2. **Trace context** is propagated via standard W3C Trace Context headers
3. **Sampling decisions** from parent spans are inherited (via ParentBased sampler)

This means traces can span multiple services seamlessly.

### Custom Resource Attributes

Add custom attributes to help identify and filter traces:

```bash
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=staging,team=platform,region=us-west-2,version=v1.2.3"
```

These attributes appear on **all spans** from this service instance and are useful for:
- Filtering traces by environment (prod vs staging)
- Analyzing performance by region or deployment
- Tracking issues to specific versions or teams