# Prometheus Metrics

OlliteRT exposes a `GET /metrics` endpoint in [Prometheus exposition format](https://prometheus.io/docs/instrumenting/exposition_formats/) (`text/plain; version=0.0.4`). Always enabled, no authentication required.

## Table of Contents

- [Quick Setup](#quick-setup)
- [Metrics Reference](#metrics-reference)
- [Example Queries](#example-queries)
- [Notes](#notes)

---

## Quick Setup

### Prometheus

```yaml
# prometheus.yml
scrape_configs:
  - job_name: "ollitert"
    scrape_interval: 15s
    static_configs:
      - targets: ["PHONE_IP:8000"]
    metrics_path: "/metrics"
```

### Grafana

1. Add your Prometheus instance as a data source
2. Import a dashboard or create panels using the metrics below
3. Suggested panels: decode speed over time, TTFB histogram, request count, error rate, context utilization

## Metrics Reference

### Counters (10)

Cumulative values that increase monotonically. Reset when the server stops.

| Metric | Description |
|:-------|:------------|
| `ollitert_requests_total` | Total requests processed |
| `ollitert_prompt_tokens_total` | Total prompt tokens (estimated) |
| `ollitert_generation_tokens_total` | Total generated tokens (estimated) |
| `ollitert_prompt_seconds_total` | Cumulative prefill time (seconds) |
| `ollitert_generation_seconds_total` | Cumulative decode time (seconds) |
| `ollitert_errors_total` | Total request errors |
| `ollitert_errors_by_category_total{category="..."}` | Errors by category (`model_load`, `inference`, `network`, `system`) |
| `ollitert_request_text_total` | Text-only requests |
| `ollitert_request_image_total` | Image multimodal requests |
| `ollitert_request_audio_total` | Audio multimodal requests |

### Gauges (19)

Point-in-time values that can go up or down.

| Metric | Description |
|:-------|:------------|
| `ollitert_uptime_seconds` | Time since server entered RUNNING state |
| `ollitert_model_load_time_seconds` | Model load/warmup time |
| `ollitert_prompt_tokens_per_second` | Last request prefill throughput |
| `ollitert_generation_tokens_per_second` | Last request decode throughput |
| `ollitert_generation_tokens_per_second_peak` | Peak decode throughput since start |
| `ollitert_time_to_first_token_ms` | Last TTFB |
| `ollitert_time_to_first_token_avg_ms` | Average TTFB |
| `ollitert_inter_token_latency_ms` | Last inter-token latency |
| `ollitert_request_latency_ms` | Last request total latency |
| `ollitert_request_latency_avg_ms` | Average request latency |
| `ollitert_request_latency_peak_ms` | Peak request latency |
| `ollitert_context_utilization_percent` | Last request context window usage (%) |
| `ollitert_requests_processing` | Currently inferring (0 or 1) |
| `ollitert_model_speculative_decoding_enabled` | Speculative decoding (MTP) enabled (0 or 1) |
| `ollitert_model_idle_unloaded` | Model unloaded due to keep-alive idle timeout (0 or 1) |
| `ollitert_memory_native_heap_bytes` | Native heap allocated bytes (LiteRT model weights) |
| `ollitert_memory_app_heap_used_bytes` | JVM heap used bytes |
| `ollitert_memory_app_total_pss_bytes` | Total process PSS (JVM + native + mmap'd pages) |
| `ollitert_memory_device_available_bytes` | Device available RAM |
| `ollitert_memory_device_total_bytes` | Device total RAM |

## Example Queries

**Average decode speed over the last 5 minutes:**
```promql
avg_over_time(ollitert_generation_tokens_per_second[5m])
```

**Request error rate:**
```promql
rate(ollitert_errors_total[5m]) / rate(ollitert_requests_total[5m])
```

**Modality breakdown:**
```promql
ollitert_request_text_total
ollitert_request_image_total
ollitert_request_audio_total
```

**Native heap memory (model weights):**
```promql
ollitert_memory_native_heap_bytes
```

**Error breakdown by category:**
```promql
ollitert_errors_by_category_total
```

## Notes

> [!NOTE]
> - **No authentication** — matches the convention used by llama.cpp, vLLM, and TGI. Prometheus expects unauthenticated scrape targets.
> - **No histograms** — OlliteRT processes one request at a time, making histograms less useful than in multi-request servers.
> - **Token counts are estimated** — the LiteRT runtime doesn't expose a tokenizer API, so token counts use a character-based approximation (~4 characters per token).

For an explanation of what these metrics mean in practice, see [FAQ → What do the benchmark numbers mean?](../FAQ.md#what-do-the-benchmark-numbers-mean).