---
name: build-grafana-dashboards
description: >
  Create production-ready Grafana dashboards with reusable panels, template variables,
  annotations, and provisioning for version-controlled dashboard deployment. Use when
  creating visual representations of Prometheus, Loki, or other data source metrics,
  building operational dashboards for SRE teams, migrating from manual dashboard creation
  to version-controlled provisioning, or establishing executive-level SLO compliance
  reporting.
license: MIT
allowed-tools: Read Write Edit Bash Grep Glob
metadata:
  author: Philipp Thoss
  version: "1.0"
  domain: observability
  complexity: intermediate
  language: multi
  tags: grafana, dashboards, visualization, panels, provisioning
---

# Build Grafana Dashboards

Design and deploy Grafana dashboards with best practices for maintainability, reusability, and version control.

## When to Use

- Creating visual representations of Prometheus, Loki, or other data source metrics
- Building operational dashboards for SRE teams and incident responders
- Establishing executive-level reporting dashboards for SLO compliance
- Migrating dashboards from manual creation to version-controlled provisioning
- Standardizing dashboard layouts across teams with template variables
- Creating drill-down experiences from high-level overviews to detailed metrics

## Inputs

- **Required**: Data source configuration (Prometheus, Loki, Tempo, etc.)
- **Required**: Metrics or logs to visualize with their query patterns
- **Optional**: Template variables for multi-service or multi-environment views
- **Optional**: Existing dashboard JSON for migration or modification
- **Optional**: Annotation queries for event correlation (deployments, incidents)

## Procedure

> See [Extended Examples](references/EXAMPLES.md) for complete configuration files and templates.


### Step 1: Design Dashboard Structure

Plan dashboard layout and organization before building panels.

Create a dashboard specification document:

```markdown
# Service Overview Dashboard

## Purpose
Real-time operational view for on-call engineers monitoring the API service.

## Rows
1. High-Level Metrics (collapsed by default)
   - Request rate, error rate, latency (RED metrics)
   - Service uptime, instance count
2. Detailed Metrics (expanded by default)
   - Per-endpoint latency breakdown
   - Error rate by status code
   - Database connection pool status
3. Resource Utilization
   - CPU, memory, disk usage per instance
   - Network I/O rates
4. Logs (collapsed by default)
   - Recent errors from Loki
   - Alert firing history

## Variables
- `environment`: production, staging, development
- `instance`: all instances or specific instance selection
- `interval`: aggregation window (5m, 15m, 1h)

## Annotations
- Deployment events from CI/CD system
- Alert firing/resolving events
```

Key design principles:
- **Most important metrics first**: Critical metrics at the top, details below
- **Consistent time ranges**: Synchronize time across all panels
- **Drill-down paths**: Link from high-level to detailed dashboards
- **Responsive layout**: Use rows and panel widths that work on various screens

**Expected:** Clear dashboard structure documented, stakeholders aligned on metrics and layout priorities.

**On failure:**
- Conduct dashboard design review with end users (SREs, developers)
- Benchmark against industry standards (USE method, RED method, Four Golden Signals)
- Review existing dashboards in team for consistency patterns

### Step 2: Create Dashboard with Template Variables

Build the dashboard foundation with reusable variables for filtering.

Create dashboard JSON structure (or use UI, then export):

```json
{
  "dashboard": {
    "title": "API Service Overview",
    "uid": "api-service-overview",
    "version": 1,
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-6h",
      "to": "now"
    },
    "refresh": "30s",
    "templating": {
      "list": [
        {
          "name": "environment",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up{job=\"api-service\"}, environment)",
          "multi": false,
          "includeAll": false,
          "refresh": 1,
          "sort": 1,
          "current": {
            "selected": false,
            "text": "production",
            "value": "production"
          }
        },
        {
          "name": "instance",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up{job=\"api-service\",environment=\"$environment\"}, instance)",
          "multi": true,
          "includeAll": true,
          "refresh": 1,
          "allValue": ".*",
          "current": {
            "selected": true,
            "text": "All",
            "value": "$__all"
          }
        },
        {
          "name": "interval",
          "type": "interval",
          "options": [
            {"text": "1m", "value": "1m"},
            {"text": "5m", "value": "5m"},
            {"text": "15m", "value": "15m"},
            {"text": "1h", "value": "1h"}
          ],
          "current": {
            "text": "5m",
            "value": "5m"
          },
          "auto": false
        }
      ]
    },
    "annotations": {
      "list": [
        {
          "name": "Deployments",
          "datasource": "Prometheus",
          "enable": true,
          "expr": "changes(app_version{job=\"api-service\",environment=\"$environment\"}[5m]) > 0",
          "step": "60s",
          "iconColor": "rgba(0, 211, 255, 1)",
          "tagKeys": "version"
        }
      ]
    }
  }
}
```

Variable types and use cases:
- **Query variables**: Dynamic lists from data source (`label_values()`, `query_result()`)
- **Interval variables**: Aggregation windows for queries
- **Custom variables**: Static lists for non-metric selections
- **Constant variables**: Shared values across panels (data source names, thresholds)
- **Text box variables**: Free-form input for filtering

**Expected:** Variables populate correctly from data source, cascading filters work (environment filters instances), default selections appropriate.

**On failure:**
- Test variable queries independently in Prometheus UI
- Check for circular dependencies (variable A depends on B depends on A)
- Verify regex patterns in `allValue` field for multi-select variables
- Review variable refresh settings (on dashboard load vs on time range change)

### Step 3: Build Visualization Panels

Create panels for each metric with appropriate visualization types.

**Time series panel** (request rate):

```json
{
  "type": "timeseries",
  "title": "Request Rate",
  "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
  "targets": [
    {
      "expr": "sum(rate(http_requests_total{job=\"api-service\",environment=\"$environment\",instance=~\"$instance\"}[$interval])) by (method)",
      "legendFormat": "{{method}}",
      "refId": "A"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",
      "color": {
        "mode": "palette-classic"
      },
      "custom": {
        "drawStyle": "line",
        "lineInterpolation": "smooth",
        "fillOpacity": 10,
        "spanNulls": true
      },
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {"value": null, "color": "green"},
          {"value": 1000, "color": "yellow"},
          {"value": 5000, "color": "red"}
        ]
      }
    }
  },
  "options": {
    "tooltip": {
      "mode": "multi",
      "sort": "desc"
    },
    "legend": {
      "displayMode": "table",
      "placement": "right",
      "calcs": ["mean", "max", "last"]
    }
  }
}
```

**Stat panel** (error rate):

```json
{
  "type": "stat",
  "title": "Error Rate",
  "gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
  "targets": [
    {
# ... (see EXAMPLES.md for complete configuration)
```

**Heatmap panel** (latency distribution):

```json
{
  "type": "heatmap",
  "title": "Request Duration Heatmap",
  "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
  "targets": [
    {
# ... (see EXAMPLES.md for complete configuration)
```

Panel selection guide:
- **Time series**: Trends over time (rates, counts, durations)
- **Stat**: Single current value with threshold coloring
- **Gauge**: Percentage values (CPU, memory, disk usage)
- **Bar gauge**: Comparing multiple values at a point in time
- **Heatmap**: Distribution of values over time (latency percentiles)
- **Table**: Detailed breakdown of multiple metrics
- **Logs**: Raw log lines from Loki with filtering

**Expected:** Panels render correctly with data, visualizations match intended metric types, legends descriptive, thresholds highlight problems.

**On failure:**
- Test queries in Explore view with same time range and variables
- Check for metric name typos or incorrect label filters
- Verify aggregation functions match metric type (rate for counters, avg for gauges)
- Review unit configurations (bytes, seconds, requests per second)
- Enable "Show query inspector" to debug empty results

### Step 4: Configure Rows and Layout

Organize panels into collapsible rows for logical grouping.

```json
{
  "panels": [
    {
      "type": "row",
      "title": "High-Level Metrics",
      "collapsed": false,
# ... (see EXAMPLES.md for complete configuration)
```

Layout best practices:
- Grid is 24 units wide, each panel specifies `w` (width) and `h` (height)
- Use rows to group related panels, collapse less critical sections by default
- Place most critical metrics in first visible area (y=0-8)
- Maintain consistent panel heights within rows (typically 4, 8, or 12 units)
- Use full width (24) for time series, half width (12) for comparisons

**Expected:** Dashboard layout organized logically, rows collapse/expand correctly, panels align visually without gaps.

**On failure:**
- Validate gridPos coordinates don't overlap
- Check that row panels array contains panels (not null)
- Verify y-coordinates increment logically down the page
- Use Grafana UI "Edit JSON" to inspect grid positions

### Step 5: Add Links and Drill-Downs

Create navigation paths between related dashboards.

Dashboard-level links in JSON:

```json
{
  "links": [
    {
      "title": "Service Details",
      "type": "link",
      "icon": "external link",
# ... (see EXAMPLES.md for complete configuration)
```

Panel-level data links:

```json
{
  "fieldConfig": {
    "defaults": {
      "links": [
        {
          "title": "View Logs for ${__field.labels.instance}",
# ... (see EXAMPLES.md for complete configuration)
```

Link variables:
- `$service`, `$environment`: Dashboard template variables
- `${__field.labels.instance}`: Label value from clicked data point
- `${__from}`, `${__to}`: Current dashboard time range
- `$__url_time_range`: Encoded time range for URL

**Expected:** Clicking panel elements or dashboard links navigates to related views with context preserved (time range, variables).

**On failure:**
- URL encode special characters in query parameters
- Test links with various variable selections (All vs specific value)
- Verify target dashboard UIDs exist and are accessible
- Check that `includeVars` and `keepTime` flags work as expected

### Step 6: Set Up Dashboard Provisioning

Version control dashboards as code for reproducible deployments.

Create provisioning directory structure:

```bash
mkdir -p /etc/grafana/provisioning/{dashboards,datasources}
```

Datasource provisioning (`/etc/grafana/provisioning/datasources/prometheus.yml`):

```yaml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
# ... (see EXAMPLES.md for complete configuration)
```

Dashboard provisioning (`/etc/grafana/provisioning/dashboards/default.yml`):

```yaml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: 'Services'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true
```

Store dashboard JSON files in `/var/lib/grafana/dashboards/`:

```bash
/var/lib/grafana/dashboards/
├── api-service/
│   ├── overview.json
│   └── details.json
├── database/
│   └── postgres.json
└── infrastructure/
    ├── nodes.json
    └── kubernetes.json
```

Using Docker Compose:

```yaml
version: '3.8'
services:
  grafana:
    image: grafana/grafana:10.2.0
    ports:
      - "3000:3000"
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
```

**Expected:** Dashboards automatically loaded on Grafana startup, changes to JSON files reflected after update interval, version control tracks dashboard changes.

**On failure:**
- Check Grafana logs: `docker logs grafana | grep -i provisioning`
- Verify JSON syntax: `python -m json.tool dashboard.json`
- Ensure file permissions allow Grafana to read: `chmod 644 *.json`
- Test with `allowUiUpdates: false` to prevent UI modifications
- Validate provisioning config: `curl http://localhost:3000/api/admin/provisioning/dashboards/reload -X POST -H "Authorization: Bearer $GRAFANA_API_KEY"`

## Validation

- [ ] Dashboard loads without errors in Grafana UI
- [ ] All template variables populate with expected values
- [ ] Variable cascading works (selecting environment filters instances)
- [ ] Panels display data for configured time ranges
- [ ] Panel queries use variables correctly (no hardcoded values)
- [ ] Thresholds highlight problem states appropriately
- [ ] Legend formatting descriptive and not cluttered
- [ ] Annotations appear for relevant events
- [ ] Links navigate to correct dashboards with context preserved
- [ ] Dashboard provisioned from JSON file (version controlled)
- [ ] Responsive layout works on different screen sizes
- [ ] Tooltip and hover interactions provide useful context

## Common Pitfalls

- **Variable not updating panels**: Ensure queries use `$variable` syntax, not hardcoded values. Check variable refresh settings.
- **Empty panels with correct query**: Verify time range includes data points. Check scrape interval vs aggregation window (5m rate needs >5m of data).
- **Legend too verbose**: Use `legendFormat` to show only relevant labels, not full metric name. Example: `{{method}} - {{status}}` instead of default.
- **Inconsistent time ranges**: Set dashboard time sync so all panels share the same time window. Use "Sync cursor" for correlated investigation.
- **Performance issues**: Avoid queries returning high cardinality series (>1000). Use recording rules or pre-aggregation. Limit time ranges for expensive queries.
- **Dashboard drift**: Without provisioning, manual UI changes create version control conflicts. Use `allowUiUpdates: false` in production.
- **Missing data links**: Data links require exact label names. Use `${__field.labels.labelname}` carefully, verify label exists in query result.
- **Annotation overload**: Too many annotations clutter the view. Filter annotations by importance or use separate annotation tracks.

## Related Skills

- `setup-prometheus-monitoring` - Configure Prometheus data sources that feed Grafana dashboards
- `configure-log-aggregation` - Set up Loki for log panel queries and log-based annotations
- `define-slo-sli-sla` - Visualize SLO compliance and error budgets with Grafana stat and gauge panels
- `instrument-distributed-tracing` - Add trace ID links from metrics panels to Tempo trace views