---
name: slo-workshop
description: Interactive SLO definition workshop - guides through defining SLIs, setting SLO targets, and establishing error budget policies for a service
allowed-tools: Read, Glob, Grep, Task, AskUserQuestion
argument-hint: [service-name-or-context]
---

# SLO Workshop Command

This command runs an interactive workshop to help define SLOs (Service Level Objectives) for a service.

## Purpose

Guide teams through the complete SLO definition process:

1. Identifying critical user journeys
2. Selecting appropriate SLIs (Service Level Indicators)
3. Setting realistic SLO targets
4. Establishing error budget policies
5. Designing alerting strategies

## Workflow

### Phase 1: Service Understanding

First, understand the service context:

**If a service name or file is provided:**

- Search the codebase for the service
- Identify endpoints, dependencies, and user-facing functionality
- Look for existing metrics, SLOs, or monitoring configuration

**Gather context through questions:**

1. What does this service do for users?
2. Who are the primary users (internal/external)?
3. What are the critical user journeys?
4. What does "working correctly" mean for users?

### Phase 2: SLI Selection

Guide through selecting meaningful SLIs:

**Present SLI categories:**

```text
Common SLI Types:

1. Availability
   "Can users access the service?"
   Measurement: Successful requests / Total requests

2. Latency
   "How fast does the service respond?"
   Measurement: Request duration at percentile (p50, p90, p99)

3. Correctness
   "Does the service return correct results?"
   Measurement: Correct responses / Total responses

4. Throughput
   "Can the service handle the load?"
   Measurement: Requests processed per time unit

5. Freshness
   "How current is the data?"
   Measurement: Age of data served to users
```

**For each relevant SLI type, define:**

- What counts as a "good" event
- What counts as a "valid" event (denominator)
- How it will be measured (metrics, logs, synthetic)

### Phase 3: SLO Target Setting

Help set appropriate targets:

**Consider factors:**

- Current baseline (what are we achieving today?)
- User expectations (what do users need?)
- Engineering capacity (what can we sustain?)
- Business requirements (what's contractually required?)

**Provide guidance:**

```text
SLO Target Guidance:

Starting Point Recommendations:
- Availability: Start at current baseline - 0.1%
- Latency: Start at current p99 + 20% buffer

Common Targets:
- 99.9% = 43 minutes downtime/month
- 99.5% = 3.6 hours downtime/month
- 99% = 7.3 hours downtime/month

Tips:
- Don't start at 100% (impossible to maintain)
- Don't set targets you can't measure
- Conservative targets are easier to achieve
- You can tighten targets over time
```

### Phase 4: Error Budget Policy

Define what happens when the error budget is consumed:

**Error budget calculation:**

```text
Error Budget = 100% - SLO Target

Example:
SLO = 99.9% availability
Error Budget = 0.1% = 43.2 minutes/month
```

**Policy framework:**

```text
Error Budget Policy Template:

Budget > 50%:
- Normal development velocity
- Standard change process

Budget 25-50%:
- Increased review for risky changes
- Prioritize reliability improvements

Budget < 25%:
- Pause non-critical feature work
- Focus on reliability improvements

Budget exhausted:
- Stop all non-critical deployments
- All hands on reliability
- Postmortem for budget-burning incidents
```

### Phase 5: Alerting Strategy

Design multi-window burn rate alerting:

**Explain burn rate concept:**

```text
Burn Rate Alerting:

Burn rate = Rate of consuming error budget

1x burn rate = Exactly consuming monthly budget
2x burn rate = Will exhaust budget in 15 days
10x burn rate = Will exhaust budget in 3 days

Multi-window alerts:
- Fast burn: 14.4x rate over 1 hour (page)
- Slow burn: 3x rate over 3 days (ticket)
```

#### Define alert thresholds based on SLO targets

### Phase 6: Documentation

Generate SLO documentation:

```text
# [Service Name] SLO Definition

## Service Overview
[Description from workshop]

## Critical User Journeys
1. [Journey 1]
2. [Journey 2]

## SLIs

### [SLI Name]
- Type: [Availability/Latency/etc.]
- Definition: [How measured]
- Good event: [What counts as good]
- Valid event: [What counts as valid]

## SLO Targets

| SLI | Target | Window | Error Budget |
|-----|--------|--------|--------------|
| [SLI 1] | [%] | [days] | [time] |

## Error Budget Policy

### Budget > 50%
[Actions]

### Budget 25-50%
[Actions]

### Budget < 25%
[Actions]

### Budget Exhausted
[Actions]

## Alerting

| Alert | Burn Rate | Window | Severity |
|-------|-----------|--------|----------|
| [Name] | [rate]x | [time] | [Page/Ticket] |

## Review Schedule
- Quarterly SLO review
- Monthly error budget review
- After significant incidents
```

## Usage Examples

```bash
# Start workshop for a specific service
/sd:slo-workshop order-service

# Start workshop with context file
/sd:slo-workshop @docs/services/payment-api.md

# Start general workshop
/sd:slo-workshop
```

## Interactive Elements

Throughout the workshop, use `AskUserQuestion` to:

- Gather service context
- Validate SLI selections
- Confirm target appropriateness
- Review error budget policies

## Output

The workshop produces:

1. **SLO Definition Document** - Complete SLO specification
2. **Implementation Checklist** - Steps to implement the SLOs
3. **Review Schedule** - When to revisit and adjust

## Related Skills

This command leverages:

- `slo-sli-error-budget` - SLO methodology details
- `observability-patterns` - Measurement approaches
- `distributed-tracing` - Trace-based SLIs

## Related Agent

For SLO consultation without interactive workshop:

- `observability-consultant` - General observability guidance