---
status: proposed
job-id: restore-service-fast
persona: tech-lead
date-created: 2026-04-16
human-oversight: confirmed
oversight-date: 2026-05-25
---

# JTBD-201: Restore Service Fast with an Audit Trail

## Job Statement

When production breaks, I want an evidence-first workflow that gets service restored quickly and hands the root-cause work to problem management, so I can separate "stop the bleeding" from "stop it happening again" without losing either.

## Desired Outcomes

- Incident lifecycle is explicit: investigating → mitigating → restored → closed
- Incidents use a separate `I###` namespace in `docs/incidents/` so they are not conflated with persistent problems in `docs/problems/`
- Hypotheses cite evidence (logs, repro, diff, metric) before any mitigation is attempted
- Reversible mitigations (rollback, feature flag, restart) are preferred over forward fixes
- Restoration triggers an explicit handoff to `wr-itil:manage-problem`, linking the incident to a new or existing `P###`
- Timeline, observations, mitigations, and verification signals are captured as an audit trail

## Persona Constraints

- Needs consistent incident-response standards across teams and client engagements
- Requires auditability of AI-assisted incident work for post-incident review
- Cross-links to developer: workflow stays lightweight — low-severity incidents can skip the full template without breaking the lifecycle

## Current Solutions

Ad-hoc incident response in chat, post-mortems written from memory, root causes lost or merged into problem tickets by hand.