---
name: ai-bug-triage
description: >-
Hybrid fingerprint + LLM pipeline for bug classification, deduplication, and ticket
generation. Normalizes CI logs, creates stable fingerprints, clusters near-duplicates,
then uses LLM for severity classification and ticket writing. Includes bug reporting
templates and severity/priority matrix. Use when: "bug triage," "classify bugs,"
"failure analysis," "auto-classify," "CI failures," "bug report," "defect template."
Related: qa-metrics, qa-dashboard, ci-cd-integration, qa-project-context.
license: MIT
metadata:
author: kindlmann
version: "1.0"
category: ai-qa
---
A hybrid pipeline for bug classification, deduplication, and ticket generation. Deterministic fingerprinting handles deduplication (what LLMs are bad at); LLM handles explanation, severity assessment, and ticket writing (what LLMs are good at).
**Key reframe:** The LLM is best at explaining and routing, not deduplication. Teach agents to DESIGN the pipeline, not BE the pipeline.
**Before starting:** Check for `.agents/qa-project-context.md` in the project root. It contains tech stack, component mapping, and known flaky areas that improve classification accuracy.
---
## Discovery Questions
Before building or using a triage pipeline, clarify:
1. **What is the failure source?**
- CI pipeline logs (GitHub Actions, GitLab CI, Jenkins, CircleCI)
- Test framework output (Playwright, Jest, pytest, Vitest)
- Production error monitoring (Sentry, Datadog, Bugsnag)
- Manual bug reports from QA or users
2. **What is the ticket destination?**
- Jira, Linear, GitHub Issues, Azure DevOps, Shortcut
- What fields are required? (component, severity, priority, labels)
- What workflows exist? (triage board, auto-assignment rules)
3. **What is the deduplication scope?**
- Same test run? Same sprint? Same release? All time?
- Do you already have fingerprinting? What is the current duplicate rate?
4. **What approval workflow is needed?**
- Auto-create tickets with human review?
- Suggest tickets for human approval before creation?
- Auto-close duplicates? (dangerous -- require approval)
5. **What historical data exists?**
- Past bug reports with resolution data?
- Flaky test history? Known environment issues?
- Component ownership mapping?
---
## Core Principles
1. **Deterministic first, LLM second.** Use stable, reproducible fingerprinting for deduplication and clustering. Use LLM only for tasks requiring understanding: severity classification, root cause hypothesis, and human-readable ticket writing.
2. **Normalize before comparing.** Raw CI logs are full of timestamps, port numbers, process IDs, and random suffixes that make identical failures look different. Strip all noise before fingerprinting.
3. **Fingerprints are anchored to stable elements.** Exception type, top stack frames, test name, error message template, and URL pattern are stable. Timestamps, request IDs, and ephemeral ports are not.
4. **Human approval before destructive actions.** Auto-closing a ticket as duplicate or auto-merging reports requires human confirmation. False deduplication wastes more time than manual triage.
5. **Classification drives routing.** The value of triage is not the label itself but the routing decision it enables: which team, what priority, what SLA.
6. **Track triage accuracy.** Measure how often auto-classification matches human judgment. Below 85% accuracy, the pipeline needs tuning.
---
## The Pipeline
```
CI Log / Error Report
│
▼
Step 1: NORMALIZE
Strip timestamps, process IDs, ports, random suffixes, ANSI codes
│
▼
Step 2: EXTRACT STABLE ANCHORS
Exception type, top N stack frames, test name, error message template, URL pattern
│
▼
Step 3: HASH CANONICAL FORM
Deterministic fingerprint from ordered anchors
│
▼
Step 4: CLUSTER NEAR-DUPLICATES
Similarity scoring for non-identical but related failures
│
▼
Step 5: LLM CLASSIFY
Severity, component, suspected root cause, failure category
│
▼
Step 6: LLM GENERATE TICKET
Title, description, repro steps, evidence, suggested assignee
│
▼
Step 7: HUMAN APPROVAL
Review before create/close/merge
```
### Step 1: Normalize
Strip noise that makes identical failures look different.
**Normalization rules (apply in order):**
```
1. Strip ANSI color codes: \x1b\[[0-9;]*m → ""
2. Strip timestamps: \d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}[.\d]*Z? → ""
3. Strip UUIDs: [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} → ""
4. Strip process IDs: pid[=: ]\d+ → "pid="
5. Strip port numbers: :\d{4,5}(?=[\s/]) → ":"
6. Strip temp file paths: /tmp/[^\s]+ → ""
7. Strip memory addresses: 0x[0-9a-f]{8,16} → ""
8. Strip random suffixes: [-_][a-z0-9]{6,8}(?=\.) → ""
9. Strip request IDs: (?:request[_-]?id|trace[_-]?id|correlation[_-]?id)[=: ]["']?[a-zA-Z0-9-]+ → ""
10. Collapse whitespace: \s+ → " "
```
**Example:**
```
Before: 2025-03-22T14:32:01.456Z [pid=42891] Error: Connection refused at 127.0.0.1:54321
request_id=abc-123-def-456
After: [pid=] Error: Connection refused at 127.0.0.1:
```
### Step 2: Extract Stable Anchors
From the normalized log, extract elements that identify the failure regardless of environment or timing.
**Anchor types (in priority order):**
| Anchor | Example | Stability |
|--------|---------|-----------|
| Exception type | `TypeError`, `AssertionError`, `HTTP 500` | Very high |
| Error message template | `Cannot read property 'X' of undefined` | High |
| Top 3 stack frames | `at processOrder (order.ts:142)` | High |
| Test name | `checkout.spec.ts > completes payment` | Very high |
| URL pattern | `POST /api/orders` | High |
| HTTP status code | `500`, `429`, `503` | Very high |
| Exit code | `exit code 1`, `SIGKILL` | High |
| Assertion diff | `Expected: 200, Received: 500` | Medium |
**Extraction rules:**
- Keep function names but strip line numbers (they change with edits)
- Keep URL paths but strip query parameters and IDs in paths (`/api/orders/`)
- Keep error message structure but replace dynamic values with placeholders
- Keep test file and test name exactly as-is
### Step 3: Hash Canonical Form
Create a deterministic fingerprint from the extracted anchors.
**Algorithm:**
```
1. Sort anchors alphabetically by type
2. Concatenate: exception_type + "|" + message_template + "|" + top_frames + "|" + test_name
3. SHA-256 hash the concatenated string
4. Take first 16 hex characters as fingerprint
```
**Fingerprint properties:**
- Same failure always produces same fingerprint (deterministic)
- Different failures produce different fingerprints (collision-resistant)
- Minor log format changes do not change fingerprint (stable)
- Fingerprint is short enough for Jira labels and GitHub tags
**Example:**
```
Anchors:
exception_type: "TypeError"
message_template: "Cannot read property 'vendorId' of undefined"
top_frames: "processOrder|groupByVendor|checkout"
test_name: "checkout.spec.ts > multi-vendor checkout"
Canonical: "TypeError|Cannot read property 'vendorId' of undefined|processOrder|groupByVendor|checkout|checkout.spec.ts > multi-vendor checkout"
Fingerprint: a3f8b2c1e9d04567
```
### Step 4: Cluster Near-Duplicates
Exact fingerprint matching catches identical failures. Similarity scoring catches related failures that differ slightly (same root cause, different manifestation).
**Similarity dimensions:**
| Dimension | Weight | Match Criteria |
|-----------|--------|---------------|
| Exception type | 0.30 | Exact match |
| Error message | 0.25 | Levenshtein distance < 20% of message length |
| Stack frames | 0.25 | Jaccard similarity of top 5 frames > 0.6 |
| Component/file | 0.10 | Same directory or module |
| Test name | 0.10 | Same describe block or test file |
**Clustering threshold:** similarity score > 0.75 = likely duplicate, suggest merge.
**Human review required for:**
- Scores between 0.60 and 0.75 (ambiguous)
- First occurrence of a new fingerprint (no history to compare)
- Failures in components with known intermittent issues
### Step 5: LLM Classify
After deterministic fingerprinting and clustering, use LLM to classify the failure.
**LLM classification prompt:**
```
Given this normalized failure:
Exception: [TYPE]
Message: [MESSAGE]
Stack trace (top 5 frames): [FRAMES]
Test name: [TEST]
CI context: [branch, commit, runner OS]
Classify this failure:
1. **Failure category:** test bug | application bug | environment issue | flaky test | build failure
2. **Severity:** critical | major | minor | trivial (see severity matrix below)
3. **Component:** [infer from stack trace and file paths]
4. **Suspected root cause:** [1-2 sentence hypothesis]
5. **Confidence:** high | medium | low
If confidence is low, explain what additional information would help.
```
**Failure categories (see references/ci-failure-analysis.md for detail):**
| Category | Description | Typical Action |
|----------|-------------|---------------|
| Application bug | The app is broken | File bug ticket, assign to owning team |
| Test bug | The test is wrong | Fix the test, no app change needed |
| Environment issue | CI infra / network / service down | Retry, notify infra team |
| Flaky test | Intermittent, non-deterministic | Quarantine, investigate root cause |
| Build failure | Compilation, dependency, config | Fix build, usually blocking |
### Step 6: LLM Generate Ticket
Once classified, use LLM to generate a human-quality bug ticket.
**Ticket generation prompt:**
```
Generate a bug ticket from this classified failure:
Failure category: [CATEGORY]
Severity: [SEVERITY]
Component: [COMPONENT]
Fingerprint: [HASH]
Suspected root cause: [HYPOTHESIS]
Normalized error:
[NORMALIZED ERROR WITH CONTEXT]
Original log excerpt (last 30 lines before failure):
[LOG EXCERPT]
Related failures (same cluster):
[LIST OF SIMILAR FINGERPRINTS WITH DATES]
Generate:
1. **Title:** concise, searchable, includes component name (under 80 chars)
2. **Description:** what happened, in plain language
3. **Steps to reproduce:** derived from test name and log context
4. **Evidence:** relevant log lines, assertion diffs, screenshots if available
5. **Suggested labels:** [component, severity, failure-category]
6. **Suggested assignee:** based on component ownership (if known)
```
### Step 7: Human Approval
**No automated action without review.** The pipeline suggests; humans decide.
**Approval decisions:**
- **Create ticket** — New failure, clear root cause, assign to team
- **Merge into existing** — Duplicate of known issue, add evidence to existing ticket
- **Quarantine test** — Flaky test, not an app bug, quarantine and schedule investigation
- **Retry and monitor** — Environment issue, retry CI, alert if persists
- **Dismiss** — Known issue already fixed in pending deploy, or test bug with obvious fix
---
## Severity/Priority Matrix
Severity measures impact. Priority measures urgency. They are independent dimensions.
### Severity Definitions
| Severity | Definition | Examples |
|----------|-----------|---------|
| **Critical** | System unusable, data loss, security breach, no workaround | Payment processing fails, user data exposed, app crashes on launch |
| **Major** | Core feature broken, degraded experience, workaround exists | Search returns wrong results, checkout requires page reload, form data lost on back-button |
| **Minor** | Non-core feature affected, cosmetic with functional impact | Sorting does not persist, tooltip clipped on mobile, secondary action fails |
| **Trivial** | Cosmetic only, no functional impact | Typo in label, 1px alignment, inconsistent capitalization |
### Priority Definitions
| Priority | Definition | SLA (example) |
|----------|-----------|---------------|
| **P0** | Fix immediately, blocks release or production | Same day |
| **P1** | Fix this sprint, significant user impact | This sprint |
| **P2** | Fix next sprint, moderate impact | Next sprint |
| **P3** | Fix when convenient, low impact | Backlog |
### Severity x Priority Decision Guide
| | Critical | Major | Minor | Trivial |
|---|---------|-------|-------|---------|
| **Affects all users** | P0 | P0 | P1 | P2 |
| **Affects segment (>10%)** | P0 | P1 | P2 | P3 |
| **Affects few users (<10%)** | P1 | P1 | P2 | P3 |
| **Edge case only** | P1 | P2 | P3 | P3 |
---
## Bug Report Template
Use this template for any bug report, whether auto-generated or human-written.
```markdown
## [Component] Brief description of the defect
**Severity:** Critical | Major | Minor | Trivial
**Priority:** P0 | P1 | P2 | P3
**Component:** [module/service/page]
**Environment:** [OS, browser, deploy environment]
**Fingerprint:** [if auto-generated: hash ID]
**Reporter:** [person or "auto-triage pipeline"]
### Description
[1-3 sentences: what is broken, who is affected, what is the business impact]
### Steps to Reproduce
1. [Precondition: user role, data state]
2. [Navigate to / call endpoint]
3. [Perform action]
4. [Observe failure]
### Expected Behavior
[What should happen]
### Actual Behavior
[What actually happens — include error messages verbatim]
### Evidence
- **Error log:** [relevant lines]
- **Screenshot:** [if applicable]
- **Assertion diff:** [expected vs actual values]
- **Trace/request ID:** [for distributed tracing]
### Frequency
- [Always | Intermittent (N/M runs) | Once observed]
- First seen: [date/commit]
- Last seen: [date/commit]
### Suggested Root Cause
[Hypothesis based on evidence — helps developer investigation]
### Related Issues
- [Links to similar/duplicate tickets]
- [Links to related PRs or deployments]
```
---
## Deduplication Patterns
| Pattern | Detection | Action |
|---------|-----------|--------|
| **Exact duplicate** | Same fingerprint | Merge into existing ticket, add evidence |
| **Near-duplicate** | Same cluster (similarity > 0.75) | Link tickets, suggest merge for human review |
| **Same root cause, different symptom** | Same exception type + overlapping frames in different tests | Create parent ticket linking symptom tickets |
| **Regression of fixed bug** | Fingerprint matches closed ticket | Reopen ticket, flag as regression, increase priority |
| **Flaky recurrence** | Same fingerprint intermittently across CI runs | Tag as flaky, quarantine if rate > 10% |
---
## CI Failure Analysis
See `references/ci-failure-analysis.md` for comprehensive patterns. Key decision: consistent failure = test bug or app bug; intermittent failure = flaky test or environment; multiple failures at once = environment or shared component; build failure = code or dependency issue.
---
## Integration Patterns
### GitHub Issues
```bash
# Create issue with labels from pipeline output
gh issue create \
--title "[Checkout] Payment fails for multi-vendor carts" \
--body "$(cat ticket-body.md)" \
--label "bug,severity:critical,component:checkout" \
--assignee "@me"
# Check for duplicate by fingerprint
gh issue list --label "fingerprint:a3f8b2c1" --state all
```
### CI Pipeline Integration
```yaml
# GitHub Actions: run triage on test failure
- name: Triage failures
if: failure()
run: |
node scripts/extract-failures.js test-results/
node scripts/triage-pipeline.js --input failures.json --output tickets/
for ticket in tickets/*.json; do
gh issue create --title "$(jq -r .title $ticket)" \
--body "$(jq -r .body $ticket)" \
--label "$(jq -r '.labels | join(",")' $ticket)"
done
```
For Jira, Linear, and Azure DevOps integration, use their respective REST/GraphQL APIs with the same ticket data generated by Step 6. The pipeline output is tracker-agnostic -- it produces title, description, labels, severity, and component that map to any tracker's fields.
---
## Anti-Patterns
### 1. Using LLM for Deduplication
LLMs are non-deterministic. The same two errors compared twice may get different similarity scores. Use deterministic fingerprinting for deduplication; use LLM only for explaining and classifying.
### 2. Auto-Closing Without Review
Automatically closing a ticket as "duplicate" based on fingerprint matching can merge distinct issues. Always require human confirmation for close/merge actions.
### 3. Over-Classifying Severity
If everything is "critical," nothing is. Follow the severity matrix strictly. A cosmetic typo is trivial even if it annoys someone.
### 4. Ignoring Environment Failures
Labeling all failures as "app bug" when many are CI infrastructure issues (Docker OOM, network timeout, disk full). Classify environment issues separately -- they need different remediation.
### 5. No Feedback Loop
Building the pipeline once and never measuring accuracy. Track: auto-classification accuracy, false duplicate rate, ticket quality ratings from developers.
### 6. Raw Logs in Tickets
Pasting 500 lines of raw CI output into a bug ticket. Normalize, extract relevant lines, and present the 5-10 lines that matter.
### 7. Fingerprinting Without Normalization
Hashing raw log lines produces unstable fingerprints that change every run. Normalization (Step 1) is mandatory before fingerprinting.
### 8. No Component Ownership Mapping
Classification without routing is useless. Maintain a component-to-team mapping so that classified bugs reach the right people.
---
## Done When
- Each triaged bug has severity, component, and root cause labels assigned
- Duplicates merged or linked with references to the canonical ticket
- CI failure analysis report generated summarizing failure categories and counts
- Actionable tickets created for all P0 and P1 issues with assigned owners
- Triage session findings summarized and shared with the team
---
## Related Skills
- **`qa-metrics`** — Track triage accuracy, duplicate rates, mean time to classification, and defect escape rates.
- **`ci-cd-integration`** — Pipeline configuration for running triage on test failures, parallel execution, and reporting.
- **`test-reliability`** — Flaky test classification, quarantine management, and root cause analysis.
- **`qa-project-context`** — Project context that improves classification accuracy: component map, known issues, ownership.
- **`ai-test-generation`** — Generate regression tests from triaged bug reports.
---
## References
- `references/classification-taxonomy.md` — Bug categories, severity definitions, component mapping rules, and root cause categories.
- `references/ci-failure-analysis.md` — CI log parsing patterns, failure category decision tree, fingerprinting algorithm detail.