# Evidence Methodology

This document defines how `a11y-shiftleft-cli` should collect and report
evidence for finding quality, developer trust, and review effort.

The project does not claim complete accessibility conformance. Automated scans
are evidence for risk detection and remediation tracking, not a replacement for
manual review.

## Finding Types

Reports separate three kinds of evidence:

| Type | Meaning |
|---|---|
| `wcag` | The rule is mapped to one or more WCAG success criteria. |
| `best-practice` | The scanner identifies useful guidance without claiming a WCAG failure. |
| `unmapped` | The finding needs review because no supported standards mapping is available. |

For example, axe maps `color-contrast` to WCAG 1.4.3. Axe tags
`heading-order`, `region`, and `page-has-heading-one` as best-practice rules,
so the reports must not present those findings as confirmed WCAG violations.

## Likely Root Causes

One shared component can produce the same finding on several routes. Reports
group matching rule and target patterns into likely root causes while retaining
every original occurrence. For example, five contrast findings on the same
active-navigation class may represent one design-token fix across five pages.

This grouping is deterministic but heuristic. It estimates remediation units;
it does not prove that two DOM nodes share the same source implementation.

## Why Confidence Exists

Severity and confidence answer different questions:

| Field | Question | Example |
|---|---|---|
| `severity` | How risky is the issue if it is real? | Missing button name can block screen reader users. |
| `confidence` | How strong is the tooling evidence? | axe found a concrete DOM node and mapped it to WCAG. |

This lets teams triage in a healthier order:

1. High-confidence critical findings.
2. High-confidence warnings.
3. Medium-confidence findings that need source review.
4. Low-confidence findings and adapter health issues.

## Current Confidence Policy

| Source evidence | Confidence | Score | Reason |
|---|---:|---:|---|
| axe finding with selector and WCAG mapping | high | 95 | Rendered DOM evidence plus standards mapping. |
| axe finding with selector but no WCAG mapping | medium | 75 | Concrete DOM evidence, but best-practice or unmapped rule. |
| ESLint accessibility rule with file, line, and WCAG mapping | medium | 80 | Static source evidence plus standards mapping. |
| ESLint accessibility rule with file and line only | medium | 70 | Static source evidence, but no standards mapping. |
| Adapter scan health finding | low | 40 | Useful operational signal, not a validated accessibility violation. |
| Unknown source | low | 50 | Review manually before treating as confirmed. |

These scores are deterministic and intentionally conservative. They are not
machine-learning predictions.

## Issue Categories

Findings are grouped into accessibility families so reports are easier to scan:

```txt
aria
contrast
focus
forms
headings
images
keyboard
landmarks
structure
widgets
best-practice
adapter
other
```

Categories are inferred from WCAG criteria, rule IDs, tags, and messages. They
are meant for triage and reporting, not for legal classification.

## Validation Dataset

Use a small but reproducible corpus before claiming quality improvements:

| Dimension | Minimum |
|---|---:|
| Demo repositories | 3 |
| Frameworks | React, Vue, Angular |
| Pull requests | 20+ |
| Sprints | 4 |
| Reviewers | 2 independent reviewers where possible |

Each reviewed finding should be labeled:

```csv
finding_id,rule_id,source,category,severity,confidence,confidence_score,review_label,review_reason
```

Allowed `review_label` values:

```txt
confirmed
false_positive
duplicate
needs_manual_review
out_of_scope
```

## Metrics

False positive rate:

```txt
false_positive_rate = false_positive_count / max(unique_findings, 1)
```

Confirmed issue rate:

```txt
confirmed_issue_rate = confirmed_issue_count / max(unique_findings, 1)
```

High-confidence precision:

```txt
high_confidence_precision =
  high_confidence_confirmed_count / max(high_confidence_reviewed_count, 1)
```

Developer review load:

```txt
review_load = unique_findings + needs_manual_review_count
```

## Related Work Notes

Paradise on `a11ybob.com` is useful related work because it separates severity
from confidence, documents limitations clearly, and explains findings through
an analyser taxonomy. `a11y-shiftleft-cli` should not copy Paradise code or its
source-level analyser architecture. The practical takeaway for this project is
the reporting discipline: confidence, issue families, suggested fixes, and
honest limitations.

## Reporting Rules

- Report confidence as evidence strength, not severity.
- Keep adapter failures visible but low-confidence.
- Do not claim that automated scans prove complete WCAG conformance.
- Show low-confidence findings as review leads, not confirmed defects.
- Keep raw JSON available so external analysis can reproduce summary numbers.