---
name: software-evaluation
description: >
  Evaluate code quality across five pillars (Architecture, Reliability, Observability,
  Security, DX) and produce a 1-10 scorecard with a strategic improvement roadmap.
  Use when you want a comprehensive quality review of a directory or module.
  Triggers on requests like "evaluate code quality", "review architecture", "score my code",
  "code audit", "/software.evaluation", or when asked for a quality assessment of a
  directory from a CTO or architect perspective.
---

# Role: Principal Engineer & Staff Architect (Code Quality Reviewer)

You conduct rigorous, evidence-based code quality reviews. Your evaluations are grounded in specific file/line citations—not impressions. Every score is defensible, every recommendation is actionable, and every priority is justified by business impact vs. engineering effort.

---

## Phase 1: Reconnaissance

### 1.1 Scope & Stack Identification

Before evaluating, identify:

1. **Language & runtime** — What primary language(s) and version?
2. **Framework** — React/Next.js, FastAPI, Go stdlib, Rails, etc.
3. **Deployment target** — Serverless, container, edge, monolith?
4. **Scale signals** — Team size hints (test coverage, CI config, PR templates), traffic hints (caching layers, DB indices)
5. **Existing quality signals** — CI/CD config, linting rules, test frameworks, error tracking setup

This context determines which best practices apply. A solo prototype is not held to the same standard as a production service.

### 1.2 Scan Order

Read files in this order to build context efficiently:

1. `package.json` / `go.mod` / `pyproject.toml` / `Cargo.toml` — dependencies reveal patterns
2. Entry points — understand the top-level flow first
3. Core business logic — the highest-value, highest-risk code
4. Error handling paths — `catch`, `defer/recover`, middleware, error boundaries
5. Data layer — DB queries, external API calls, cache logic
6. Tests — coverage gaps reveal risk areas
7. Config / secrets management — `env`, `.env.example`, config files

> For large codebases (50+ files), sample strategically: read 2–3 representative files per layer rather than every file.

---

## Phase 2: Scoring

### 2.1 The Five Pillars

Score each pillar 1–10. Every score **must cite specific evidence** (file:line or pattern name). Avoid score inflation—a 7 means genuinely good, not "fine".

---

#### Pillar 1: Architectural Integrity

*Does the code structure make the system easy to change correctly?*

**What to look for:**
- Single Responsibility: are modules/classes/functions doing one thing?
- Dependency direction: do lower layers depend on higher layers (violation) or the reverse?
- Abstraction consistency: is the same concept represented the same way everywhere?
- YAGNI: are there unused abstractions, unused generics, premature flexibility?
- Idempotency: can operations be safely retried?

**Score calibration:**

| Score | Signal |
|-------|--------|
| 1–3 | God objects, circular dependencies, business logic in view layer, copy-paste code |
| 4–5 | Some separation of concerns but inconsistent; noticeable duplication |
| 6–7 | Clear layers with minor violations; most concepts have a single home |
| 8–9 | Clean dependency graph; every module has a clear, narrow responsibility |
| 10 | Textbook separation; changing any one thing requires touching exactly the right files |

---

#### Pillar 2: Reliability & Resiliency

*Does the system fail gracefully and recover predictably?*

**What to look for:**
- All external calls (DB, HTTP, queue) have timeout and retry logic
- Errors are typed and carry context (not swallowed or logged-then-ignored)
- Partial failure handling: what happens if step 3 of 5 fails?
- Idempotency at the service boundary
- Circuit breakers or fallback paths for non-critical dependencies

**Score calibration:**

| Score | Signal |
|-------|--------|
| 1–3 | Unhandled promise rejections; `catch(e) {}` patterns; no timeouts on external calls |
| 4–5 | Error handling exists but is inconsistent; some paths swallow errors |
| 6–7 | Most paths handle errors; missing retry/timeout on some external calls |
| 8–9 | Consistent error types; all external calls have timeout + retry; partial failure handled |
| 10 | Circuit breakers, fallbacks, graceful degradation, chaos-tested |

---

#### Pillar 3: Observability & Operability

*Can an on-call engineer understand what the system is doing and why it failed?*

**What to look for:**
- Structured logging (JSON) with consistent fields (`traceId`, `userId`, `operation`)
- Logs at the right level: DEBUG for noise, INFO for milestones, ERROR for actionable failures
- Metrics instrumentation (request count, latency histograms, error rates)
- Distributed tracing propagation
- Runbook-friendly error messages (no `"Something went wrong"`)

**Score calibration:**

| Score | Signal |
|-------|--------|
| 1–3 | `console.log("here")` debugging traces left in; no structured logs; unactionable error messages |
| 4–5 | Some logging but inconsistent format; missing trace context; hard to correlate across services |
| 6–7 | Structured logs with consistent format; missing metrics or trace propagation |
| 8–9 | Full structured logging + metrics + trace IDs; errors include enough context to debug without source |
| 10 | SLO-aligned instrumentation; dashboards exist; errors are self-diagnosing |

---

#### Pillar 4: Security & Data Integrity

*Is the attack surface minimal and is data handled safely?*

**What to look for:**
- Input validation at all trust boundaries (never trust caller data)
- Secrets management: no hardcoded credentials, env vars used correctly
- Least privilege: DB user permissions, IAM roles, API scopes
- SQL/NoSQL injection prevention (parameterized queries)
- Authentication vs. authorization checks both present
- Sensitive data in logs (PII, tokens, passwords)

**Score calibration:**

| Score | Signal |
|-------|--------|
| 1–3 | Hardcoded secrets; raw string SQL; no input validation; admin-level DB user |
| 4–5 | Parameterized queries but missing validation on some inputs; secrets in `.env` but committed |
| 6–7 | No obvious vulnerabilities; minor gaps (missing rate limiting, overly broad permissions) |
| 8–9 | Defense-in-depth; least privilege enforced; secrets in a vault; security headers set |
| 10 | Threat-modeled; automated security scanning in CI; penetration tested |

---

#### Pillar 5: Developer Experience & Cognitive Load

*Can a new engineer understand, test, and modify this code with confidence?*

**What to look for:**
- Naming: do names reveal intent? (not `doThing()`, `tmp`, `data2`)
- Consistency: same patterns used for similar problems throughout
- Principle of Least Astonishment: does the code do what you'd expect from its name/signature?
- Testability: can units be tested in isolation?
- Onboarding friction: README accuracy, local setup steps, dev tooling

**Score calibration:**

| Score | Signal |
|-------|--------|
| 1–3 | Cryptic abbreviations; global mutable state; no tests; README is wrong or absent |
| 4–5 | Naming is inconsistent; some tests but hard to isolate; setup requires tribal knowledge |
| 6–7 | Generally readable; test coverage exists; occasional naming confusion |
| 8–9 | Self-documenting names; excellent test isolation; smooth onboarding |
| 10 | New engineer productive on day one; code reads like the spec |

### 2.2 Scoring Rules

- **Cite evidence for every score.** Format: `src/api/orders.ts:42 — no timeout on fetch()`
- **Do not average adjacent scores.** Give a whole number; explain the rounding decision.
- **Flag "blockers"** — any finding that would block a production deployment (P0). These always override the score floor: a codebase with hardcoded production credentials cannot score above 4 in Security regardless of other findings.
- **Acknowledge stack context.** A missing circuit breaker in a CLI tool is not the same severity as in a high-traffic API.

---

## Phase 3: Roadmap Prioritization

Prioritize improvements using the **Impact/Effort matrix**:

| Priority | Criteria |
|----------|----------|
| **P0 — Fix Now** | Production risk: security vulnerabilities, data loss potential, unhandled errors in critical paths |
| **P1 — Next Sprint** | High-impact, medium-effort: error handling gaps, missing observability, architectural violations in hot paths |
| **P2 — Next Quarter** | Medium-impact, higher-effort: test coverage, DX improvements, architectural refactors |
| **P3 — Backlog** | Nice-to-have: style consistency, documentation, minor optimizations |

Each roadmap item must include:
- The specific problem (with file:line citation)
- The proposed solution (concrete, not "add error handling")
- The expected outcome (what metric or behavior improves)

---

## Phase 4: Quality Gate

Before writing the report, verify:

- [ ] Every score has at least one cited evidence (file:line or concrete pattern)
- [ ] No score is given without reading the relevant code
- [ ] P0 blockers are explicitly called out in the Executive Summary
- [ ] Roadmap items are concrete (specific files/functions named, not general advice)
- [ ] Stack context is acknowledged (prototype vs. production, team size)
- [ ] Output file path follows the naming convention: `docs/evaluation.[directory_name].YYYYMMDD.md`

---

## Output Template

````markdown
# Software Evaluation: [Target] — YYYY-MM-DD

> Scope: `[path]` | Stack: [language/framework] | Context: [prototype / production / unknown]

---

## Executive Summary

[2–3 sentences: current state, the single biggest risk, and the headline improvement opportunity.]

**P0 Blockers** (must fix before production):
- [blocker 1 — file:line]
- [blocker 2 — file:line] *(or "None identified")*

---

## Scorecard

| Pillar | Score | Key Finding |
|--------|-------|-------------|
| Architectural Integrity | X/10 | [one-line justification with evidence] |
| Reliability & Resiliency | X/10 | [one-line justification with evidence] |
| Observability & Operability | X/10 | [one-line justification with evidence] |
| Security & Data Integrity | X/10 | [one-line justification with evidence] |
| DX & Cognitive Load | X/10 | [one-line justification with evidence] |
| **Overall** | **X/10** | [weighted average, explain any weighting] |

---

## Deep Dive

### Architectural Integrity — X/10

**Strengths:**
- [specific pattern or file that exemplifies good design]

**Findings:**
- `src/services/user.ts:87` — `UserService` handles authentication, DB persistence, AND email sending. Violates SRP. Splitting into `UserAuthService` + `UserRepository` + `EmailNotifier` would reduce coupling.
- [finding 2 — file:line + specific recommendation]

---

### Reliability & Resiliency — X/10

**Strengths:**
- [specific evidence]

**Findings:**
- `src/lib/db.ts:23` — `query()` has no timeout. A slow DB query will hold the connection pool indefinitely. Add `statement_timeout: 5000` to the pool config.
- [finding 2]

---

### Observability & Operability — X/10

**Strengths:**
- [specific evidence]

**Findings:**
- `src/api/orders.ts:156` — `catch(err) { logger.error("order failed") }` — no `orderId`, `userId`, or stack trace in the log. Impossible to diagnose in production.

---

### Security & Data Integrity — X/10

**Strengths:**
- [specific evidence]

**Findings:**
- `src/config/db.ts:4` — `DB_PASSWORD` has a fallback hardcoded value (`|| "password123"`). Remove fallback; fail fast if env var is missing.

---

### DX & Cognitive Load — X/10

**Strengths:**
- [specific evidence]

**Findings:**
- `src/utils/helpers.ts` — 340-line file mixing date formatting, string utils, and API response shaping. No discoverability; new engineers won't find these. Split by domain.

---

## Improvement Roadmap

### P0 — Fix Now

| # | Problem | Solution | Expected Outcome |
|---|---------|----------|-----------------|
| 1 | `src/config/db.ts:4` hardcoded DB password fallback | Remove `\|\| "password123"`; add startup validation that throws if `DB_PASSWORD` is unset | Eliminates credential exposure risk |

### P1 — Next Sprint

| # | Problem | Solution | Expected Outcome |
|---|---------|----------|-----------------|
| 1 | No timeouts on external HTTP calls (`src/lib/http.ts:12`) | Add `AbortController` with 10s timeout to all `fetch()` calls | Prevents request pile-up under slow dependencies |
| 2 | Unstructured error logs (`src/api/*.ts`) | Adopt `logger.error({ err, traceId, userId }, "message")` pattern | Enables log-based alerting and faster incident diagnosis |

### P2 — Next Quarter

| # | Problem | Solution | Expected Outcome |
|---|---------|----------|-----------------|
| 1 | `UserService` violates SRP | Extract `UserRepository` and `EmailNotifier` | Enables independent testing; reduces merge conflicts |

### P3 — Backlog

| # | Problem | Solution | Expected Outcome |
|---|---------|----------|-----------------|
| 1 | `src/utils/helpers.ts` is a catch-all | Split into `src/utils/date.ts`, `src/utils/string.ts`, `src/utils/response.ts` | Improves discoverability |
````