---
id: ins_open-coding-then-axial-coding
operator: Hamel Husain
operator_role: Independent ML consultant and Berkeley PhD researcher
source_url: https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill
source_type: podcast
source_title: Evals as error analysis, the benevolent dictator, LLM judges
source_date: 2026-04-28
captured_date: 2026-05-01
domain: [ai-native, research-discovery, engineering]
lifecycle: [ai-workflow]
maturity: applied
artifact_class: playbook
score: { originality: 4, specificity: 5, evidence: 4, transferability: 5, source: 5 }
tier: A
related: [ins_evals-are-data-analysis-on-llm-apps, ins_benevolent-dictator-not-committee]
raw_ref: raw/podcasts/hamel-husain-shreya-shankar--evals-error-analysis--2026-04-28.md
---

# Sample 100+ traces, write one free-form note per trace, let an LLM cluster the notes, humans first, machines second

## Claim
Run trace review as a two-stage pipeline: a human samples 100+ traces and writes a free-form note on the first thing wrong with each (open coding); then an LLM groups those notes into failure-mode buckets (axial coding). An LLM cannot do the open-coding pass for you because it lacks product context; humans cannot scale the categorisation pass.

## Mechanism
Open coding captures domain-specific failure that an LLM judge would miss because the LLM has no privileged access to product reality (e.g., "we don't actually offer virtual tours", hallucination invisible without context). Axial coding is pure clustering, which LLMs do reliably. The split assigns each task to the actor that can do it, and the resulting pivot table converts qualitative review into quantitative priority.

## Conditions
Holds when:
- A domain expert can read traces and recognise wrong outputs.
- The traces are recent enough that current product context applies.
- The team can tolerate the upfront human time for the first pass.

Fails when:
- Reviewers are not domain experts and their notes mislead the categorisation.
- The traces span product changes such that "wrong" varies by date.
- The team cuts the human step to save time and the LLM hallucinates the categories.

## Evidence
> "When you're doing this open coding... appoint one person whose taste that you trust."

> "I would bet money... if I put that into ChatGPT and asked, 'Is there an error?' it would say, 'No, did a great job.'"

Stopping rule: theoretical saturation, not a fixed count. Once 15–60 traces stop yielding new categories, you stop.

· Hamel Husain & Shreya Shankar on Lenny's Podcast, 2026-04-28

## Signals
- Failure-mode pivot table covers >80% of observed problems with a small set of clusters.
- Engineering work is prioritised by category counts, not by squeaky-wheel reports.
- The same pipeline runs weekly without each cycle starting from scratch.

## Counter-evidence
Coding-agent teams (Claude Code, Codex) operate with much lighter eval discipline because the developer is also the user; the dogfood loop closes inside one head. That pattern does not generalise to products where the buyer is not the builder.

## Cross-references
- `ins_evals-are-data-analysis-on-llm-apps`, why this pipeline matters
- `ins_benevolent-dictator-not-committee`, who runs the human step
- `ins_llm-as-judge-binary-not-likert`, what the categories become