--- name: finetune-generate description: Use when generating synthetic training data for multi-turn conversation fine-tuning. Triggers - have design artifacts ready, need to generate conversations, ready to assess quality. Requires finetune-design first. --- # Fine-tune Generate Iteratively generate and filter training data until quality stabilizes. ## Prerequisites Complete [finetune-design](../finetune-design/SKILL.md) first. You need: - [ ] Model choice and token constraints - [ ] Input taxonomy - [ ] Evaluation rubric with calibration examples - [ ] Persona template - [ ] User simulator, assistant, and system prompts ## Outputs By the end of this phase, you will have: - [ ] `training_data.jsonl` — Filtered, sliced training examples - [ ] `generation_stats.md` — Pass rates, criterion breakdown, iterations - [ ] `prompt_versions/` — History of prompt iterations --- ## The Core Loop **This is the most important part of the entire pipeline.** ``` ┌─────────────────────────────────────────────────────────────┐ │ TIGHT LOOP (5 transcripts per iteration) │ │ │ │ 1. Generate 5 transcripts │ │ 2. Assess with rubric (all backends) │ │ 3. HUMAN REVIEWS both transcripts AND assessments │ │ 4. Iterate based on human judgment │ │ 5. Repeat until ≥70% pass rate AND human satisfied │ │ │ │ Then: Scale to full volume │ └─────────────────────────────────────────────────────────────┘ ``` ### Why 5 Transcripts? - Small enough for human to actually READ each one carefully - Fast feedback (minutes, not hours) - See patterns without wasting compute - Iterate while context is fresh ### Why Human-in-the-Loop? (Non-Negotiable) **Human review is required, not optional.** The human reviews BOTH transcripts AND assessment results: | Human reviews... | Looking for... | |------------------|----------------| | Transcripts | Quality issues the rubric might miss | | Assessment results | False positives (passed but shouldn't have) | | Assessment results | False negatives (failed but seems fine) | | Both together | Gaps in what the rubric even checks | **Without human review:** - You're optimizing against a potentially broken metric - False positives silently corrupt training data - Rubric blind spots never get discovered ### Red Flags: Rationalizations to Resist | Rationalization | Reality | |-----------------|---------| | "Human review slows us down" | Skipping review = optimizing against broken metric. 1 hour of review saves days of bad data. | | "Pass rate is high, must be fine" | High pass rate with single backend misses 20-30% of issues. Multi-backend + human review required. | | "We can add calibration examples later" | Without calibration examples, backends disagree silently. Add them during design. | | "The rubric is complete" | Rubrics evolve (e.g., 12→18 criteria). New failure modes emerge. | | "One assessor backend is enough" | Single backend gave transcript 1000 perfect 1.0; other backends caught 4 failures. | | "Let's just scale and filter later" | Scaling before 70% pass rate wastes compute. Fix prompts first. | **If you catch yourself using any of these rationalizations: STOP. Follow the gates.** ### Dual Iteration You iterate on TWO things, not one: | When you see... | Iterate on... | |-----------------|---------------| | Transcript quality issues | Generation prompts (user-sim, assistant) | | Assessment seems wrong | Assessor prompt, criteria wording | | Backend disagreement | Calibration examples for that criterion | | Missing failure mode | Add new criterion to rubric | | Pass rates high but something feels off | Run expert role-play critique | **The rubric is never "done."** In one project, criteria evolved: 12 → 14 → 16 → 17 → 18. **Expert role-play critique is required** — periodically have Claude role-play domain experts to critique your rubric and small transcript batch directly. This catches blind spots invisible from your own perspective. See [assessment-guide.md#expert-role-play-critique](assessment-guide.md#expert-role-play-critique). --- ## Workflow ### Step 1: Tight Iteration Loop For each batch of 5 transcripts: 1. **Generate** 5 transcripts using two-agent simulation 2. **Assess** with rubric using multiple backends (Claude, Gemini, GPT-5) 3. **Human reviews** both transcripts and assessments: - Read each transcript: Is this actually good? - Read each assessment: Did the rubric catch what matters? - Note: false positives, false negatives, missing criteria 4. **Iterate** based on human judgment: - Fix generation prompts (if transcript quality issues) - Fix assessor prompt/criteria (if assessment issues) - Add calibration examples (if edge cases found) 5. **Repeat** until quality stabilizes **Gate (before scaling):** | Condition | Action | |-----------|--------| | ≥70% pass rate AND human satisfied | Proceed to scale | | 50-70% OR human sees issues | Continue iterating | | <50% | Major revision needed | **Reference:** [generation-guide.md](generation-guide.md), [assessment-guide.md](assessment-guide.md) --- ### Step 2: Scale Generation Once the tight loop stabilizes: 1. **Generate target volume** (100+ transcripts) 2. **Continue assessment** with same multi-backend approach 3. **Periodic human spot-checks** (every 20-50 transcripts) 4. **Track statistics** (pass rate, criterion breakdown) **Warning signs during scale:** - Pass rate drifting down → Revisit prompts - New failure patterns emerging → Add criteria - Perfect scores (1.0) → Suspiciously high, investigate --- ### Step 3: Audit Patterns Run quantitative analysis on the full dataset to catch issues invisible in spot-checks: | Check | Red Flag | Action | |-------|----------|--------| | Phrase repetition | Any phrase in >50% of responses | Add to anti-patterns, regenerate | | Structural rigidity | 100% same format | Vary response structure | | Response length ratio | Avg >2x user length | Tighten length constraints | | Praise distribution | Late responses 2x more praise | Adjust tone consistency | **Gate:** No audit red flags **Reference:** [assessment-guide.md#audit-patterns](assessment-guide.md#audit-patterns) --- ### Step 4: Fixup or Reject For failing transcripts, decide whether to fix or reject: | Failure Type | Action | |--------------|--------| | Soft failures (language, tone) | Attempt fixup with entailment constraint | | Safety gate failures | Truncate at failure point or reject entirely | | Structural issues | Usually reject | **Entailment constraint:** Fixed response must naturally lead to user's next message. If fix breaks continuity → truncate instead. **If >30% need fixup:** Generation prompts need revision. **Reference:** [assessment-guide.md#fixup-strategy](assessment-guide.md#fixup-strategy) --- ### Step 5: Slice for Training Create training examples from full transcripts: ``` 50-turn transcript → ~8-10 training examples via slicing ``` **Slicing strategy:** - Random slice points (seeded by transcript ID for reproducibility) - Minimum 3 exchanges before first slice - 2-5 exchange gaps between slices - Always include final turn **Token validation:** - Each slice must be under your token limit (e.g., 16K) - Long transcripts may need truncation **Leakage prevention:** - Split by transcript/persona FIRST - Then slice within each split - Never let slices from same transcript in both train and validation **Reference:** [assessment-guide.md#slicing-strategy](assessment-guide.md#slicing-strategy) **Optional:** Use `hugging-face-dataset-creator` skill when ready to push `training_data.jsonl` to HF Hub. --- ## Infrastructure ### Checkpointing Write progress after each transcript, not at the end: ```python for persona in personas: transcript = generate_transcript(persona) save_immediately(transcript) # Don't batch ``` ### Retry with Backoff API failures will happen. Use exponential backoff: - Claude: 7 attempts, 1-hour max wait - Google: Extract retry delay from error message - OpenAI: Standard exponential backoff ### Progress Tracking Track throughout generation: - Transcripts generated / target - Transcripts assessed / generated - Pass rate (rolling and cumulative) - Criterion failure breakdown **Reference:** [assessment-guide.md#infrastructure](assessment-guide.md#infrastructure) --- ## Resources | Resource | What It Contains | |----------|------------------| | [code/SETUP-REFERENCE.md](../code/SETUP-REFERENCE.md) | Script templates: generate.py, assess.py, slice.py | | [code/infrastructure.py](../code/infrastructure.py) | Copy-paste ready: LLM backend, retry strategies, checkpointing | | [examples/therapy-domain.md](../examples/therapy-domain.md) | Complete therapy example: prompts, flaw patterns, criteria | --- ## Done When - [ ] Target training example count reached - [ ] Pass rate stable across last 2-3 batches (≥70%) - [ ] Human satisfied with transcript quality - [ ] Audit patterns within thresholds - [ ] `training_data.jsonl` validated --- ## Next Phase → [finetune-train](../finetune-train/SKILL.md)