--- name: langfuse-optimization description: Analyzes writing-ecosystem traces to fix style.yaml, template.yaml, and tools.yaml based on quality issues found in production runs. allowed-tools: "*" --- # Writing Ecosystem Config Optimizer Analyzes Langfuse traces to identify what's wrong with your **style.yaml**, **template.yaml**, and **tools.yaml** files, then tells you exactly how to fix them. ## When to Use This Skill - "Analyze traces and fix my config files" - "My checks are failing - what's wrong with style.yaml?" - "Optimize case 0001 configuration" - "Why is the research node selecting wrong tools?" ## Required Environment Variables - `LANGFUSE_PUBLIC_KEY`: Your Langfuse public API key - `LANGFUSE_SECRET_KEY`: Your Langfuse secret API key - `LANGFUSE_HOST`: Langfuse host URL (default: https://cloud.langfuse.com) ## What This Skill Does **Input**: User request + case ID **Output**: Specific fixes for style.yaml, template.yaml, tools.yaml **3-Step Process**: 1. **Retrieve traces** from Langfuse for specified case 2. **Extract problems** from trace data (check failures, tool errors, structure issues) 3. **Generate fixes** with exact YAML changes to make ## Workflow ### Step 1: Get User Request & Case ID Ask for: - **Case ID** (e.g., "0001", "0002", "The Prep") - **Time range** (default: last 7 days) - **Specific focus** (optional: "just style checks", "just tools", "everything") ### Step 2: Retrieve Trace Data #### Option A: Unified Retrieval (Recommended - Simpler) Use the unified helper to get traces and observations in one command: ```bash cd /home/runner/workspace/.claude/skills/langfuse-optimization # Get last 5 traces with observations for a case (using tags - RECOMMENDED) python3 helpers/retrieve_traces_and_observations.py \ --limit 5 \ --tags "case:0001" \ --output /tmp/langfuse_analysis/bundle.json # Filter by metadata (e.g., specific case_id) python3 helpers/retrieve_traces_and_observations.py \ --limit 3 \ --metadata case_id=0001 \ --output /tmp/langfuse_analysis/case_0001_bundle.json # Get traces only (skip observations for faster retrieval) python3 helpers/retrieve_traces_and_observations.py \ --limit 10 \ --no-observations \ --output /tmp/langfuse_analysis/traces_only.json # Save separate files + unified bundle python3 helpers/retrieve_traces_and_observations.py \ --limit 5 \ --output /tmp/langfuse_analysis/bundle.json \ --traces-output /tmp/langfuse_analysis/traces.json \ --observations-output /tmp/langfuse_analysis/observations.json # **RECOMMENDED**: Strip bloat for 95% size reduction python3 helpers/retrieve_traces_and_observations.py \ --tags "case:0001" \ --limit 1 \ --filter-essential \ --output /tmp/langfuse_analysis/filtered_bundle.json ``` **Output**: Single JSON bundle with: - Query parameters (for reproducibility) - Traces list - Observations grouped by trace_id - Trace count and IDs **Size Optimization Flags**: **`--filter-essential`** (Config Optimization): - Strips: `facts_pack` (391KB+), `validation_report` (45KB+), long text fields - Replaces with compact summaries (facts count, size, failed checks) - **Reduction**: ~95% (4.2MB → 200KB) - **Use case**: Analyzing style.yaml, template.yaml, tools.yaml **`--filter-research-details`** (Additional Reduction): - Strips: `structured_citations` (34KB → 700B), `step_status` (8KB → 200B) - Replaces with counts, domains, tools used, success/failure stats - **Reduction**: ~70% additional (on top of essential) - **Use case**: When citation URLs and detailed step logs not needed **`--filter-all`** (Maximum Reduction): - Convenience flag: enables both `--filter-essential` + `--filter-research-details` - **Total reduction**: ~96% (4.2MB → 30KB per trace) - **Use case**: Large-scale trace collection, config optimization **Comparison**: - Without filtering: 4.2MB per trace (slow, all raw data) - With `--filter-essential`: 200KB per trace (fast, config analysis) - With `--filter-all`: 30KB per trace (fastest, minimal size) #### Option A.1: Single Trace Retrieval (Fastest for Individual Analysis) When you know the exact trace ID you want to analyze, use the single trace helper: ```bash cd /home/runner/workspace/.claude/skills/langfuse-optimization # Essential filtering only (95% reduction) python3 helpers/retrieve_single_trace.py 8fda46d7ac626327396d1a7962690807 --filter-essential # Maximum filtering (96% reduction) - RECOMMENDED for most cases python3 helpers/retrieve_single_trace.py 8fda46d7ac626327396d1a7962690807 --filter-all # Essential + Research details (custom combination) python3 helpers/retrieve_single_trace.py 8fda46d7ac626327396d1a7962690807 \ --filter-essential --filter-research-details \ --output /tmp/langfuse_analysis/single_trace.json # Without filtering (keep all raw data) python3 helpers/retrieve_single_trace.py abc123 --output /tmp/langfuse_analysis/trace.json ``` **Benefits over multi-trace retrieval:** - **10-20x faster**: Only fetches one trace instead of all traces for a case - **Lower API usage**: Fewer API calls, less rate limiting - **Cleaner workflow**: No need for client-side extraction - **Same structure**: Output is identical to `retrieve_traces_and_observations.py` **Output**: Same bundle structure as Option A (compatible with all analysis tools) **When to use:** - Analyzing a specific trace ID from Langfuse dashboard - Deep-diving into one workflow run - Following up on a specific error or issue - Comparing before/after changes to config #### Option B: Two-Step Retrieval (Advanced - More Control) For scenarios where you need separate retrieval stages: ```bash cd /home/runner/workspace/.claude/skills/langfuse-optimization # Step 1: Get traces for a specific case (using tags) python3 helpers/retrieve_traces.py \ --tags "case:0001" \ --days 7 \ --limit 10 \ --output /tmp/langfuse_analysis/traces.json # Step 2: Get observations for those traces python3 helpers/retrieve_observations.py \ --trace-ids-file /tmp/langfuse_analysis/traces.json \ --output /tmp/langfuse_analysis/observations.json # Step 2 (with filtering): Strip bloat for 95% size reduction python3 helpers/retrieve_observations.py \ --trace-ids-file /tmp/langfuse_analysis/traces.json \ --filter-essential \ --output /tmp/langfuse_analysis/filtered_observations.json ``` ### Step 2B: Retrieve Annotation Queue Data (Optional) If you have human annotations/feedback in Langfuse annotation queues: ```bash cd /home/runner/workspace/.claude/skills/langfuse-optimization # Get all annotated items from a queue python3 helpers/retrieve_annotations.py \ --queue-id \ --output /tmp/langfuse_analysis/annotations.json # Get only completed annotations (reviewed items) python3 helpers/retrieve_annotations.py \ --queue-id \ --status completed \ --output /tmp/langfuse_analysis/annotations.json # Limit to recent 100 items python3 helpers/retrieve_annotations.py \ --queue-id \ --limit 100 ``` **What you get**: - Human comments/notes on traces - Manual scores assigned by reviewers - Issues flagged during quality review - Trace IDs linked to annotations **How to use in analysis**: - Cross-reference annotation comments with trace data - Identify patterns in human-flagged issues - Prioritize fixes based on manual feedback frequency - Validate if automated checks catch the same issues humans flag ### Step 2.5: Using Metadata Filters Filter traces by metadata fields to focus analysis on specific subsets: ```bash cd /home/runner/workspace/.claude/skills/langfuse-optimization # Single metadata filter - analyze specific case python3 helpers/retrieve_traces_and_observations.py \ --metadata case_id=0001 \ --limit 10 \ --output /tmp/langfuse_analysis/case_0001.json # Multiple filters (AND logic - trace must match ALL) python3 helpers/retrieve_traces_and_observations.py \ --metadata case_id=0001 profile_name="Stock Deep Dive" \ --limit 5 \ --output /tmp/langfuse_analysis/filtered.json # Use dot notation for nested metadata (if applicable) python3 helpers/retrieve_traces_and_observations.py \ --metadata workflow_version=1 \ --output /tmp/langfuse_analysis/v1_workflows.json ``` **How it works**: - Retrieves all traces from Langfuse within time range - Applies client-side filtering by metadata fields - Returns only traces matching ALL specified filters - Limit applied AFTER filtering (ensures you get requested number of matching traces) **Common Use Cases**: - **Analyze specific case**: `--metadata case_id=0001` - **Compare workflow versions**: `--metadata workflow_version=1` vs `--metadata workflow_version=2` - **Profile-specific issues**: `--metadata profile_name="The Prep"` - **Combine filters**: `--metadata case_id=0001 workflow_version=2` (both must match) **Tips**: - Metadata values are case-sensitive strings - Use exact matches only (no wildcards/regex) - Check available metadata: run without filter first, inspect trace metadata - Common fields: `case_id`, `profile_name`, `workflow_version` ### Step 3: Extract Problems from Traces Read `/tmp/langfuse_analysis/bundle.json` (or `observations.json` if using two-step retrieval) and extract: #### A. Style Check Failures (for style.yaml) From **edit node** observations, find: - Which checks failed - Failure rates (how often each check fails) - Scores vs thresholds - Example content that failed **Map to style.yaml issues**: - **Vague rubric**: Check description unclear, LLM can't grade consistently - **Wrong threshold**: Check fails too often (>30%) or never fails - **Missing check**: Quality issue exists but no check catches it - **Wrong weight**: Check importance (MINOR/MAJOR/CRITICAL) doesn't match impact #### B. Template Problems (for template.yaml) From **write node** observations, find: - Missing required sections - Word count violations - Structure mismatches (bullets vs narrative) **Map to template.yaml issues**: - Unclear section descriptions - Unrealistic word limits - Missing section definitions #### C. Tool Selection Issues (for tools.yaml) From **research node** observations, find: - Which tools were selected - Tool failures (API errors, timeouts) - Wrong tool for topic (should have used X but used Y) - Loop expansion failures (`for_each` errors) **Map to tools.yaml issues**: - Tool not available in pattern - Wrong research pattern selected - Loop directive path incorrect - Missing fallback configuration ### Step 4: Generate Config Fixes For each problem, create a recommendation: ```markdown ## Fix #N: [Problem description] **File**: `writing_ecosystem/config/cases/XXXX/[style|template|tools].yaml` **Problem**: - [Specific issue found in traces] - [Evidence: X failures in Y traces] **Current Config**: ```yaml [Show current YAML] ``` **Fixed Config**: ```yaml [Show corrected YAML with inline comments explaining changes] ``` **Why this fixes it**: - [Explanation of root cause] - [Expected improvement] ``` ### Step 5: Present Simple Report ```markdown # Config Optimization Report - Case XXXX **Traces Analyzed**: X traces from [date range] --- ## Problems Found ### style.yaml Issues 1. ❌ `tone_consistency` check failing 30% (vague rubric) 2. ❌ `ttr_constraint` threshold too strict (16% failures) 3. ⚠️ `formality` check never fails (threshold too loose) ### template.yaml Issues 1. ❌ "Context" section missing description 2. ❌ Word limit conflict: max 100 words but needs 5 bullets ### tools.yaml Issues 1. ❌ Research pattern missing `finnhub` for financial topics 2. ❌ Loop directive path wrong: `user.portfolio.symbols` (should be `user.portfolio.summary.symbols`) --- ## Recommended Fixes ### Fix #1: Improve tone_consistency Rubric (style.yaml) **Problem**: Failing 30% of traces - rubric too vague **Current**: ```yaml signatures: tone_consistency: rubric: "Assess whether tone is consistent. Score 1-10." threshold: 7.0 ``` **Fixed**: ```yaml signatures: tone_consistency: rubric: | Check tone consistency across: 1. FORMALITY: Professional terms only (not "pretty big", "kinda") 2. OBJECTIVITY: Neutral facts (not "shocked markets") 3. EXPERTISE: Assumes financial literacy Score 9-10: Perfect consistency Score 7-8: 1-2 minor lapses Score 5-6: Noticeable shifts Score <5: Multiple violations threshold: 7.0 ``` **Why**: Specific dimensions + examples → LLM can grade consistently --- ### Fix #2: Lower TTR Threshold (style.yaml) **Problem**: Failing 16% - too strict for financial jargon **Current**: ```yaml constraints: ttr_constraint: threshold: 0.55 ``` **Fixed**: ```yaml constraints: ttr_constraint: threshold: 0.50 # Financial terms naturally repeat ``` **Why**: Domain terminology (Fed, QE, yield curve) lowers lexical diversity --- ### Fix #3: Add Finnhub to Research Pattern (tools.yaml) **Problem**: Financial topics not getting market data **Current**: ```yaml research_patterns: default: general_research patterns: general_research: steps: - tool: perplexity ``` **Fixed**: ```yaml research_patterns: default: financial_research # Changed default for case 0001 patterns: financial_research: steps: - tool: perplexity save_as: news - tool: finnhub # Added for market data input: endpoint: company_news symbol: "{{topic}}" # Extract symbol from topic save_as: market_data ``` **Why**: Financial topics need both news (perplexity) + data (finnhub) --- ## Implementation **1. Backup configs**: ```bash cd writing_ecosystem/config/cases/0001 cp style.yaml style.yaml.backup cp template.yaml template.yaml.backup cp tools.yaml tools.yaml.backup ``` **2. Apply fixes**: - Open each file in editor - Apply changes from recommendations above - Save files **3. Test**: ```bash python run_workflow.py --case 0001 --topic "Test topic" # Check Langfuse trace for improvements ``` **4. Monitor**: - Run 20-30 workflows - Re-run this analysis - Compare before/after failure rates --- ## Expected Results - `tone_consistency` failures: 30% → ~15% - `ttr_constraint` failures: 16% → ~8% - Research quality: +20% (adding finnhub) - Overall pre-flight score: 7.8 → 8.2 --- **Ready to implement?** Let me know which fixes to apply first, or if you want to see more detail on any issue. ``` ## Analysis Patterns ### For style.yaml Issues **Look for**: 1. **High failure rate** (>30%) → Vague rubric or wrong threshold 2. **Zero failures** → Threshold too loose or check not working 3. **Inconsistent scores** → Rubric needs examples and clear criteria 4. **Low edit fix rate** (<50%) → Check unclear about what to fix **Common fixes**: - Add specific dimensions to rubrics - Provide good/bad examples - Adjust thresholds based on domain (finance vs tech vs general) - Add deterministic pre-checks for obvious violations ### For template.yaml Issues **Look for**: 1. **Missing sections** in write node output 2. **Word count violations** (consistent over/under) 3. **Structure mismatches** (bullets vs narrative) **Common fixes**: - Add clear section descriptions - Adjust word limits to realistic values - Clarify format requirements (when to use bullets vs prose) ### For tools.yaml Issues **Look for**: 1. **Wrong tool selected** for topic type 2. **Missing tools** for domain (finance needs finnhub) 3. **Loop expansion failures** (path errors in `for_each`) 4. **Tool errors** (API failures, timeouts) **Common fixes**: - Add domain-specific tools to patterns - Fix loop directive paths - Add fallback patterns - Update default pattern for case ## Key Principles ### 1. Evidence-Based Every recommendation must show: - How many traces failed - Example content that failed - Why current config caused the failure ### 2. Specific No generic advice like "improve rubric" - show EXACT YAML changes with inline comments ### 3. Prioritized Focus on: - High-frequency issues first (affects >30% of traces) - Quick wins (threshold adjustments) - High-impact changes (missing tools for domain) ### 4. Actionable Every fix includes: - Exact file path - Before/after YAML - Expected improvement - How to test ## Troubleshooting **"No traces found"**: - Verify case ID is correct - Check trace naming: `writing-workflow-0001` vs `writing-workflow-001` - Try broader: `--name "writing-workflow"` to see all cases **"No check failures in traces"**: - Workflow may be in fallback mode (no LLM) - Edit node may have been skipped (pre-flight score >8.5) - Verify edit node ran in observations **"Can't identify issue"**: - Read the actual style.yaml/template.yaml/tools.yaml files - Compare trace output to config requirements - Look for mismatches **"Metadata filter returning no traces"**: - Verify metadata fields exist in your traces (check raw trace JSON) - Metadata values are case-sensitive strings - Use exact matches only (no wildcards/regex) - Try without metadata filter first to see available metadata fields - Common fields: `case_id`, `profile_name`, `workflow_version` ## Success Criteria Good recommendations should: 1. ✅ Show exact YAML before/after 2. ✅ Explain WHY issue occurred (root cause) 3. ✅ Quantify impact (X% failure rate → Y% expected) 4. ✅ Be implementable in <5 min per fix 5. ✅ Focus on top 3-5 issues (not 50 minor ones) --- **Remember**: This skill is about **fixing config files**, not analyzing architecture. Keep it simple: 1. What's broken in the YAML? 2. Here's the fix 3. Here's why it works