--- name: create-inspect-task description: Create custom inspect-ai evaluation tasks through interacted, guided workflow. --- # Create Inspect Task You help users create custom inspect-ai evaluation tasks through an interactive, guided workflow. Create well-documented, reusable evaluation scripts that follow inspect-ai best practices. ## Your Task Guide the user through designing and implementing a custom inspect-ai evaluation task. Create a complete, runnable task file and comprehensive documentation that explains the design decisions and usage. ## Operating Modes This skill supports two modes: ### Mode 1: Experiment-Guided (Recommended) When an `experiment_summary.yaml` file exists (created by `design-experiment` skill), extract configuration to pre-populate: - Dataset path and format - Model information - Evaluation objectives - System prompts - Common parameters **Usage:** Run skill from experiment directory or provide path to experiment_summary.yaml ### Mode 2: Standalone Create evaluation tasks from scratch without experiment context. User provides all configuration manually. **Usage:** Run skill when no experiment exists or when creating general-purpose evaluation tasks ## Workflow ### Initial Setup (Both Modes) 1. **Check for experiment context** - Look for `experiment_summary.yaml` in current directory - If found, ask user: "I found an experiment summary. Would you like me to use it to configure the evaluation task?" - If user says yes, proceed with Mode 1 - If no or not found, proceed with Mode 2 ### Mode 1: Experiment-Guided Workflow 1. **Read experiment_summary.yaml** - Extract configuration 2. **Confirm extracted info** - Show user what was found (dataset, models, etc.) 3. **Understand evaluation objective** - What specific aspect to evaluate? 4. **Configure task-specific details** - Solver chain, scorers (guided by experiment context) 5. **Add task parameters** - Make the task flexible and reusable 6. **Generate code** - Create the complete task file with experiment integration 7. **Create documentation** - Write design documentation with experiment context 8. **Create log** - Document all decisions in `create-inspect-task.log` 9. **Provide usage guidance** - Show user how to run the task with their models ### Mode 2: Standalone Workflow 1. **Understand the objective** - What does the user want to evaluate? 2. **Configure dataset** - Guide dataset format selection and loading 3. **Design solver chain** - Build the solver pipeline (prompts, generation, etc.) 4. **Select scorers** - Choose appropriate scoring mechanisms 5. **Add task parameters** - Make the task flexible and reusable 6. **Generate code** - Create the complete task file 7. **Create documentation** - Write design documentation with rationale 8. **Create log** - Document all decisions in `create-inspect-task.log` 9. **Provide usage guidance** - Show user how to run the task ## Extracting Information from experiment_summary.yaml (Mode 1) When operating in experiment-guided mode, extract the following information from the YAML structure: ### YAML Structure Overview ```yaml experiment: name: string type: string question: string data: training: path: string label: string format: string splits: train: int validation: int test: int models: base: - name: string path: string evaluation: system_prompt: string temperature: float runs: - name: string type: string # "fine-tuned" or "control" model: string ``` ### Extraction Algorithm ```python import yaml from pathlib import Path def extract_from_experiment_summary(path): """Extract configuration from experiment_summary.yaml""" with open(path, 'r') as f: config = yaml.safe_load(f) # Extract dataset configuration dataset_path = config['data']['training']['path'] dataset_format = config['data']['training']['format'] dataset_splits = config['data']['training']['splits'] # Extract system prompt from evaluation section system_prompt = config['evaluation']['system_prompt'] # Extract research question research_question = config['experiment']['question'] experiment_type = config['experiment']['type'] # Extract model information (first base model) base_models = config['models']['base'] model_name = base_models[0]['name'] if base_models else None model_path = base_models[0]['path'] if base_models else None # Extract run names for documentation examples run_names = [run['name'] for run in config['runs']] control_runs = [run['name'] for run in config['runs'] if run['type'] == 'control'] return { 'dataset_path': dataset_path, 'dataset_format': dataset_format, 'dataset_splits': dataset_splits, 'system_prompt': system_prompt, 'research_question': research_question, 'experiment_type': experiment_type, 'model_name': model_name, 'model_path': model_path, 'run_names': run_names, 'control_runs': control_runs } ``` ### Key Fields to Extract **From `experiment` section:** - `question` → Research question/objective (informs evaluation goal) - `type` → Experiment type (helps understand what's being compared) **From `data.training` section:** - `path` → Dataset path for evaluation - `format` → Dataset format (json, parquet) - `splits` → Sample counts (use test split for evaluation) **From `models.base[]` section:** - `name` → Model identifier - `path` → Full path to base model (for usage examples) **From `evaluation` section:** - `system_prompt` → Use same prompt for consistency - `temperature` → Default temperature setting **From `runs[]` section:** - `name` → Run identifiers (for documentation) - `type` → Filter for "control" runs that need evaluation ### Presenting Extracted Information After extraction, show the user what was found: ```markdown ## Configuration Extracted from Experiment I found the following configuration in your experiment: **Dataset:** - Path: `/scratch/gpfs/.../data/green/capitalization/words_4L_80P_300.json` - Format: JSON - Splits: train (240), test (60) **Models:** - Llama-3.2-1B-Instruct - Path: `/scratch/gpfs/.../pretrained-llms/Llama-3.2-1B-Instruct` **System Prompt:** ``` {extracted_prompt or "(none)"} ``` **Research Question:** {extracted_question} I'll use this information to help configure your evaluation task. You can override any of these settings if needed. ``` ### Validation Check extracted information: - ✓ Dataset path exists (verify with `ls`) - ✓ Dataset format is supported (.json, .parquet, .jsonl) - ✓ Model path exists (verify with `ls`) - ✓ System prompt is properly formatted (string, not list) If validation fails: - Warn user but continue - Ask user to provide correct information - Log validation failures ## Logging **IMPORTANT:** Create a detailed log file at `{task_directory}/create-inspect-task.log` that records all questions, answers, and decisions made during task creation. ### Log Format ``` [YYYY-MM-DD HH:MM:SS] ACTION: Description Details: {specifics} Result: {outcome} ``` ### What to Log - User's evaluation objective - Dataset selection and configuration decisions - Solver chain composition choices - Scorer selection rationale - Task parameter decisions - File creation - Any validation performed ### Example Log Entries #### Mode 1: Experiment-Guided ``` [2025-10-24 14:30:00] MODE_SELECTION: Experiment-guided mode Details: Found experiment_summary.yaml at /scratch/gpfs/MSALGANIK/mjs3/cap_4L_lora_lr_sweep/experiment_summary.yaml Result: User confirmed to use experiment configuration [2025-10-24 14:30:05] EXTRACT_CONFIG: Reading experiment_summary.yaml Details: Parsing YAML structure: experiment, data, models, evaluation sections Result: Successfully extracted configuration [2025-10-24 14:30:10] EXTRACTED_DATASET: Dataset configuration Details: Path: /scratch/gpfs/MSALGANIK/niznik/GitHub/cruijff_kit/data/green/capitalization/words_4L_80P_300.json Format: JSON, Splits: train (240), test (60) Result: Verified dataset exists (43KB) [2025-10-24 14:30:15] EXTRACTED_SYSTEM_PROMPT: System prompt from experiment Details: Prompt: "" (empty - no system message) Result: Will use empty system prompt for consistency with training [2025-10-24 14:30:20] EXTRACTED_RESEARCH_QUESTION: Scientific objective Details: Compare LoRA ranks and learning rates for capitalization task Result: Will design evaluation to measure exact match accuracy [2025-10-24 14:30:25] EVALUATION_OBJECTIVE: User wants to evaluate capitalization accuracy Details: Exact match (case-sensitive), using experiment dataset Result: Will use match(location="exact", ignore_case=False) scorer for strict evaluation [2025-10-24 14:30:30] SOLVER_CONFIG: Designing solver chain Details: system_message(""), prompt_template("{prompt}"), generate(temp=0.0) Result: Matches training configuration for consistency ``` #### Mode 2: Standalone ``` [2025-10-24 14:30:00] MODE_SELECTION: Standalone mode Details: No experiment_summary.yaml found Result: User will provide all configuration manually [2025-10-24 14:30:05] EVALUATION_OBJECTIVE: User wants to evaluate sentiment classification Details: Binary classification (positive/negative), using custom dataset in JSON format Result: Will use match() scorer for exact matching, temperature=0.0 for consistency [2025-10-24 14:30:15] DATASET_CONFIG: Selected JSON dataset format Details: Dataset path: /scratch/gpfs/MSALGANIK/niznik/data/sentiment_test.json Field mapping: input="text", target="sentiment" Result: Will use hf_dataset with json format and custom record_to_sample function ``` ## Questions to Ask ### 1. Evaluation Objective **What do you want to evaluate?** - Classification task? (sentiment, topic, entity type, etc.) - Generation quality? (summarization, translation, etc.) - Factual accuracy? (question answering, fact checking) - Reasoning ability? (math, logic, chain-of-thought) - Task-specific capability? (code generation, instruction following) **What defines a correct answer?** - Exact match with target? - Contains specific information? - Model-graded quality assessment? - Multiple acceptable answers? ### 2. Dataset Configuration **What dataset format do you have?** - JSON file (`.json` or `.jsonl`) - Parquet files (`.parquet`) - HuggingFace dataset (specify dataset name) - CSV file - Custom format (will need conversion) **Where is the dataset located?** - Get full path to dataset - Verify file exists if possible - Check file size for sanity **What are the field names?** - Input field name (e.g., "question", "text", "prompt") - Target/answer field name (e.g., "answer", "label", "output") - Any metadata fields to preserve? (e.g., "category", "difficulty") **Dataset structure specifics:** - For JSON: Is it a single JSON file with nested structure or JSONL? - For JSON with splits: Which field contains the test split? - For Parquet: Is it a directory of parquet files? - For HuggingFace: Dataset name and split to use? **Example questions:** - "Does your JSON file have a structure like `{'train': [...], 'test': [...]}`?" - "Is each line a separate JSON object (JSONL format)?" - "Do you need to load from a specific split like 'test' or 'validation'?" ### 3. Solver Configuration **System message:** - Do you want to provide instructions to the model via system message? - What role should the model play? (e.g., "You are a helpful assistant", "You are an expert classifier") - Default: empty string (no system message) **Prompt template:** - Should we use the input directly or wrap it in a template? - Do you need chain-of-thought prompting? - Default: `"{prompt}"` (direct input) **Generation parameters:** - **Temperature**: - 0.0 for deterministic, consistent answers (recommended for most evals) - Higher values (0.7-1.0) for creative tasks - **Max tokens**: Maximum length of model response (default: model's default) - **Top-p**: Nucleus sampling parameter (default: 1.0) **Common solver patterns:** - Simple generation: `[system_message(""), prompt_template("{prompt}"), generate()]` - Chain-of-thought: `[chain_of_thought(), generate()]` - Multiple-choice: `[multiple_choice()]` (don't add separate generate()) - Custom template: `[prompt_template("Answer: {prompt}\n"), generate()]` ### 4. Scorer Selection **Based on evaluation objective, suggest scorers:** **For exact matching:** - `match()` - Target appears at beginning/end; ignores case, whitespace, punctuation - Options: `location="begin"/"end"/"any"`, `ignore_case=True/False` - `exact()` - Precise matching after normalization - `includes()` - Target appears anywhere in output - Options: `ignore_case=True/False` **For multiple choice:** - `choice()` - Works with `multiple_choice()` solver - Returns letter of selected answer (A, B, C, D, etc.) **For pattern extraction:** - `pattern()` - Extract answer using regex - Requires regex pattern parameter **For model-graded evaluation:** - `model_graded_qa()` - Another model assesses answer quality - Options: `partial_credit=True/False`, custom `template` - `model_graded_fact()` - Checks if specific facts appear - Note: Requires additional model, adds latency and cost **For numeric/F1 scoring:** - `f1()` - F1 score for text overlap **Multiple scorers:** - Can use a list: `[match(), includes()]` to get multiple scores - Helpful for comparing scoring methods ### 5. Task Parameters **Should the task accept parameters for flexibility?** **Common parameters to expose:** - `system_prompt` - Allow different system messages - `temperature` - Enable temperature tuning - `dataset_path` - Support different datasets - `grader_model` - For model-graded scoring - `config_dir` - (legacy) For runtime config reading; scaffold-inspect uses direct params instead **Benefits of parameters:** - Run variations without code changes - Easier experimentation - Better reusability **How to pass parameters:** ```bash inspect eval task.py -T param_name=value ``` ### 6. Model Specification **How will the model be specified?** **Option 1: CLI specification (most flexible)** - User provides model at runtime - `inspect eval task.py --model hf/local -M model_path=/path/to/model` - Recommended for most cases **Option 2: Integration with fine-tuning config (legacy)** - Like existing `cap_task` example - Reads from `setup_finetune.yaml` at runtime via `config_dir` parameter - Note: scaffold-inspect now bakes values into SLURM instead of using this pattern **Option 3: Hard-coded in task** - Less flexible but simpler - Can specify model inside task definition - Better for benchmarking specific models ## Output Files Create two files: ### 1. Task Script: `{task_name}_task.py` The complete, runnable inspect-ai task following best practices. **File naming convention:** - Descriptive name: `sentiment_classification_task.py` - Include domain: `math_reasoning_task.py` - Follow pattern: `{domain}_{type}_task.py` **Required components:** ```python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset, hf_dataset, FieldSpec from inspect_ai.solver import chain, generate, prompt_template, system_message from inspect_ai.scorer import match, includes @task def my_task(param1: str = "default"): """ Brief description of what this task evaluates. Args: param1: Description of parameter Returns: Task: Configured inspect-ai task """ # Dataset loading dataset = ... # Solver chain solver = chain( system_message("..."), prompt_template("{prompt}"), generate({"temperature": 0.0}) ) # Return task return Task( dataset=dataset, solver=solver, scorer=... ) ``` **Best practices to follow:** - Use type hints for parameters - Include docstring explaining purpose - Add comments explaining non-obvious choices - Handle errors gracefully (try/except for file operations) - Validate required parameters - Use descriptive variable names ### 2. Design Documentation: `{task_name}_design.md` Comprehensive documentation of design decisions. **Required sections:** ```markdown # {Task Name} Evaluation Task **Created:** {timestamp} **Inspect-AI Version:** {version if known} ## Evaluation Objective {What this task evaluates and why} ## Dataset Configuration **Format:** {JSON/Parquet/HuggingFace/etc.} **Location:** `{full_path_to_dataset}` **Size:** {number of samples if known} **Field Mapping:** - Input field: `{field_name}` - Target field: `{field_name}` - Metadata fields: `{field_names or "none"}` **Loading Method:** {Description of how dataset is loaded} **Data Structure:** {Explanation of JSON structure, splits, etc.} ## Solver Chain **Components:** 1. {Solver 1}: {Purpose} 2. {Solver 2}: {Purpose} 3. ... **System Message:** ``` {system message text or "none"} ``` **Prompt Template:** ``` {template or "direct input"} ``` **Generation Parameters:** - Temperature: {value} - {rationale} - Max tokens: {value or "default"} - {rationale} - {Other parameters if any} **Rationale:** {Why this solver chain was chosen} ## Scorer Configuration **Primary Scorer:** `{scorer_name}()` **Options:** - {option1}: {value} - {reason} - {option2}: {value} - {reason} **Additional Scorers:** {List if multiple scorers used, or "none"} **Rationale:** {Why this scorer is appropriate for the task} ## Task Parameters | Parameter | Type | Default | Purpose | |-----------|------|---------|---------| | {param1} | {type} | {default} | {description} | **Parameter Usage:** ```bash inspect eval {task_file}.py -T {param}={value} ``` ## Model Specification **Recommended usage:** ```bash inspect eval {task_file}.py --model hf/local -M model_path=/path/to/model ``` {Any specific notes about model compatibility} ## Example Usage **Basic evaluation:** ```bash inspect eval {task_name}_task.py --model hf/local -M model_path=/path/to/model ``` **With parameters:** ```bash inspect eval {task_name}_task.py --model hf/local -M model_path=/path/to/model -T temperature=0.5 ``` **Evaluating fine-tuned model:** {if applicable} ```bash cd /path/to/experiment/run/epoch_0 inspect eval {task_name}_task.py --model hf/local -M model_path=$PWD -T config_dir=$PWD ``` ## Output Files Inspect-ai will create: - `logs/{task_name}_{timestamp}.eval` - Evaluation results log - Console output with accuracy and metrics ## Expected Performance {If known, describe expected baseline performance or what good performance looks like} ## Notes {Any additional considerations, limitations, or future improvements} ## References - Inspect-AI documentation: https://inspect.aisi.org.uk/ - {Any other relevant references} ``` ## Code Generation Guidelines ### Dataset Loading Patterns **JSON with nested splits:** ```python from inspect_ai.dataset import hf_dataset def record_to_sample(record): return Sample( input=record["input"], target=record["output"] ) dataset = hf_dataset( path="json", data_files="/path/to/data.json", field="test", # Access the "test" split split="train", # Don't get confused - this refers to top-level split sample_fields=record_to_sample ) ``` **JSONL (one JSON object per line):** ```python from inspect_ai.dataset import json_dataset def record_to_sample(record): return Sample( input=record["question"], target=record["answer"] ) dataset = json_dataset( "/path/to/data.jsonl", record_to_sample ) ``` **Parquet directory:** ```python from inspect_ai.dataset import hf_dataset, FieldSpec dataset = hf_dataset( path="parquet", data_dir="/path/to/parquet_dir", split="test", sample_fields=FieldSpec( input="question", target="answer" ) ) ``` **HuggingFace dataset:** ```python from inspect_ai.dataset import hf_dataset, FieldSpec dataset = hf_dataset( path="username/dataset-name", split="test", sample_fields=FieldSpec( input="question", target="answer", metadata=["category", "difficulty"] # Preserve metadata ) ) ``` ### Solver Chain Patterns **Simple generation:** ```python from inspect_ai.solver import chain, generate, prompt_template, system_message solver = chain( system_message(""), # Empty if no system message needed prompt_template("{prompt}"), # Direct input generate({"temperature": 0.0}) ) ``` **With system message and custom template:** ```python solver = chain( system_message("You are an expert classifier. Respond with only the category label."), prompt_template("Text: {prompt}\n\nCategory:"), generate({"temperature": 0.0, "max_tokens": 50}) ) ``` **Chain-of-thought:** ```python from inspect_ai.solver import chain_of_thought, generate solver = chain( chain_of_thought(), # Adds "Let's think step by step" prompt generate({"temperature": 0.0}) ) ``` **Multiple choice:** ```python from inspect_ai.solver import multiple_choice solver = multiple_choice() # Don't add generate() separately # Or with chain-of-thought: solver = multiple_choice(cot=True) ``` ### Scorer Patterns **Exact matching (case-insensitive):** ```python from inspect_ai.scorer import match scorer = match() # Default: ignore case, whitespace, punctuation # Or customize: scorer = match(location="exact", ignore_case=False) ``` **Substring matching:** ```python from inspect_ai.scorer import includes scorer = includes() # Default: case-sensitive # Or: scorer = includes(ignore_case=True) ``` **Multiple scorers:** ```python scorer = [ match("exact", ignore_case=False), includes(ignore_case=False) ] # Results will show scores from both ``` **Model-graded:** ```python from inspect_ai.scorer import model_graded_qa scorer = model_graded_qa( partial_credit=True, # Allow 0.5 scores model="openai/gpt-4o" # Specify grading model ) ``` ## Integration with Fine-Tuning Workflow ### Experiment-Guided Task Creation (Recommended) When creating tasks for an experiment: 1. **Run from experiment directory:** ```bash cd /scratch/gpfs/MSALGANIK/mjs3/my_experiment/ # Invoke create-inspect-task skill ``` 2. **Skill automatically extracts from experiment_summary.yaml:** - Dataset path and format - System prompt (ensures eval matches training) - Model information - Research objectives 3. **Task parameter modes:** - **Direct parameters (preferred)**: `data_path`, `prompt`, `system_prompt` passed via `-T` flags. scaffold-inspect bakes these into SLURM scripts at scaffolding time. - **config_dir mode (legacy)**: Reads from `setup_finetune.yaml` at runtime. Not used by scaffold-inspect but supported for backwards compatibility. ### Generated Task Pattern **For tasks integrated with experiments:** ```python import yaml from pathlib import Path @task def my_task( config_dir: Optional[str] = None, dataset_path: Optional[str] = None, system_prompt: str = "", temperature: float = 0.0, split: str = "test" ) -> Task: """ Evaluate model using configuration from fine-tuning setup or direct paths. Args: config_dir: Path to epoch directory (contains ../setup_finetune.yaml). If provided, reads dataset path and system prompt from config. dataset_path: Direct path to dataset JSON file. Used if config_dir not provided. system_prompt: System message for the model. Overrides config if both provided. temperature: Generation temperature (default: 0.0 for deterministic output). split: Which data split to use (default: "test"). Returns: Task: Configured inspect-ai task """ # Determine configuration source if config_dir: # Mode 1: Read from fine-tuning configuration config_path = Path(config_dir).parent / "setup_finetune.yaml" with open(config_path, 'r') as f: config = yaml.safe_load(f) # Extract settings from fine-tuning config dataset_path = config['input_dir_base'] + config['dataset_label'] + config['dataset_ext'] # Use system prompt from config unless overridden if not system_prompt: system_prompt = config.get('system_prompt', '') elif dataset_path: # Mode 2: Direct dataset path # system_prompt and other params used as provided pass else: raise ValueError("Must provide either config_dir or dataset_path") # Load dataset dataset = ... # Load using dataset_path return Task( dataset=dataset, solver=chain( system_message(system_prompt), prompt_template("{prompt}"), generate({"temperature": temperature}) ), scorer=... ) ``` ### Usage Examples **Evaluating fine-tuned model from experiment:** ```bash cd /path/to/experiment/run_dir/epoch_0 inspect eval /path/to/my_task.py --model hf/local -M model_path=$PWD -T config_dir=$PWD ``` **Evaluating base model (control run):** ```bash inspect eval my_task.py \ --model hf/local \ -M model_path=/scratch/gpfs/MSALGANIK/pretrained-llms/Llama-3.2-1B-Instruct \ -T dataset_path=/path/to/dataset.json ``` ### Integration with setup_inspect.py (Future) This task pattern enables integration with the `setup_inspect.py` tool (when implemented): ```bash python tools/inspect/setup_inspect.py --finetune_epoch_dir /path/to/experiment/run/epoch_0 ``` ## Validation Before Completion ### Common Validation (Both Modes) Before finishing, verify: - ✓ Task file is syntactically correct Python - ✓ All imports are present - ✓ Task decorated with `@task` - ✓ Dataset loading code matches format - ✓ Solver chain follows inspect-ai patterns - ✓ Scorer is appropriate for task - ✓ Design documentation includes all sections - ✓ Example usage commands are correct - ✓ Log file documents all decisions ### Mode 1 Specific Validation Additional checks for experiment-guided mode: - ✓ experiment_summary.yaml was successfully parsed - ✓ Extracted dataset path exists and format matches - ✓ System prompt matches training configuration - ✓ Task supports both `config_dir` and `dataset_path` parameters - ✓ Documentation includes experiment context (research question, runs) - ✓ Usage examples show both fine-tuned and base model evaluation - ✓ Log includes extraction details and validation results ## Next Steps After Creation After creating the task, guide user: 1. **Test the task:** ```bash # Validate syntax python -m py_compile {task_file}.py # Test with small sample inspect eval {task_file}.py --model {model} --limit 5 ``` 2. **Run full evaluation:** ```bash inspect eval {task_file}.py --model {model} ``` 3. **View results:** ```bash inspect view # Opens web UI to browse evaluation logs ``` 4. **Iterate if needed:** - Adjust scorer settings - Modify prompts - Change generation parameters - Use `inspect score` to re-score without re-running ## Important Notes ### General Best Practices - Follow inspect-ai best practices from https://inspect.aisi.org.uk/ - Always include docstrings and comments - Make tasks parameterized for flexibility - Create comprehensive documentation for reproducibility - Use type hints for parameters - Handle errors gracefully - Validate dataset paths when possible - Keep generation temperature at 0.0 for consistency unless user needs creativity - Prefer simple scorers (match, includes) over model-graded when possible - Test with small samples first (`--limit 5`) ### Experiment Integration - **Prefer Mode 1 (experiment-guided)** when working with designed experiments - Always check for experiment_summary.yaml before starting - Extract and validate all configuration before proceeding - **System prompt consistency is critical** - eval must match training - Generated tasks should work for both fine-tuned and base models - Include experiment context in documentation (research question, runs) - Use `config_dir` parameter pattern for experiment integration - Log all extraction and validation steps for reproducibility ## Error Handling **If dataset file not found:** - Warn user but proceed with code generation - Note in documentation that path should be verified - Include validation suggestion in next steps **If unsure about dataset format:** - Ask for example record - Offer to help convert to supported format - Suggest user examine file structure **If scorer choice unclear:** - Recommend starting with simple scorers - Suggest using multiple scorers for comparison - Note that scorers can be changed later without re-running generation