--- name: "hypogenic-hypothesis-generation" description: "LLM-driven hypothesis generation/testing on tabular data. Three methods: HypoGeniC (data-driven), HypoRefine (literature+data), Union. Iterative refinement, Redis caching, multi-hypothesis inference. Manual: hypothesis-generation; ideation: scientific-brainstorming." license: "MIT" --- # HypoGeniC Hypothesis Generation ## Overview HypoGeniC automates scientific hypothesis generation and testing using LLMs on tabular datasets. Given labeled data (e.g., deception detection, AI-content identification), it generates testable hypotheses, iteratively refines them against validation performance, and runs inference to classify new samples. It supports three approaches: purely data-driven (HypoGeniC), literature-integrated (HypoRefine), and mechanistic union of both. ## When to Use - Generating testable hypotheses from labeled observational datasets without prior theory - Systematically testing multiple competing hypotheses on empirical data - Combining insights from research papers with data-driven pattern discovery - Accelerating hypothesis ideation in domains like deception detection, content analysis, mental health indicators - Benchmarking LLM-based hypothesis generation methods against few-shot baselines - For manual hypothesis formulation frameworks, use **hypothesis-generation** knowhow - For general-purpose ML classification without hypothesis interpretability, use **scikit-learn-machine-learning** ## Prerequisites - **Python packages**: `hypogenic` - **Optional**: Redis server (port 6832) for LLM response caching; GROBID for PDF literature processing - **API keys**: OpenAI, Anthropic, or compatible LLM API key in environment - **Data**: Labeled JSON datasets in HypoGeniC format (see Key Concepts) ```bash pip install hypogenic # Optional: clone example datasets git clone https://github.com/ChicagoHAI/HypoGeniC-datasets.git ./data git clone https://github.com/ChicagoHAI/Hypothesis-agent-datasets.git ./data_lit ``` ## Quick Start ```python from hypogenic import BaseTask import re # Custom label extractor (must match dataset label format) def extract_label(text: str) -> str: match = re.search(r'final answer:\s+(.*)', text, re.IGNORECASE) return match.group(1).strip() if match else text.strip() # 1. Load task from config task = BaseTask( config_path="./data/your_task/config.yaml", extract_label=extract_label ) # 2. Generate hypotheses (data-driven) task.generate_hypotheses( method="hypogenic", num_hypotheses=20, output_path="./output/hypotheses.json" ) # 3. Run inference on test set results = task.inference( hypothesis_bank="./output/hypotheses.json", test_data="./data/your_task/your_task_test.json" ) print(f"Accuracy: {results['accuracy']:.3f}") ``` ## Workflow ### Step 1: Prepare Dataset Create train/val/test JSON files with text features and labels. ```python import json # Dataset: each key maps to a list of equal length dataset = { "headline_1": [ "What Up, Comet? You Just Got *PROBED*", "Scientists Made a Breakthrough in Quantum Computing" ], "headline_2": [ "Scientists Were Holding Their Breath Today. Here's Why.", "New Quantum Computer Achieves Milestone" ], "label": [ "Headline 2 has more clicks than Headline 1", "Headline 1 has more clicks than Headline 2" ] } # All lists must have equal length; labels must match extract_label output for split in ["train", "val", "test"]: with open(f"my_task_{split}.json", "w") as f: json.dump(dataset, f, indent=2) print(f"Created dataset with {len(dataset['label'])} samples") ``` ### Step 2: Create Task Configuration Write a `config.yaml` defining dataset paths and prompt templates. ```python # config.yaml structure (write as YAML file) config = """ task_name: my_task train_data_path: ./my_task_train.json val_data_path: ./my_task_val.json test_data_path: ./my_task_test.json prompt_templates: observations: | Feature 1: ${text_features_1} Feature 2: ${text_features_2} Observation: ${label} batched_generation: system: "You are a research scientist generating hypotheses." user: "Generate ${num_hypotheses} testable hypotheses from these observations." inference: system: "You are evaluating a hypothesis against data." user: "Hypothesis: ${hypothesis}\\nSample: ${sample_text}\\nFinal answer: ${label}" is_relevant: system: "Check hypothesis relevance." user: "Is this hypothesis relevant? ${hypothesis}" """ with open("config.yaml", "w") as f: f.write(config) print("Configuration written to config.yaml") ``` ### Step 3: Implement Label Extraction Define a custom `extract_label` function matching your label format. ```python import re def extract_label(llm_output: str) -> str: """Parse LLM output to extract predicted label. Must return labels matching the 'label' field values in the dataset. Default: searches for 'final answer: