--- name: openthoughts-data-recipes title: "OpenThoughts: Data Recipes for Reasoning Models" version: 0.0.2 engine: skillxiv-v0.0.2-claude-opus-4.6 license: MIT url: "https://arxiv.org/abs/2506.04178" keywords: [data-curation, reasoning, distillation, mathematics, code-generation] description: "Design data generation pipelines for reasoning models through systematic experimentation with answer sampling, teacher selection, and source quality optimization." --- # OpenThoughts: Data Recipes for Reasoning Models ## Core Concept OpenThoughts demonstrates that achieving state-of-the-art reasoning performance depends critically on data curation strategy rather than raw model capacity. Through over 1,000 controlled experiments, the researchers identified that answer diversity from a single high-quality teacher model outweighs source diversity, quality source selection beats mixing, and LLM-based filtering exceeds traditional methods. ## Architecture Overview - **Question Sourcing**: Evaluate 27 code, 21 math, and 14 science benchmark sources - **Mixing Strategy**: Select top 1-2 sources per domain based on performance correlation - **Question Filtering**: Apply LLM-based difficulty and response-length scoring - **Deduplication**: Exact deduplication with 16× answer multiplicity sampling - **Answer Filtering**: Counterintuitively, keeping all teacher answers outperforms selective filtering - **Teacher Model Selection**: QwQ-32B yields best-distilled 7B performance despite lower standalone scores ## Implementation ### Step 1: Source Evaluation Framework ```python import numpy as np from collections import defaultdict class SourceEvaluator: def __init__(self, num_sources, domains=['math', 'code', 'science']): self.sources = {domain: [] for domain in domains} self.domain_configs = { 'math': 27, # 27 math sources tested 'code': 27, # 27 code sources tested 'science': 14 # 14 science sources tested } def evaluate_source_quality(self, source_name, domain, sample_size=100): """Measure correlation with downstream task performance""" samples = self.sample_from_source(source_name, sample_size) # Train small model on samples from this source model = train_student_on_source(samples) # Evaluate on held-out reasoning benchmarks aime_score = evaluate_on_benchmark(model, 'AIME2025') livecodebench_score = evaluate_on_benchmark(model, 'LiveCodeBench') source_quality = { 'name': source_name, 'domain': domain, 'aime_correlation': aime_score, 'code_correlation': livecodebench_score, 'combined_score': (aime_score + livecodebench_score) / 2 } return source_quality def select_top_sources(self, domain, num_sources=2): """Keep only top 1-2 sources per domain""" evaluated = [self.evaluate_source_quality(src, domain) for src in self.domain_configs[domain]] ranked = sorted(evaluated, key=lambda x: x['combined_score'], reverse=True) return [s['name'] for s in ranked[:num_sources]] ``` ### Step 2: Answer Sampling Multiplicity ```python class AnswerDiversityGenerator: def __init__(self, teacher_model_name, multiplicity=16): self.teacher = load_model(teacher_model_name) # e.g., QwQ-32B self.multiplicity = multiplicity self.sampling_config = { 'temperature': 1.0, 'top_p': 0.95, 'max_new_tokens': 2048, } def generate_multiple_answers(self, question, num_samples=16): """Generate diverse answers from same question via stochastic sampling""" answers = [] for i in range(num_samples): # Use stochastic decoding for diversity answer = self.teacher.generate( question, temperature=self.sampling_config['temperature'], top_p=self.sampling_config['top_p'], do_sample=True, # Critical for diversity ) answers.append({ 'answer': answer, 'sample_id': i, 'question': question }) return answers def create_dataset_via_multiplication(self, questions, multiplicity=16): """Scale dataset by generating multiple answers per question""" dataset = [] for question in questions: answer_samples = self.generate_multiple_answers( question, num_samples=multiplicity ) dataset.extend(answer_samples) print(f"Dataset scaled from {len(questions)} to {len(dataset)} samples") print(f"Scaling factor: {len(dataset) / len(questions)}×") return dataset ``` ### Step 3: LLM-Based Question Filtering ```python class LLMQuestionFilter: def __init__(self, scoring_model_name='GPT-4'): self.scorer = load_model(scoring_model_name) def compute_difficulty_score(self, question, ground_truth): """Use LLM to assess question difficulty""" prompt = f"""Rate the difficulty of this question on scale 1-10: Question: {question} Answer: {ground_truth} Consider: problem complexity, reasoning depth, required knowledge.""" difficulty_response = self.scorer.generate(prompt) difficulty = extract_numeric_score(difficulty_response) return difficulty # 1=trivial, 10=hard def compute_response_length(self, answer): """Estimate expected solution complexity""" token_count = len(answer.split()) return token_count def filter_by_lvm_criteria(self, questions_with_answers, min_difficulty=3, target_length_range=(100, 2000)): """Keep questions matching quality criteria""" filtered = [] for qa_pair in questions_with_answers: difficulty = self.compute_difficulty_score( qa_pair['question'], qa_pair['answer'] ) length = self.compute_response_length(qa_pair['answer']) # LLM-based filtering outperforms embedding-based approaches if (min_difficulty <= difficulty and target_length_range[0] <= length <= target_length_range[1]): filtered.append(qa_pair) print(f"Filtered from {len(questions_with_answers)} to {len(filtered)} samples") return filtered ``` ### Step 4: Teacher Model Selection Strategy ```python class TeacherSelectionFramework: def __init__(self): self.candidate_teachers = [ 'QwQ-32B', # Best distillation performance 'DeepSeek-R1-Distill', 'LLaMA-3.1-70B', 'Mistral-Large', ] def evaluate_teacher_distillation(self, teacher_name, student_size='7B'): """Measure how well a teacher distills into 7B student""" # Generate data with teacher dataset = self.generate_reasoning_dataset_with_teacher(teacher_name) # Train student model on teacher data student = train_student_model(dataset, student_size) # Evaluate on multiple benchmarks aime_2025 = evaluate_on_dataset(student, 'AIME2025') livecodebench = evaluate_on_dataset(student, 'LiveCodeBench') distillation_quality = { 'teacher': teacher_name, 'aime_2025_score': aime_2025, 'livecodebench_score': livecodebench, 'avg_downstream_performance': (aime_2025 + livecodebench) / 2 } return distillation_quality def select_best_teacher(self): """Key finding: teacher selection outweighs raw model performance""" results = [] for teacher_name in self.candidate_teachers: result = self.evaluate_teacher_distillation(teacher_name) results.append(result) # QwQ-32B often outperforms stronger models like Claude-3-Opus best_teacher = max(results, key=lambda x: x['avg_downstream_performance']) print(f"Best teacher for distillation: {best_teacher['teacher']}") return best_teacher['teacher'] ``` ### Step 5: Answer Filtering (Optional but Counterintuitive) ```python class AnswerFilteringStrategy: def __init__(self): self.filter_config = { 'apply_filtering': False, # Key finding: no filtering > filtering 'reason': 'Keeping all answers provides better training signal' } def analyze_filtering_impact(self, dataset_with_answers, dataset_no_filter): """Compare distilled model performance with and without answer filtering""" # Train student without filtering student_no_filter = train_student(dataset_with_answers) perf_no_filter = evaluate(student_no_filter) # Train student with filtering filtered_dataset = self.filter_low_quality_answers(dataset_with_answers) student_filtered = train_student(filtered_dataset) perf_filtered = evaluate(student_filtered) print(f"Performance without filtering: {perf_no_filter}") print(f"Performance with filtering: {perf_filtered}") # Empirical finding: no filtering wins return perf_no_filter >= perf_filtered ``` ## Practical Guidance 1. **Source Concentration Strategy**: Focus on 1-2 highest-quality sources per domain rather than mixing 16 sources. Quality correlation with downstream tasks matters more than source diversity. 2. **Multiplicity Over Diversity**: Generate 16× answers from the same teacher and question distribution instead of seeking diverse question sources. Answer diversity at the token level provides better training signal. 3. **Teacher Selection Matters**: Benchmark multiple candidate teachers via distillation performance on downstream tasks, not just their standalone benchmark scores. QwQ-32B proved superior despite lower intrinsic performance on some metrics. 4. **LLM-Based Filtering**: Use LLM judges to score question difficulty and expected response length rather than embedding-based or statistical methods. This outperforms traditional filtering approaches. 5. **Keep All Answers**: Don't filter based on answer correctness or quality thresholds. The surprising finding is that keeping all teacher outputs (including wrong answers) provides a richer learning signal than keeping only high-confidence answers. 6. **Data Scaling Pattern**: Expect consistent improvements with data scaling across math, code, and science domains. The pipeline enables training 7B reasoning models to 53% AIME and 51% LiveCodeBench scores. ## Reference - Paper: OpenThoughts (2506.04178) - Result: OpenThinker3-7B with 1.2M training examples - Key Metric: 53% on AIME 2025, 51% on LiveCodeBench - Open Source: Released on Hugging Face (OpenThoughts3-1.2M dataset)