--- name: design-of-experiments description: Expert guidance for Design of Experiments (DOE) in Python - interactive goal-driven design selection, classical DOE (factorial, response surface, screening), Bayesian optimization with Gaussian processes, model-driven optimal designs, active learning, and sequential experimentation; includes pyDOE3, pycse, GPyOpt, scikit-optimize, statsmodels allowed-tools: ["*"] --- # Design of Experiments (DOE) - Interactive Expert ## Overview Master experimental design through **interactive, goal-driven guidance** that asks the right questions to recommend the best approach for your situation. This skill covers classical DOE, Bayesian optimization, model-driven designs, and active learning—helping you choose between batch and sequential strategies, screening and optimization, and exploration and exploitation. **Core value:** Don't start with a method—start with questions. Based on your goals, budget, and constraints, get personalized recommendations for experimental design strategies that maximize information gain per experiment. ## CRITICAL: Start with Questions, Not Methods **When a user mentions DOE or experimental design, ASK THESE QUESTIONS FIRST:** ### Primary Questions (Ask Before Recommending): 1. **"What is your primary goal?"** - **Screening**: Identify which factors matter (many factors → few important) - **Optimization**: Find best settings for known factors - **Exploration**: Understand the system/build model - **Model discrimination**: Choose between competing models - **Robustness**: Minimize sensitivity to noise 2. **"Can you run experiments sequentially, or must they be done in a batch?"** - **Sequential**: Run one (or few), get results, decide next experiment(s) - **Batch**: Must plan all experiments upfront - **Hybrid**: Sequential with occasional batches 3. **"How expensive is each experiment?"** - **Very expensive**: >$1000 or >1 day per experiment - **Moderate**: $100-$1000 or hours per experiment - **Cheap**: <$100 or minutes per experiment 4. **"How many factors are you investigating?"** - **1-5 factors**: Full factorial, RSM feasible - **6-15 factors**: Need screening or fractional designs - **>15 factors**: Definitive screening, regularization needed 5. **"Do you have prior knowledge, existing data, or a mechanistic model?"** - **Have model**: Use model-driven optimal designs - **Have data**: Warm-start Bayesian optimization - **No knowledge**: Need exploration, space-filling designs ### Based on Answers, Recommend: ``` Sequential + Expensive + Unknown landscape → Bayesian Optimization (GP + Expected Improvement) Sequential + Have model + Parameter estimation goal → Model-driven sequential optimal design Batch only + Screening goal + Many factors → Fractional factorial or Definitive Screening Design Batch only + Optimization + Few factors → Central Composite Design or Box-Behnken Sequential + Moderate cost + Want model → Active learning with Gaussian Process Batch only + No prior knowledge + Exploration → Latin Hypercube Sampling ``` ## Quick Decision Tree ``` Can experiments be run sequentially with feedback? │ ├─ YES (Sequential possible) │ │ │ ├─ Expensive experiments (>$1000 or >1 day each)? │ │ ├─ YES → Bayesian Optimization │ │ │ • Expected Improvement for optimization │ │ │ • Upper Confidence Bound for exploration │ │ │ • Tools: GPyOpt, scikit-optimize, BoTorch │ │ │ │ │ └─ NO → Have a model? │ │ ├─ YES → Model-driven sequential design │ │ │ • D-optimal for parameter estimation │ │ │ • Update design as parameters learned │ │ │ │ │ └─ NO → Active learning │ │ • GP with uncertainty sampling │ │ • Tools: modAL, custom GP │ │ └─ NO (Batch only) │ ├─ What's your goal? │ │ │ ├─ SCREENING (identify important factors) │ │ ├─ <20 runs available → Plackett-Burman, Fractional Factorial │ │ ├─ 20-50 runs → Definitive Screening Design │ │ └─ Tools: pyDOE3, dexpy │ │ │ ├─ OPTIMIZATION (find best settings) │ │ ├─ 2-5 factors → Central Composite Design, Box-Behnken │ │ ├─ >5 factors → Sequential screening → then optimize │ │ ├─ Have model → D-optimal or I-optimal │ │ └─ Tools: pyDOE3, pycse, statsmodels │ │ │ ├─ EXPLORATION (understand system) │ │ ├─ No prior → Latin Hypercube Sampling │ │ ├─ Build surrogate → Space-filling + modeling │ │ └─ Tools: pyDOE3, pycse, scipy │ │ │ └─ ROBUSTNESS (minimize variability) │ └─ Control + noise factors → Taguchi / Robust Parameter Design ``` ## When to Use Each Approach ### Classical DOE (Batch Experiments) **Use when:** - Must plan all experiments upfront - Well-understood system or standard optimization - Moderate number of factors (<15) - Experiments are cheap enough for comprehensive coverage **Best for:** - Screening many factors (Plackett-Burman, fractional factorial) - Response surface modeling (CCD, Box-Behnken) - Standard process optimization - Teaching/learning DOE concepts **Tools:** pyDOE3, dexpy, pycse, statsmodels ### Bayesian Optimization (Sequential) **Use when:** - Can run experiments one-at-a-time (or small batches) - Experiments are expensive (time, money, resources) - Black-box system (no mechanistic model) - Want to minimize total number of experiments **Best for:** - Optimizing expensive processes ($1000+ per run) - Materials discovery, drug screening - Hyperparameter tuning for ML models - Autonomous experimentation / self-driving labs **Tools:** GPyOpt, scikit-optimize, BoTorch, Ax ### Model-Driven DOE (Optimal Designs) **Use when:** - Have mechanistic or empirical model - Goal is parameter estimation or model discrimination - Want statistically optimal experiments - Can update designs based on current parameter estimates **Best for:** - Chemical kinetics (rate constant estimation) - Pharmacokinetics (PK parameter estimation) - Systems biology (model calibration) - Comparing competing models **Tools:** Custom implementation with Fisher Information Matrix, pyoptex ### Active Learning (Sequential Model Building) **Use when:** - Want to build accurate surrogate model efficiently - Can run experiments sequentially - Need to balance exploration and exploitation - Moderate experiment cost **Best for:** - Adaptive sampling for complex landscapes - Building GP surrogates for later optimization - Uncertainty quantification - Smart data collection **Tools:** modAL, scikit-learn, GPy ## Quick Reference Table | Situation | Recommended Approach | Key Tool | Typical # Runs | |-----------|---------------------|----------|----------------| | Batch, <5 factors, screening | Full/Fractional Factorial | pyDOE3 | 8-32 | | Batch, 6-15 factors, screening | Plackett-Burman, DSD | pyDOE3, dexpy | 12-50 | | Batch, optimization, 2-5 factors | CCD, Box-Behnken | pyDOE3, pycse | 13-50 | | Batch, exploration, unknown | Latin Hypercube | pyDOE3, pycse | 10×factors | | **Sequential, expensive, optimize** | **Bayesian Optimization** | **GPyOpt, skopt** | **10-50** | | Sequential, build model | Active Learning + GP | modAL | 20-100 | | Have mechanistic model | Model-driven D-optimal | Custom FIM | 10-30 | | Parameter estimation | D-optimal, A-optimal | pyoptex | Varies | | Model discrimination | T-optimal, KL-optimal | Custom | 10-20 | | Multiple objectives | Multi-objective BO | BoTorch | 30-100 | | Mixture components | Simplex designs | pyDOE3 | 10-30 | ## Quick Start Examples ### Example 1: Classical Response Surface (Batch) **Situation:** Optimize 3 factors, batch experiments, moderate cost ```python import pyDOE3 as pyd import pandas as pd # Generate Central Composite Design n_factors = 3 design = pyd.ccdesign(n_factors, center=(0, 4)) # With center points # Create DataFrame with factor names factors = ['Temperature', 'Pressure', 'Catalyst'] df = pd.DataFrame(design, columns=factors) # Scale to actual ranges ranges = {'Temperature': (300, 400), 'Pressure': (1, 5), 'Catalyst': (0.1, 1.0)} for factor, (low, high) in ranges.items(): df[factor] = df[factor] * (high - low)/2 + (high + low)/2 print(f"Generated {len(df)} experiments") print(df.head()) # Export for lab work df.to_csv('experimental_design.csv', index=False) ``` **Next steps:** 1. Run experiments and collect responses 2. Analyze with statsmodels or pycse 3. Optimize using fitted model ### Example 2: pycse Surface Response **Situation:** Quick RSM with integrated analysis ```python from pycse import design_sr, analyze_sr, sr_parity import numpy as np # Generate design bounds = np.array([[300, 400], # Temperature [1, 5], # Pressure [0.1, 1.0]]) # Catalyst design = design_sr(bounds, inputs=['Temperature', 'Pressure', 'Catalyst'], outputs=['Yield']) print(design) # After running experiments, add results design['Yield'] = [78, 82, 75, 88, 91, 85, 79, ...] # Your data # Analyze anova_table = analyze_sr(design) print(anova_table) # Parity plot (model fit) sr_parity(design, show=True) ``` **Advantages:** Integrated workflow, automatic ANOVA, quick visualization ### Example 3: Bayesian Optimization (Sequential) **Situation:** Expensive experiments, 4 factors, want to minimize total runs ```python import numpy as np from skopt import gp_minimize from skopt.space import Real from skopt.plots import plot_convergence # Define parameter space space = [ Real(300, 400, name='Temperature'), Real(1, 5, name='Pressure'), Real(0.1, 1.0, name='Catalyst'), Real(10, 60, name='Time') ] # Your black-box function (returns value to minimize) def expensive_experiment(params): temp, pressure, catalyst, time = params # Run actual experiment here # For now, simulate yield_value = run_experiment(temp, pressure, catalyst, time) return -yield_value # Minimize negative = maximize yield # Bayesian optimization result = gp_minimize( expensive_experiment, space, n_calls=30, # Maximum experiments n_random_starts=5, # Initial random exploration acq_func='EI', # Expected Improvement random_state=42 ) print(f"Best parameters: {result.x}") print(f"Best yield: {-result.fun:.2f}") # Visualize convergence plot_convergence(result) ``` **Workflow:** 1. Start with 5 random experiments (exploration) 2. Fit GP surrogate model 3. Suggest next experiment (maximize EI) 4. Run experiment, update model 5. Repeat until convergence ### Example 4: Model-Driven D-Optimal (Parameter Estimation) **Situation:** Have kinetic model, need to estimate 3 rate constants ```python import numpy as np from scipy.optimize import minimize from scipy.linalg import det # Your mechanistic model def model(t, k1, k2, k3): """Concentration vs time model""" return k1 * (1 - np.exp(-k2 * t)) + k3 * t # Fisher Information Matrix for given design points def fisher_information(t_design, k_guess): """Calculate FIM for design points t""" k1, k2, k3 = k_guess # Jacobian (sensitivity matrix) J = np.zeros((len(t_design), 3)) for i, t in enumerate(t_design): J[i, 0] = 1 - np.exp(-k2 * t) J[i, 1] = k1 * t * np.exp(-k2 * t) J[i, 2] = t # FIM = J^T * J (assuming constant variance) FIM = J.T @ J return FIM # D-optimal criterion: maximize determinant of FIM def d_optimal_criterion(t_design, k_guess): FIM = fisher_information(t_design, k_guess) return -np.log(det(FIM)) # Maximize det = minimize -log(det) # Find optimal design points k_initial_guess = [1.0, 0.1, 0.05] n_points = 6 result = minimize( lambda t: d_optimal_criterion(t, k_initial_guess), x0=np.linspace(0, 100, n_points), bounds=[(0, 100)] * n_points, method='L-BFGS-B' ) optimal_times = result.x print(f"Optimal measurement times: {optimal_times}") # Run experiments at these times # After getting data, update k_guess and re-optimize design if sequential ``` **Sequential workflow:** 1. Initial guess for parameters 2. Calculate D-optimal design 3. Run experiments 4. Fit model, update parameter estimates 5. Recalculate optimal design with new estimates 6. Repeat until parameters converge ### Example 5: Active Learning (Build Surrogate) **Situation:** Want accurate surrogate model, can run sequentially ```python from modAL.models import ActiveLearner from modAL.uncertainty import uncertainty_sampling from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.gaussian_process.kernels import RBF # Initial random sample X_initial = np.random.uniform([0, 0], [10, 10], size=(5, 2)) y_initial = [expensive_function(x) for x in X_initial] # Create active learner with GP regressor = GaussianProcessRegressor(kernel=RBF()) learner = ActiveLearner( estimator=regressor, query_strategy=uncertainty_sampling, # Query where uncertain X_training=X_initial, y_training=y_initial ) # Active learning loop n_queries = 20 for i in range(n_queries): # Query next experiment (highest uncertainty) query_idx, query_instance = learner.query(X_candidate_pool) # Run experiment y_new = expensive_function(query_instance[0]) # Update model learner.teach(query_instance, y_new) print(f"Iteration {i+1}: GP std = {learner.predict(query_instance)[1]:.4f}") # Final model is in learner.estimator ``` ## Workflow: How to Use This Skill ### Step 1: Answer Questions When starting a DOE project, expect these questions: 1. What's your experimental goal? 2. Can you run sequentially or batch only? 3. How expensive are experiments? 4. How many factors? 5. Do you have a model or prior data? 6. Any constraints on factor combinations? 7. What's your total budget (# experiments)? ### Step 2: Get Recommendation Based on answers, receive: - **Recommended approach** with rationale - **Alternative options** with trade-offs - **Estimated number of runs** needed - **Library/tool recommendations** - **Example code** to get started ### Step 3: Generate Design Use provided code to: - Generate design matrix - Visualize design in factor space - Check design properties - Export to CSV for lab work ### Step 4: Run Experiments Execute experiments according to design: - Randomize run order (unless sequential) - Record all results - Note any deviations or issues ### Step 5: Analyze Results Analyze data with appropriate methods: - **Classical**: ANOVA, regression, diagnostics - **Bayesian**: Update GP, check convergence - **Model-driven**: Fit model, assess parameters ### Step 6: Make Decisions Based on analysis: - **Classical**: Optimize response, validate - **Bayesian**: Continue or stop? Next experiment? - **Model-driven**: Parameters converged? Need more data? ## Detailed References For deep dives into specific topics: - **references/DESIGN_SELECTION_GUIDE.md** - Complete decision trees with questions - **references/CLASSICAL_DOE.md** - Factorial, RSM, screening designs - **references/BAYESIAN_OPTIMIZATION.md** - BO theory, acquisition functions, GP models - **references/MODEL_DRIVEN_DOE.md** - Optimal designs, Fisher Information - **references/ACTIVE_LEARNING.md** - Sequential strategies, uncertainty sampling - **references/ANALYSIS_METHODS.md** - Statistical analysis, ANOVA, diagnostics - **references/PYCSE_INTEGRATION.md** - Using pycse for RSM workflows ## Common Patterns by Industry **Chemical Engineering:** - Reactor optimization → CCD or Bayesian Optimization - Catalyst screening → Fractional factorial → BO on hits - Process development → Sequential model-driven DOE **Materials Science:** - Composition optimization → Mixture designs or BO - Property mapping → Latin Hypercube + GP surrogate - Alloy discovery → High-throughput with active learning **Pharmaceutical:** - Formulation → Mixture designs, response surface - Dose optimization → Bayesian optimization - PK/PD modeling → Model-driven D-optimal designs **Manufacturing:** - Process parameter tuning → CCD or Box-Behnken - Quality improvement → Taguchi, robust parameter design - Continuous improvement → Sequential BO **Machine Learning:** - Hyperparameter tuning → Bayesian optimization - Architecture search → BO with discrete/categorical variables - Neural network training → Adaptive sampling ## Installation ```bash # Classical DOE pip install pyDOE3 # Factorial, RSM, LHS pip install dexpy # Modern DOE library # pycse for integrated RSM pip install pycse # Bayesian Optimization pip install scikit-optimize # skopt - easiest to use pip install GPyOpt # Comprehensive BO pip install ax-platform # Meta's adaptive experimentation # pip install botorch torch # Advanced BO (requires PyTorch) # Active Learning pip install modAL # Active learning framework # Gaussian Processes pip install GPy # GP models # scikit-learn has GP built-in # Analysis pip install statsmodels # ANOVA, regression pip install scipy # Optimization, statistics pip install scikit-learn # ML models, cross-validation # Visualization pip install matplotlib seaborn plotly # Sensitivity Analysis pip install SALib # All in one pip install pyDOE3 dexpy pycse scikit-optimize statsmodels scipy scikit-learn modAL GPy matplotlib seaborn ``` ## Best Practices ### 1. Always Start with Questions **❌ Don't:** "I'll use a Box-Behnken design" **✅ Do:** "My goal is optimization, I have 3 factors, I can only batch, so Box-Behnken makes sense" ### 2. Match Method to Constraints - **Budget limited** → Bayesian optimization - **Time limited** → Batch classical DOE - **Knowledge limited** → Space-filling exploration - **Have model** → Model-driven optimal design ### 3. Sequential When Possible Sequential designs (BO, active learning, model-driven): - **Adapt** based on results - **Stop early** if converged - **Avoid wasted** experiments - **Total runs** usually fewer than batch **When batch is better:** - Parallel equipment available - Setup cost dominates run cost - Well-understood system ### 4. Start Simple, Add Complexity **Initial phase:** Simple screening (fractional factorial, LHS) **Refinement:** Focus on important factors (RSM, BO) **Validation:** Confirmation runs ### 5. Validate Assumptions **Classical DOE assumes:** - Normal residuals - Constant variance - Independent observations - Linear/quadratic model adequate **Check diagnostics:** Residual plots, Q-Q plots, lack-of-fit tests **Bayesian optimization assumes:** - GP model appropriate - Kernel choice reasonable - Enough initial exploration **Check:** GP posterior uncertainty, kernel hyperparameters ### 6. Use Confirmation Experiments After finding "optimum": - Run additional experiments at predicted best settings - Verify predictions match reality - Account for prediction uncertainty ## Common Pitfalls ### 1. Using Wrong Design Type **Problem:** Applying batch RSM to expensive experiments **Solution:** Ask questions first, consider sequential approaches ### 2. Too Few Initial Points (BO) **Problem:** GP needs diverse data to model landscape **Solution:** Start with 5-10 space-filling points (LHS) ### 3. Ignoring Constraints **Problem:** Design suggests infeasible experiments **Solution:** Specify constraints upfront, use constrained optimization ### 4. Overfitting Surrogate Models **Problem:** GP fits noise, poor predictions **Solution:** Cross-validation, hold-out test sets, regularization ### 5. Not Randomizing Run Order **Problem:** Systematic effects confounded with factors **Solution:** Randomize execution order (except sequential designs) ### 6. Stopping Too Early (Sequential) **Problem:** Declare convergence before truly converged **Solution:** Use stopping criteria (EI < threshold, GP uncertainty low) ## Cost-Benefit Analysis: Sequential vs Batch ### When Sequential is Worth It: **Expensive experiments:** - Batch CCD: 20 runs at $1000 = $20,000 - Sequential BO: 15 runs (converge early) = $15,000 - **Savings: $5,000 + better result** **Complex landscapes:** - Batch might miss global optimum - Sequential adapts to findings - **Better final result** ### When Batch is Better: **Cheap experiments:** - Batch: 30 runs in 1 day (parallel) - Sequential: 30 runs over 30 days - **Time savings dominate** **Setup costs:** - If setup takes hours, run time minutes - Batch amortizes setup - **Sequential overhead too high** ## Response Format When helping with DOE, Claude should: 1. **Ask questions first** - Don't assume method 2. **Explain recommendation** - Why this approach for their situation 3. **Provide complete code** - Working examples, not pseudocode 4. **Show alternatives** - "BO is best, but if you must batch, try..." 5. **Guide through workflow** - Design → Execute → Analyze → Optimize 6. **Interpret results** - Statistical significance AND practical significance ## Additional Resources - **pyDOE3 Documentation:** https://pydoe3.readthedocs.io/ - **pycse Examples:** https://kitchingroup.cheme.cmu.edu/pycse/ - **scikit-optimize:** https://scikit-optimize.github.io/ - **GPyOpt:** https://sheffieldml.github.io/GPyOpt/ - **Bayesian Optimization Book:** https://bayesoptbook.com/ - **Design of Experiments (Montgomery):** Classic textbook - **Statistics for Experimenters (Box, Hunter, Hunter):** Comprehensive reference ## Related Skills - `python-optimization` - Response optimization after DOE - `pycse` - Includes regression, ANOVA, confidence intervals for analysis - `python-multiobjective-optimization` - Multi-response optimization - `python-plotting` - Visualization of results and diagnostics