--- name: evolution-strategies-llm-finetuning title: "Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning" version: 0.0.2 engine: skillxiv-v0.0.2-claude-opus-4.6 license: MIT url: "https://arxiv.org/abs/2509.24372" keywords: [evolution-strategies, llm-finetuning, parameter-optimization, backpropagation-free, reward-modeling, population-based-search, training-stability] description: "Scale Evolution Strategies to billion-parameter LLMs without backpropagation for superior robustness and stability across diverse models, reward horizons, and evaluation tasks. Outperforms RL methods while eliminating gradient computation overhead." --- ## Evolution Strategies Fine-Tuning: Direct Parameter Optimization at Billion Scale ### Outcome Fine-tune large language models through population-based direct parameter search, achieving robust model improvements across diverse architectures with 15.5× lower training variance than gradient-based RL methods and resistance to reward hacking without explicit penalties. ### Problem Context Current LLM fine-tuning relies on backpropagation through gradient-based reinforcement learning (PPO, GRPO), which struggles with: - **Sparse, long-horizon rewards**: Intermediate supervision often unavailable for reasoning tasks; gradients through long sequences become unstable - **Reward hacking**: Gradient-based optimization exploits loopholes (short-but-nonsensical outputs) without explicit KL constraints - **Cross-model brittleness**: Fine-tuning success varies dramatically across base model architectures; GRPO failed entirely on certain models - **Training instability**: High variance across runs (15.5× higher than ES) makes expensive fine-tuning unreliable for large deployments - **Computational overhead**: Backpropagation and KL penalty computation add substantial memory and compute burden Evolution Strategies offer an alternative: direct parameter space search using only reward signals, no gradients required. ### Core Concept Evolution Strategies treat model parameters as a genome subject to evolutionary pressure. The algorithm repeatedly: 1. Sample parameter perturbations from a normal distribution 2. Evaluate perturbed models on the target task to obtain rewards 3. Update parameters in the direction of high-reward perturbations (natural gradient) Key insight: ES needs only reward values, not gradients, enabling response-level supervision (did the model solve the problem?) rather than loss gradients. This decouples optimization from model architecture and enables effective search in sparse reward regimes. At billion-parameter scale, seven engineering optimizations make ES tractable: noise reproducibility via random seeds, parallel GPU evaluation, in-place perturbation, reward normalization, greedy decoding, decomposed updates, and simplified learning rates. ### Architecture Overview **Population-Based Search** - Small fixed population (30 members vs. 10,000+ in prior work) evaluates perturbations in parallel - Each member: base weights + scaled Gaussian noise sampled from seed - Parallel evaluation across GPUs; single machines or distributed clusters via Hugging Face Accelerate **Reward-Driven Parameter Updates** - Collect reward signal (scalar, delayed OK) from each population member - Normalize rewards to zero-mean unit-variance - Compute utility-weighted average of perturbations: Δθ ∝ Σ(utility_i × noise_i) - Apply learning rate: θ_new = θ_old + α × Δθ **Memory & Compute Efficiency** - Noise retrieval: reconstruct perturbations from random seeds on-the-fly (no storage overhead) - Layer-level in-place perturbation: modify weights sequentially, evaluate, restore (single copy in memory) - Batch GPU evaluation: evaluate multiple perturbed models per GPU via threading - No backpropagation: ~50% memory reduction vs. gradient methods **Stability Properties** - ES update is rank-based utility weighting (robust to reward outliers and scale) - No explicit KL penalties; ES naturally avoids reward hacking through population diversity - Variance reduction: 15.5× lower than GRPO across runs on identical problems ### Implementation #### 1. Environment Setup Prepare the Python environment and install dependencies for distributed GPU evaluation. ```python # Create and activate virtual environment python3.10 -m venv es_env source es_env/bin/activate # Install dependencies (from repository) pip install -r requirements.txt # Key packages: # - torch>=2.0.0 # - transformers>=4.40.0 # - accelerate>=0.27.0 (distributed training) # - datasets>=2.18.0 (data loading) # - numpy, pandas (utilities) ``` #### 2. Define the Reward Function The reward function takes a model and returns a scalar score. ES optimizes this directly—no gradients needed. ```python def compute_reward(model, tokenizer, examples): """ Evaluate model on a task and return scalar reward. Args: model: LLM instance (already loaded) tokenizer: Tokenizer for the model examples: List of {input, expected_output} dicts Returns: float: Aggregated reward (0-1 range recommended) """ correct = 0 for example in examples: # Generate response with greedy decoding inputs = tokenizer(example["input"], return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=256, do_sample=False, # greedy pad_token_id=tokenizer.eos_token_id ) response = tokenizer.decode(output[0], skip_special_tokens=True) # Check correctness (task-specific) if is_correct(response, example["expected_output"]): correct += 1 # Return fraction correct return correct / len(examples) def is_correct(response, expected): """Task-specific correctness check.""" # Example: exact match return response.strip() == expected.strip() ``` #### 3. Initialize Population and State Set up the ES state: mean parameters, step size, and population utilities. ```python import torch import numpy as np from transformers import AutoModelForCausalLM, AutoTokenizer # Load model and tokenizer model_name = "Qwen/Qwen2.5-7B-Instruct" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) # Flatten parameters into a single vector (for ES state) params_init = torch.nn.utils.parameters_to_vector(model.parameters()).detach().clone() num_params = params_init.numel() # ES hyperparameters population_size = 30 # Small population due to engineering optimizations learning_rate = 0.001 sigma = 0.017 # Standard deviation of perturbations (tune per task) print(f"Model parameters: {num_params:,} | Population: {population_size}") # Initialize utilities (per-member weighting) utilities = np.array([max(0, np.log(population_size/2 + 1) - np.log(i+1)) for i in range(population_size)]) utilities /= np.sum(utilities) # Normalize ``` #### 4. Main ES Loop: Mutation, Evaluation, and Update Run ES for multiple generations, accumulating rewards and updating parameters. ```python def es_train_loop( model, tokenizer, params_init, reward_fn, generations=100, population_size=30, sigma=0.017, lr=0.001, seed_base=42, device="cuda" ): """ Main Evolution Strategies training loop. Args: model: LLM to fine-tune tokenizer: Model tokenizer params_init: Initial parameter vector reward_fn: Function(model, tokenizer) -> float generations: Number of ES iterations population_size: Population members per iteration sigma: Perturbation std dev (controls exploration) lr: Natural gradient step size seed_base: RNG seed for reproducibility device: "cuda" or "cpu" """ params_current = params_init.clone() rewards_history = [] for gen in range(generations): gen_rewards = [] param_updates = np.zeros(params_init.numel()) # Generate and evaluate population for member_id in range(population_size): # Deterministic noise from seed (no storage overhead) seed = seed_base + gen * population_size + member_id np.random.seed(seed) noise = torch.tensor( np.random.randn(params_init.numel()), dtype=params_init.dtype, device=device ) # Perturbed parameters params_perturbed = params_current + sigma * noise # Update model weights in-place (layer by layer) offset = 0 for param in model.parameters(): param_size = param.numel() param.data = params_perturbed[offset:offset+param_size].reshape(param.shape) offset += param_size # Evaluate (reward only, no gradients) reward = reward_fn(model, tokenizer) gen_rewards.append(reward) # Accumulate utility-weighted noise for update param_updates += utilities[member_id] * noise.cpu().numpy() # Normalize rewards and update parameters rewards_array = np.array(gen_rewards) rewards_normalized = (rewards_array - np.mean(rewards_array)) / (np.std(rewards_array) + 1e-8) # Natural gradient update: θ ← θ + α * (1/σ) * Σ util_i * noise_i * (r_i - mean_r) param_updates_weighted = np.zeros_like(param_updates) for member_id in range(population_size): seed = seed_base + gen * population_size + member_id np.random.seed(seed) noise_update = np.random.randn(params_init.numel()) param_updates_weighted += utilities[member_id] * noise_update * rewards_normalized[member_id] params_current = params_current.cpu() + (lr / sigma) * torch.tensor(param_updates_weighted, dtype=params_current.dtype) params_current = params_current.to(device) # Log progress best_reward = np.max(gen_rewards) mean_reward = np.mean(gen_rewards) rewards_history.append(best_reward) if (gen + 1) % 10 == 0: print(f"Gen {gen+1:3d} | Best: {best_reward:.4f} | Mean: {mean_reward:.4f} | Std: {np.std(gen_rewards):.4f}") return params_current, rewards_history ``` #### 5. Save and Evaluate Fine-Tuned Model After training, restore final parameters and test performance. ```python def save_finetuned_model(model, params_final, output_path): """ Write final parameters back to model and save to disk. Args: model: LLM with architecture to save params_final: Final parameter vector from ES output_path: Directory to save (will create via model.save_pretrained) """ # Restore final parameters offset = 0 for param in model.parameters(): param_size = param.numel() param.data = params_final[offset:offset+param_size].reshape(param.shape) offset += param_size # Save to disk model.save_pretrained(output_path) print(f"Fine-tuned model saved to {output_path}") # Example usage if __name__ == "__main__": # Load data (example: math reasoning) train_examples = [ {"input": "Solve: 2x + 3 = 7", "expected_output": "x = 2"}, # ... more examples ] # Define reward function def reward_fn(m, t): return compute_reward(m, t, train_examples[:20]) # Subset for speed # Run ES fine-tuning params_final, history = es_train_loop( model, tokenizer, params_init, reward_fn, generations=100, population_size=30, sigma=0.017, lr=0.001 ) # Save and evaluate save_finetuned_model(model, params_final, "./model_finetuned") ``` #### 6. Distributed Multi-GPU Setup (via Accelerate) For large models, distribute population evaluation across multiple GPUs or machines. ```python from accelerate import Accelerator def es_train_distributed( model_name, reward_fn, generations=100, population_size=30, num_processes=2, gpu_threads=15 ): """ Multi-GPU ES training using Hugging Face Accelerate. Total parallel evaluations = num_processes * gpu_threads. Args: model_name: HuggingFace model ID reward_fn: Reward function (called per process) num_processes: Number of GPUs (or machines) gpu_threads: Threads per GPU (model copies per GPU) """ accelerator = Accelerator() # Each process loads model independently model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16) tokenizer = AutoTokenizer.from_pretrained(model_name) model = accelerator.prepare(model) # Each process evaluates a subset of population local_pop_size = population_size // num_processes # Main ES loop (same as single-GPU, but rewards aggregated) # ... print(f"Rank {accelerator.process_index}: evaluating {local_pop_size} members") ``` ### Practical Guidance #### Hyperparameter Recommendations | Parameter | Typical Range | Notes | |-----------|---------------|-------| | `population_size` | 20–50 | Smaller than RL batch sizes; 30 is default. Increase for harder tasks. | | `sigma` (noise std) | 0.01–0.05 | Controls exploration vs. exploitation. Start at 0.017; lower for final refinement. | | `learning_rate` | 0.0001–0.01 | Step size for parameter updates. 0.001 is standard; reduce if oscillating. | | `generations` | 50–500 | Task-dependent; monitor reward curve to detect plateau. | | `seed_base` | any | Ensures reproducibility; increment per run if multiple trials needed. | #### When to Use ES Fine-Tuning - **Reasoning tasks** with sparse, delayed rewards (math, logic, puzzle solving) - **Heterogeneous base models**: Need a method that works across Qwen, Llama, Mistral, etc. - **Robustness critical**: Training stability matters more than marginal reward gains - **Reward specification difficult**: You have outcome labels but not intermediate supervision - **Small datasets**: ES is sample-efficient (often < 20% of RL data needed) - **Long-horizon tasks**: Few intermediate steps; only final answer is evaluable #### When NOT to Use ES Fine-Tuning - **Dense reward signals**: If you have loss gradients or detailed intermediate supervision, gradient-based RL (PPO, DPO) will be faster - **Continuous action spaces**: ES excels at large discrete parameter spaces; for action fine-tuning, RL is more direct - **Extreme speed required**: ES requires multiple forward passes per update; if latency is critical, SFT or single-pass methods preferred - **Highly model-specific optimization**: If you're tuning for a single model and have unlimited compute for gradient tuning, RL may squeeze out extra performance - **Limited evaluation budget**: Each generation requires `population_size` full model evaluations; if evaluation is expensive (e.g., human-in-the-loop), use smaller populations or RL with importance weighting #### Common Pitfalls 1. **Σ too high or low**: If noise is too large, updates become random. If too small, stuck in local optima. Adapt σ per task (start 0.017, halve if rewards plateau). 2. **Ignoring reward scale**: Normalizing rewards per generation is critical for stable updates. If rewards are 0–1 vs. 0–1000, learning rate must adjust; the algorithm handles this via z-score normalization. 3. **Small population on large tasks**: With population_size < 15, gradient estimates become noisy. For complex reasoning, use 30+. 4. **Not greedy decoding**: ES assumes deterministic reward (same input → same output). Sampling during generation adds noise; use greedy decoding or fix seed. 5. **Starting from mid-training checkpoint**: ES searches from the current parameter point; if base model is undertrained, ES may optimize for weak behaviors. Fine-tune strong base models. 6. **Incorrect utility weights**: The utility vector ranks population members by reward. Ensure it's recalculated per generation (don't reuse across different tasks). ### Reference **Paper**: Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning **Authors**: Xin Qiu, Yulu Gan, Conor F. Hayes, Qiyao Liang, Elliot Meyerson, Babak Hodjat, Risto Miikkulainen **ArXiv**: [2509.24372](https://arxiv.org/abs/2509.24372) **Code**: [GitHub – Cognizant AI Lab](https://github.com/cognizant-ai-lab/es-fine-tuning) **Cited Baselines**: PPO (Schulman et al., 2017), GRPO (Xu et al., 2024), DPO (Rafailov et al., 2023)