--- name: flex-continuous-agent-evolution title: "FLEX: Continuous Agent Evolution via Forward Learning from Experience" version: 0.0.2 engine: skillxiv-v0.0.2-claude-opus-4.6 license: MIT url: "https://arxiv.org/abs/2511.06449" keywords: [Reinforcement Learning, In-Context Learning, Experience Libraries, Agent Learning, Gradient-Free Optimization] description: "Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training." --- # Evolve Agents Through Structured Experience Accumulation Deployed language model agents are typically static—once trained, they don't improve from real-world interactions. FLEX solves this through gradient-free continuous learning: agents maintain a structured experience library recording successes, failures, and their contexts. During subsequent interactions, the agent retrieves and reflects on relevant past experiences, incorporating these lessons into prompting without retraining. The approach demonstrates substantial gains: 23% improvement on mathematical reasoning (AIME25), 10% on chemical synthesis, 14% on protein engineering—all from self-refinement during deployment, not additional training. ## Core Concept FLEX treats deployed agent improvement as a problem of structured experience management rather than parameter optimization. The system maintains three components: 1. **Experience Library**: Structured records of past interactions (state, action, outcome, reflection) 2. **Retrieval Mechanism**: Finding relevant precedents for current problems 3. **Self-Reflection**: Agents analyze successes/failures and distill lessons as prompting context This approach is particularly powerful because it requires no gradient computation, model retraining, or API calls to external LLMs during learning—only structured reflection during inference. ## Architecture Overview - **Experience Capture Module**: Records interactions (problem, solution attempt, outcome, contextual factors) - **Structured Library**: Organizes experiences by problem domain, difficulty, technique type - **Semantic Retrieval**: Finds relevant past experiences using embedding similarity or keyword matching - **Reflection Engine**: Generates natural language summaries of why solutions succeeded/failed - **Prompt Augmentation**: Incorporates retrieved experiences into in-context examples - **Performance Tracking**: Measures improvement over time; identifies learning plateaus ## Implementation Steps **Step 1: Experience Data Structure** Define a structured format for recording and retrieving agent interactions. ```python from dataclasses import dataclass from typing import List, Dict, Any import json from datetime import datetime @dataclass class Experience: """Single interaction record in the experience library.""" problem: str # Problem description solution_attempt: str # Agent's attempted solution ground_truth: str # Correct answer (if available) is_correct: bool # Did the solution succeed? domain: str # Problem domain (math, coding, etc.) difficulty: str # Estimated difficulty timestamp: str # When this occurred techniques_used: List[str] # Techniques employed (e.g., 'divide-and-conquer') failure_reason: str # Why it failed (if applicable) reflection: str # Agent's own analysis of the attempt metadata: Dict[str, Any] # Additional context (tokens used, latency, etc.) def to_dict(self): return { 'problem': self.problem, 'solution': self.solution_attempt, 'correct': self.is_correct, 'domain': self.domain, 'difficulty': self.difficulty, 'timestamp': self.timestamp, 'techniques': self.techniques_used, 'failure_reason': self.failure_reason, 'reflection': self.reflection, 'metadata': self.metadata } class ExperienceLibrary: """Maintains structured experience collection.""" def __init__(self, storage_path='./experience_library.jsonl'): self.storage_path = storage_path self.experiences: List[Experience] = [] self.load_from_disk() def add_experience(self, exp: Experience): """Record a new experience.""" self.experiences.append(exp) # Append to disk for persistence with open(self.storage_path, 'a') as f: f.write(json.dumps(exp.to_dict()) + '\n') def retrieve_relevant(self, problem: str, domain: str, k=3) -> List[Experience]: """ Find most relevant past experiences for a given problem. Args: problem: Current problem description domain: Problem domain k: Number of experiences to retrieve Returns: relevant_experiences: Top-k similar past experiences """ from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # Filter by domain first domain_exps = [e for e in self.experiences if e.domain == domain] if len(domain_exps) < k: return domain_exps # Compute similarity between current problem and past problems all_problems = [e.problem for e in domain_exps] + [problem] vectorizer = TfidfVectorizer(max_features=100) tfidf = vectorizer.fit_transform(all_problems) # Similarity of current problem to all past problems similarities = cosine_similarity(tfidf[-1:], tfidf[:-1])[0] # Sort by similarity and return top-k top_indices = similarities.argsort()[-k:][::-1] return [domain_exps[i] for i in top_indices] def load_from_disk(self): """Load experiences from persistent storage.""" try: with open(self.storage_path, 'r') as f: for line in f: exp_dict = json.loads(line) self.experiences.append(Experience(**exp_dict)) except FileNotFoundError: pass # First run: empty library ``` **Step 2: Self-Reflection Engine** Generate structured reflections on why attempts succeeded or failed. ```python def generate_reflection(problem: str, solution: str, is_correct: bool, ground_truth: str = None, llm_api=None) -> str: """ Generate agent's reflection on an attempt. Args: problem: Original problem solution: Agent's attempted solution is_correct: Whether solution was correct ground_truth: Correct solution (if available) llm_api: LLM API for generating reflection (e.g., GPT-4, Claude) Returns: reflection: Natural language analysis """ if is_correct: prompt = f"""Analyze why this solution was correct: Problem: {problem} Solution: {solution} Provide a brief reflection on what techniques made this solution work:""" else: prompt = f"""Analyze why this solution failed: Problem: {problem} Your solution: {solution} Correct solution: {ground_truth} Identify the key mistake or misconception:""" # Call LLM to generate reflection if llm_api: reflection = llm_api.generate(prompt, max_tokens=200) else: # Fallback: simple pattern matching if "ValueError" in solution or "TypeError" in solution: reflection = "Code had syntax or type error" elif is_correct: reflection = "Solution approach was sound" else: reflection = "Solution logic was flawed" return reflection ``` **Step 3: Experience-Augmented Prompting** Incorporate retrieved experiences into prompts during inference. ```python def augment_prompt_with_experiences( original_prompt: str, relevant_experiences: List[Experience], include_failures: bool = True) -> str: """ Create augmented prompt including relevant past experiences. Args: original_prompt: User's problem description relevant_experiences: Retrieved past experiences include_failures: Whether to include negative examples Returns: augmented_prompt: Enhanced prompt with examples """ augmented = "You have access to relevant past experiences. Use insights from successes:\n\n" successful_exps = [e for e in relevant_experiences if e.is_correct] for i, exp in enumerate(successful_exps): augmented += f"Example {i+1} - Success:\n" augmented += f"Problem: {exp.problem}\n" augmented += f"Solution: {exp.solution_attempt}\n" augmented += f"Key insight: {exp.reflection}\n\n" if include_failures: failed_exps = [e for e in relevant_experiences if not e.is_correct] if failed_exps: augmented += "Learn from past mistakes:\n\n" for i, exp in enumerate(failed_exps): augmented += f"Past Mistake {i+1}:\n" augmented += f"Problem: {exp.problem}\n" augmented += f"Failed attempt: {exp.solution_attempt}\n" augmented += f"Why it failed: {exp.failure_reason}\n\n" augmented += f"Now solve this new problem:\n{original_prompt}" return augmented ``` **Step 4: Agent Deployment Loop** Main loop integrating experience capture and retrieval during deployment. ```python class DeployedAgent: """LLM agent that learns from deployment experiences.""" def __init__(self, base_model, experience_library, domain='general'): self.model = base_model self.library = experience_library self.domain = domain self.success_count = 0 self.total_attempts = 0 def solve_problem(self, problem: str, ground_truth: str = None) -> Dict[str, Any]: """ Solve a problem, recording experience for future learning. Args: problem: Problem description ground_truth: Correct answer (if available for offline validation) Returns: result: {solution, is_correct, experience_recorded} """ # Step 1: Retrieve relevant past experiences relevant_exps = self.library.retrieve_relevant(problem, self.domain, k=3) # Step 2: Augment prompt with relevant experiences augmented_prompt = augment_prompt_with_experiences( problem, relevant_exps, include_failures=True ) # Step 3: Generate solution solution = self.model.generate(augmented_prompt, max_tokens=2048) # Step 4: Validate (if ground truth available) is_correct = False if ground_truth: is_correct = self._validate_solution(solution, ground_truth) self.total_attempts += 1 if is_correct: self.success_count += 1 # Step 5: Generate reflection reflection = generate_reflection(problem, solution, is_correct, ground_truth) # Step 6: Record experience failure_reason = None if not is_correct: failure_reason = self._analyze_failure(solution, ground_truth) experience = Experience( problem=problem, solution_attempt=solution, ground_truth=ground_truth or '', is_correct=is_correct, domain=self.domain, difficulty=self._estimate_difficulty(problem), timestamp=datetime.now().isoformat(), techniques_used=self._extract_techniques(solution), failure_reason=failure_reason, reflection=reflection, metadata={ 'model': self.model.name, 'num_examples': len(relevant_exps), 'accuracy_rate': self.success_count / self.total_attempts } ) self.library.add_experience(experience) return { 'solution': solution, 'is_correct': is_correct, 'experience_recorded': True } def _validate_solution(self, solution: str, ground_truth: str) -> bool: """Check if solution matches ground truth.""" # Simple string matching; extend for domain-specific validation return solution.strip() == ground_truth.strip() def _analyze_failure(self, solution: str, ground_truth: str) -> str: """Identify type of failure.""" if "ValueError" in solution or "TypeError" in solution: return "Syntax/Type Error" elif len(solution) == 0: return "No Output Generated" else: return "Incorrect Logic" def _estimate_difficulty(self, problem: str) -> str: """Estimate problem difficulty.""" word_count = len(problem.split()) if word_count < 50: return "easy" elif word_count < 200: return "medium" else: return "hard" def _extract_techniques(self, solution: str) -> List[str]: """Identify reasoning techniques used.""" techniques = [] if "divide" in solution.lower() or "split" in solution.lower(): techniques.append("divide-and-conquer") if "recursion" in solution.lower(): techniques.append("recursion") if "dynamic" in solution.lower(): techniques.append("dynamic-programming") if "greedy" in solution.lower(): techniques.append("greedy") return techniques ``` **Step 5: Continuous Monitoring and Adaptation** Track learning over time and identify when improvements plateau. ```python def monitor_agent_learning(agent: DeployedAgent, window_size: int = 100): """ Monitor improvement trends in agent performance. Args: agent: Deployed agent instance window_size: Number of recent attempts to analyze Yields: metrics: Performance statistics """ while True: recent_exps = agent.library.experiences[-window_size:] if len(recent_exps) > 0: success_rate = sum(1 for e in recent_exps if e.is_correct) / len(recent_exps) avg_reflection_length = sum( len(e.reflection.split()) for e in recent_exps ) / len(recent_exps) metrics = { 'success_rate': success_rate, 'sample_count': agent.total_attempts, 'improvement': success_rate, # Compare to baseline if available 'avg_reflection_length': avg_reflection_length, 'unique_domains': len(set(e.domain for e in recent_exps)) } yield metrics # If plateau detected, could trigger additional strategies if success_rate > 0.9: print("Agent has reached high performance; consider expanding domain") import time time.sleep(60) # Monitor every minute ``` ## Practical Guidance **When to Use FLEX:** - Deployed agents handling diverse, evolving tasks (continuous learning essential) - Scenarios where retraining is expensive or time-consuming - Systems where interpretability matters (experience reflections are human-readable) **When NOT to Use:** - Static batch learning (no deployment or evaluation feedback) - Tasks with no clear success/failure signal (reflection requires outcome validation) - Privacy-sensitive applications (experience library stores past interactions) **Hyperparameters and Configuration:** - Retrieval k: 3-5 examples per problem (balance context size with diversity) - Reflection model: Use same LLM or smaller model to reduce cost - Library retention: Keep all experiences initially; prune redundant ones after scale-up - Domain granularity: Separate libraries for distinct problem types (math, code, reasoning) **Pitfalls to Avoid:** 1. **Stale experiences** - Old experiences from early deployment may become outdated; prioritize recent successes 2. **Confirmation bias** - Over-weighting successful experiences; include failures to avoid learning wrong patterns 3. **Library bloat** - Experiences accumulate; implement deduplication or archival strategies 4. **No validation** - Without ground truth, incorrectly-classified successes pollute library; add validation when possible --- Reference: https://arxiv.org/abs/2511.06449