--- name: agentdb-reinforcement-learning-training description: AgentDB Reinforcement Learning Training operates on 3 fundamental principles: allowed-tools: Read, Write, Edit, Task, TodoWrite, Glob, Grep --- # AgentDB Reinforcement Learning Training --- ## LIBRARY-FIRST PROTOCOL (MANDATORY) **Before writing ANY code, you MUST check:** ### Step 1: Library Catalog - Location: `.claude/library/catalog.json` - If match >70%: REUSE or ADAPT ### Step 2: Patterns Guide - Location: `.claude/docs/inventories/LIBRARY-PATTERNS-GUIDE.md` - If pattern exists: FOLLOW documented approach ### Step 3: Existing Projects - Location: `D:\Projects\*` - If found: EXTRACT and adapt ### Decision Matrix | Match | Action | |-------|--------| | Library >90% | REUSE directly | | Library 70-90% | ADAPT minimally | | Pattern exists | FOLLOW pattern | | In project | EXTRACT | | No match | BUILD (add to library after) | --- ## Overview Train AI learning plugins with AgentDB's 9 reinforcement learning algorithms including Decision Transformer, Q-Learning, SARSA, Actor-Critic, PPO, and more. Build self-learning agents, implement RL, and optimize agent behavior through experience. ## When to Use This Skill Use this skill when you need to: - Train autonomous agents that learn from experience - Implement reinforcement learning systems - Optimize agent behavior through trial and error - Build self-improving AI systems - Deploy RL agents in production environments - Benchmark and compare RL algorithms ## Available RL Algorithms 1. **Q-Learning** - Value-based, off-policy 2. **SARSA** - Value-based, on-policy 3. **Deep Q-Network (DQN)** - Deep RL with experience replay 4. **Actor-Critic** - Policy gradient with value baseline 5. **Proximal Policy Optimization (PPO)** - Trust region policy optimization 6. **Decision Transformer** - Offline RL with transformers 7. **Advantage Actor-Critic (A2C)** - Synchronous advantage estimation 8. **Twin Delayed DDPG (TD3)** - Continuous control 9. **Soft Actor-Critic (SAC)** - Maximum entropy RL ## SOP Framework: 5-Phase RL Training Deployment ### Phase 1: Initialize Learning Environment (1-2 hours) **Objective:** Setup AgentDB learning infrastructure with environment configuration **Agent:** ml-developer **Steps:** 1. **Install AgentDB Learning Module** ```bash npm install agentdb-learning@latest npm install @agentdb/rl-algorithms @agentdb/environments ``` 2. **Initialize learning database** ```typescript import { AgentDB, LearningPlugin } from 'agentdb-learning'; const learningDB = new AgentDB({ name: 'rl-training-db', dimensions: 512, // State embedding dimension learning: { enabled: true, persistExperience: true, replayBufferSize: 100000 } }); await learningDB.initialize(); // Create learning plugin const learningPlugin = new LearningPlugin({ database: learningDB, algorithms: ['q-learning', 'dqn', 'ppo', 'actor-critic'], config: { batchSize: 64, learningRate: 0.001, discountFactor: 0.99, explorationRate: 1.0, explorationDecay: 0.995 } }); await learningPlugin.initialize(); ``` 3. **Define environment** ```typescript import { Environment } from '@agentdb/environments'; const environment = new Environment({ name: 'grid-world', stateSpace: { type: 'continuous', shape: [10, 10], bounds: [[0, 10], [0, 10]] }, actionSpace: { type: 'discrete', actions: ['up', 'down', 'left', 'right'] }, rewardFunction: (state, action, nextState) => { // Distance to goal reward const goalDistance = Math.sqrt( Math.pow(nextState[0] - 9, 2) + Math.pow(nextState[1] - 9, 2) ); return -goalDistance + (goalDistance === 0 ? 100 : 0); }, terminalCondition: (state) => { return state[0] === 9 && state[1] === 9; // Reached goal } }); await environment.initialize(); ``` 4. **Setup monitoring** ```typescript const monitor = learningPlugin.createMonitor({ metrics: ['reward', 'loss', 'exploration-rate', 'episode-length'], logInterval: 100, // Log every 100 episodes saveCheckpoints: true, checkpointInterval: 1000 }); monitor.on('episode-complete', (episode) => { console.log('Episode:', episode.number, 'Reward:', episode.totalReward); }); ``` **Memory Pattern:** ```typescript await agentDB.memory.store('agentdb/learning/environment', { name: environment.name, stateSpace: environment.stateSpace, actionSpace: environment.actionSpace, initialized: Date.now() }); ``` **Validation:** - Learning database initialized - Environment configured and tested - Monitor capturing metrics - Configuration stored in memory ### Phase 2: Configure RL Algorithm (1-2 hours) **Objective:** Select and configure RL algorithm for the learning task **Agent:** ml-developer **Steps:** 1. **Select algorithm** ```typescript // Example: Deep Q-Network (DQN) const dqnAgent = learningPlugin.createAgent({ algorithm: 'dqn', config: { networkArchitecture: { layers: [ { type: 'dense', units: 128, activation: 'relu' }, { type: 'dense', units: 128, activation: 'relu' }, { type: 'dense', units: environment.actionSpace.size, activation: 'linear' } ] }, learningRate: 0.001, batchSize: 64, replayBuffer: { size: 100000, prioritized: true, alpha: 0.6, beta: 0.4 }, targetNetwork: { updateFrequency: 1000, tauSync: 0.001 // Soft update }, exploration: { initial: 1.0, final: 0.01, decay: 0.995 }, training: { startAfter: 1000, // Start training after 1000 experiences updateFrequency: 4 } } }); await dqnAgent.initialize(); ``` 2. **Configure hyperparameters** ```typescript const hyperparameters = { // Learning parameters learningRate: 0.001, discountFactor: 0.99, // Gamma batchSize: 64, // Exploration epsilonStart: 1.0, epsilonEnd: 0.01, epsilonDecay: 0.995, // Experience replay replayBufferSize: 100000, minReplaySize: 1000, prioritizedReplay: true, // Training maxEpisodes: 10000, maxStepsPerEpisode: 1000, targetUpdateFrequency: 1000, // Evaluation evalFrequency: 100, evalEpisodes: 10 }; dqnAgent.setHyperparameters(hyperparameters); ``` 3. **Setup experience replay** ```typescript import { PrioritizedReplayBuffer } from '@agentdb/rl-algorithms'; const replayBuffer = new PrioritizedReplayBuffer({ capacity: 100000, alpha: 0.6, // Prioritization exponent beta: 0.4, // Importance sampling betaIncrement: 0.001, epsilon: 0.01 // Small constant for stability }); dqnAgent.setReplayBuffer(replayBuffer); ``` 4. **Configure training loop** ```typescript const trainingConfig = { episodes: 10000, stepsPerEpisode: 1000, warmupSteps: 1000, trainFrequency: 4, targetUpdateFrequency: 1000, saveFrequency: 1000, evalFrequency: 100, earlyStoppingPatience: 500, earlyStoppingThreshold: 0.01 }; dqnAgent.setTrainingConfig(trainingConfig); ``` **Memory Pattern:** ```typescript await agentDB.memory.store('agentdb/learning/algorithm-config', { algorithm: 'dqn', hyperparameters: hyperparameters, trainingConfig: trainingConfig, configured: Date.now() }); ``` **Validation:** - Algorithm selected and configured - Hyperparameters validated - Replay buffer initialized - Training config set ### Phase 3: Train Agents (3-4 hours) **Objective:** Execute training iterations and optimize agent behavior **Agent:** safla-neural **Steps:** 1. **Start training loop** ```typescript async function trainAgent() { console.log('Starting RL training...'); const trainingStats = { episodes: [], totalReward: [], episodeLength: [], loss: [], explorationRate: [] }; for (let episode = 0; episode < trainingConfig.episodes; episode++) { let state = await environment.reset(); let episodeReward = 0; let episodeLength = 0; let episodeLoss = 0; for (let step = 0; step < trainingConfig.stepsPerEpisode; step++) { // Select action const action = await dqnAgent.selectAction(state, { explore: true }); // Execute action const { nextState, reward, done } = await environment.step(action); // Store experience await dqnAgent.storeExperience({ state, action, reward, nextState, done }); // Train if enough experiences if (dqnAgent.canTrain()) { const loss = await dqnAgent.train(); episodeLoss += loss; } episodeReward += reward; episodeLength += 1; state = nextState; if (done) break; } // Update target network if (episode % trainingConfig.targetUpdateFrequency === 0) { await dqnAgent.updateTargetNetwork(); } // Decay exploration dqnAgent.decayExploration(); // Log progress trainingStats.episodes.push(episode); trainingStats.totalReward.push(episodeReward); trainingStats.episodeLength.push(episodeLength); trainingStats.loss.push(episodeLoss / episodeLength); trainingStats.explorationRate.push(dqnAgent.getExplorationRate()); if (episode % 100 === 0) { console.log(`Episode ${episode}:`, { reward: episodeReward.toFixed(2), length: episodeLength, loss: (episodeLoss / episodeLength).toFixed(4), epsilon: dqnAgent.getExplorationRate().toFixed(3) }); } // Save checkpoint if (episode % trainingConfig.saveFrequency === 0) { await dqnAgent.save(`checkpoint-${episode}`); } // Evaluate if (episode % trainingConfig.evalFrequency === 0) { const evalReward = await evaluateAgent(dqnAgent, environment); console.log(`Evaluation at episode ${episode}: ${evalReward.toFixed(2)}`); } // Early stopping if (checkEarlyStopping(trainingStats, episode)) { console.log('Early stopping triggered'); break; } } return trainingStats; } const trainingStats = await trainAgent(); ``` 2. **Monitor training progress** ```typescript monitor.on('training-update', (stats) => { // Calculate moving averages const window = 100; const recentRewards = stats.totalReward.slice(-window); const avgReward = recentRewards.reduce((a, b) => a + b, 0) / recentRewards.length; // Store metrics agentDB.memory.store('agentdb/learning/training-progress', { episode: stats.episodes[stats.episodes.length - 1], avgReward: avgReward, explorationRate: stats.explorationRate[stats.explorationRate.length - 1], timestamp: Date.now() }); // Plot learning curve (if visualization enabled) if (monitor.visualization) { monitor.plot('reward-curve', stats.episodes, stats.totalReward); monitor.plot('loss-curve', stats.episodes, stats.loss); } }); ``` 3. **Handle convergence** ```typescript function checkConvergence(stats, windowSize = 100, threshold = 0.01) { if (stats.totalReward.length < windowSize * 2) { return false; } const recent = stats.totalReward.slice(-windowSize); const previous = stats.totalReward.slice(-windowSize * 2, -windowSize); const recentAvg = recent.reduce((a, b) => a + b, 0) / recent.length; const previousAvg = previous.reduce((a, b) => a + b, 0) / previous.length; const improvement = (recentAvg - previousAvg) / Math.abs(previousAvg); return improvement < threshold; } ``` 4. **Save trained model** ```typescript await dqnAgent.save('trained-agent-final', { includeReplayBuffer: false, includeOptimizer: false, metadata: { trainingStats: trainingStats, hyperparameters: hyperparameters, finalReward: trainingStats.totalReward[trainingStats.totalReward.length - 1] } }); console.log('Training complete. Model saved.'); ``` **Memory Pattern:** ```typescript await agentDB.memory.store('agentdb/learning/training-results', { algorithm: 'dqn', episodes: trainingStats.episodes.length, finalReward: trainingStats.totalReward[trainingStats.totalReward.length - 1], converged: checkConvergence(trainingStats), modelPath: 'trained-agent-final', timestamp: Date.now() }); ``` **Validation:** - Training completed or converged - Reward curve shows improvement - Model saved successfully - Training stats stored ### Phase 4: Validate Performance (1-2 hours) **Objective:** Benchmark trained agent and validate performance **Agent:** performance-benchmarker **Steps:** 1. **Load trained agent** ```typescript const trainedAgent = await learningPlugin.loadAgent('trained-agent-final'); ``` 2. **Run evaluation episodes** ```typescript async function evaluateAgent(agent, env, numEpisodes = 100) { const results = { rewards: [], episodeLengths: [], successRate: 0 }; for (let i = 0; i < numEpisodes; i++) { let state = await env.reset(); let episodeReward = 0; let episodeLength = 0; let success = false; for (let step = 0; step < 1000; step++) { const action = await agent.selectAction(state, { explore: false }); const { nextState, reward, done } = await env.step(action); episodeReward += reward; episodeLength += 1; state = nextState; if (done) { success = env.isSuccessful(state); break; } } results.rewards.push(episodeReward); results.episodeLengths.push(episodeLength); if (success) results.successRate += 1; } results.successRate /= numEpisodes; return { meanReward: results.rewards.reduce((a, b) => a + b, 0) / results.rewards.length, stdReward: calculateStd(results.rewards), meanLength: results.episodeLengths.reduce((a, b) => a + b, 0) / results.episodeLengths.length, successRate: results.successRate, results: results }; } const evalResults = await evaluateAgent(trainedAgent, environment, 100); console.log('Evaluation results:', evalResults); ``` 3. **Compare with baseline** ```typescript // Random policy baseline const randomAgent = learningPlugin.createAgent({ algorithm: 'random' }); const randomResults = await evaluateAgent(randomAgent, environment, 100); // Calculate improvement const improvement = { rewardImprovement: (evalResults.meanReward - randomResults.meanReward) / Math.abs(randomResults.meanReward), lengthImprovement: (randomResults.meanLength - evalResults.meanLength) / randomResults.meanLength, successImprovement: evalResults.successRate - randomResults.successRate }; console.log('Improvement over random:', improvement); ``` 4. **Run comprehensive benchmarks** ```typescript const benchmarks = { performanceMetrics: { meanReward: evalResults.meanReward, stdReward: evalResults.stdReward, successRate: evalResults.successRate, meanEpisodeLength: evalResults.meanLength }, algorithmComparison: { dqn: evalResults, random: randomResults, improvement: improvement }, inferenceTiming: { actionSelection: 0, totalEpisode: 0 } }; // Measure inference speed const timingTrials = 1000; const startTime = performance.now(); for (let i = 0; i < timingTrials; i++) { const state = await environment.randomState(); await trainedAgent.selectAction(state, { explore: false }); } const endTime = performance.now(); benchmarks.inferenceTiming.actionSelection = (endTime - startTime) / timingTrials; await agentDB.memory.store('agentdb/learning/benchmarks', benchmarks); ``` **Memory Pattern:** ```typescript await agentDB.memory.store('agentdb/learning/validation', { evaluated: true, meanReward: evalResults.meanReward, successRate: evalResults.successRate, improvement: improvement, timestamp: Date.now() }); ``` **Validation:** - Evaluation completed (100 episodes) - Mean reward exceeds threshold - Success rate acceptable - Improvement over baseline demonstrated ### Phase 5: Deploy Trained Agents (1-2 hours) **Objective:** Deploy trained agents to production environment **Agent:** ml-developer **Steps:** 1. **Export production model** ```typescript await trainedAgent.export('production-agent', { format: 'onnx', // or 'tensorflowjs', 'pytorch' optimize: true, quantize: 'int8', // Quantization for faster inference includeMetadata: true }); ``` 2. **Create inference API** ```typescript import express from 'express'; const app = express(); app.use(express.json()); // Load production agent const productionAgent = await learningPlugin.loadAgent('production-agent'); app.post('/api/predict', async (req, res) => { try { const { state } = req.body; const action = await productionAgent.selectAction(state, { explore: false, returnProbabilities: true }); res.json({ action: action.action, probabilities: action.probabilities, confidence: action.confidence }); } catch (error) { res.status(500).json({ error: error.message }); } }); app.listen(3000, () => { console.log('RL agent API running on port 3000'); }); ``` 3. **Setup monitoring** ```typescript import { ProductionMonitor } from '@agentdb/monitoring'; const prodMonitor = new ProductionMonitor({ agent: productionAgent, metrics: ['inference-latency', 'action-distribution', 'reward-feedback'], alerting: { latencyThreshold: 100, // ms anomalyDetection: true } }); await prodMonitor.start(); ``` 4. **Create deployment pipeline** ```typescript const deploymentPipeline = { stages: [ { name: 'validation', steps: [ 'Load trained model', 'Run validation suite', 'Check performance metrics', 'Verify inference speed' ] }, { name: 'export', steps: [ 'Export to production format', 'Optimize model', 'Quantize weights', 'Package artifacts' ] }, { name: 'deployment', steps: [ 'Deploy to staging', 'Run smoke tests', 'Deploy to production', 'Monitor performance' ] } ] }; await agentDB.memory.store('agentdb/learning/deployment-pipeline', deploymentPipeline); ``` **Memory Pattern:** ```typescript await agentDB.memory.store('agentdb/learning/production', { deployed: true, modelPath: 'production-agent', apiEndpoint: 'http://localhost:3000/api/predict', monitoring: true, timestamp: Date.now() }); ``` **Validation:** - Model exported successfully - API running and responding - Monitoring active - Deployment pipeline documented ## Integration Scripts ### Complete Training Script ```bash #!/bin/bash # train-rl-agent.sh set -e echo "AgentDB RL Training Script" echo "==========================" # Phase 1: Initialize echo "Phase 1: Initializing learning environment..." npm install agentdb-learning @agentdb/rl-algorithms # Phase 2: Configure echo "Phase 2: Configuring algorithm..." node -e "require('./config-algorithm.js')" # Phase 3: Train echo "Phase 3: Training agent..." node -e "require('./train-agent.js')" # Phase 4: Validate echo "Phase 4: Validating performance..." node -e "require('./evaluate-agent.js')" # Phase 5: Deploy echo "Phase 5: Deploying to production..." node -e "require('./deploy-agent.js')" echo "Training complete!" ``` ### Quick Start Script ```typescript // quickstart-rl.ts import { setupRLTraining } from './setup'; async function quickStart() { console.log('Starting RL training quick setup...'); // Setup const { learningDB, environment, agent } = await setupRLTraining({ algorithm: 'dqn', environment: 'grid-world', episodes: 1000 }); // Train console.log('Training agent...'); const stats = await agent.train(environment, { episodes: 1000, logInterval: 100 }); // Evaluate console.log('Evaluating agent...'); const results = await agent.evaluate(environment, { episodes: 100 }); console.log('Results:', results); // Save await agent.save('quickstart-agent'); console.log('Quick start complete!'); } quickStart().catch(console.error); ``` ## Evidence-Based Success Criteria 1. **Training Convergence (Self-Consistency)** - Reward curve stabilizes - Moving average improvement < 1% - Agent achieves consistent performance 2. **Performance Benchmarks (Quantitative)** - Mean reward exceeds baseline by 50% - Success rate > 80% - Inference time < 10ms per action 3. **Algorithm Validation (Chain-of-Verification)** - Hyperparameters validated - Exploration-exploitation balanced - Experience replay functioning 4. **Production Readiness (Multi-Agent Consensus)** - Model exported successfully - API responds within latency threshold - Monitoring active and alerting - Deployment pipeline documented ## MCP Requirements This skill operates using AgentDB's npm package and API only. No additional MCP servers required. All AgentDB learning plugin operations are performed through: - npm CLI: `npx agentdb@latest create-plugin` - TypeScript/JavaScript API: `import { AgentDB, LearningPlugin } from 'agentdb-learning'` ## Additional Resources - AgentDB Learning Documentation: https://agentdb.dev/docs/learning - RL Algorithms Guide: https://agentdb.dev/docs/rl-algorithms - Training Best Practices: https://agentdb.dev/docs/training - Production Deployment: https://agentdb.dev/docs/deployment ## Core Principles AgentDB Reinforcement Learning Training operates on 3 fundamental principles: ### Principle 1: Experience Replay - Learn from Past Mistakes Without Catastrophic Forgetting Naive online learning suffers from correlation bias (agent learns from consecutive similar experiences) and catastrophic forgetting (new experiences overwrite old knowledge). Prioritized experience replay stores diverse experiences in a buffer, samples uniformly to break correlations, and prioritizes high-error transitions to focus learning on difficult scenarios. In practice: - Maintain replay buffer with 100K+ experiences to ensure diverse sampling across state space - Use prioritized replay (alpha=0.6) to oversample transitions with high TD-error, accelerating convergence 2-3x - Apply importance sampling correction (beta=0.4->1.0) to prevent bias from non-uniform sampling - Start training only after 1000+ warm-up experiences to ensure sufficient diversity ### Principle 2: Exploration-Exploitation Balance - Decay Epsilon to Shift from Discovery to Refinement The exploration-exploitation dilemma is fundamental to RL: explore too much and waste time on suboptimal actions, exploit too much and miss better strategies. Epsilon-greedy with decay starts high (explore aggressively to map state space) and decays toward zero (exploit learned policy once confident). In practice: - Initialize epsilon=1.0 to ensure comprehensive state space exploration in early episodes - Decay epsilon geometrically (multiply by 0.995 each episode) to smoothly transition from exploration to exploitation - Set epsilon_min=0.01 to maintain 1% exploration indefinitely, preventing policy stagnation from environment changes - Monitor reward variance - high variance indicates insufficient exploration, consider slower decay ### Principle 3: Target Network Stabilization - Decouple Policy from Value Estimation Q-learning suffers from instability when the target (expected future reward) shifts while training. Using the same network for both action selection and target computation creates a moving target problem. Target networks freeze value estimates periodically, stabilizing training and improving convergence reliability. In practice: - Update target network every 1000 training steps (hard update) or use soft updates (tau=0.001) every step - Soft updates provide smoother convergence for continuous control tasks (robotics, game AI) - Hard updates work better for discrete action spaces with sparse rewards - Monitor TD-error convergence - diverging errors indicate target network update frequency needs adjustment ## Common Anti-Patterns | Anti-Pattern | Problem | Solution | |--------------|---------|----------| | **Reward Hacking - Agent Finds Unintended Policy Shortcuts** | Poorly designed reward functions incentivize agents to exploit loopholes rather than solve the intended task. Classic example: agent learns to pause game indefinitely to avoid losing instead of playing well. | Use shaped rewards with multiple components (task completion + efficiency + constraints). Validate reward function with adversarial testing - manually identify shortcuts and penalize them. Prefer sparse terminal rewards over dense step rewards when task definition is clear. | | **Training Convergence Blindness - Run Fixed Episode Count Without Monitoring** | Training for arbitrary 10K episodes wastes compute if convergence happens at 3K or fails to converge at all. Agents either plateau early or train indefinitely without improvement. | Implement early stopping with patience threshold (stop if no improvement in 500 episodes). Monitor moving average reward over 100-episode window. Track loss curves alongside rewards - diverging loss indicates hyperparameter tuning needed before continuing training. | | **Hyperparameter Lottery - Use Default Values Without Task-Specific Tuning** | RL algorithms are notoriously sensitive to hyperparameters. Default learning_rate=0.001 may be 10x too high for high-dimensional state spaces or 10x too low for simple tasks. | Start with baseline hyperparameters from algorithm papers for similar task domains. Run hyperparameter sweeps on key parameters (learning_rate, discount_factor, exploration_rate) using small-scale experiments (1K episodes). Use grid search or Bayesian optimization to find task-specific optimal values before full-scale training. | ## Conclusion AgentDB Reinforcement Learning Training provides a production-ready framework for training autonomous agents across 9 RL algorithms, from classic Q-Learning to state-of-the-art Decision Transformers. The 5-phase SOP systematically guides you from environment initialization and algorithm configuration through training iterations, performance validation, and production deployment, with comprehensive monitoring and benchmarking at each stage. By integrating experience replay, exploration-exploitation balancing, and target network stabilization, the framework implements proven RL best practices that accelerate convergence and improve final policy quality. This skill is essential when building self-learning agents for game AI, robotics control, resource optimization, or any domain where optimal behavior must be discovered through trial-and-error rather than explicitly programmed. The key differentiator is systematic validation - rather than blindly training for arbitrary episode counts, the framework monitors convergence, validates against baselines, and implements early stopping to prevent wasted computation. The deployment pipeline ensures trained policies are properly exported, optimized (quantization, format conversion), and monitored in production with latency and action distribution tracking. The choice of RL algorithm matters critically. Q-Learning and SARSA suit discrete action spaces with full observability. DQN scales to high-dimensional state spaces (images, sensor data). Policy gradient methods (PPO, A2C) handle continuous control and partial observability. Decision Transformers enable offline RL from logged data without environment interaction. The framework provides all 9 algorithms with unified interfaces, allowing rapid experimentation to find the optimal approach for your specific task complexity, state/action space characteristics, and data availability constraints. With proper validation against baselines and comprehensive benchmarking, you can confidently deploy RL agents that genuinely learn and improve rather than memorizing fixed policies.