---
name: reinforcement-learning-guide
description: "Reinforcement learning fundamentals, algorithms, and research"
metadata:
  openclaw:
    emoji: "🤖"
    category: "domains"
    subcategory: "ai-ml"
    keywords: ["reinforcement learning", "machine learning", "deep learning", "neural network"]
    source: "wentor-research-plugins"
---

# Reinforcement Learning Guide

Understand and implement reinforcement learning algorithms from tabular methods through deep RL, including policy gradients, actor-critic, and model-based approaches.

## RL Fundamentals

### The RL Framework

An agent interacts with an environment to maximize cumulative reward:

```
Agent                     Environment
  |                           |
  |--- action a_t ---------->|
  |                           |--- next state s_{t+1}
  |<-- reward r_t, state s_t |--- reward r_{t+1}
  |                           |
```

| Concept | Symbol | Definition |
|---------|--------|-----------|
| State | s | Observation of the environment |
| Action | a | Decision made by the agent |
| Reward | r | Scalar feedback signal |
| Policy | pi(a\|s) | Mapping from states to actions |
| Value function | V(s) | Expected cumulative reward from state s |
| Q-function | Q(s, a) | Expected cumulative reward from (s, a) |
| Discount factor | gamma | Weight of future vs. immediate rewards (0-1) |
| Return | G_t | Sum of discounted future rewards from time t |

### Key Equations

```
# Return (discounted cumulative reward)
G_t = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ...

# Bellman equation for V
V(s) = E[r + gamma * V(s') | s]

# Bellman equation for Q
Q(s, a) = E[r + gamma * max_a' Q(s', a') | s, a]

# Policy gradient theorem
gradient J(theta) = E[gradient log pi_theta(a|s) * Q(s, a)]
```

## Algorithm Taxonomy

| Category | Algorithm | Key Idea | On/Off Policy |
|----------|-----------|----------|--------------|
| **Value-based** | Q-Learning | Learn Q(s,a), act greedily | Off-policy |
| | DQN | Q-Learning + neural net + replay buffer | Off-policy |
| | Double DQN | Two networks to reduce overestimation | Off-policy |
| | Dueling DQN | Separate value and advantage streams | Off-policy |
| **Policy gradient** | REINFORCE | Monte Carlo policy gradient | On-policy |
| | PPO | Clipped surrogate objective | On-policy |
| | TRPO | Trust region constraint | On-policy |
| **Actor-Critic** | A2C/A3C | Advantage actor-critic (parallel) | On-policy |
| | SAC | Maximum entropy + off-policy AC | Off-policy |
| | TD3 | Twin delayed DDPG | Off-policy |
| **Model-based** | Dreamer | World model + imagination | On-policy |
| | MBPO | Model-based policy optimization | Off-policy |
| | MuZero | Learned model + planning (MCTS) | Off-policy |

## Implementation: DQN

```python
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )

    def forward(self, x):
        return self.net(x)

class DQNAgent:
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99,
                 epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01,
                 buffer_size=10000, batch_size=64):
        self.action_dim = action_dim
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.batch_size = batch_size

        self.q_network = QNetwork(state_dim, action_dim)
        self.target_network = QNetwork(state_dim, action_dim)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)

        self.replay_buffer = deque(maxlen=buffer_size)

    def select_action(self, state):
        if random.random() < self.epsilon:
            return random.randint(0, self.action_dim - 1)
        with torch.no_grad():
            q_values = self.q_network(torch.FloatTensor(state))
            return q_values.argmax().item()

    def store_transition(self, state, action, reward, next_state, done):
        self.replay_buffer.append((state, action, reward, next_state, done))

    def train_step(self):
        if len(self.replay_buffer) < self.batch_size:
            return 0.0

        batch = random.sample(self.replay_buffer, self.batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)

        states = torch.FloatTensor(np.array(states))
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(np.array(next_states))
        dones = torch.FloatTensor(dones)

        # Current Q values
        q_values = self.q_network(states).gather(1, actions.unsqueeze(1)).squeeze()

        # Target Q values (Double DQN variant)
        with torch.no_grad():
            best_actions = self.q_network(next_states).argmax(1)
            next_q = self.target_network(next_states).gather(1, best_actions.unsqueeze(1)).squeeze()
            targets = rewards + self.gamma * next_q * (1 - dones)

        loss = nn.MSELoss()(q_values, targets)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
        return loss.item()

    def update_target(self):
        self.target_network.load_state_dict(self.q_network.state_dict())
```

## Implementation: PPO

```python
class PPOAgent:
    def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99,
                 lam=0.95, clip_ratio=0.2, epochs=10):
        self.gamma = gamma
        self.lam = lam
        self.clip_ratio = clip_ratio
        self.epochs = epochs

        self.actor = nn.Sequential(
            nn.Linear(state_dim, 64), nn.Tanh(),
            nn.Linear(64, 64), nn.Tanh(),
            nn.Linear(64, action_dim), nn.Softmax(dim=-1)
        )
        self.critic = nn.Sequential(
            nn.Linear(state_dim, 64), nn.Tanh(),
            nn.Linear(64, 64), nn.Tanh(),
            nn.Linear(64, 1)
        )
        self.optimizer = optim.Adam(
            list(self.actor.parameters()) + list(self.critic.parameters()), lr=lr
        )

    def compute_gae(self, rewards, values, dones):
        """Generalized Advantage Estimation."""
        advantages = []
        gae = 0
        for t in reversed(range(len(rewards))):
            next_value = values[t + 1] if t + 1 < len(values) else 0
            delta = rewards[t] + self.gamma * next_value * (1 - dones[t]) - values[t]
            gae = delta + self.gamma * self.lam * (1 - dones[t]) * gae
            advantages.insert(0, gae)
        return torch.FloatTensor(advantages)

    def update(self, states, actions, old_log_probs, rewards, dones):
        values = self.critic(states).squeeze().detach().numpy()
        advantages = self.compute_gae(rewards, values, dones)
        returns = advantages + torch.FloatTensor(values[:len(advantages)])
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        for _ in range(self.epochs):
            probs = self.actor(states)
            dist = torch.distributions.Categorical(probs)
            new_log_probs = dist.log_prob(actions)
            entropy = dist.entropy().mean()

            ratio = (new_log_probs - old_log_probs).exp()
            clipped = torch.clamp(ratio, 1 - self.clip_ratio, 1 + self.clip_ratio)
            actor_loss = -torch.min(ratio * advantages, clipped * advantages).mean()

            critic_loss = nn.MSELoss()(self.critic(states).squeeze(), returns)

            loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
```

## Research Environments

| Environment | Domain | Complexity | Key Paper |
|-------------|--------|-----------|-----------|
| Gymnasium (ex-Gym) | Classic control, Atari | Low-High | Brockman et al., 2016 |
| MuJoCo | Continuous control, robotics | Medium-High | Todorov et al., 2012 |
| DMControl | Continuous control from pixels | High | Tassa et al., 2018 |
| ProcGen | Procedurally generated games | High (generalization) | Cobbe et al., 2020 |
| Minigrid | Grid-world navigation | Low-Medium | Chevalier-Boisvert et al. |
| Isaac Gym | GPU-accelerated physics sim | High | Makoviychuk et al., 2021 |
| NetHack | Complex roguelike game | Very High | Kuttler et al., 2020 |

## Top Venues

| Venue | Type | Focus |
|-------|------|-------|
| NeurIPS | Conference | Broad ML including RL |
| ICML | Conference | Broad ML including RL |
| ICLR | Conference | Representation learning, deep RL |
| AAAI | Conference | Broad AI |
| CoRL | Conference | Robot learning |
| JMLR | Journal | Broad ML (open access) |
| L4DC | Conference | Learning for dynamics and control |

## Key Research Directions (2024-2025)

1. **RLHF / RLAIF**: RL from human or AI feedback for LLM alignment
2. **Offline RL**: Learning from pre-collected datasets without environment interaction
3. **Foundation models for control**: Using pre-trained LLMs/VLMs as world models or planners
4. **Multi-agent RL**: Cooperative and competitive settings with communication
5. **Safe RL**: Constrained optimization to ensure safety during training and deployment
6. **Sample-efficient RL**: Reducing the gap between model-free and model-based sample complexity