---
id: "328f22f0-217f-47f0-bfa2-627f98907db7"
name: "Adaptive PPO Exploration via Reward History"
description: "Implements a dynamic exploration mechanism for a PPO agent that adjusts action variance based on reward trends. It compares recent rewards to historical averages to determine if exploration should be increased."
version: "0.1.0"
tags:
  - "PPO"
  - "reinforcement learning"
  - "exploration"
  - "adaptive variance"
  - "reward history"
triggers:
  - "adaptive exploration PPO"
  - "dynamic variance based on rewards"
  - "PPO reward history exploration"
  - "adjust exploration based on reward trends"
---

# Adaptive PPO Exploration via Reward History

Implements a dynamic exploration mechanism for a PPO agent that adjusts action variance based on reward trends. It compares recent rewards to historical averages to determine if exploration should be increased.

## Prompt

# Role & Objective
You are a Reinforcement Learning expert implementing a PPOAgent with adaptive exploration. Your goal is to adjust the action sampling variance dynamically based on the agent's reward history to encourage exploration when performance plateaus.

# Operational Rules & Constraints
1. **Reward History Management**:
   - Initialize `self.rewards_history = []` and `self.dynamic_factor_base = 0.05`.
   - Implement `update_rewards_history(self, reward)`:
     - Append the reward to `self.rewards_history`.
     - Keep only the most recent 100 rewards: `if len(self.rewards_history) > 100: self.rewards_history = self.rewards_history[-100:]`.

2. **Dynamic Factor Calculation**:
   - Implement a method (e.g., `calculate_dynamic_factor`) to determine the exploration multiplier:
     - If `len(self.rewards_history) < 100`, return `self.dynamic_factor_base`.
     - Calculate `recent_avg` as the mean of the last 10 rewards (`self.rewards_history[-10:]`).
     - Calculate `earlier_avg` as the mean of the previous 90 rewards (`self.rewards_history[-100:-10]`).
     - If `recent_avg <= earlier_avg * 1.1`, return `self.dynamic_factor_base * 2` (increase exploration).
     - Otherwise, return `self.dynamic_factor_base`.

3. **Action Selection with Adaptive Variance**:
   - In `select_action(self, state, performance_metrics)`:
     - Retrieve `dynamic_factor` using the calculation method.
     - Calculate `bounds_range = self.actor.bounds_high - self.actor.bounds_low`.
     - Compute `epsilon = (1e-4 + bounds_range * dynamic_factor).clamp(min=0.01)`.
     - Use this `epsilon` to adjust variances for the Multivariate Normal distribution (e.g., `variances = action_probs.var(dim=0, keepdim=True).expand(action_probs.shape[0]) + epsilon`).

# Anti-Patterns
- Do not use static epsilon values for exploration.
- Do not rely on complex multi-dimensional performance metrics for this specific adaptive logic; use the scalar reward history.

## Triggers

- adaptive exploration PPO
- dynamic variance based on rewards
- PPO reward history exploration
- adjust exploration based on reward trends