# Multi-Agent Proximal Policy Optimization (MAPPO)

MAPPO Simple Spread Demo
MAPPO agents cooperating in Simple Spread environment

## Overview **Multi-Agent Proximal Policy Optimization (MAPPO)** is a centralized training with decentralized execution (CTDE) algorithm that extends PPO to multi-agent settings. MAPPO uses a centralized critic during training while maintaining decentralized policies for execution, making it highly effective for cooperative multi-agent tasks. ## Algorithm Theory ### Core Concept MAPPO operates under the **Centralized Training, Decentralized Execution (CTDE)** paradigm, where agents share information during training but act independently during execution. This approach allows agents to leverage global information for better coordination while maintaining the benefits of decentralized execution. ### Key Components #### 1. Centralized Training - All agents share a centralized critic network - Global state information is available during training - Joint optimization of all agent policies #### 2. Decentralized Execution - Each agent has its own policy network - Agents act based on local observations only - No communication required during execution #### 3. Proximal Policy Optimization - Uses PPO's clipped objective function for stable updates - Trust region optimization prevents large policy changes - Entropy regularization encourages exploration #### 4. Random Network Distillation (RND) Variants - Intrinsic motivation for exploration - Helps agents discover novel strategies - Improves performance in complex environments ## Implementation Details ### Network Architecture #### Centralized Critic ```python class CentralizedCritic(nn.Module): def __init__(self, global_state_dim, num_agents): self.network = nn.Sequential( layer_init(nn.Linear(global_state_dim, 128)), nn.Tanh(), layer_init(nn.Linear(128, 128)), nn.Tanh(), layer_init(nn.Linear(128, 1), std=1.0) ) ``` #### Decentralized Actors ```python class Actor(nn.Module): def __init__(self, observation_dim, action_dim): self.network = nn.Sequential( layer_init(nn.Linear(observation_dim, 128)), nn.Tanh(), layer_init(nn.Linear(128, 128)), nn.Tanh(), ) self.actor = layer_init(nn.Linear(128, 64), std=0.01) ``` ### Training Process 1. **Environment Interaction** - Multiple parallel environments (15 by default) - Agents interact using decentralized policies - Global state information is collected for critic 2. **Experience Collection** - Rollout length: 256 steps per environment (longer than IPPO) - Store local observations, actions, rewards, global states - Compute advantages using centralized critic 3. **Policy Updates** - PPO epochs: 10 (more than IPPO for better convergence) - Minibatch size: 3840 (15 envs × 256 steps) - Learning rate: 2.5e-4 with linear annealing 4. **Optimization** - Adam optimizer with gradient clipping (0.5) - Orthogonal initialization for stable training - Entropy coefficient: 0.02 for enhanced exploration ## Supported Environments ### 1. Simple Spread (Cooperative) - **Environment**: `simple_spread_v3` - **Task**: Cooperative navigation where agents must cover landmarks - **Actions**: Discrete (5 actions per agent) - **Observations**: Vector observations with agent positions - **Global State**: Full environment state including all agent positions ### 2. Cooperative Pong (Butterfly) - **Environment**: `cooperative_pong_v5` - **Task**: Cooperative version of Pong where agents work together - **Actions**: Discrete actions for paddle movement - **Observations**: Image-based observations - **Global State**: Full game state including ball and paddle positions ### 3. RND-Enhanced Environments - **Purpose**: Improved exploration through intrinsic motivation - **Implementation**: RND networks provide additional reward signals - **Benefits**: Better performance in complex, sparse-reward environments ## Usage ### Installation ```bash pip install torch pettingzoo[mpe,butterfly] supersuit wandb tqdm imageio opencv-python gymnasium ``` ### Training Commands #### Standard MAPPO (Simple Spread) ```bash python mappo_without_rnd.py --env_id simple_spread_v3 --total_timesteps 20000000 ``` #### MAPPO with RND ```bash python mappo_rnd.py --env_id simple_spread_v3 --total_timesteps 20000000 ``` #### MAPPO for Cooperative Pong ```bash python mappo_rnd_pong.py --env_id cooperative_pong_v5 --total_timesteps 10000000 ``` #### MAPPO Training Script ```bash python train.py --env_id cooperative_pong_v5 --total_timesteps 10000000 ``` ### Key Hyperparameters ```python # Training Configuration lr = 2.5e-4 # Learning rate num_envs = 15 # Parallel environments max_steps = 256 # Rollout length (longer than IPPO) PPO_EPOCHS = 10 # PPO update epochs (more than IPPO) clip_coeff = 0.2 # PPO clipping coefficient ENTROPY_COEFF = 0.02 # Entropy regularization (higher than IPPO) GAE = 0.95 # GAE lambda parameter total_timesteps = 20000000 # Total training steps ``` ### Evaluation ```bash # Evaluate trained model python mappo_without_rnd.py --eval --checkpoint "checkpoint.pt" # Interactive play python play_ippo.py "checkpoint.pt" ``` ## Technical Implementation ### File Structure ``` MAPPO/ ├── mappo_without_rnd.py # Standard MAPPO implementation ├── mappo_rnd.py # MAPPO with RND for exploration ├── mappo_rnd_pong.py # MAPPO with RND for cooperative Pong ├── train.py # MAPPO training script ├── images/ # Training visualizations │ └── simple_spread.mp4 # Demo video └── README.md # This file ``` ### Key Classes #### Config Centralized configuration class containing all hyperparameters and training settings. #### CentralizedCritic Global value function that has access to the full environment state. #### Actor Networks Decentralized policy networks for each agent. #### MAPPO Trainer Main training loop implementing the MAPPO algorithm with centralized training. ## RND Integration ### Random Network Distillation RND provides intrinsic motivation by measuring how "surprising" or "novel" an observation is: ```python class RNDNetwork(nn.Module): def __init__(self, observation_dim): self.predictor = nn.Sequential(...) # Predicts target features self.target = nn.Sequential(...) # Fixed target network ``` ## References ### Papers - [The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games](https://arxiv.org/abs/2103.01955) - [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347) - [Exploration by Random Network Distillation](https://arxiv.org/abs/1810.12894) - [Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments](https://arxiv.org/abs/1706.02275) ### Code References - [CleanRL MAPPO Implementation](https://github.com/vwxyzjn/cleanrl) - [PettingZoo Multi-Agent Environments](https://pettingzoo.farama.org/) - [SuperSuit Environment Wrappers](https://github.com/Farama-Foundation/SuperSuit) --- ## Contributing This implementation is part of a larger MARL research project. Contributions are welcome in the form of: - Bug reports and fixes - Performance improvements - New environment support - Algorithm extensions ## License This implementation is open source and available under the MIT License.