{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# Reinforcement Learning with Doom - Increasing complexity and monitoring the model\n", "\n", "Leandro Kieliger\n", "contact@lkieliger.ch\n", "\n", "---\n", "## Description\n", "\n", "In this notebook we are going to build upon the setup introduced in the previous part of this series. We will tackle a more difficult scenario, add useful logging to the learning process and finally start competing against in-game bots in a first attemps at playing a deathmatch game using reinforcement learning.\n", "\n", "The notebook is structured in 3 parts:\n", "\n", "\n", "### [Part 1 - Increasing complexity](#part_1)\n", "* [Defend the center](#defend_the_center)\n", "* [Defend the center (harder)](#defend_the_center_hard)\n", "\n", " \n", "### [Part 2 - Monitoring the model](#part_2)\n", "* [Adding hooks](#monitoring)\n", "* [Monitoring callback](#callback)\n", "* [Adding normalization](#normalization)\n", "* [Comparing norms](#norm_comparison)\n", " \n", " \n", "### [Part 3 - Playing doom deathmatch](#part_3)\n", "* [Scenario presentation](#deathmatch_presentation)\n", "* [New environment](#deathmatch_environment)\n", "* [Discussion](#deathmatch_discussion)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# [^](#top) Part 1 - Increasing complexity\n", "\n", "\n", "## Preparations" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "\n", "import cv2\n", "import gym\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import torch as th\n", "import typing as t\n", "import vizdoom\n", "from stable_baselines3 import ppo\n", "from stable_baselines3.common.callbacks import EvalCallback\n", "from stable_baselines3.common import evaluation, policies\n", "from torch import nn\n", "\n", "from common import envs, plotting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## New scenario: Defend The Center\n", "\n", "The next scenario we are going to tackle is called \"Defend the center\". In this scenario, the agent is stuck in the center of a circular arena and will be attacked by monsters spawning at random intervals and locations. Here are the rewards for this scenario:\n", "\n", "* 1 point for each ennemy killed.\n", "* -1 point for dying.\n", "\n", "Since there is a limit on the total ammunition of the agent, 26, the theoretical maximum reward achievable for this scenario is 25. The buttons available are `ATTACK`, `TURN_LEFT` and `TURN_RIGHT`. \n", "\n", "We are going to define a little helper function that will streamline the training and evaluation process as we will need to repeat it several time in this notebook. The function simply create the environments, instantiate an agent based on the PPO algorithm and start training and optionally evaluating the agent for a given number of steps." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def solve_env(env_args, agent_args, n_envs, timesteps, callbacks, eval_freq=None, init_func=None):\n", " \"\"\"Helper function to streamline the learning and evaluation process.\n", " \n", " Args:\n", " env_args: A dict containing arguments passed to the environment.\n", " agent_args: A dict containing arguments passed to the agent.\n", " n_envs: The number of parallel training envs to instantiate.\n", " timesteps: The number of timesteps for which to train the model.\n", " callbacks: A list of callbacks for the training process.\n", " eval_freq: The frequency (in steps) at which to evaluate the agent.\n", " init_func: A function to be applied on the agent before training.\n", " \"\"\"\n", " # Create environments.\n", " env = envs.create_vec_env(n_envs, **env_args)\n", "\n", " # Build the agent.\n", " agent = ppo.PPO(policies.ActorCriticCnnPolicy, env, tensorboard_log='logs/tensorboard', seed=0, **agent_args)\n", " \n", " # Optional processing on the agent.\n", " if init_func is not None: \n", " init_func(agent)\n", " \n", " # Optional evaluation callback.\n", " if eval_freq is not None:\n", " eval_env = envs.create_eval_vec_env(**env_args)\n", "\n", " callbacks.append(EvalCallback(\n", " eval_env, \n", " n_eval_episodes=10, \n", " eval_freq=eval_freq, \n", " log_path=f'logs/evaluations/{env_args[\"scenario\"]}',\n", " best_model_save_path=f'logs/models/{env_args[\"scenario\"]}'))\n", "\n", " # Start the training process.\n", " agent.learn(total_timesteps=timesteps, tb_log_name=env_args['scenario'], callback=callbacks)\n", "\n", " # Cleanup.\n", " env.close()\n", " if eval_freq is not None: eval_env.close()\n", " \n", " return agent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we configure the environments like we have seen in the [previous notebook](https://github.com/lkiel/rl-doom/blob/develop/standalone_examples/Basic%20Scenario.ipynb). We will be training for only 100k steps as this is already enough to reach a good score in this scenario. Don't forget to specify the frame skip parameter as we have seen that this is one of the most efficient way to speed up the learning process along with the learning rate." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Results in a 100x156 image, no pixel lost due to padding with the default CNN architecture\n", "frame_processor = lambda frame: cv2.resize(frame[40:, 4:-4], None, fx=.5, fy=.5, interpolation=cv2.INTER_AREA)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Eval num_timesteps=4096, episode_reward=-1.00 +/- 0.00\n", "Episode length: 72.80 +/- 7.95\n", "New best mean reward!\n", "Eval num_timesteps=8192, episode_reward=-0.50 +/- 0.81\n", "Episode length: 78.10 +/- 15.14\n", "New best mean reward!\n", "Eval num_timesteps=12288, episode_reward=3.40 +/- 1.91\n", "Episode length: 139.10 +/- 43.52\n", "New best mean reward!\n", "Eval num_timesteps=16384, episode_reward=1.90 +/- 0.70\n", "Episode length: 110.20 +/- 6.78\n", "Eval num_timesteps=20480, episode_reward=9.40 +/- 7.14\n", "Episode length: 248.20 +/- 130.42\n", "New best mean reward!\n", "Eval num_timesteps=24576, episode_reward=9.30 +/- 4.88\n", "Episode length: 227.70 +/- 82.49\n", "Eval num_timesteps=28672, episode_reward=17.80 +/- 4.31\n", "Episode length: 400.10 +/- 93.08\n", "New best mean reward!\n", "Eval num_timesteps=32768, episode_reward=18.50 +/- 5.37\n", "Episode length: 404.40 +/- 106.92\n", "New best mean reward!\n", "Eval num_timesteps=36864, episode_reward=16.90 +/- 4.53\n", "Episode length: 370.10 +/- 94.13\n", "Eval num_timesteps=40960, episode_reward=19.50 +/- 2.01\n", "Episode length: 427.40 +/- 28.64\n", "New best mean reward!\n", "Eval num_timesteps=45056, episode_reward=19.50 +/- 1.20\n", "Episode length: 407.80 +/- 23.39\n", "Eval num_timesteps=49152, episode_reward=19.70 +/- 1.79\n", "Episode length: 406.10 +/- 31.34\n", "New best mean reward!\n", "Eval num_timesteps=53248, episode_reward=16.30 +/- 1.27\n", "Episode length: 342.70 +/- 29.71\n", "Eval num_timesteps=57344, episode_reward=18.50 +/- 2.11\n", "Episode length: 375.00 +/- 40.77\n", "Eval num_timesteps=61440, episode_reward=18.30 +/- 1.79\n", "Episode length: 357.50 +/- 24.90\n", "Eval num_timesteps=65536, episode_reward=19.40 +/- 2.11\n", "Episode length: 363.90 +/- 30.47\n", "Eval num_timesteps=69632, episode_reward=18.50 +/- 1.69\n", "Episode length: 362.10 +/- 37.28\n", "Eval num_timesteps=73728, episode_reward=17.10 +/- 2.21\n", "Episode length: 342.00 +/- 33.39\n", "Eval num_timesteps=77824, episode_reward=19.00 +/- 2.19\n", "Episode length: 346.60 +/- 25.57\n", "Eval num_timesteps=81920, episode_reward=18.90 +/- 2.70\n", "Episode length: 377.80 +/- 33.54\n", "Eval num_timesteps=86016, episode_reward=18.00 +/- 1.73\n", "Episode length: 347.80 +/- 23.25\n", "Eval num_timesteps=90112, episode_reward=18.80 +/- 2.23\n", "Episode length: 367.70 +/- 27.82\n", "Eval num_timesteps=94208, episode_reward=18.80 +/- 2.44\n", "Episode length: 356.70 +/- 38.54\n", "Eval num_timesteps=98304, episode_reward=20.90 +/- 1.64\n", "Episode length: 357.10 +/- 31.33\n", "New best mean reward!\n", "Eval num_timesteps=102400, episode_reward=18.80 +/- 1.89\n", "Episode length: 339.60 +/- 21.56\n", "Eval num_timesteps=106496, episode_reward=19.50 +/- 1.43\n", "Episode length: 342.50 +/- 25.11\n", "Eval num_timesteps=110592, episode_reward=19.40 +/- 2.15\n", "Episode length: 348.50 +/- 28.62\n", "Eval num_timesteps=114688, episode_reward=19.70 +/- 1.19\n", "Episode length: 333.40 +/- 16.79\n", "Eval num_timesteps=118784, episode_reward=21.60 +/- 2.62\n", "Episode length: 355.60 +/- 30.08\n", "New best mean reward!\n", "Eval num_timesteps=122880, episode_reward=20.10 +/- 1.30\n", "Episode length: 339.40 +/- 29.45\n", "Eval num_timesteps=126976, episode_reward=20.50 +/- 1.63\n", "Episode length: 337.50 +/- 19.14\n", "Eval num_timesteps=131072, episode_reward=18.70 +/- 1.27\n", "Episode length: 306.70 +/- 14.32\n", "Eval num_timesteps=135168, episode_reward=20.60 +/- 1.74\n", "Episode length: 329.50 +/- 18.01\n", "Eval num_timesteps=139264, episode_reward=18.80 +/- 1.33\n", "Episode length: 308.50 +/- 21.88\n", "Eval num_timesteps=143360, episode_reward=19.20 +/- 2.14\n", "Episode length: 314.00 +/- 22.63\n", "Eval num_timesteps=147456, episode_reward=18.20 +/- 1.17\n", "Episode length: 299.30 +/- 17.75\n", "Eval num_timesteps=151552, episode_reward=19.40 +/- 2.84\n", "Episode length: 336.90 +/- 41.69\n" ] }, { "data": { "text/plain": [ "