{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "bpOEhPfXa9Nz" }, "source": [ "## HW2: Value Based Methods (DQN vs DDQN)" ] }, { "cell_type": "markdown", "metadata": { "id": "zSHRAecMbEKC" }, "source": [ "> - Full Name: **[Full Name]**\n", "> - Student ID: **[Stundet ID]**" ] }, { "cell_type": "markdown", "metadata": { "id": "NsmqMy9fZSgU" }, "source": [ "## **Overview**\n", "\n", "DQN, introduced in 2013, revolutionized deep reinforcement learning. \n", "In this notebook, you'll use PyTorch to train a Deep Q-Learning (DQN) agent on the [Cart-Pole](https://gymnasium.farama.org/environments/classic_control/cart_pole/) task from [Gymnasium](https://gymnasium.farama.org/). You'll also implement Double DQN (DDQN), an improved version with better stability, convergence, and test performance.\n", "\n", "### In this notebook:\n", "- Explore the Cart-Pole environment and observe an untrained agent.\n", "- Set up a Gymnasium environment.\n", "- Implement and train DQN and DDQN from scratch.\n", "- Compare their performance to understand strengths and weaknesses.\n", "- Render the trained agent’s behavior.\n", "\n", "Before starting, import the necessary packages. Helper functions are provided for visualization and rendering.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "O_4FcKWExplP" }, "source": [ "## **Setup** \n", "\n", "First, install the required packages. If you're using Colab, everything should work smoothly. However, on a local system, you may encounter some dependency installation challenges. \n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "H6T0r2sdxplQ", "outputId": "567fa4d6-187f-4179-ca79-ee5e4c8b0322" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]\n", "Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 InRelease [1,581 B]\n", "Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages [1,315 kB]\n", "Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease\n", "Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]\n", "Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]\n", "Hit:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease\n", "Hit:8 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease\n", "Hit:9 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease\n", "Get:10 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]\n", "Get:11 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]\n", "Get:12 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,656 kB]\n", "Get:13 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [2,911 kB]\n", "Get:14 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1,526 kB]\n", "Get:15 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,677 kB]\n", "Fetched 17.5 MB in 5s (3,184 kB/s)\n", "Reading package lists...\n", "W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)\n", "Reading package lists...\n", "Building dependency tree...\n", "Reading state information...\n", "ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).\n", "The following additional packages will be installed:\n", " libfontenc1 libxfont2 libxkbfile1 x11-xkb-utils xfonts-base xfonts-encodings\n", " xfonts-utils xserver-common\n", "The following NEW packages will be installed:\n", " libfontenc1 libxfont2 libxkbfile1 x11-xkb-utils xfonts-base xfonts-encodings\n", " xfonts-utils xserver-common xvfb\n", "0 upgraded, 9 newly installed, 0 to remove and 22 not upgraded.\n", "Need to get 7,815 kB of archives.\n", "After this operation, 11.9 MB of additional disk space will be used.\n", "Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libfontenc1 amd64 1:1.1.4-1build3 [14.7 kB]\n", "Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libxfont2 amd64 1:2.0.5-1build1 [94.5 kB]\n", "Get:3 http://archive.ubuntu.com/ubuntu jammy/main amd64 libxkbfile1 amd64 1:1.1.0-1build3 [71.8 kB]\n", "Get:4 http://archive.ubuntu.com/ubuntu jammy/main amd64 x11-xkb-utils amd64 7.7+5build4 [172 kB]\n", "Get:5 http://archive.ubuntu.com/ubuntu jammy/main amd64 xfonts-encodings all 1:1.0.5-0ubuntu2 [578 kB]\n", "Get:6 http://archive.ubuntu.com/ubuntu jammy/main amd64 xfonts-utils amd64 1:7.7+6build2 [94.6 kB]\n", "Get:7 http://archive.ubuntu.com/ubuntu jammy/main amd64 xfonts-base all 1:1.0.5 [5,896 kB]\n", "Get:8 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 xserver-common all 2:21.1.4-2ubuntu1.7~22.04.12 [28.7 kB]\n", "Get:9 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 xvfb amd64 2:21.1.4-2ubuntu1.7~22.04.12 [864 kB]\n", "Fetched 7,815 kB in 2s (3,620 kB/s)\n", "debconf: unable to initialize frontend: Dialog\n", "debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 9.)\n", "debconf: falling back to frontend: Readline\n", "debconf: unable to initialize frontend: Readline\n", "debconf: (This frontend requires a controlling tty.)\n", "debconf: falling back to frontend: Teletype\n", "dpkg-preconfigure: unable to re-open stdin: \n", "Selecting previously unselected package libfontenc1:amd64.\n", "(Reading database ... 124926 files and directories currently installed.)\n", "Preparing to unpack .../0-libfontenc1_1%3a1.1.4-1build3_amd64.deb ...\n", "Unpacking libfontenc1:amd64 (1:1.1.4-1build3) ...\n", "Selecting previously unselected package libxfont2:amd64.\n", "Preparing to unpack .../1-libxfont2_1%3a2.0.5-1build1_amd64.deb ...\n", "Unpacking libxfont2:amd64 (1:2.0.5-1build1) ...\n", "Selecting previously unselected package libxkbfile1:amd64.\n", "Preparing to unpack .../2-libxkbfile1_1%3a1.1.0-1build3_amd64.deb ...\n", "Unpacking libxkbfile1:amd64 (1:1.1.0-1build3) ...\n", "Selecting previously unselected package x11-xkb-utils.\n", "Preparing to unpack .../3-x11-xkb-utils_7.7+5build4_amd64.deb ...\n", "Unpacking x11-xkb-utils (7.7+5build4) ...\n", "Selecting previously unselected package xfonts-encodings.\n", "Preparing to unpack .../4-xfonts-encodings_1%3a1.0.5-0ubuntu2_all.deb ...\n", "Unpacking xfonts-encodings (1:1.0.5-0ubuntu2) ...\n", "Selecting previously unselected package xfonts-utils.\n", "Preparing to unpack .../5-xfonts-utils_1%3a7.7+6build2_amd64.deb ...\n", "Unpacking xfonts-utils (1:7.7+6build2) ...\n", "Selecting previously unselected package xfonts-base.\n", "Preparing to unpack .../6-xfonts-base_1%3a1.0.5_all.deb ...\n", "Unpacking xfonts-base (1:1.0.5) ...\n", "Selecting previously unselected package xserver-common.\n", "Preparing to unpack .../7-xserver-common_2%3a21.1.4-2ubuntu1.7~22.04.12_all.deb ...\n", "Unpacking xserver-common (2:21.1.4-2ubuntu1.7~22.04.12) ...\n", "Selecting previously unselected package xvfb.\n", "Preparing to unpack .../8-xvfb_2%3a21.1.4-2ubuntu1.7~22.04.12_amd64.deb ...\n", "Unpacking xvfb (2:21.1.4-2ubuntu1.7~22.04.12) ...\n", "Setting up libfontenc1:amd64 (1:1.1.4-1build3) ...\n", "Setting up xfonts-encodings (1:1.0.5-0ubuntu2) ...\n", "Setting up libxkbfile1:amd64 (1:1.1.0-1build3) ...\n", "Setting up libxfont2:amd64 (1:2.0.5-1build1) ...\n", "Setting up x11-xkb-utils (7.7+5build4) ...\n", "Setting up xfonts-utils (1:7.7+6build2) ...\n", "Setting up xfonts-base (1:1.0.5) ...\n", "Setting up xserver-common (2:21.1.4-2ubuntu1.7~22.04.12) ...\n", "Setting up xvfb (2:21.1.4-2ubuntu1.7~22.04.12) ...\n", "Processing triggers for man-db (2.10.2-1) ...\n", "Processing triggers for fontconfig (2.13.1-4.2ubuntu5) ...\n", "Processing triggers for libc-bin (2.35-0ubuntu3.8) ...\n", "/sbin/ldconfig.real: /usr/local/lib/libtbb.so.12 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_5.so.3 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libtcm_debug.so.1 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc.so.2 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libur_adapter_level_zero.so.0 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc_proxy.so.2 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libhwloc.so.15 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libtcm.so.1 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libumf.so.0 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libtbbbind.so.3 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libur_loader.so.0 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libur_adapter_opencl.so.0 is not a symbolic link\n", "\n", "/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_0.so.3 is not a symbolic link\n", "\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.9/1.9 MB\u001b[0m \u001b[31m47.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: gymnasium[box2d] in /usr/local/lib/python3.11/dist-packages (1.0.0)\n", "Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.11/dist-packages (from gymnasium[box2d]) (1.26.4)\n", "Requirement already satisfied: cloudpickle>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from gymnasium[box2d]) (3.1.1)\n", "Requirement already satisfied: typing-extensions>=4.3.0 in /usr/local/lib/python3.11/dist-packages (from gymnasium[box2d]) (4.12.2)\n", "Requirement already satisfied: farama-notifications>=0.0.1 in /usr/local/lib/python3.11/dist-packages (from gymnasium[box2d]) (0.0.4)\n", "Collecting box2d-py==2.3.5 (from gymnasium[box2d])\n", " Downloading box2d-py-2.3.5.tar.gz (374 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m374.4/374.4 kB\u001b[0m \u001b[31m26.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h Building wheel for box2d-py (setup.py) ... \u001b[?25l\u001b[?25hdone\n", " Created wheel for box2d-py: filename=box2d_py-2.3.5-cp311-cp311-linux_x86_64.whl size=2379446 sha256=01dcaf5b40c9697dc280991e9dcbde5c77b5cf0fa8e1ad4b91f09185b95a1540\n", " Stored in directory: /root/.cache/pip/wheels/ab/f1/0c/d56f4a2bdd12bae0a0693ec33f2f0daadb5eb9753c78fa5308\n", "Successfully built box2d-py\n", "Installing collected packages: box2d-py\n", "Successfully installed box2d-py-2.3.5\n" ] } ], "source": [ "!sudo apt-get update --quiet\n", "!pip install imageio --quiet\n", "!sudo apt-get install -y xvfb ffmpeg --quiet\n", "!pip install swig --quiet\n", "!pip install gymnasium[box2d]" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "rQsT67I7xplS", "outputId": "386d398a-4fa1-4392-c362-d4383db7a673" }, "outputs": [ { "data": { "text/plain": [ "device(type='cuda')" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import gymnasium as gym\n", "import random\n", "import matplotlib\n", "from matplotlib import pyplot as plt\n", "import numpy as np\n", "from collections import namedtuple, deque\n", "\n", "import base64\n", "import json\n", "import imageio\n", "import IPython\n", "\n", "import torch\n", "import torch.nn as nn\n", "import torch.optim as optim\n", "import torch.nn.functional as F\n", "\n", "\n", "# if GPU is to be used\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "device" ] }, { "cell_type": "markdown", "metadata": { "id": "l0NlmaWvxplS" }, "source": [ "## **Helper Functions** \n", "This section contains functions for visualizing your results. \n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "id": "2TSVEQFFxplS" }, "outputs": [], "source": [ "# @title helper functions\n", "\n", "# disable warnings\n", "import logging\n", "logging.getLogger().setLevel(logging.ERROR)\n", "\n", "# set up matplotlib\n", "is_ipython = 'inline' in matplotlib.get_backend()\n", "if is_ipython:\n", " from IPython import display\n", "plt.ion()\n", "plt.xkcd(scale=1, length=100, randomness=2)\n", "matplotlib.rcParams['figure.figsize'] = (12, 6)\n", "\n", "\n", "def plot_rewards(sum_of_rewards,i,show_result=False):\n", " plt.figure(1)\n", " rewards = torch.tensor(sum_of_rewards, dtype=torch.float)\n", " if show_result:\n", " plt.title('Result')\n", " else:\n", " plt.clf()\n", " plt.title(f'Training the Agent number {i}')\n", " plt.xlabel('Episode')\n", " plt.ylabel('Reward')\n", " plt.plot(rewards.numpy())\n", " # Take 50 episode averages and plot them too\n", " length = len(rewards)\n", " init_len = min(49, length)\n", " init_means = np.cumsum(rewards[:init_len]) / (1 + np.arange(init_len))\n", " if length > 50:\n", " means = rewards.unfold(0, 50, 1).mean(1).view(-1)\n", " means = torch.cat((init_means, means))\n", " else:\n", " means = init_means\n", " plt.plot(means.numpy())\n", "\n", " plt.pause(0.001)\n", " if is_ipython:\n", " if not show_result:\n", " display.display(plt.gcf())\n", " display.clear_output(wait=True)\n", " else:\n", " display.display(plt.gcf())\n", "\n", "def plot_smooth(DDQN_mean_rewards,DDQN_min_rewards,DDQN_max_rewards,DQN_mean_rewards,DQN_min_rewards,DQN_max_rewards):\n", " plt.figure(figsize=(12,7))\n", "\n", " # Plot DDQN\n", " DDQN, = plt.plot(range(len(DDQN_mean_rewards)), DDQN_mean_rewards, color='blue', label='DDQN')\n", " plt.fill_between(range(len(DDQN_min_rewards)), DDQN_min_rewards, DDQN_max_rewards, color='blue', alpha=0.2)\n", "\n", " # Plot DQN\n", " DQN, = plt.plot(range(len(DQN_mean_rewards)), DQN_mean_rewards, color='red', label='DQN')\n", " plt.fill_between(range(len(DQN_min_rewards)), DQN_min_rewards, DQN_max_rewards, color='red', alpha=0.2)\n", "\n", " # Fix legend\n", " plt.legend(handles=[DDQN, DQN])\n", " plt.show()\n", "\n", "\n", "\n", "def plot_values(values):\n", " plt.figure(figsize=(15, 9))\n", "\n", " # Iterate over each value set\n", " for i, value in enumerate(values):\n", " for n, Data in enumerate(value):\n", " plt.plot(range(len(Data)), Data, label=f\"Values of selected trained Agent Number {i+1}, Evaluation {n+1}\")\n", "\n", " plt.title('Test Episode Mean Q values')\n", " plt.xlabel(\"Episodes\")\n", " plt.ylabel(\"Value\")\n", " plt.grid(True)\n", " plt.legend()\n", " plt.show()\n", "\n", "\n", "\n", "def embed_mp4(filename):\n", " video = open(filename,'rb').read()\n", " b64 = base64.b64encode(video)\n", " tag = '''\n", " '''.format(b64.decode())\n", " return IPython.display.HTML(tag)\n", "\n", "\n", "def create_policy_eval_video(env, agent, filename, num_episodes=1, fps=30):\n", " filename = filename + \".mp4\"\n", " with imageio.get_writer(filename, fps=fps) as video:\n", " for _ in range(num_episodes):\n", " state, info = env.reset()\n", " video.append_data(env.render())\n", " while True:\n", " state = torch.from_numpy(state).unsqueeze(0).to(device)\n", " action,_ = agent.act(state, greedy=True)\n", " state, reward, terminated, truncated, info = env.step(action)\n", " video.append_data(env.render())\n", " if terminated or truncated:\n", " break\n", " return embed_mp4(filename)\n", "\n", "\n", "def save_progress(sum_of_rewards, PATH):\n", " # Convert the list to a JSON string\n", " json_data = json.dumps(sum_of_rewards)\n", " # Write the JSON data to a file\n", " with open(PATH + str('.json'), \"w\") as file:\n", " file.write(json_data)\n", "\n", "\n", "def load_progress(PATH):\n", " with open(PATH + str('.json'), \"r\") as file:\n", " json_data = file.read()\n", " # Load the JSON data back into a Python list\n", " return json.loads(json_data)" ] }, { "cell_type": "markdown", "metadata": { "id": "6QAc3RPgxplS" }, "source": [ "# **Explore the Environment** \n", "\n", "Let's explore the Gym Cart-Pole environment. \n", "First, we need to create the environment and set `rgb_array` as the render mode for visualization. \n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "kkOb315uxplS", "outputId": "589eff99-8afc-417e-a369-a23ba7d6ff8a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Observations: 4\n", "Actions: 2\n" ] } ], "source": [ "env = gym.make(\"CartPole-v1\", render_mode=\"rgb_array\")\n", "# TODO: Print the observation space and action space\n", "print('Observations:', env.observation_space.shape[0])\n", "print('Actions:', env.action_space.n)" ] }, { "cell_type": "markdown", "metadata": { "id": "ebrWxAFrxplS" }, "source": [ "Complete the following class to create an agent that selects actions randomly from the action space. \n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "MgHtlH8fxplT" }, "outputs": [], "source": [ "class RandomAgent():\n", "\n", " def __init__(self,env_name,mode='rgb_array'):\n", " self.env =gym.make(env_name, render_mode='rgb_array')\n", "\n", "\n", " def act(self,state = None,greedy = None):\n", " # TODO: Select and return a random action\n", " action = self.env.action_space.sample()\n", " return action,0\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "dcGBHs_2xplT" }, "source": [ "**Monitor the random Agent perfomance**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 502 }, "id": "wgKiOogxxplT", "outputId": "280f468f-f227-4f5f-bb3a-cea4247bbd37" }, "outputs": [ { "data": { "text/html": [ "\n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_agent = RandomAgent(\"CartPole-v1\")\n", "create_policy_eval_video(random_agent.env, random_agent, 'random_policy', num_episodes=5)" ] }, { "cell_type": "markdown", "metadata": { "id": "JRFrfnYpxplT" }, "source": [ "## **Main Components of DDQN and Its Variants** \n", "\n", "### **Deep Q Network (DQN)** \n", "\n", "DQN uses a neural network to estimate $Q(s,a)$ values. In theory, the network takes both state and action as input and outputs a single $Q(s,a)$ value. However, in practice, it takes only the state as input and outputs a vector of Q-values, where each value corresponds to an action in the action space. \n", "\n", " \n", "\n", "Now, let's define the Deep Q Network. \n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "I81sC3E5xplT" }, "outputs": [], "source": [ "\n", "class QNetwork(nn.Module):\n", "\n", " def __init__(self, state_size, action_size):\n", "\n", " super(QNetwork, self).__init__()\n", "\n", " self.relu = nn.ReLU()\n", " self.fc1 = nn.Linear(state_size, 512)\n", " self.fc2 = nn.Linear(512,125)\n", " self.fc3 = nn.Linear(125,action_size)\n", "\n", "\n", " def forward(self, state):\n", "\n", " x = self.relu(self.fc1(state))\n", " x = self.relu(self.fc2(x))\n", " x = self.relu(self.fc3(x))\n", "\n", " return x\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "pJ5N9yZjxplT" }, "source": [ "## **Experience Replay Buffer** \n", "\n", "To train the network, we need data. We store **transitions**—tuples of (state, action, reward, next state, termination)—in a replay buffer for later sampling. \n", "\n", "- **state (s):** The current situation the agent is experiencing. \n", "- **action (a):** The action taken by the agent. \n", "- **next state (s'):** The new state after taking the action. \n", "- **reward (r):** The feedback received for the action. \n", "- **termination (done):** A boolean indicating if the episode has ended. \n", "\n", "These transitions are stored in the **Experience Replay Buffer**, allowing us to sample and train the Q-network efficiently. \n", "\n", "A stack of transitions $(s, a, s', r, done)$ looks like this: \n", "\n", " \n", "\n", "**We shuffle the sampled data to break temporal correlations before training.** \n", "\n", "Now, let's define the Experience Replay Buffer class. \n" ] }, { "cell_type": "markdown", "metadata": { "id": "oqMrze5oxplT" }, "source": [ "## **Experience Replay Buffer** \n", "\n", "- `push`: Stores transitions from the environment. \n", "- `sample`: Shuffles and samples transitions. \n", "- `__len__`: Returns the number of stored transitions. \n", "- `get_size`: Returns the buffer size (same as `__len__`). \n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "3xn-7fyXxplT" }, "outputs": [], "source": [ "class ReplayMemory(object):\n", "\n", " def __init__(self, capacity, batch_size):\n", " self.memory = deque([],maxlen = capacity)\n", " self.batch_size = batch_size\n", " self.experience = namedtuple(\"Experience\", field_names = (\"state\", \"action\", \"reward\", \"next_state\", \"done\"))\n", "\n", " def push(self, state, action, reward, next_state, done):\n", " e = self.experience(state, action, reward, next_state, done)\n", " self.memory.append(e)\n", "\n", " def sample(self, batch_size = None):\n", "\n", " if batch_size is None:\n", " batch_size = self.batch_size\n", " experiences = random.sample(self.memory, k = batch_size)\n", " states = np.vstack([e.state.detach().cpu() for e in experiences if e is not None])\n", " actions = np.vstack([e.action for e in experiences if e is not None])\n", " rewards = np.vstack([e.reward for e in experiences if e is not None])\n", " next_states = np.vstack([e.next_state.detach().cpu() for e in experiences if e is not None])\n", " dones = np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)\n", " return states, actions, rewards, next_states, dones\n", "\n", " def __len__(self):\n", " return len(self.memory)\n", "\n", " def get_size(self):\n", " return self.__len__()" ] }, { "cell_type": "markdown", "metadata": { "id": "b0ctqz-zxplT" }, "source": [ "## **DQN Agent** \n", "\n", "DQN, the first Deep RL algorithm, uses TD learning similar to Q-learning, aiming to minimize the distance: \n", "\n", "$r_i + \\gamma \\cdot max_{a'} Q_\\theta'(s_i',a_i') - Q_\\theta(s_i,a_i)$ \n", "\n", "The more appropriate cost function is: \n", "\n", "$[r_i + \\gamma \\cdot max_{a'} Q_\\theta(s_i',a_i') - Q_\\theta(s_i,a_i)]^2$ $(1)$ \n", "\n", "Instead of a tabular method, DQNs use deep networks to estimate Q-values. \n", "\n", "In this class, we implement the original DQN with an experience replay buffer. The training process is as follows: \n", "\n", "**Network Updating** \n", "\n", "1. Gather data. Once the buffer reaches a certain size, start training the Q-networks. \n", "2. Sample transitions and feed the States (instead of State-action pairs) to the Q-network to estimate $Q_\\theta(s_i,a_i)$. \n", "3. Feed Next_States to the Q-network to estimate $Q_\\theta(s_i',a_i')$. \n", "4. Use the average of equation $(1)$ to update the network via backpropagation. \n" ] }, { "cell_type": "markdown", "metadata": { "id": "BVK_MrnFxplT" }, "source": [ "## **$\\epsilon-\\text{greedy policy}$** \n", "\n", "Exploration is crucial in RL algorithms. In DQN, exploration is achieved using the $\\epsilon-\\text{greedy policy}$.\n", "\n", "$$\n", "\\pi(a \\mid s) =\n", "\\begin{cases}\n", "\\arg\\max\\limits_{a} Q(s, a), & \\text{with probability } 1 - \\epsilon, \\\\\n", "\\text{random action}, & \\text{with probability } \\epsilon.\n", "\\end{cases}\n", "$$\n", "\n", "\n", "With probability $\\epsilon$, a random action is chosen for exploration, and with probability $1 - \\epsilon$, the best action is selected for exploitation. \n", "\n", "$\\epsilon$ starts high and gradually decreases over time. The decay follows this formula:\n", "\n", "$$\\varepsilon = \\varepsilon_{end} + \\left(\\varepsilon_{start} - \\varepsilon_{end}\\right)\\exp\\left(-\\frac{\\text{steps done}}{\\text{decay rate}}\\right)$$ \n", "\n", "**The $\\epsilon-\\text{greedy policy}$ is used in both DQN and DDQN.** \n", "\n", "## **Original DQN Pseudocode** \n", "\n", " \n", "\n", "### Properties of the Algorithm:\n", "\n", "- Uses a single network to estimate both $Q(s,a)$ and $Q(s',a')$, which can lead to instability—this is the problem DDQN aims to solve. **It's up to you to figure out why this is a problem.** \n", "- Utilizes the $\\epsilon-\\text{greedy}$ policy for exploration.\n", "\n", "**Hint:** \n", "- A single network leads to unstable learning, hence the introduction of a target network in modified versions. \n", "- The $\\epsilon-\\text{greedy}$ policy can sometimes hinder the learning process, despite addressing the exploration problem. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2VJnYgyMxplT" }, "outputs": [], "source": [ "class DQNAgent(object):\n", "\n", " def __init__(self, q_network, memory, optimizer, criterion, params):\n", " # TODO: Create policy net based on the q-net\n", " self.policy_net = q_network.to(device)\n", "\n", " # TODO: Setup the agent's memory\n", " self.reply_buffer = memory(params['BUFFER_SIZE'],params['BATCH_SIZE'])\n", " # criterion, optimizer and params\n", " self.criterion = criterion()\n", " self.optimizer = optimizer(self.policy_net.parameters(), lr=params['LR'],amsgrad=True)\n", "\n", " self.gamma = params['GAMMA']\n", " self.eps = {'START': params['EPS_START'], 'END': params['EPS_END'], 'DECAY': params['EPS_DECAY']}\n", " self.steps_done = 0\n", " self.Loss = []\n", "\n", "\n", " def step(self, state, action, reward, next_state, done):\n", " # TODO: Save the experience in the memory of the agent\n", "\n", " self.reply_buffer.push(state, action, reward, next_state, done)\n", " # TODO: Increment the steps counter\n", " self.steps_done +=1\n", "\n", "\n", " if self.reply_buffer.get_size() > self.reply_buffer.batch_size:\n", " # TODO: Sample a batch from memory and learn from it\n", " states, actions, rewards, next_states,dones= self.reply_buffer.sample()\n", " self.learn(states, actions, rewards, next_states,dones)\n", "\n", "\n", " def act(self, state, greedy=False):\n", " self.eps_threshold = self.eps['END'] + (self.eps['START'] - self.eps['END']) * np.exp(- self.steps_done / self.eps['DECAY'])\n", "\n", " if greedy or random.random() > self.eps_threshold:\n", " with torch.no_grad(): # TODO: Select greedy action\n", " Q_values = self.policy_net(state)\n", " action = torch.argmax(Q_values).item()\n", " max_Q = torch.max(Q_values).item()\n", " else: # TODO: Select random action\n", " action = random.choices(range(2))[0]\n", " max_Q = 0\n", " return action,max_Q\n", "\n", " def learn(self, states, actions, rewards, next_states, dones):\n", "\n", "\n", "\n", " states = torch.from_numpy(states).float().to(device)\n", " actions = torch.from_numpy(actions).long().to(device)\n", " rewards = torch.from_numpy(rewards).float().to(device)\n", " next_states = torch.from_numpy(next_states).float().to(device)\n", " dones = torch.from_numpy(dones).float().to(device)\n", " # TODO: Compute the predicted Q-values using the policy network\n", " state_Q = self.policy_net(states)\n", " state_Q = state_Q.gather(dim = 1, index = actions)\n", "\n", "\n", "\n", " with torch.no_grad():\n", " next_state_Q = self.policy_net(next_states).detach().max(1)[0].unsqueeze(1)\n", " belman_backup = rewards + self.gamma*next_state_Q*(1-dones)\n", "\n", " # TODO: Compute the loss and do backpropagation\n", " loss = self.criterion(state_Q,belman_backup)\n", " self.optimizer.zero_grad()\n", " loss.backward()\n", " self.optimizer.step()\n", "\n", "\n", "\n", "\n", " def save(self, PATH):\n", " torch.save(self.policy_net, PATH + '_policy.pt')\n", "\n", "\n", " def load(self, PATH):\n", " self.policy_net = torch.load(PATH + '_policy.pt')\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Pep4oRAjxplU" }, "source": [ "# **Training the DQN Agents** \n", "\n", "Now that you've implemented DQN and explored the Gym CartPole environment, it's time to train an actual agent. \n", "\n", "**First, we train the DQN agent.** \n" ] }, { "cell_type": "markdown", "metadata": { "id": "YbYIPpZaxplU" }, "source": [ "## **Setting Up Essentials for DQN** \n", "\n", "- **CartPole Environment Setup** \n", "- **Hyperparameter Initialization** \n", "- **Creating DQN Agents with Different Random Seeds** \n", " - Why? Training multiple agents with different seeds helps assess DQN's consistency and robustness. \n", " - We use **5 random seeds** for better evaluation. \n", "- **Defining Optimizer and Loss Function** \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bNCRCzX0xplU" }, "outputs": [], "source": [ "def create_dqn_agent(seed, QNetwork, ReplayMemory, optimizer, criterion, params):\n", " # TODO: set the different seeds\n", " torch.manual_seed(seed)\n", " random.seed(seed)\n", " np.random.seed(seed)\n", "\n", " # Instantiate the agent\n", " return DQNAgent(QNetwork(n_observations, n_actions), ReplayMemory, optimizer, criterion, params)\n", "\n", "\n", "\n", "\n", "params = {\n", "\n", " 'BUFFER_SIZE': int(100000), # size of the replay buffer\n", " 'BATCH_SIZE':128, # number of experiences sampled from memory\n", " 'GAMMA': 0.99 , # discount factor\n", " 'EPS_START':0.99 , # starting value of epsilon\n", " 'EPS_END': 0.01 , # final value of epsilon\n", " 'EPS_DECAY': 10000 , # rate of exponential decay of epsilon \n", " 'LR': 0.0001 # learning rate of the optimizer\n", "}\n", "\n", "# Srrting the seeds\n", "seeds = [1, 10, 15, 43, 63]\n", "\n", "\n", "env = gym.make(\"CartPole-v1\")\n", "state, _ = env.reset()\n", "# Get number of actions from gym action space\n", "n_actions = env.action_space.n\n", "# Get the number of state observations\n", "n_observations = env.observation_space.shape[0]\n", "# TODO: Choose optimizer and loss function\n", "optimizer = optim.Adam\n", "criterion = nn.MSELoss\n", "\n", "\n", "# Create multiple agents with different seeds\n", "DQN_agents = []\n", "dqn_sum_of_rewards = []\n", "\n", "for seed in seeds:\n", " # TODO: create the Agent instance\n", " dqn_Agent = create_dqn_agent(seed, QNetwork, ReplayMemory, optimizer, criterion, params)\n", " # TODO: Append the Agent to the agents list\n", " DQN_agents.append(dqn_Agent)\n", " dqn_sum_of_rewards.append([])" ] }, { "cell_type": "markdown", "metadata": { "id": "QXFBHgbhxplU" }, "source": [ "## **Training Loop for the DQN Agent** \n", "\n", "With the hyperparameters set, we can now train the agent. \n", "\n", "The code is designed for convenient retraining. simply re-run the segment to continue training if the agent hasn’t achieved satisfactory results. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "id": "3_1rzGEXxplU", "outputId": "c5aacfa6-c1e6-4f6d-9096-01376eb16ea4" }, "outputs": [ { "data": { "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# TODO: Set the number of training episodes\n", "num_episodes = 400\n", "\n", "\n", "for i,dqn_Agent in enumerate(DQN_agents):\n", "\n", " for e in range(1, num_episodes + 1):\n", " # TODO: reset the environment\n", " state,_ = env.reset()\n", " state = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)\n", " episode_reward, done = 0, False\n", " # TODO: in the loop take actions,go to next step and get the rewards\n", " while True:\n", "\n", " action,_ = dqn_Agent.act(state)\n", " next_state, reward, done,truncated ,_,= env.step(action)\n", " next_state = torch.tensor(next_state, dtype=torch.float32, device=device).unsqueeze(0)\n", " # TODO: Save the Data to the Agent memory and train the Agent\n", " dqn_Agent.step(state, action, reward, next_state, done)\n", "\n", " episode_reward += reward\n", " # TODO: break the loop if the episode is terminated or truncated\n", " if done or truncated:\n", " break\n", " state = next_state\n", "\n", " dqn_sum_of_rewards[i].append(episode_reward)\n", " # Save model every 50 episodes and plot the returns (change the rate if needed) , you can turn the plot off.\n", " if e % 50 == 0:\n", "\n", " plot_rewards(dqn_sum_of_rewards[i],i)\n", " path = f'DQN_{i}_Network' + (str(len(dqn_sum_of_rewards[i]))).zfill(4)\n", " dqn_Agent.save(path)\n", " save_progress(dqn_sum_of_rewards[i], path)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Hq7ueyTGxplU" }, "source": [ "## **Why DDQN?** \n", "\n", "DQN and its variants use a replay buffer and Q-network, but DQN struggles with learning stability. **Can you identify why?** \n", "To address this, we introduce DDQN—a modified version with improved stability, capable of handling more complex environments. \n", "\n", "---\n", "\n", "## **DDQN Agent** \n", "\n", "### **Deep Q-Networks in DDQN** \n", "\n", "DDQN uses two networks: \n", "\n", "- **Online Network**: Estimates $Q_\\theta(s_i,a_i)$, representing the value of a state-action pair. its parameters is denoted with $\\theta$\n", "- **Target Network**: Estimates $Q_{\\theta'}(s_i',a_i')$, representing the value of the next state-action pair. its parameters is denoted with $\\theta'$\n", "\n", "Both networks have identical architectures but serve different roles. The target network helps stabilize training by reducing overestimation bias. \n", "\n", "### **Network Updating** \n", "\n", "In DQN, a single network estimates both $Q(s_i,a_i)$ and $Q(s_i',a_i')$, leading to instability. DDQN introduces a key change: \n", "\n", "- The **online network** selects the best action: \n", " $$ \\arg\\max_a Q_\\theta(s'_i,a) $$\n", "- The **target network** estimates its value: \n", " $$ Q_{\\theta'}(s_i, \\arg\\max_a Q_\\theta(s'_i,a)) $$ \n", "\n", "This modifies the loss function to: \n", "\n", "$$ [r_i + \\gamma \\cdot Q_{\\theta'}(s_i, \\arg\\max_a Q_\\theta(s'_i,a)) - Q_\\theta(s_i,a_i)]^2 $$ \n", "\n", "In this version DDQN, we also use $\\text{soft replacemet}$ which is not a part of original DDQN, that another modification that is applied.\n", "* `soft_update` : The target network is updated at every step with a soft update controlled by the hyperparameter $\\tau$, which was previously defined. The target is updated according to: $$\\theta' \\leftarrow \\tau \\theta + (1 - \\tau) \\theta'$$\n", "\n", "## DDQN Psudocode\n", "\n", " \n", "\n", "\n", "Now, let's define the **DDQN class**. \n" ] }, { "cell_type": "markdown", "metadata": { "id": "MFqyOXMIxplU" }, "source": [ "## **DDQN Agent Class**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vtv2Nf4HxplU" }, "outputs": [], "source": [ "class DDQNAgent(object):\n", "\n", " def __init__(self, q_network, memory, optimizer, criterion, params):\n", " # TODO: Create policy and target nets based on the q-net\n", " self.target_net = q_network.to(device)\n", " self.policy_net = q_network.to(device)\n", " self.Loss = []\n", " # TODO: Setup the agent's memory\n", " self.reply_buffer = memory(params['BUFFER_SIZE'],params['BATCH_SIZE'])\n", " # criterion, optimizer and params\n", " self.criterion = criterion()\n", " self.optimizer = optimizer(self.policy_net.parameters(), lr=params['LR'],amsgrad=True)\n", " self.tau = params['TAU']\n", " self.gamma = params['GAMMA']\n", " self.update_rate = params['UPDATE_RATE']\n", " self.eps = {'START': params['EPS_START'], 'END': params['EPS_END'], 'DECAY': params['EPS_DECAY']}\n", " self.steps_done = 0\n", "\n", "\n", " # TODO: Set the tagert network parameters equal to policy network\n", " self.soft_update(tau = 1)\n", "\n", " def step(self, state, action, reward, next_state, done):\n", " # TODO: Increment the steps_done 1 step each time\n", " self.steps_done += 1\n", " # TODO: Save the experience in the memory of the agent\n", " self.reply_buffer.push(state, action, reward, next_state, done)\n", " if self.reply_buffer.get_size() > self.reply_buffer.batch_size:\n", " # TODO: Sample a batch from memory and learn from it\n", " states, actions, rewards, next_states,dones= self.reply_buffer.sample()\n", " self.learn(states, actions, rewards, next_states,dones)\n", "\n", "\n", " def act(self, state, greedy=False,eps_threshold = None):\n", " self.eps_threshold = self.eps['END'] + (self.eps['START'] - self.eps['END']) * np.exp(- self.steps_done / self.eps['DECAY'])\n", "\n", " if greedy or random.random() > self.eps_threshold:\n", " with torch.no_grad(): # TODO: Select greedy action\n", " Q_values = self.target_net.forward(state)\n", " action = torch.argmax(Q_values).item()\n", " max_Q = torch.max(Q_values).item()\n", " else: # TODO: Select random action\n", " action = random.choices(range(2))[0]\n", " max_Q = 0\n", " return action,max_Q\n", "\n", " def learn(self, states, actions, rewards, next_states, dones):\n", "\n", "\n", "\n", " states = torch.from_numpy(states).float().to(device)\n", " actions = torch.from_numpy(actions).long().to(device)\n", " rewards = torch.from_numpy(rewards).float().to(device)\n", " next_states = torch.from_numpy(next_states).float().to(device)\n", " dones = torch.from_numpy(dones).float().to(device)\n", "\n", " # TODO: Compute the predicted Q-values using the policy network\n", " state_Q = self.policy_net(states)\n", " state_Q = state_Q.gather(dim = 1, index = actions).reshape(params['BATCH_SIZE'],1)\n", "\n", "\n", "\n", " with torch.no_grad():\n", " # TODO: Select the best next action using the policy network (online network)\n", " best_next_action = torch.argmax(self.policy_net(next_states),dim = 1).detach().unsqueeze(1)\n", " # TODO: Get the Q-values of the (next state,best next action) using the target network and selected best next action\n", " next_state_Q = self.target_net(next_states).detach().gather(1, best_next_action).detach()\n", " # TODO: Compute the Bellman backup target: r + γ * Q(next state,best next action) * (1 - done)\n", " belman_backup = rewards + self.gamma*next_state_Q*(1-dones)\n", " # TODO: Compute the loss and do backpropagation\n", " loss = self.criterion(state_Q,belman_backup)\n", " self.optimizer.zero_grad()\n", " loss.backward()\n", " self.optimizer.step()\n", " # TODO: do soft replacement if steps_done % update rate = 0\n", " if self.steps_done % self.update_rate == 0:\n", " self.soft_update(tau = self.tau)\n", "\n", "\n", " def soft_update(self,tau):\n", " # TODO: Soft update of all weights in the target network\n", "\n", " for target_param, pi_param in zip(self.target_net.parameters(), self.policy_net.parameters()):\n", " target_param.data.copy_(tau * pi_param.data + (1 - tau) * target_param.data)\n", "\n", "\n", " def save(self, PATH):\n", " torch.save(self.policy_net, PATH + '_policy.pt')\n", " torch.save(self.target_net, PATH + '_target.pt')\n", "\n", " def load(self, PATH):\n", " self.policy_net = torch.load(PATH + '_policy.pt')\n", " self.target_net = torch.load(PATH + '_target.pt')" ] }, { "cell_type": "markdown", "metadata": { "id": "Cd7tRn5vEtxr" }, "source": [ "## **Setting Up Essentials for DDQN** \n", "\n", "- **CartPole Environment Setup** \n", "- **Hyperparameter Initialization** \n", "- **Creating DDQN Agents with Different Random Seeds** \n", " - Why? Training multiple agents with different seeds helps assess DDQN's consistency and robustness. \n", " - We use **5 random seeds** for better evaluation. \n", "- **Defining Optimizer and Loss Function** \n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "LdkgUQl2RlBv" }, "outputs": [], "source": [ "def create_ddqn_agent(seed, QNetwork, ReplayMemory, optimizer, criterion, params):\n", " # TODO: set the different seeds\n", " torch.manual_seed(seed)\n", " random.seed(seed)\n", " np.random.seed(seed)\n", " # TODO: Instantiate the agent\n", " return DDQNAgent(QNetwork(n_observations, n_actions), ReplayMemory, optimizer, criterion, params)\n", "\n", "params = {\n", " 'UPDATE_RATE': 10 , # how often to update the network\n", " 'BUFFER_SIZE': int(100000), # size of the replay buffer\n", " 'BATCH_SIZE':128, # number of experiences sampled from memory\n", " 'GAMMA': 0.99 , # discount factor\n", " 'EPS_START':0.99 , # starting value of epsilon\n", " 'EPS_END': 0.01 , # final value of epsilon\n", " 'EPS_DECAY': 10000 , # rate of exponential decay of epsilon 35000\n", " 'TAU':0.009, # update rate of the target network 0.008\n", " 'LR': 0.0001 # learning rate of the optimizer\n", "}\n", "\n", "# setting different seed values\n", "seeds = [1, 10, 15, 43, 63]\n", "\n", "# TODO: Setup the Environment\n", "env = gym.make(\"CartPole-v1\")\n", "state, _ = env.reset()\n", "# Get number of actions from gym action space\n", "n_actions = env.action_space.n\n", "# Get the number of state observations\n", "n_observations = env.observation_space.shape[0]\n", "# TODO: Choose optimizer and loss function\n", "optimizer = optim.Adam\n", "criterion = nn.MSELoss\n", "\n", "# TODO: Create multiple agents with different seeds\n", "agents = []\n", "sum_of_rewards = []\n", "\n", "for seed in seeds:\n", " # TODO: create the Agent instance\n", " Agent = create_ddqn_agent(seed, QNetwork, ReplayMemory, optimizer, criterion, params)\n", " # TODO: Append the Agent to the agents list\n", " agents.append(Agent)\n", " sum_of_rewards.append([])" ] }, { "cell_type": "markdown", "metadata": { "id": "FKgBONebcSHp" }, "source": [ "## **Training Loop for DDQN Agent** \n", "\n", "With the hyperparameters set, we can now train the agent. The following code allows for easy re-training—if the agent's performance isn't satisfactory, simply re-run the segment to continue training for more episodes. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "id": "hbyia1ACd6Ju", "outputId": "75a11a2e-9d31-4f7f-dd19-204c6f5b3483" }, "outputs": [ { "data": { "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# TODO: Set the number of training episodes\n", "num_episodes = 400\n", "\n", "\n", "for i,DDQN_agent in enumerate(agents):\n", "\n", " for e in range(1, num_episodes + 1):\n", " # TODO: reset the environment\n", " state,_ = env.reset()\n", " state = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)\n", " episode_reward, done = 0, False\n", " # TODO: in the loop take actions,go to next step and get the rewards\n", " while True:\n", "\n", " action,_ = DDQN_agent.act(state)\n", " next_state, reward, done,truncated ,_,= env.step(action)\n", " next_state = torch.tensor(next_state, dtype=torch.float32, device=device).unsqueeze(0)\n", " # TODO: Save the Data to the Agent memory and train the Agent\n", " DDQN_agent.step(state, action, reward, next_state, done)\n", " episode_reward += reward\n", " # TODO: break the loop if the episode is terminated or truncated\n", " if done or truncated:\n", " break\n", " state = next_state\n", "\n", " sum_of_rewards[i].append(episode_reward)\n", " # Save model every 50 episodes and plot the returns (change the rate if needed) , you can turn the plot off.\n", " if e % 50 == 0:\n", "\n", " plot_rewards(sum_of_rewards[i],i)\n", " path = f'DDQN_{i}_Network' + (str(len(sum_of_rewards[i]))).zfill(4)\n", " DDQN_agent.save(path)\n", " save_progress(sum_of_rewards[i], path)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "8-pmbcW1n-l9" }, "source": [ "## **Computing the Moving Average of Results (5 points)** \n", "\n", "We compute a moving average to smooth the results for both DDQN and DQN across all seeds. This includes the average of the minimum, maximum, and actual returns across all seeds and episodes. \n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "id": "bzNZAA2o3Xyu" }, "outputs": [], "source": [ "def moving_average(data, window_size=25):\n", " return np.convolve(data, np.ones(window_size)/window_size, mode='valid')\n", "\n", "# Compute moving averages for DDQN\n", "DDQN_mean_rewards = moving_average(np.array(sum_of_rewards).mean(axis=0))\n", "DDQN_min_rewards = moving_average(np.array(sum_of_rewards).min(axis=0))\n", "DDQN_max_rewards = moving_average(np.array(sum_of_rewards).max(axis=0))\n", "\n", "# Compute moving averages for DQN\n", "DQN_mean_rewards = moving_average(np.array(dqn_sum_of_rewards).mean(axis=0))\n", "DQN_min_rewards = moving_average(np.array(dqn_sum_of_rewards).min(axis=0))\n", "DQN_max_rewards = moving_average(np.array(dqn_sum_of_rewards).max(axis=0))" ] }, { "cell_type": "markdown", "metadata": { "id": "CDY4CVrjochx" }, "source": [ "## **Visualizing the Outputs (5 points)** \n", "\n", "Plotting the smoothed results with: \n", "- Lower bound as the average of minimum returns \n", "- Upper bound as the average of maximum returns \n", "- Actual moving average of the returns \n", "\n", "for smoother plots, you can increase the **window_size** in `moving_average` function.\n" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 610 }, "id": "2P-JEtDL33xM", "outputId": "7d5101ac-5d09-4cc0-85dd-9571fdc3196d" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_smooth(DDQN_mean_rewards,DDQN_min_rewards,DDQN_max_rewards,DQN_mean_rewards,DQN_min_rewards,DQN_max_rewards)" ] }, { "cell_type": "markdown", "metadata": { "id": "t3EWv6crnryw" }, "source": [ "## **Takeaway Questions**\n", "- Which algorithm, DQN or DDQN, exhibits more stable learning?\n", " - Consider mean returns and the tightness of the upper and lower bounds in the training plot.\n", "- Which algorithm struggles less during learning?" ] }, { "cell_type": "markdown", "metadata": { "id": "PdK1XgRiIXBP" }, "source": [ "## **Model Evaluation** \n", "\n", "To evaluate the model, we measure the **average reward** and its **standard deviation** using the following function. \n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "id": "h-ToISzSmBxl" }, "outputs": [], "source": [ "def evaluate_policy(env, agent, num_episodes=3):\n", " # TODO: Initialize sum of rewards\n", " total_reward = np.zeros((num_episodes,1))\n", " Episode_values = []\n", "\n", " for episode in range(num_episodes):\n", "\n", " # TODO: Initialize environment\n", " state,_ = env.reset()\n", " state = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)\n", " done,truncated = False,False\n", " values =[]\n", " mean_value =[]\n", " # TODO: In the loop use greedy policy to evaluate your Agent\n", " while True:\n", "\n", " action,Q = agent.act(state,True)\n", " next_state,reward,done,truncated,_ = env.step(action)\n", " total_reward[episode]+=reward\n", " # appending Q values for later use\n", " ###################################################\n", " values.append(Q)\n", " mean_value.append(np.array(values).mean())\n", " ###################################################\n", " if done or truncated:\n", " break\n", " next_state = torch.tensor(next_state, dtype=torch.float32, device=device)\n", " state = next_state\n", "\n", " Episode_values.append(mean_value)\n", "\n", " # TODO: Return mean and std of rewards\n", " mean = sum(total_reward)/num_episodes\n", " std = sum((total_reward - mean)**2)/num_episodes\n", "\n", "\n", " return mean,std,Episode_values\n" ] }, { "cell_type": "markdown", "metadata": { "id": "pOzO_aW_Yimj" }, "source": [ "**Run both trained policies for 3 episodes, observe the mean and std of returns, and plot the mean Q-values per episode. If the agent passes the evaluation bar, plot their mean values.**\n", "\n", "## **Plot DDQN values**" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 989 }, "id": "1AgNnxeBl_V1", "outputId": "26de7fef-3db2-481d-d52c-7071c0e1a826" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "evaluating the 0th agent mean_reward = [316.33333333] +/- [366.88888889]\n", "\n", "evaluating the 1th agent mean_reward = [9.33333333] +/- [0.22222222]\n", "\n", "evaluating the 2th agent mean_reward = [477.33333333] +/- [1027.55555556]\n", "\n", "evaluating the 3th agent mean_reward = [113.] +/- [60.66666667]\n", "\n", "evaluating the 4th agent mean_reward = [370.33333333] +/- [11534.88888889]\n", "\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "DDQN_values = []\n", "# TODO: Run the each Trained Policy for 3 episodes a observe the mean and std of recived Returns and mean Q values over each episode\n", "for i,DDQN_agent in enumerate(agents):\n", " # TODO Evaluate the trained Agents\n", " mean_reward, std_reward,mean_values = evaluate_policy(env, DDQN_agent,num_episodes = 3)\n", " print(f\"evaluating the {i}th agent mean_reward = {mean_reward} +/- {std_reward}\\n\")\n", " if mean_reward >= 450:\n", " DDQN_values.append(mean_values)\n", "\n", "if len(DDQN_values) != 0:\n", " # TODO: Plot the mean values\n", " plot_values(DDQN_values)\n", "else:\n", " print('[Info] ... the Agent Did not pass the minimum requirement Please train it more.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Plot DQN values**" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ztVRLsjgJJYK", "outputId": "4c220a1c-65b4-4acb-9d08-940d2805ccfb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "evaluating the 0th agent mean_reward = [153.33333333] +/- [16.88888889]\n", "\n", "evaluating the 1th agent mean_reward = [9.] +/- [0.66666667]\n", "\n", "evaluating the 2th agent mean_reward = [286.33333333] +/- [7574.22222222]\n", "\n", "evaluating the 3th agent mean_reward = [374.] +/- [1418.66666667]\n", "\n", "evaluating the 4th agent mean_reward = [384.66666667] +/- [3146.88888889]\n", "\n", "[Info] ... the Agent Did not pass the minimum requirement Please train it more.\n" ] } ], "source": [ "DQN_values = []\n", "# TODO: Run the each Trained Policy for 3 episodes a observe the mean and std of recived Returns and mean Q values over each episode\n", "for i,DQN_agent in enumerate(DQN_agents):\n", " # TODO Evaluate the trained Agents\n", " mean_reward, std_reward,mean_values = evaluate_policy(env, DQN_agent,num_episodes = 3)\n", " print(f\"evaluating the {i}th agent mean_reward = {mean_reward} +/- {std_reward}\\n\")\n", " if mean_reward >= 450:\n", " DQN_values.append(mean_values)\n", "\n", "\n", "if len(DQN_values) != 0:\n", " # TODO: Plot the mean values\n", " plot_values(DQN_values)\n", "else:\n", " print('[Info] ... the Agent Did not pass the minimum requirement Please train it more.')" ] }, { "cell_type": "markdown", "metadata": { "id": "pyohSdzqEC3Z" }, "source": [ "## **Watch the best Agent's performance**\n", "**Select one of the best agents from the evaluation step and render its performance.**\n", "\n", "**Rendering DDQN Agent**" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 519 }, "id": "QGQ3aE8fDLjb", "outputId": "310a9ac1-94ad-4350-dc56-4dc54528c3f7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[info] ... rendering the DDQN Agent performance\n" ] }, { "data": { "text/html": [ "\n", " " ], "text/plain": [ "" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "env = gym.make(\"CartPole-v1\", render_mode='rgb_array')\n", "print('[info] ... rendering the DDQN Agent performance')\n", "# TODO: from the previous evaluation Select the best Agent to render the performance\n", "create_policy_eval_video(env, agents[2], 'greedy_policy', 3)" ] }, { "cell_type": "markdown", "metadata": { "id": "ZSPQW_n9Nix_" }, "source": [ "**Rendering DQN Agent**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RJdb9KpadSRp" }, "outputs": [], "source": [ "env = gym.make(\"CartPole-v1\", render_mode='rgb_array')\n", "print('[info] ... rendering the DQN Agent performance')\n", "# TODO: from the previous evaluation Select the best Agent to render the performance\n", "create_policy_eval_video(env, DQN_agents[...], 'greedy_policy', 3)" ] }, { "cell_type": "markdown", "metadata": { "id": "dDKTwX3explX" }, "source": [ "## **Takeaway Questions**\n", "- Which agent is trained better considering the evaluation and rendering the performance?\n", "- One of the DDQN's goal is to prevent Q values over estimation. Did you observe this phenomenon after ploting the Mean Q values?" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kaggle": { "accelerator": "nvidiaTeslaT4", "dataSources": [], "dockerImageVersionId": 30665, "isGpuEnabled": true, "isInternetEnabled": true, "language": "python", "sourceType": "notebook" }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" } }, "nbformat": 4, "nbformat_minor": 0 }