{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "e539f7fea2ba568c7a9a8292b0602a0a", "grade": false, "grade_id": "cell-4292e6ff11f3c291", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "# Assignment 2 - Q-Learning and Expected Sarsa" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "084c7b68a27987da29071541fb20358b", "grade": false, "grade_id": "cell-f4e1bfc6ad38ce3d", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "Welcome to Course 2 Programming Assignment 2. In this notebook, you will:\n", "\n", "- Implement Q-Learning with $\\epsilon$-greedy action selection\n", "- Implement Expected Sarsa with $\\epsilon$-greedy action selection\n", "- Investigate how these two algorithms behave on Cliff World (described on page 132 of the textbook)\n", "\n", "We will provide you with the environment and infrastructure to run an experiment (called the experiment program in RL-Glue). This notebook will provide all the code you need to run your experiment and visualise learning performance.\n", "\n", "This assignment will be graded automatically by comparing the behavior of your agent to our implementations of Expected Sarsa and Q-learning. The random seed will be set to avoid different behavior due to randomness. **You should not call any random functions in this notebook.** It will affect the agent's random state and change the results." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "3a6df636f47ebdf7f0707d7b2651a2c6", "grade": false, "grade_id": "cell-2a8ddbbf0ef25d07", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "## Packages" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "d74b7bc264a49057450f81177d1afbdb", "grade": false, "grade_id": "cell-69f08c6441da699c", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "You will need the following libraries for this assignment. We are using:\n", "1. numpy: the fundamental package for scientific computing with Python.\n", "2. scipy: a Python library for scientific and technical computing.\n", "3. matplotlib: library for plotting graphs in Python.\n", "4. RL-Glue: library for reinforcement learning experiments.\n", "\n", "**Please do not import other libraries** — this will break the autograder." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "from scipy.stats import sem\n", "import matplotlib.pyplot as plt\n", "from rl_glue import RLGlue\n", "import agent\n", "import cliffworld_env\n", "from tqdm import tqdm\n", "import pickle" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "781be58c941d2ddc62052efda26ebd05", "grade": false, "grade_id": "cell-92144e79fff2c0ea", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "plt.rcParams.update({'font.size': 15})\n", "plt.rcParams.update({'figure.figsize': [10,5]})" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "f6c9d5996579dbe1b3ac25058a574409", "grade": false, "grade_id": "cell-148cfbbe73465cef", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "## Section 1: Q-Learning" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "4869e937cb5c63d7046a204ebe15914c", "grade": false, "grade_id": "cell-0c942413e94d98db", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "In this section you will implement and test a Q-Learning agent with $\\epsilon$-greedy action selection (Section 6.5 in the textbook). " ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "d6eff9064c79025d80bff9970686a5d3", "grade": false, "grade_id": "cell-11cf7ceec7f5b9fe", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "### Implementation" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "d8a38e971b034abfdfc90ca66f3936b0", "grade": false, "grade_id": "cell-3417aeb44526bda3", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "Your job is to implement the updates in the methods agent_step and agent_end. We provide detailed comments in each method describing what your code should do." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "b523008e6f0bde39944117023b591333", "grade": false, "grade_id": "cell-e77107160ebd3c72", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# [Graded]\n", "# Q-Learning agent here\n", "class QLearningAgent(agent.BaseAgent):\n", " def agent_init(self, agent_init_info):\n", " \"\"\"Setup for the agent called when the experiment first starts.\n", " \n", " Args:\n", " agent_init_info (dict), the parameters used to initialize the agent. The dictionary contains:\n", " {\n", " num_states (int): The number of states,\n", " num_actions (int): The number of actions,\n", " epsilon (float): The epsilon parameter for exploration,\n", " step_size (float): The step-size,\n", " discount (float): The discount factor,\n", " }\n", " \n", " \"\"\"\n", " # Store the parameters provided in agent_init_info.\n", " self.num_actions = agent_init_info[\"num_actions\"]\n", " self.num_states = agent_init_info[\"num_states\"]\n", " self.epsilon = agent_init_info[\"epsilon\"]\n", " self.step_size = agent_init_info[\"step_size\"]\n", " self.discount = agent_init_info[\"discount\"]\n", " self.rand_generator = np.random.RandomState(agent_info[\"seed\"])\n", " \n", " # Create an array for action-value estimates and initialize it to zero.\n", " self.q = np.zeros((self.num_states, self.num_actions)) # The array of action-value estimates.\n", "\n", " \n", " def agent_start(self, state):\n", " \"\"\"The first method called when the episode starts, called after\n", " the environment starts.\n", " Args:\n", " state (int): the state from the\n", " environment's evn_start function.\n", " Returns:\n", " action (int): the first action the agent takes.\n", " \"\"\"\n", " \n", " # Choose action using epsilon greedy.\n", " current_q = self.q[state,:]\n", " if self.rand_generator.rand() < self.epsilon:\n", " action = self.rand_generator.randint(self.num_actions)\n", " else:\n", " action = self.argmax(current_q)\n", " self.prev_state = state\n", " self.prev_action = action\n", " return action\n", " \n", " def agent_step(self, reward, state):\n", " \"\"\"A step taken by the agent.\n", " Args:\n", " reward (float): the reward received for taking the last action taken\n", " state (int): the state from the\n", " environment's step based on where the agent ended up after the\n", " last step.\n", " Returns:\n", " action (int): the action the agent is taking.\n", " \"\"\"\n", " \n", " # Choose action using epsilon greedy.\n", " current_q = self.q[state, :]\n", " if self.rand_generator.rand() < self.epsilon:\n", " action = self.rand_generator.randint(self.num_actions)\n", " else:\n", " action = self.argmax(current_q)\n", " \n", " # Perform an update (1 line)\n", " ### START CODE HERE ###\n", " self.q[self.prev_state, self.prev_action] += self.step_size * (reward + self.discount * np.max(current_q) - self.q[self.prev_state, self.prev_action])\n", " ### END CODE HERE ###\n", " \n", " self.prev_state = state\n", " self.prev_action = action\n", " return action\n", " \n", " def agent_end(self, reward):\n", " \"\"\"Run when the agent terminates.\n", " Args:\n", " reward (float): the reward the agent received for entering the\n", " terminal state.\n", " \"\"\"\n", " # Perform the last update in the episode (1 line)\n", " ### START CODE HERE ###\n", " self.q[self.prev_state, self.prev_action] += self.step_size * (reward - self.q[self.prev_state, self.prev_action])\n", " ### END CODE HERE ###\n", " \n", " def argmax(self, q_values):\n", " \"\"\"argmax with random tie-breaking\n", " Args:\n", " q_values (Numpy array): the array of action-values\n", " Returns:\n", " action (int): an action with the highest value\n", " \"\"\"\n", " top = float(\"-inf\")\n", " ties = []\n", "\n", " for i in range(len(q_values)):\n", " if q_values[i] > top:\n", " top = q_values[i]\n", " ties = []\n", "\n", " if q_values[i] == top:\n", " ties.append(i)\n", "\n", " return self.rand_generator.choice(ties)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "9361d06fd03ef5169c039e916de4ec26", "grade": false, "grade_id": "cell-5bb232d570f6ba80", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "### Test" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "301cb73e95ae17680f0d24e10c7513d6", "grade": false, "grade_id": "cell-d2621de8f8b5e4ba", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "Run the cells below to test the implemented methods. The output of each cell should match the expected output.\n", "\n", "Note that passing this test does not guarantee correct behavior on the Cliff World." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "e31522059faa25ed475e25a6fbbc420c", "grade": false, "grade_id": "cell-1c160d79c07cac0b", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Action Value Estimates: \n", " [[0. 0. 0. 0.]\n", " [0. 0. 0. 0.]\n", " [0. 0. 0. 0.]]\n", "Action: 1\n" ] } ], "source": [ "# Do not modify this cell!\n", "\n", "## Test Code for agent_start() ##\n", "\n", "agent_info = {\"num_actions\": 4, \"num_states\": 3, \"epsilon\": 0.1, \"step_size\": 0.1, \"discount\": 1.0, \"seed\": 0}\n", "current_agent = QLearningAgent()\n", "current_agent.agent_init(agent_info)\n", "action = current_agent.agent_start(0)\n", "print(\"Action Value Estimates: \\n\", current_agent.q)\n", "print(\"Action:\", action)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "fbe3f4201266f67423b1ece02dbc0333", "grade": false, "grade_id": "cell-f1a6a8b66b6598e6", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Expected Output:**\n", "\n", "```\n", "Action Value Estimates: \n", " [[0. 0. 0. 0.]\n", " [0. 0. 0. 0.]\n", " [0. 0. 0. 0.]]\n", "Action: 1\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "b5d0abaed2b270d5a21f9503d8470e68", "grade": false, "grade_id": "cell-b63b908156924031", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Action Value Estimates: \n", " [[0. 0.2 0. 0. ]\n", " [0. 0. 0. 0.02]\n", " [0. 0. 0. 0. ]]\n", "Actions: [1, 3, 1]\n" ] } ], "source": [ "# Do not modify this cell!\n", "\n", "## Test Code for agent_step() ##\n", "\n", "actions = []\n", "agent_info = {\"num_actions\": 4, \"num_states\": 3, \"epsilon\": 0.1, \"step_size\": 0.1, \"discount\": 1.0, \"seed\": 0}\n", "current_agent = QLearningAgent()\n", "current_agent.agent_init(agent_info)\n", "actions.append(current_agent.agent_start(0))\n", "actions.append(current_agent.agent_step(2, 1))\n", "actions.append(current_agent.agent_step(0, 0))\n", "print(\"Action Value Estimates: \\n\", current_agent.q)\n", "print(\"Actions:\", actions)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "ed1a688d14e6eb3961b32a8dbdbbb858", "grade": false, "grade_id": "cell-3b916a9081886d4d", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Expected Output:**\n", "\n", "```\n", "Action Value Estimates: \n", " [[ 0. 0.2 0. 0. ]\n", " [ 0. 0. 0. 0.02]\n", " [ 0. 0. 0. 0. ]]\n", "Actions: [1, 3, 1]\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "49dd68d058ac35cf96e3682e71080b1f", "grade": false, "grade_id": "cell-8fe80d6a4a6555a5", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Action Value Estimates: \n", " [[0. 0.2 0. 0. ]\n", " [0. 0. 0. 0.1]\n", " [0. 0. 0. 0. ]]\n", "Actions: [1, 3]\n" ] } ], "source": [ "# Do not modify this cell!\n", "\n", "## Test Code for agent_end() ##\n", "\n", "actions = []\n", "agent_info = {\"num_actions\": 4, \"num_states\": 3, \"epsilon\": 0.1, \"step_size\": 0.1, \"discount\": 1.0, \"seed\": 0}\n", "current_agent = QLearningAgent()\n", "current_agent.agent_init(agent_info)\n", "actions.append(current_agent.agent_start(0))\n", "actions.append(current_agent.agent_step(2, 1))\n", "current_agent.agent_end(1)\n", "print(\"Action Value Estimates: \\n\", current_agent.q)\n", "print(\"Actions:\", actions)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "d34093b01b729874834af87668416b5f", "grade": false, "grade_id": "cell-8eddb10c5e7c1791", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Expected Output:**\n", "\n", "```\n", "Action Value Estimates: \n", " [[0. 0.2 0. 0. ]\n", " [0. 0. 0. 0.1]\n", " [0. 0. 0. 0. ]]\n", "Actions: [1, 3]\n", "```" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "9a549cc5d3d6a35b2578be87a3ea288a", "grade": false, "grade_id": "cell-3ab82a89ea44f09e", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "## Section 2: Expected Sarsa" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "e16e2e0918866de0908360b07d53b814", "grade": false, "grade_id": "cell-12980d9f811d7bb6", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "In this section you will implement an Expected Sarsa agent with $\\epsilon$-greedy action selection (Section 6.6 in the textbook). " ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "f635cf2541375086474f964e9ebe31d8", "grade": false, "grade_id": "cell-09c8eef6bd8e9472", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "### Implementation" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "401762021600e7176bb065754532c57b", "grade": false, "grade_id": "cell-27a67597b07f3d03", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "Your job is to implement the updates in the methods agent_step and agent_end. We provide detailed comments in each method describing what your code should do." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "db1e0c043dcc4292dc81eb19e3e0debd", "grade": false, "grade_id": "cell-8d20990dcf9eeb6c", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "# [Graded]\n", "# Expected Sarsa agent here\n", "class ExpectedSarsaAgent(agent.BaseAgent):\n", " def agent_init(self, agent_init_info):\n", " \"\"\"Setup for the agent called when the experiment first starts.\n", " \n", " Args:\n", " agent_init_info (dict), the parameters used to initialize the agent. The dictionary contains:\n", " {\n", " num_states (int): The number of states,\n", " num_actions (int): The number of actions,\n", " epsilon (float): The epsilon parameter for exploration,\n", " step_size (float): The step-size,\n", " discount (float): The discount factor,\n", " }\n", " \n", " \"\"\"\n", " # Store the parameters provided in agent_init_info.\n", " self.num_actions = agent_init_info[\"num_actions\"]\n", " self.num_states = agent_init_info[\"num_states\"]\n", " self.epsilon = agent_init_info[\"epsilon\"]\n", " self.step_size = agent_init_info[\"step_size\"]\n", " self.discount = agent_init_info[\"discount\"]\n", " self.rand_generator = np.random.RandomState(agent_info[\"seed\"])\n", " \n", " # Create an array for action-value estimates and initialize it to zero.\n", " self.q = np.zeros((self.num_states, self.num_actions)) # The array of action-value estimates.\n", "\n", " \n", " def agent_start(self, state):\n", " \"\"\"The first method called when the episode starts, called after\n", " the environment starts.\n", " Args:\n", " state (int): the state from the\n", " environment's evn_start function.\n", " Returns:\n", " action (int): the first action the agent takes.\n", " \"\"\"\n", " \n", " # Choose action using epsilon greedy.\n", " current_q = self.q[state, :]\n", " if self.rand_generator.rand() < self.epsilon:\n", " action = self.rand_generator.randint(self.num_actions)\n", " else:\n", " action = self.argmax(current_q)\n", " self.prev_state = state\n", " self.prev_action = action\n", " return action\n", " \n", " def agent_step(self, reward, state):\n", " \"\"\"A step taken by the agent.\n", " Args:\n", " reward (float): the reward received for taking the last action taken\n", " state (int): the state from the\n", " environment's step based on where the agent ended up after the\n", " last step.\n", " Returns:\n", " action (int): the action the agent is taking.\n", " \"\"\"\n", " \n", " # Choose action using epsilon greedy.\n", " current_q = self.q[state,:]\n", " if self.rand_generator.rand() < self.epsilon:\n", " action = self.rand_generator.randint(self.num_actions)\n", " else:\n", " action = self.argmax(current_q)\n", " \n", " # Perform an update (~5 lines)\n", " ### START CODE HERE ###\n", " q_max = np.max(current_q)\n", " pi = np.ones(self.num_actions) * (self.epsilon / self.num_actions)\n", " pi += (current_q == q_max) * ((1 - self.epsilon) / np.sum(current_q == q_max))\n", " expectation = np.sum(current_q * pi)\n", " self.q[self.prev_state, self.prev_action] += self.step_size * (reward + self.discount * expectation - self.q[self.prev_state, self.prev_action])\n", " ### END CODE HERE ###\n", " \n", " self.prev_state = state\n", " self.prev_action = action\n", " return action\n", " \n", " def agent_end(self, reward):\n", " \"\"\"Run when the agent terminates.\n", " Args:\n", " reward (float): the reward the agent received for entering the\n", " terminal state.\n", " \"\"\"\n", " # Perform the last update in the episode (1 line)\n", " ### START CODE HERE ###\n", " self.q[self.prev_state, self.prev_action] += self.step_size * (reward - self.q[self.prev_state, self.prev_action])\n", " ### END CODE HERE ###\n", " \n", " def argmax(self, q_values):\n", " \"\"\"argmax with random tie-breaking\n", " Args:\n", " q_values (Numpy array): the array of action-values\n", " Returns:\n", " action (int): an action with the highest value\n", " \"\"\"\n", " top = float(\"-inf\")\n", " ties = []\n", "\n", " for i in range(len(q_values)):\n", " if q_values[i] > top:\n", " top = q_values[i]\n", " ties = []\n", "\n", " if q_values[i] == top:\n", " ties.append(i)\n", "\n", " return self.rand_generator.choice(ties)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "f358f7e2676a77b8dd13a09fad9261a2", "grade": false, "grade_id": "cell-bd6580041d80533a", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "### Test" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "562af8b2c4449bec9534666c9747e461", "grade": false, "grade_id": "cell-7574736a2553024d", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "Run the cells below to test the implemented methods. The output of each cell should match the expected output.\n", "\n", "Note that passing this test does not guarantee correct behavior on the Cliff World." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "62db384f5fa66caae6a68a840cb56797", "grade": false, "grade_id": "cell-7d4f037d4106e8e2", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Action Value Estimates: \n", " [[0. 0. 0. 0.]\n", " [0. 0. 0. 0.]\n", " [0. 0. 0. 0.]]\n", "Action: 1\n" ] } ], "source": [ "# Do not modify this cell!\n", "\n", "## Test Code for agent_start() ##\n", "\n", "agent_info = {\"num_actions\": 4, \"num_states\": 3, \"epsilon\": 0.1, \"step_size\": 0.1, \"discount\": 1.0, \"seed\": 0}\n", "current_agent = ExpectedSarsaAgent()\n", "current_agent.agent_init(agent_info)\n", "action = current_agent.agent_start(0)\n", "print(\"Action Value Estimates: \\n\", current_agent.q)\n", "print(\"Action:\", action)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "2f5cc33e33a94e5123e0311be2208c2a", "grade": false, "grade_id": "cell-4d1ae44ff39f2ef6", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Expected Output:**\n", "\n", "```\n", "Action Value Estimates: \n", " [[0. 0. 0. 0.]\n", " [0. 0. 0. 0.]\n", " [0. 0. 0. 0.]]\n", "Action: 1\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "5b40fa207655b4dd1028786e8d553a70", "grade": false, "grade_id": "cell-e77508d1e061c326", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Action Value Estimates: \n", " [[0. 0.2 0. 0. ]\n", " [0. 0. 0. 0.0185]\n", " [0. 0. 0. 0. ]]\n", "Actions: [1, 3, 1]\n" ] } ], "source": [ "# Do not modify this cell!\n", "\n", "## Test Code for agent_step() ##\n", "\n", "actions = []\n", "agent_info = {\"num_actions\": 4, \"num_states\": 3, \"epsilon\": 0.1, \"step_size\": 0.1, \"discount\": 1.0, \"seed\": 0}\n", "current_agent = ExpectedSarsaAgent()\n", "current_agent.agent_init(agent_info)\n", "actions.append(current_agent.agent_start(0))\n", "actions.append(current_agent.agent_step(2, 1))\n", "actions.append(current_agent.agent_step(0, 0))\n", "print(\"Action Value Estimates: \\n\", current_agent.q)\n", "print(\"Actions:\", actions)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "e92c78b348a88e1db2e988fd442a1ae5", "grade": false, "grade_id": "cell-11bdb20cca21c6d6", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Expected Output:**\n", "\n", "```\n", "Action Value Estimates: \n", " [[0. 0.2 0. 0. ]\n", " [0. 0. 0. 0.0185]\n", " [0. 0. 0. 0. ]]\n", "Actions: [1, 3, 1]\n", "```" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "4f52f3065b81d15c96f297117c7b6d81", "grade": false, "grade_id": "cell-1866144548cd9c28", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Action Value Estimates: \n", " [[0. 0.2 0. 0. ]\n", " [0. 0. 0. 0.1]\n", " [0. 0. 0. 0. ]]\n", "Actions: [1, 3]\n" ] } ], "source": [ "# Do not modify this cell!\n", "\n", "## Test Code for agent_end() ##\n", "\n", "actions = []\n", "agent_info = {\"num_actions\": 4, \"num_states\": 3, \"epsilon\": 0.1, \"step_size\": 0.1, \"discount\": 1.0, \"seed\": 0}\n", "current_agent = ExpectedSarsaAgent()\n", "current_agent.agent_init(agent_info)\n", "actions.append(current_agent.agent_start(0))\n", "actions.append(current_agent.agent_step(2, 1))\n", "current_agent.agent_end(1)\n", "print(\"Action Value Estimates: \\n\", current_agent.q)\n", "print(\"Actions:\", actions)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "e9a2554acf9aa8d280d1175c3f23554b", "grade": false, "grade_id": "cell-9edd1b6d5a51c18a", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "**Expected Output:**\n", "\n", "```\n", "Action Value Estimates: \n", " [[0. 0.2 0. 0. ]\n", " [0. 0. 0. 0.1]\n", " [0. 0. 0. 0. ]]\n", "Actions: [1, 3]\n", "```" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "95e2ea24f0de8c0a847e3f9b1719e8f1", "grade": false, "grade_id": "cell-2692792f654c792f", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "## Section 3: Solving the Cliff World" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "5f6c1e54b358fabad02c9002f23a1087", "grade": false, "grade_id": "cell-6e7fbbaa12d4bf31", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "We described the Cliff World environment in the video \"Expected Sarsa in the Cliff World\" in Lesson 3. This is an undiscounted episodic task and thus we set $\\gamma$=1. The agent starts in the bottom left corner of the gridworld below and takes actions that move it in the four directions. Actions that would move the agent off of the cliff incur a reward of -100 and send the agent back to the start state. The reward for all other transitions is -1. An episode terminates when the agent reaches the bottom right corner. " ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "bac5c2eaf9d52fa5d29242db0de448f4", "grade": false, "grade_id": "cell-6aaddf82523ef2a5", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "\"Drawing\"\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "4215fbaa30c33d57f4351e501f0a6422", "grade": false, "grade_id": "cell-e55d077b9f8b6133", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "Using the experiment program in the cell below we now compare the agents on the Cliff World environment and plot the sum of rewards during each episode for the two agents.\n", "\n", "The result of this cell will be graded. If you make any changes to your algorithms, you have to run this cell again before submitting the assignment." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "343a62fbee9e83abdb3d4bd9a25c6283", "grade": false, "grade_id": "cell-6d11bb590ebfb0b2", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 100/100 [00:25<00:00, 3.91it/s]\n", "100%|██████████| 100/100 [00:56<00:00, 1.77it/s]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Do not modify this cell!\n", "\n", "agents = {\n", " \"Q-learning\": QLearningAgent,\n", " \"Expected Sarsa\": ExpectedSarsaAgent\n", "}\n", "env = cliffworld_env.Environment\n", "all_reward_sums = {} # Contains sum of rewards during episode\n", "all_state_visits = {} # Contains state visit counts during the last 10 episodes\n", "agent_info = {\"num_actions\": 4, \"num_states\": 48, \"epsilon\": 0.1, \"step_size\": 0.5, \"discount\": 1.0}\n", "env_info = {}\n", "num_runs = 100 # The number of runs\n", "num_episodes = 500 # The number of episodes in each run\n", "\n", "for algorithm in [\"Q-learning\", \"Expected Sarsa\"]:\n", " all_reward_sums[algorithm] = []\n", " all_state_visits[algorithm] = []\n", " for run in tqdm(range(num_runs)):\n", " agent_info[\"seed\"] = run\n", " rl_glue = RLGlue(env, agents[algorithm])\n", " rl_glue.rl_init(agent_info, env_info)\n", "\n", " reward_sums = []\n", " state_visits = np.zeros(48)\n", "# last_episode_total_reward = 0\n", " for episode in range(num_episodes):\n", " if episode < num_episodes - 10:\n", " # Runs an episode\n", " rl_glue.rl_episode(0) \n", " else: \n", " # Runs an episode while keeping track of visited states\n", " state, action = rl_glue.rl_start()\n", " state_visits[state] += 1\n", " is_terminal = False\n", " while not is_terminal:\n", " reward, state, action, is_terminal = rl_glue.rl_step()\n", " state_visits[state] += 1\n", " \n", " reward_sums.append(rl_glue.rl_return())\n", "# last_episode_total_reward = rl_glue.rl_return()\n", " \n", " all_reward_sums[algorithm].append(reward_sums)\n", " all_state_visits[algorithm].append(state_visits)\n", "\n", "# save results\n", "import os\n", "import shutil\n", "os.makedirs('results', exist_ok=True)\n", "np.save('results/q_learning.npy', all_reward_sums['Q-learning'])\n", "np.save('results/expected_sarsa.npy', all_reward_sums['Expected Sarsa'])\n", "shutil.make_archive('results', 'zip', '.', 'results')\n", "\n", " \n", "for algorithm in [\"Q-learning\", \"Expected Sarsa\"]:\n", " plt.plot(np.mean(all_reward_sums[algorithm], axis=0), label=algorithm)\n", "plt.xlabel(\"Episodes\")\n", "plt.ylabel(\"Sum of\\n rewards\\n during\\n episode\",rotation=0, labelpad=40)\n", "plt.xlim(0,500)\n", "plt.ylim(-100,0)\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "1cbb34897b56a32ea1e378b95caa0842", "grade": false, "grade_id": "cell-c3967df7d24c7d02", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "To see why these two agents behave differently, let's inspect the states they visit most. Run the cell below to generate plots showing the number of timesteps that the agents spent in each state over the last 10 episodes." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "a5d9243d4e90f82665bc9ca467e065ef", "grade": false, "grade_id": "cell-37a2b6675676da6f", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Do not modify this cell!\n", "\n", "for algorithm, position in [(\"Q-learning\", 211), (\"Expected Sarsa\", 212)]:\n", " plt.subplot(position)\n", " average_state_visits = np.array(all_state_visits[algorithm]).mean(axis=0)\n", " grid_state_visits = average_state_visits.reshape((4,12))\n", " grid_state_visits[0,1:-1] = np.nan\n", " plt.pcolormesh(grid_state_visits, edgecolors='gray', linewidth=2)\n", " plt.title(algorithm)\n", " plt.axis('off')\n", " cm = plt.get_cmap()\n", " cm.set_bad('gray')\n", "\n", " plt.subplots_adjust(bottom=0.0, right=0.7, top=1.0)\n", " cax = plt.axes([0.85, 0.0, 0.075, 1.])\n", "cbar = plt.colorbar(cax=cax)\n", "cbar.ax.set_ylabel(\"Visits during\\n the last 10\\n episodes\", rotation=0, labelpad=70)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "e20aaec2eb1806cda6de9f75002264d5", "grade": false, "grade_id": "cell-c7575e40e56f751c", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "The Q-learning agent learns the optimal policy, one that moves along the cliff and reaches the goal in as few steps as possible. However, since the agent does not follow the optimal policy and uses $\\epsilon$-greedy exploration, it occasionally falls off the cliff. The Expected Sarsa agent takes exploration into account and follows a safer path. Note this is different from the book. The book shows Sarsa learns the even safer path\n", "\n", "\n", "Previously we used a fixed step-size of 0.5 for the agents. What happens with other step-sizes? Does this difference in performance persist?\n", "\n", "In the next experiment we will try 10 different step-sizes from 0.1 to 1.0 and compare the sum of rewards per episode averaged over the first 100 episodes (similar to the interim performance curves in Figure 6.3 of the textbook). Shaded regions show standard errors.\n", "\n", "This cell takes around 10 minutes to run. The result of this cell will be graded. If you make any changes to your algorithms, you have to run this cell again before submitting the assignment." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "96725dad62b0596792b4d5694f64637e", "grade": false, "grade_id": "cell-f079ef9418195c22", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 100/100 [00:21<00:00, 4.65it/s]\n", "100%|██████████| 100/100 [00:15<00:00, 6.60it/s]\n", "100%|██████████| 100/100 [00:11<00:00, 8.92it/s]\n", "100%|██████████| 100/100 [00:09<00:00, 10.57it/s]\n", "100%|██████████| 100/100 [00:09<00:00, 10.92it/s]\n", "100%|██████████| 100/100 [00:08<00:00, 11.17it/s]\n", "100%|██████████| 100/100 [00:08<00:00, 11.79it/s]\n", "100%|██████████| 100/100 [00:07<00:00, 13.26it/s]\n", "100%|██████████| 100/100 [00:06<00:00, 14.45it/s]\n", "100%|██████████| 100/100 [00:07<00:00, 12.89it/s]\n", "100%|██████████| 100/100 [00:48<00:00, 2.07it/s]\n", "100%|██████████| 100/100 [00:33<00:00, 2.99it/s]\n", "100%|██████████| 100/100 [00:27<00:00, 3.62it/s]\n", "100%|██████████| 100/100 [00:22<00:00, 4.35it/s]\n", "100%|██████████| 100/100 [00:20<00:00, 4.92it/s]\n", "100%|██████████| 100/100 [00:17<00:00, 5.60it/s]\n", "100%|██████████| 100/100 [00:18<00:00, 5.55it/s]\n", "100%|██████████| 100/100 [00:17<00:00, 5.67it/s]\n", "100%|██████████| 100/100 [00:17<00:00, 5.80it/s]\n", "100%|██████████| 100/100 [00:14<00:00, 6.73it/s]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Do not modify this cell!\n", "\n", "agents = {\n", " \"Q-learning\": QLearningAgent,\n", " \"Expected Sarsa\": ExpectedSarsaAgent\n", "}\n", "env = cliffworld_env.Environment\n", "all_reward_sums = {}\n", "step_sizes = np.linspace(0.1,1.0,10)\n", "agent_info = {\"num_actions\": 4, \"num_states\": 48, \"epsilon\": 0.1, \"discount\": 1.0}\n", "env_info = {}\n", "num_runs = 100\n", "num_episodes = 100\n", "all_reward_sums = {}\n", "\n", "for algorithm in [\"Q-learning\", \"Expected Sarsa\"]:\n", " for step_size in step_sizes:\n", " all_reward_sums[(algorithm, step_size)] = []\n", " agent_info[\"step_size\"] = step_size\n", " for run in tqdm(range(num_runs)):\n", " agent_info[\"seed\"] = run\n", " rl_glue = RLGlue(env, agents[algorithm])\n", " rl_glue.rl_init(agent_info, env_info)\n", "\n", " return_sum = 0\n", " for episode in range(num_episodes):\n", " rl_glue.rl_episode(0)\n", " return_sum += rl_glue.rl_return()\n", " all_reward_sums[(algorithm, step_size)].append(return_sum/num_episodes)\n", " \n", "\n", "for algorithm in [\"Q-learning\", \"Expected Sarsa\"]:\n", " algorithm_means = np.array([np.mean(all_reward_sums[(algorithm, step_size)]) for step_size in step_sizes])\n", " algorithm_stds = np.array([sem(all_reward_sums[(algorithm, step_size)]) for step_size in step_sizes])\n", " plt.plot(step_sizes, algorithm_means, marker='o', linestyle='solid', label=algorithm)\n", " plt.fill_between(step_sizes, algorithm_means + algorithm_stds, algorithm_means - algorithm_stds, alpha=0.2)\n", "\n", "plt.legend()\n", "plt.xlabel(\"Step-size\")\n", "plt.ylabel(\"Sum of\\n rewards\\n per episode\",rotation=0, labelpad=50)\n", "plt.xticks(step_sizes)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "6113751690c166257cd1ace47ef977b1", "grade": false, "grade_id": "cell-e2c9c37b494e40f1", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "## Wrapping up" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "893577356341c384f4e2457631037f81", "grade": false, "grade_id": "cell-10150ffd5c7c91f8", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "Expected Sarsa shows an advantage over Q-learning in this problem across a wide range of step-sizes.\n", "\n", "Congratulations! Now you have:\n", "\n", "- implemented Q-Learning with $\\epsilon$-greedy action selection\n", "- implemented Expected Sarsa with $\\epsilon$-greedy action selection\n", "- investigated the behavior of these two algorithms on Cliff World\n", "\n", "To submit your solution, you will need to submit the `results.zip` file generated by the experiments. Here are the steps:\n", "\n", "- Go to the `file` menu at the top of the screen\n", "- Select `open`\n", "- Click the selection square next to `results.zip`\n", "- Select `Download` from the top menu\n", "- Upload that file to the grader in the next part of this module\n" ] } ], "metadata": { "coursera": { "course_slug": "sample-based-learning-methods", "launcher_item_id": "biN1L" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }