{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "705c9778a2b942ee040d62a35e17f305", "grade": false, "grade_id": "cell-2adc36b256efc420", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "# Assignment 1 - TD with State Aggregation\n", "\n", "Welcome to your Course 3 Programming Assignment 1. In this assignment, you will implement **semi-gradient TD(0) with State Aggregation** in an environment with a large state space. This assignment will focus on the **policy evaluation task** (prediction problem) where the goal is to accurately estimate state values under a given (fixed) policy.\n", "\n", "\n", "**In this assignment, you will:**\n", "1. Implement semi-gradient TD(0) with function approximation (state aggregation).\n", "2. Understand how to use supervised learning approaches to approximate value functions.\n", "3. Compare the impact of different resolutions of state aggregation, and see first hand how function approximation can speed up learning through generalization.\n", "\n", "**Note: You can create new cells for debugging purposes but please do not duplicate any Read-only cells. This may break the grader.**" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "2fdda174e78661ef365a8959fdef9ddf", "grade": false, "grade_id": "cell-99df6e3a990f9278", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## 500-State RandomWalk Environment\n", "\n", "In this assignment, we will implement and use a smaller 500 state version of the problem we covered in lecture (see \"State Aggregation with Monte Carlo”, and Example 9.1 in the [textbook](http://www.incompleteideas.net/book/RLbook2018.pdf)). The diagram below illustrates the problem.\n", "\n", "![](data/randomwalk_diagram.png)\n", "\n", "There are 500 states numbered from 1 to 500, left to right, and all episodes begin with the agent located at the center, in state 250. For simplicity, we will consider state 0 and state 501 as the left and right terminal states respectively. \n", "\n", "The episode terminates when the agent reaches the terminal state (state 0) on the left, or the terminal state (state 501) on the right. Termination on the left (state 0) gives the agent a reward of -1, and termination on the right (state 501) gives the agent a reward of +1.\n", "\n", "The agent can take one of two actions: go left or go right. If the agent chooses the left action, then it transitions uniform randomly into one of the 100 neighboring states to its left. If the agent chooses the right action, then it transitions randomly into one of the 100 neighboring states to its right. \n", "\n", "States near the edge may have fewer than 100 neighboring states on that side. In this case, all transitions that would have taken the agent past the edge result in termination. If the agent takes the left action from state 50, then it has a 0.5 chance of terminating on the left. If it takes the right action from state 499, then it has a 0.99 chance of terminating on the right.\n", "\n", "\n", "### Your Goal\n", "\n", "For this assignment, we will consider the problem of **policy evaluation**: estimating state-value function for a fixed policy.You will evaluate a uniform random policy in the 500-State Random Walk environment. This policy takes the right action with 0.5 probability and the left with 0.5 probability, regardless of which state it is in. \n", "\n", "This environment has a relatively large number of states. Generalization can significantly speed learning as we will show in this assignment. Often in realistic environments, states are high-dimensional and continuous. For these problems, function approximation is not just useful, it is also necessary." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "e986d8adb2b9aeb776c0c5c5e32e3511", "grade": false, "grade_id": "cell-72dc8196386b12dd", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Packages\n", "\n", "You will use the following packages in this assignment.\n", "\n", "- [numpy](www.numpy.org) : Fundamental package for scientific computing with Python.\n", "- [matplotlib](http://matplotlib.org) : Library for plotting graphs in Python.\n", "- [RL-Glue](http://www.jmlr.org/papers/v10/tanner09a.html) : Library for reinforcement learning experiments.\n", "- [jdc](https://alexhagen.github.io/jdc/) : Jupyter magic that allows defining classes over multiple jupyter notebook cells.\n", "- [tqdm](https://tqdm.github.io/) : A package to display progress bar when running experiments\n", "- plot_script : custom script to plot results\n", "\n", "**Please do not import other libraries** — this will break the autograder.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "d8c0405bac84c96f532b9865ea4906de", "grade": false, "grade_id": "cell-df277e2f962adb8c", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "# Do not modify this cell!\n", "\n", "# Import necessary libraries\n", "# DO NOT IMPORT OTHER LIBRARIES - This will break the autograder.\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "import os\n", "import jdc\n", "from tqdm import tqdm\n", "\n", "from rl_glue import RLGlue\n", "from environment import BaseEnvironment\n", "from agent import BaseAgent\n", "import plot_script" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "31c30526f523218ed622d6c3ec68fff9", "grade": false, "grade_id": "cell-ab47eee3b7f7d678", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Section 1: Create the 500-State RandomWalk Environment\n", "\n", "In this section we have provided you with the implementation of the 500-State RandomWalk Environment. It is useful to know how the environment is implemented. We will also use this environment in the next programming assignment. \n", "\n", "Once the agent chooses which direction to move, the environment determines how far the agent is moved in that direction. Assume the agent passes either 0 (indicating left) or 1 (indicating right) to the environment.\n", "\n", "Methods needed to implement the environment are: `env_init`, `env_start`, and `env_step`.\n", "\n", "- `env_init`: This method sets up the environment at the very beginning of the experiment. Relevant parameters are passed through `env_info` dictionary.\n", "- `env_start`: This is the first method called when the experiment starts, returning the start state.\n", "- `env_step`: This method takes in action and returns reward, next_state, and is_terminal." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "22821f10a71299ed2296dbaf70a1da70", "grade": false, "grade_id": "cell-cbf892eccaaeae92", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "# Do not modify this cell!\n", "\n", "class RandomWalkEnvironment(BaseEnvironment):\n", " def env_init(self, env_info={}):\n", " \"\"\"\n", " Setup for the environment called when the experiment first starts.\n", " \n", " Set parameters needed to setup the 500-state random walk environment.\n", " \n", " Assume env_info dict contains:\n", " {\n", " num_states: 500 [int],\n", " start_state: 250 [int],\n", " left_terminal_state: 0 [int],\n", " right_terminal_state: 501 [int],\n", " seed: int\n", " }\n", " \"\"\"\n", " \n", " # set random seed for each run\n", " self.rand_generator = np.random.RandomState(env_info.get(\"seed\")) \n", " \n", " # set each class attribute\n", " self.num_states = env_info[\"num_states\"] \n", " self.start_state = env_info[\"start_state\"] \n", " self.left_terminal_state = env_info[\"left_terminal_state\"] \n", " self.right_terminal_state = env_info[\"right_terminal_state\"]\n", "\n", " def env_start(self):\n", " \"\"\"\n", " The first method called when the experiment starts, called before the\n", " agent starts.\n", "\n", " Returns:\n", " The first state from the environment.\n", " \"\"\"\n", "\n", " # set self.reward_state_term tuple\n", " reward = 0.0\n", " state = self.start_state\n", " is_terminal = False\n", " \n", " self.reward_state_term = (reward, state, is_terminal)\n", " \n", " # return first state from the environment\n", " return self.reward_state_term[1]\n", " \n", " def env_step(self, action):\n", " \"\"\"A step taken by the environment.\n", "\n", " Args:\n", " action: The action taken by the agent\n", "\n", " Returns:\n", " (float, state, Boolean): a tuple of the reward, state,\n", " and boolean indicating if it's terminal.\n", " \"\"\"\n", " \n", " last_state = self.reward_state_term[1]\n", " \n", " # set reward, current_state, and is_terminal\n", " #\n", " # action: specifies direction of movement - 0 (indicating left) or 1 (indicating right) [int]\n", " # current state: next state after taking action from the last state [int]\n", " # reward: -1 if terminated left, 1 if terminated right, 0 otherwise [float]\n", " # is_terminal: indicates whether the episode terminated [boolean]\n", " #\n", " # Given action (direction of movement), determine how much to move in that direction from last_state\n", " # All transitions beyond the terminal state are absorbed into the terminal state.\n", " \n", " if action == 0: # left\n", " current_state = max(self.left_terminal_state, last_state + self.rand_generator.choice(range(-100,0)))\n", " elif action == 1: # right\n", " current_state = min(self.right_terminal_state, last_state + self.rand_generator.choice(range(1,101)))\n", " else: \n", " raise ValueError(\"Wrong action value\")\n", " \n", " # terminate left\n", " if current_state == self.left_terminal_state: \n", " reward = -1.0\n", " is_terminal = True\n", " \n", " # terminate right\n", " elif current_state == self.right_terminal_state:\n", " reward = 1.0\n", " is_terminal = True\n", " \n", " else:\n", " reward = 0.0\n", " is_terminal = False\n", " \n", " self.reward_state_term = (reward, current_state, is_terminal)\n", " \n", " return self.reward_state_term\n", " " ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "ffd4081f20b77a7e7650dbb4c099aa2e", "grade": false, "grade_id": "cell-78613720dae0e08a", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Section 2: Create Semi-gradient TD(0) Agent with State Aggregation\n", "\n", "Now let's create the Agent that interacts with the Environment.\n", "\n", "You will create an Agent that learns with semi-gradient TD(0) with state aggregation.\n", "For state aggregation, if the resolution (num_groups) is 10, then 500 states are partitioned into 10 groups of 50 states each (i.e., states 1-50 are one group, states 51-100 are another, and so on.)\n", "\n", "Hence, 50 states would share the same feature and value estimate, and there would be 10 distinct features. The feature vector for each state is a one-hot feature vector of length 10, with a single one indicating the group for that state. (one-hot vector of length 10)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "28c483b8d39c1d96546e28bfacc064cb", "grade": false, "grade_id": "cell-3676d253ce82f3e3", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Section 2-1: Implement Useful Functions\n", "\n", "Before we implement the agent, we need to define a couple of useful helper functions.\n", "\n", "**Please note all random method calls should be called through random number generator. Also do not use random method calls unless specified. In the agent, only `agent_policy` requires random method calls.**" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "0892ebbf5ce230de86cdba18f79c3e3b", "grade": false, "grade_id": "cell-fd6ef7407bc3283d", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Section 2-1a: Selecting actions\n", "\n", "In this part we have implemented `agent_policy()` for you.\n", "\n", "This method is used in `agent_start()` and `agent_step()` to select appropriate action.\n", "Normally, the agent acts differently given state, but in this environment the agent chooses randomly to move either left or right with equal probability.\n", "\n", "Agent returns 0 for left, and 1 for right." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "0529387589e86ec010d047c15e97ed91", "grade": false, "grade_id": "cell-9daa349ce740c93d", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "# Do not modify this cell!\n", "\n", "def agent_policy(rand_generator, state):\n", " \"\"\"\n", " Given random number generator and state, returns an action according to the agent's policy.\n", " \n", " Args:\n", " rand_generator: Random number generator\n", "\n", " Returns:\n", " chosen action [int]\n", " \"\"\"\n", " \n", " # set chosen_action as 0 or 1 with equal probability\n", " # state is unnecessary for this agent policy\n", " chosen_action = rand_generator.choice([0,1])\n", " \n", " return chosen_action" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "db39efbba49a89a8d148b5db81be22b5", "grade": false, "grade_id": "cell-d3817bfa37301c97", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "Run the following code to verify `agent_policy`. Expected output should match since we are controlling the random seed. Verify that actions 0 and 1 are chosen equally randomly." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "bc7511639f9b00a54ed38babbff85a47", "grade": false, "grade_id": "graded_agent_policy", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "action_array: [1, 1, 1, 0, 1, 0, 0, 0, 1, 0]\n" ] } ], "source": [ "# Do not modify this cell!\n", "\n", "## Test Code for agent_policy() ##\n", "\n", "test_rand_generator = np.random.RandomState(99) \n", "state = 250\n", "\n", "action_array = []\n", "for i in range(10):\n", " action_array.append(agent_policy(test_rand_generator, 250))\n", " \n", "print('action_array: {}'.format(action_array))" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "e7b7eaca654565f2085b16db33f0715b", "grade": false, "grade_id": "cell-488d18208bcb70d1", "locked": true, "schema_version": 1, "solution": false } }, "source": [ " **Expected output**:\n", " \n", "action_array: [1, 1, 1, 0, 1, 0, 0, 0, 1, 0]\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "7ad3a617b7596178c46a14c3beb1a7a0", "grade": false, "grade_id": "cell-4c0ff691fe474743", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Section 2-1b: Processing State Features with State Aggregation\n", "\n", "In this part you will implement `get_state_feature()`\n", "\n", "This method takes in a state and returns the aggregated feature (one-hot-vector) of that state.\n", "The feature vector size is determined by `num_groups`. Use `state` and `num_states_in_group` to determine which element in the feature vector is active.\n", "\n", "`get_state_feature()` is necessary whenever the agent receives a state and needs to convert it to a feature for learning. The features will thus be used in `agent_step()` and `agent_end()` when the agent updates its state values." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "deletable": false, "nbgrader": { "checksum": "0b7852b7ce9605d27552783172041141", "grade": false, "grade_id": "cell-53bb0b4c53614fed", "locked": false, "schema_version": 1, "solution": true } }, "outputs": [], "source": [ "# [Graded]\n", "\n", "def get_state_feature(num_states_in_group, num_groups, state):\n", " \"\"\"\n", " Given state, return the feature of that state\n", " \n", " Args:\n", " num_states_in_group [int]\n", " num_groups [int] \n", " state [int] : 1~500\n", "\n", " Returns:\n", " one_hot_vector [numpy array]\n", " \"\"\"\n", " \n", " ### Generate state feature (2~4 lines)\n", " # Create one_hot_vector with size of the num_groups, according to state\n", " # For simplicity, assume num_states is always perfectly divisible by num_groups\n", " # Note that states start from index 1, not 0!\n", " \n", " # Example:\n", " # If num_states = 100, num_states_in_group = 20, num_groups = 5,\n", " # one_hot_vector would be of size 5.\n", " # For states 1~20, one_hot_vector would be: [1, 0, 0, 0, 0]\n", " # \n", " # one_hot_vector = ?\n", " \n", " ### START CODE HERE ###\n", " one_hot_vector = np.zeros(num_groups)\n", " one_hot_vector[(state-1) // num_states_in_group] = 1\n", " ### END CODE HERE ###\n", " \n", " return one_hot_vector\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "d6cc269247cf5a0e13a4e7f054b1665e", "grade": false, "grade_id": "cell-e4f70e5687bffe68", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "Run the following code to verify your `get_state_feature()` function." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "961f1e5704613318508bfb978adba9ff", "grade": true, "grade_id": "graded_get_state_feature", "locked": true, "points": 15, "schema_version": 1, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1st group: [1. 0. 0. 0. 0.]\n", "2nd group: [0. 1. 0. 0. 0.]\n", "3rd group: [0. 0. 1. 0. 0.]\n", "4th group: [0. 0. 0. 1. 0.]\n", "5th group: [0. 0. 0. 0. 1.]\n" ] } ], "source": [ "# Do not modify this cell!\n", "\n", "## Test Code for get_state_feature() ##\n", "\n", "# Given that num_states = 10 and num_groups = 5, test get_state_feature()\n", "# There are states 1~10, and the state feature vector would be of size 5.\n", "# Only one element would be active for any state feature vector.\n", "\n", "# get_state_feature() should support various values of num_states, num_groups, not just this example\n", "# For simplicity, assume num_states will always be perfectly divisible by num_groups\n", "num_states = 10\n", "num_groups = 5\n", "num_states_in_group = int(num_states / num_groups)\n", "\n", "# Test 1st group, state = 1\n", "state = 1\n", "print(\"1st group: {}\".format(get_state_feature(num_states_in_group, num_groups, state)))\n", "\n", "# Test 2nd group, state = 3\n", "state = 3\n", "print(\"2nd group: {}\".format(get_state_feature(num_states_in_group, num_groups, state)))\n", "\n", "# Test 3rd group, state = 6\n", "state = 6\n", "print(\"3rd group: {}\".format(get_state_feature(num_states_in_group, num_groups, state)))\n", "\n", "# Test 4th group, state = 7\n", "state = 7\n", "print(\"4th group: {}\".format(get_state_feature(num_states_in_group, num_groups, state)))\n", "\n", "# Test 5th group, state = 10\n", "state = 10\n", "print(\"5th group: {}\".format(get_state_feature(num_states_in_group, num_groups, state)))\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "c3639dc51824bb421855833469675fc5", "grade": false, "grade_id": "cell-a55d91373189d887", "locked": true, "schema_version": 1, "solution": false } }, "source": [ " **Expected output**:\n", " \n", " 1st group: [1. 0. 0. 0. 0.]\n", " 2nd group: [0. 1. 0. 0. 0.]\n", " 3rd group: [0. 0. 1. 0. 0.]\n", " 4th group: [0. 0. 0. 1. 0.]\n", " 5th group: [0. 0. 0. 0. 1.]" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "bd0606054e551739f18f601b6d4840be", "grade": false, "grade_id": "cell-eed6babe9b563391", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Section 2-2: Implement Agent Methods\n", "\n", "Now that we have implemented all the helper functions, let's create an agent. In this part, you will implement `agent_init()`, `agent_start()`, `agent_step()` and `agent_end()`. You will have to use `agent_policy()` that we implemented above. We will implement `agent_message()` later, when returning the learned state-values.\n", "\n", "To save computation time, we precompute features for all states beforehand in `agent_init()`. The pre-computed features are saved in `self.all_state_features` numpy array. Hence, you do not need to call `get_state_feature()` every time in `agent_step()` and `agent_end()`.\n", "\n", "The shape of `self.all_state_features` numpy array is `(num_states, feature_size)`, with features of states from State 1-500. Note that index 0 stores features for State 1 (Features for State 0 does not exist). Use `self.all_state_features` to access each feature vector for a state.\n", "\n", "When saving state values in the agent, recall how the state values are represented with linear function approximation.\n", "\n", "**State Value Representation**: $\\hat{v}(s,\\mathbf{w}) = \\mathbf{w}\\cdot\\mathbf{x^T}$ where $\\mathbf{w}$ is a weight vector and $\\mathbf{x}$ is the feature vector of the state.\n", "\n", "\n", "When performing TD(0) updates with Linear Function Approximation, recall how we perform semi-gradient TD(0) updates using supervised learning.\n", "\n", "**semi-gradient TD(0) Weight Update Rule**: $\\mathbf{w_{t+1}} = \\mathbf{w_{t}} + \\alpha [R_{t+1} + \\gamma \\hat{v}(S_{t+1},\\mathbf{w}) - \\hat{v}(S_t,\\mathbf{w})] \\nabla \\hat{v}(S_t,\\mathbf{w})$" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "deletable": false, "nbgrader": { "checksum": "639e32d76f77e0d254d8b731ada078ad", "grade": false, "grade_id": "cell-cfe76b4f7bad2b67", "locked": false, "schema_version": 1, "solution": true } }, "outputs": [], "source": [ "# [Graded]\n", "\n", "# Create TDAgent\n", "class TDAgent(BaseAgent):\n", " def __init__(self):\n", " self.num_states = None\n", " self.num_groups = None\n", " self.step_size = None\n", " self.discount_factor = None\n", " \n", " def agent_init(self, agent_info={}):\n", " \"\"\"Setup for the agent called when the experiment first starts.\n", "\n", " Set parameters needed to setup the semi-gradient TD(0) state aggregation agent.\n", "\n", " Assume agent_info dict contains:\n", " {\n", " num_states: 500 [int],\n", " num_groups: int, \n", " step_size: float, \n", " discount_factor: float,\n", " seed: int\n", " }\n", " \"\"\"\n", "\n", " # set random seed for each run\n", " self.rand_generator = np.random.RandomState(agent_info.get(\"seed\")) \n", "\n", " # set class attributes\n", " self.num_states = agent_info.get(\"num_states\")\n", " self.num_groups = agent_info.get(\"num_groups\")\n", " self.step_size = agent_info.get(\"step_size\")\n", " self.discount_factor = agent_info.get(\"discount_factor\")\n", "\n", " # pre-compute all observable features\n", " num_states_in_group = int(self.num_states / self.num_groups)\n", " self.all_state_features = np.array([get_state_feature(num_states_in_group, self.num_groups, state) for state in range(1, self.num_states + 1)])\n", "\n", " ### initialize weights correctly (1 line)\n", " # initialize all weights to zero using numpy array with correct size\n", " # self.weights = ?\n", "\n", " ### START CODE HERE ###\n", " self.weights = np.zeros(self.num_groups)\n", " ### END CODE HERE ###\n", "\n", " self.last_state = None\n", " self.last_action = None\n", "\n", " def agent_start(self, state):\n", " \"\"\"The first method called when the experiment starts, called after\n", " the environment starts.\n", " Args:\n", " state (Numpy array): the state from the\n", " environment's evn_start function.\n", " Returns:\n", " self.last_action [int] : The first action the agent takes.\n", " \"\"\"\n", "\n", " ### select action given state (using agent_policy), and save current state and action (2~3 lines)\n", " # Use self.rand_generator for agent_policy\n", " # \n", " # self.last_state = ?\n", " # self.last_action = ?\n", "\n", " ### START CODE HERE ###\n", " action = agent_policy(self.rand_generator, state)\n", " self.last_state = state\n", " self.last_action = action\n", " ### END CODE HERE ###\n", "\n", " return self.last_action\n", "\n", " def agent_step(self, reward, state):\n", " \"\"\"A step taken by the agent.\n", " Args:\n", " reward [float]: the reward received for taking the last action taken\n", " state [int]: the state from the environment's step, where the agent ended up after the last step\n", " Returns:\n", " self.last_action [int] : The action the agent is taking.\n", " \"\"\"\n", " \n", " # get relevant feature\n", " current_state_feature = self.all_state_features[state-1] \n", " last_state_feature = self.all_state_features[self.last_state-1] \n", " \n", " ### update weights and select action (3~5 lines)\n", " # (Hint: np.dot method is useful!)\n", " #\n", " # Update weights:\n", " # use self.weights, current_state_feature, and last_state_feature\n", " #\n", " # Select action:\n", " # use self.rand_generator for agent_policy\n", " #\n", " # Current state and selected action should be saved to self.last_state and self.last_action at the end\n", " #\n", " # self.weights = ?\n", " # self.last_state = ?\n", " # self.last_action = ?\n", "\n", " ### START CODE HERE ###\n", " self.weights += self.step_size * (reward + self.discount_factor * np.dot(self.weights, current_state_feature) - np.dot(self.weights, last_state_feature)) * last_state_feature\n", " action = agent_policy(self.rand_generator, state)\n", " self.last_state = state\n", " self.last_action = action\n", " ### END CODE HERE ###\n", " return self.last_action\n", "\n", " def agent_end(self, reward):\n", " \"\"\"Run when the agent terminates.\n", " Args:\n", " reward (float): the reward the agent received for entering the\n", " terminal state.\n", " \"\"\"\n", "\n", " # get relevant feature\n", " last_state_feature = self.all_state_features[self.last_state-1]\n", " \n", " ### update weights (1~2 lines)\n", " # Update weights using self.weights and last_state_feature\n", " # (Hint: np.dot method is useful!)\n", " # \n", " # Note that here you don't need to choose action since the agent has reached a terminal state\n", " # Therefore you should not update self.last_state and self.last_action\n", " # \n", " # self.weights = ?\n", " \n", " ### START CODE HERE ###\n", " self.weights += self.step_size * (reward - np.dot(self.weights, last_state_feature)) * last_state_feature\n", " ### END CODE HERE ###\n", " return\n", " \n", " def agent_message(self, message):\n", " # We will implement this method later\n", " raise NotImplementedError\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "8b53f563e46fb2e3949010f3c16f5f9a", "grade": false, "grade_id": "cell-a92a727706966b2b", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "\n", "Run the following code to verify `agent_init()`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "79882d4de94a6db403f376d274e6cb34", "grade": true, "grade_id": "graded_agent_init", "locked": true, "points": 5, "schema_version": 1, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "num_states: 500\n", "num_groups: 10\n", "step_size: 0.1\n", "discount_factor: 1.0\n", "weights shape: (10,)\n", "weights init. value: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n" ] } ], "source": [ "# Do not modify this cell!\n", "\n", "## Test Code for agent_init() ## \n", "\n", "agent_info = {\"num_states\": 500,\n", " \"num_groups\": 10,\n", " \"step_size\": 0.1,\n", " \"discount_factor\": 1.0,\n", " \"seed\": 1}\n", "\n", "test_agent = TDAgent()\n", "test_agent.agent_init(agent_info)\n", "\n", "# check attributes\n", "print(\"num_states: {}\".format(test_agent.num_states))\n", "print(\"num_groups: {}\".format(test_agent.num_groups))\n", "print(\"step_size: {}\".format(test_agent.step_size))\n", "print(\"discount_factor: {}\".format(test_agent.discount_factor))\n", "\n", "print(\"weights shape: {}\".format(test_agent.weights.shape))\n", "print(\"weights init. value: {}\".format(test_agent.weights))\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "eaa694784b499c6edf93d7c845805928", "grade": false, "grade_id": "cell-09d1b865ec761116", "locked": true, "schema_version": 1, "solution": false } }, "source": [ " **Expected output**:\n", " \n", " num_states: 500\n", " num_groups: 10\n", " step_size: 0.1\n", " discount_factor: 1.0\n", " weights shape: (10,)\n", " weights init. value: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "7bf26e7339307df3557652e06cfbbd54", "grade": false, "grade_id": "cell-c47a537224d052ad", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "Run the following code to verify `agent_start()`.\n", "Although there is randomness due to `rand_generator.choice()` in `agent_policy()`, we control the seed so your output should match the expected output. \n", "\n", "Make sure `rand_generator.choice()` is called only once per `agent_policy()` call." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "fbf48a7b1eeb420f1650730d050f3c8a", "grade": true, "grade_id": "graded_agent_start", "locked": true, "points": 10, "schema_version": 1, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Agent state: 250\n", "Agent selected action: 1\n" ] } ], "source": [ "# Do not modify this cell!\n", "## Test Code for agent_start() and agent_policy() ## \n", "\n", "agent_info = {\"num_states\": 500,\n", " \"num_groups\": 10,\n", " \"step_size\": 0.1,\n", " \"discount_factor\": 1.0,\n", " \"seed\": 1\n", " }\n", "\n", "# Suppose state = 250\n", "state = 250\n", "\n", "test_agent = TDAgent()\n", "test_agent.agent_init(agent_info)\n", "test_agent.agent_start(state)\n", "\n", "print(\"Agent state: {}\".format(test_agent.last_state))\n", "print(\"Agent selected action: {}\".format(test_agent.last_action))\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "b8da6624acf4294ac9b0e9844a976551", "grade": false, "grade_id": "cell-4bb285c764d8ad67", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "**Expected output**:\n", "\n", " Agent state: 250\n", " Agent selected action: 1" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "a78f84de7558de794442da4c496e581e", "grade": false, "grade_id": "cell-a3d392998465216c", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "Run the following code to verify `agent_step()`\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "187c92e0f22eb19c833f17276623dfca", "grade": true, "grade_id": "graded_agent_step", "locked": true, "points": 15, "schema_version": 1, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Initial weights: [-1.5 0.5 1. -0.5 1.5 -0.5 1.5 0. -0.5 -1. ]\n", "Updated weights: [-0.26 0.5 1. -0.5 1.5 -0.5 1.5 0. -0.5 -1. ]\n", "weight update is correct!\n", "\n", "last state: 120\n", "last action: 1\n" ] } ], "source": [ "# Do not modify this cell!\n", "\n", "## Test Code for agent_step() ## \n", "# Make sure agent_init() and agent_start() are working correctly first.\n", "# agent_step() should work correctly for other arbitrary state transitions in addition to this test case.\n", "agent_info = {\"num_states\": 500,\n", " \"num_groups\": 10,\n", " \"step_size\": 0.1,\n", " \"discount_factor\": 0.9,\n", " \"seed\": 1}\n", "\n", "test_agent = TDAgent()\n", "test_agent.agent_init(agent_info)\n", "\n", "# Initializing the weights to arbitrary values to verify the correctness of weight update\n", "test_agent.weights = np.array([-1.5, 0.5, 1., -0.5, 1.5, -0.5, 1.5, 0.0, -0.5, -1.0])\n", "print(\"Initial weights: {}\".format(test_agent.weights))\n", "\n", "# Assume the agent started at State 50\n", "start_state = 50\n", "action = test_agent.agent_start(start_state)\n", "\n", "# Assume the reward was 10.0 and the next state observed was State 120\n", "reward = 10.0\n", "next_state = 120\n", "test_agent.agent_step(reward, next_state)\n", "print(\"Updated weights: {}\".format(test_agent.weights))\n", "\n", "if np.allclose(test_agent.weights, np.array([-0.26, 0.5, 1., -0.5, 1.5, -0.5, 1.5, 0., -0.5, -1.])):\n", " print(\"weight update is correct!\\n\")\n", "else:\n", " print(\"weight update is incorrect.\\n\")\n", "\n", "print(\"last state: {}\".format(test_agent.last_state))\n", "print(\"last action: {}\".format(test_agent.last_action))\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "28de0ce9c2293d85cfa167c3e376aca6", "grade": false, "grade_id": "cell-feab2079de2e1fc0", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "**Expected output**: (Note only the 1st element was changed)\n", " \n", " Initial weights: [-1.5 0.5 1. -0.5 1.5 -0.5 1.5 0. -0.5 -1. ]\n", " Updated weights: [-0.26 0.5 1. -0.5 1.5 -0.5 1.5 0. -0.5 -1. ]\n", " last state: 120\n", " last action: 1" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "74f0d3cc7f10e820a03933ff8a9c8f57", "grade": false, "grade_id": "cell-b1a7b031081d1821", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "Run the following code to verify `agent_end()`" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "e932f659435c5fb1885fd60dc183f614", "grade": true, "grade_id": "graded_agent_end", "locked": true, "points": 10, "schema_version": 1, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Initial weights: [-1.5 0.5 1. -0.5 1.5 -0.5 1.5 0. -0.5 -1. ]\n", "Updated weights: [-0.35 0.5 1. -0.5 1.5 -0.5 1.5 0. -0.5 -1. ]\n", "weight update is correct!\n", "\n" ] } ], "source": [ "# Do not modify this cell!\n", "\n", "## Test Code for agent_end() ## \n", "# Make sure agent_init() and agent_start() are working correctly first.\n", "\n", "agent_info = {\"num_states\": 500,\n", " \"num_groups\": 10,\n", " \"step_size\": 0.1,\n", " \"discount_factor\": 0.9,\n", " \"seed\": 1}\n", "\n", "test_agent = TDAgent()\n", "test_agent.agent_init(agent_info)\n", "\n", "# Initializing the weights to arbitrary values to verify the correctness of weight update\n", "test_agent.weights = np.array([-1.5, 0.5, 1., -0.5, 1.5, -0.5, 1.5, 0.0, -0.5, -1.0])\n", "print(\"Initial weights: {}\".format(test_agent.weights))\n", "\n", "# Assume the agent started at State 50\n", "start_state = 50\n", "test_agent.agent_start(start_state)\n", "\n", "# Assume the reward was 10.0 and reached the terminal state\n", "test_agent.agent_end(10.0)\n", "print(\"Updated weights: {}\".format(test_agent.weights))\n", "\n", "if np.allclose(test_agent.weights, np.array([-0.35, 0.5, 1., -0.5, 1.5, -0.5, 1.5, 0., -0.5, -1.])):\n", " print(\"weight update is correct!\\n\")\n", "else:\n", " print(\"weight update is incorrect.\\n\")\n", " " ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "833af2e84df2cdad6d43bacb144e7a81", "grade": false, "grade_id": "cell-f8457a84eed9709d", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "**Expected output**: (Note only the 1st element was changed, and the result is different from `agent_step()` )\n", " \n", " Initial weights: [-1.5 0.5 1. -0.5 1.5 -0.5 1.5 0. -0.5 -1. ]\n", " Updated weights: [-0.35 0.5 1. -0.5 1.5 -0.5 1.5 0. -0.5 -1. ]" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "7bf058b9bfcc1f79a75944e0aaf333c6", "grade": false, "grade_id": "cell-cd580cba3ee6c3a1", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Section 2-3: Returning Learned State Values\n", "\n", "You are almost done! Now let's implement a code block in `agent_message()` that returns the learned state values.\n", "\n", "The method `agent_message()` will return the learned state_value array when `message == 'get state value'`.\n", "\n", "**Hint**: Think about how state values are represented with linear function approximation. `state_value` array will be a 1D array with length equal to the number of states." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "deletable": false, "nbgrader": { "checksum": "3919378741a1bd275799f88b40744fc1", "grade": false, "grade_id": "cell-b469919b09cd9284", "locked": false, "schema_version": 1, "solution": true } }, "outputs": [], "source": [ "%%add_to TDAgent\n", "# [Graded]\n", "\n", "def agent_message(self, message):\n", " if message == 'get state value':\n", " \n", " ### return state_value (1~2 lines)\n", " # Use self.all_state_features and self.weights to return the vector of all state values\n", " # Hint: Use np.dot()\n", " #\n", " # state_value = ?\n", " \n", " ### START CODE HERE ###\n", " state_value = np.dot(self.weights, self.all_state_features.T)\n", " ### END CODE HERE ###\n", " \n", " return state_value" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "54f25dd3167af63a5c815dda36482c22", "grade": false, "grade_id": "cell-33209f575321ccb5", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "Run the following code to verify `get_state_val()`" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "6df9d79197c4fcc7699195570cb9c9cb", "grade": true, "grade_id": "graded_get_state_val", "locked": true, "points": 10, "schema_version": 1, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "State value shape: (20,)\n", "Initial State value for all states: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n" ] } ], "source": [ "# Do not modify this cell!\n", "\n", "## Test Code for agent_get_state_val() ##\n", "\n", "agent_info = {\"num_states\": 20,\n", " \"num_groups\": 5,\n", " \"step_size\": 0.1,\n", " \"discount_factor\": 1.0}\n", "\n", "test_agent = TDAgent()\n", "test_agent.agent_init(agent_info)\n", "test_state_val = test_agent.agent_message('get state value')\n", "\n", "print(\"State value shape: {}\".format(test_state_val.shape))\n", "print(\"Initial State value for all states: {}\".format(test_state_val))\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "0e09f2c295f33bf6daa1e13bfbb5f4b5", "grade": false, "grade_id": "cell-8b87229733a8fd76", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "**Expected Output**:\n", "\n", " State value shape: (20,)\n", " Initial State value for all states: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "fd97d78c67f1398fe62084fc178a968c", "grade": false, "grade_id": "cell-4a2937aee7e48fe0", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Section 3: Run Experiment\n", "\n", "Now that we've implemented all the components of environment and agent, let's run an experiment! We will plot two things: (1) the learned state value function and compare it against the true state values, and (2) a learning curve depicting the error in the learned value estimates over episodes. For the learning curve, what should we plot to see if the agent is learning well?" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "dc87729f4043121c9a25e501f1929e70", "grade": false, "grade_id": "cell-9081e37ad214f0b6", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Section 3-1: Prediction Objective (Root Mean Squared Value Error) \n", "\n", "Recall that the Prediction Objective in function approximation is Mean Squared Value Error $\\overline{VE}(\\mathbf{w}) \\doteq \\sum\\limits_{s \\in \\mathcal{S}}\\mu(s)[v_\\pi(s)-\\hat{v}(s,\\mathbf{w})]^2$\n", "\n", "We will use the square root of this measure, the root $\\overline{VE}$ to give a rough measure of how much the learned values differ from the true values.\n", "\n", "`calc RMSVE()` computes the Root Mean Squared Value Error given learned state value $\\hat{v}(s, \\mathbf{w})$.\n", "We provide you with true state value $v_\\pi(s)$ and state distribution $\\mu(s)$\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "7868b93e326ebd58cb7f365882009909", "grade": false, "grade_id": "cell-72fdaf375f3e3d99", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "# Do not modify this cell!\n", "\n", "# Here we provide you with the true state value and state distribution\n", "true_state_val = np.load('data/true_V.npy') \n", "state_distribution = np.load('data/state_distribution.npy')\n", "\n", "def calc_RMSVE(learned_state_val):\n", " assert(len(true_state_val) == len(learned_state_val) == len(state_distribution))\n", " MSVE = np.sum(np.multiply(state_distribution, np.square(true_state_val - learned_state_val)))\n", " RMSVE = np.sqrt(MSVE)\n", " return RMSVE" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "4155d4fd0c4f33610bf7d66dc5f01777", "grade": false, "grade_id": "cell-bea80af13342f057", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Section 3-2a: Run Experiment with 10-State Aggregation\n", "\n", "We have provided you the experiment/plot code in the cell below." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "9ae12f60fde0b13b2f50ca1a121c1781", "grade": false, "grade_id": "cell-42b7e0b38d1ead4c", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "# Do not modify this cell!\n", "\n", "# Define function to run experiment\n", "def run_experiment(environment, agent, environment_parameters, agent_parameters, experiment_parameters):\n", "\n", " rl_glue = RLGlue(environment, agent)\n", " \n", " # Sweep Agent parameters\n", " for num_agg_states in agent_parameters[\"num_groups\"]:\n", " for step_size in agent_parameters[\"step_size\"]:\n", " \n", " # save rmsve at the end of each evaluation episode\n", " # size: num_episode / episode_eval_frequency + 1 (includes evaluation at the beginning of training)\n", " agent_rmsve = np.zeros(int(experiment_parameters[\"num_episodes\"]/experiment_parameters[\"episode_eval_frequency\"]) + 1)\n", " \n", " # save learned state value at the end of each run\n", " agent_state_val = np.zeros(environment_parameters[\"num_states\"])\n", "\n", " env_info = {\"num_states\": environment_parameters[\"num_states\"],\n", " \"start_state\": environment_parameters[\"start_state\"],\n", " \"left_terminal_state\": environment_parameters[\"left_terminal_state\"],\n", " \"right_terminal_state\": environment_parameters[\"right_terminal_state\"]}\n", "\n", " agent_info = {\"num_states\": environment_parameters[\"num_states\"],\n", " \"num_groups\": num_agg_states,\n", " \"step_size\": step_size,\n", " \"discount_factor\": environment_parameters[\"discount_factor\"]}\n", "\n", " print('Setting - num. agg. states: {}, step_size: {}'.format(num_agg_states, step_size))\n", " os.system('sleep 0.2')\n", " \n", " # one agent setting\n", " for run in tqdm(range(1, experiment_parameters[\"num_runs\"]+1)):\n", " env_info[\"seed\"] = run\n", " agent_info[\"seed\"] = run\n", " rl_glue.rl_init(agent_info, env_info)\n", " \n", " # Compute initial RMSVE before training\n", " current_V = rl_glue.rl_agent_message(\"get state value\")\n", " agent_rmsve[0] += calc_RMSVE(current_V)\n", " \n", " for episode in range(1, experiment_parameters[\"num_episodes\"]+1):\n", " # run episode\n", " rl_glue.rl_episode(0) # no step limit\n", " \n", " if episode % experiment_parameters[\"episode_eval_frequency\"] == 0:\n", " current_V = rl_glue.rl_agent_message(\"get state value\")\n", " agent_rmsve[int(episode/experiment_parameters[\"episode_eval_frequency\"])] += calc_RMSVE(current_V)\n", " \n", " # store only one run of state value\n", " if run == 50:\n", " agent_state_val = rl_glue.rl_agent_message(\"get state value\")\n", " \n", " # rmsve averaged over runs\n", " agent_rmsve /= experiment_parameters[\"num_runs\"]\n", " \n", " save_name = \"{}_agg_states_{}_step_size_{}\".format('TD_agent', num_agg_states, step_size).replace('.','')\n", " \n", " if not os.path.exists('results'):\n", " os.makedirs('results')\n", " \n", " # save avg. state value\n", " np.save(\"results/V_{}\".format(save_name), agent_state_val)\n", "\n", " # save avg. rmsve\n", " np.save(\"results/RMSVE_{}\".format(save_name), agent_rmsve)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "c26b5d7551c7a1b3daa3ab0af916933f", "grade": false, "grade_id": "cell-46962f34e1051db3", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "\n", "We will first test our implementation using state aggregation with resolution of 10, with three different step sizes: {0.01, 0.05, 0.1}.\n", "\n", "Note that running the experiment cell below will take **_approximately 5 min_**.\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "83dc1423ac345560f1868b5eae8c57f6", "grade": false, "grade_id": "cell-e9bf5a92d552cda5", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Setting - num. agg. states: 10, step_size: 0.01\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 50/50 [01:03<00:00, 1.27s/it]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Setting - num. agg. states: 10, step_size: 0.05\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 50/50 [01:03<00:00, 1.26s/it]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Setting - num. agg. states: 10, step_size: 0.1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 50/50 [01:03<00:00, 1.27s/it]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Do not modify this cell!\n", "\n", "#### Run Experiment\n", "\n", "# Experiment parameters\n", "experiment_parameters = {\n", " \"num_runs\" : 50,\n", " \"num_episodes\" : 2000,\n", " \"episode_eval_frequency\" : 10 # evaluate every 10 episodes\n", "}\n", "\n", "# Environment parameters\n", "environment_parameters = {\n", " \"num_states\" : 500, \n", " \"start_state\" : 250,\n", " \"left_terminal_state\" : 0,\n", " \"right_terminal_state\" : 501, \n", " \"discount_factor\" : 1.0\n", "}\n", "\n", "# Agent parameters\n", "# Each element is an array because we will be later sweeping over multiple values\n", "agent_parameters = {\n", " \"num_groups\": [10],\n", " \"step_size\": [0.01, 0.05, 0.1]\n", "}\n", "\n", "current_env = RandomWalkEnvironment\n", "current_agent = TDAgent\n", "\n", "run_experiment(current_env, current_agent, environment_parameters, agent_parameters, experiment_parameters)\n", "plot_script.plot_result(agent_parameters, 'results')" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "278568c5227efec4e4e4c1c2a6e31918", "grade": false, "grade_id": "cell-cf9f9b84e4498115", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "Is the learned state value plot with step-size=0.01 similar to Figure 9.2 (p.208) in Sutton and Barto?\n", "\n", "(Note that our environment has less states: 500 states and we have done 2000 episodes, and averaged the performance over 50 runs)\n", "\n", "Look at the plot of the learning curve. Does RMSVE decrease over time?\n", "\n", "Would it be possible to reduce RMSVE to 0?\n", "\n", "You should see the RMSVE decrease over time, but the error seems to plateau. It is impossible to reduce RMSVE to 0, because of function approximation (and we do not decay the step-size parameter to zero). With function approximation, the agent has limited resources and has to trade-off the accuracy of one state for another state." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "2d162faa4b5808751fcb4433bbd81b7c", "grade": false, "grade_id": "cell-7cfde5a470e987d7", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "Run the following code to verify your experimental result." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "ecf0cf04d03e75d9707fc51839a05f03", "grade": true, "grade_id": "graded_exp_result", "locked": true, "points": 35, "schema_version": 1, "solution": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Your experiment results are correct!\n" ] } ], "source": [ "# Do not modify this cell!\n", "\n", "## Test Code for experimental result##\n", "\n", "agent_parameters = {\n", " \"num_groups\": [10],\n", " \"step_size\": [0.01, 0.05, 0.1]\n", "}\n", "\n", "all_correct = True\n", "for num_agg_states in agent_parameters[\"num_groups\"]:\n", " for step_size in agent_parameters[\"step_size\"]:\n", " filename = 'RMSVE_TD_agent_agg_states_{}_step_size_{}'.format(num_agg_states, step_size).replace('.','')\n", " agent_RMSVE = np.load('results/{}.npy'.format(filename))\n", " correct_RMSVE = np.load('correct_npy/{}.npy'.format(filename))\n", "\n", " if not np.allclose(agent_RMSVE, correct_RMSVE):\n", " all_correct=False\n", "\n", "if all_correct:\n", " print(\"Your experiment results are correct!\")\n", "else:\n", " print(\"Your experiment results does not match with ours. Please check if you have implemented all methods correctly.\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "703b4f0570c04845318e54f0e03ec360", "grade": false, "grade_id": "cell-dc298f2f5dfb981a", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Section 3-2b: Run Experiment with Different State Aggregation Resolution and Step-Size\n", "\n", "In this section, we will run some more experiments to see how different parameter settings affect the results!\n", "\n", "In particular, we will test several values of `num_groups` and `step_size`. Parameter sweeps although necessary, can take lots of time. So now that you have verified your experiment result, here we show you the results of the parameter sweeps that you would see when running the sweeps yourself.\n", "\n", "We tested several different values of `num_groups`: {10, 100, 500}, and `step-size`: {0.01, 0.05, 0.1}. As before, we performed 2000 episodes per run, and averaged the results over 50 runs for each setting.\n", "\n", "Run the cell below to display the sweep results.\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "2df7cb76187f3f0b0994a5490cdb6112", "grade": false, "grade_id": "cell-63cf84b307913593", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#### Display result with parameter sweeps\n", "\n", "# Make sure to verify your experiment result with the test cell above.\n", "# Otherwise the sweep results will not be displayed.\n", "\n", "# Experiment parameters\n", "experiment_parameters = {\n", " \"num_runs\" : 50,\n", " \"num_episodes\" : 2000,\n", " \"episode_eval_frequency\" : 10 # evaluate every 10 episodes\n", "}\n", "\n", "# Environment parameters\n", "environment_parameters = {\n", " \"num_states\" : 500,\n", " \"start_state\" : 250,\n", " \"left_terminal_state\" : 0,\n", " \"right_terminal_state\" : 501,\n", " \"discount_factor\" : 1.0\n", "}\n", "\n", "# Agent parameters\n", "# Each element is an array because we will be sweeping over multiple values\n", "agent_parameters = {\n", " \"num_groups\": [10, 100, 500],\n", " \"step_size\": [0.01, 0.05, 0.1]\n", "}\n", "\n", "if all_correct:\n", " plot_script.plot_result(agent_parameters, 'correct_npy')\n", "else:\n", " raise ValueError(\"Make sure your experiment result is correct! Otherwise the sweep results will not be displayed.\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "49c0fb855e6595468cbc05f352e38988", "grade": false, "grade_id": "cell-e9c6a124eb3c37e6", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Wrapping up\n", "\n", "Let’s think about the results of our parameter study.\n", "\n", "### State Aggregation\n", "\n", "- Which state aggregation resolution do you think is the best after running 2000 episodes? Which state aggregation resolution do you think would be the best if we could train for only 200 episodes? What if we could train for a million episodes?\n", "\n", "- Should we use tabular representation (state aggregation of resolution 500) whenever possible? Why might we want to use function approximation?\n", "\n", "\n", "From the plots, using 100 state aggregation with step-size 0.05 reaches the best performance: the lowest RMSVE after 2000 episodes. If the agent can only be trained for 200 episodes, then 10 state aggregation with step-size 0.05 reaches the lowest error. Increasing the resolution of state aggregation makes the function approximation closer to a tabular representation, which would be able to learn exactly correct state values for all states. But learning will be slower. \n", "\n", "\n", "### Step-Size\n", "\n", "- How did different step-sizes affect learning?\n", "\n", "The best step-size is different for different state aggregation resolutions. A larger step-size allows the agent to learn faster, but might not perform as well asymptotically. A smaller step-size causes it to learn more slowly, but may perform well asymptotically." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "ef771d597be2c0842b166322410cb7fa", "grade": false, "grade_id": "cell-496cb0059a0b96d1", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "### **Congratulations!** You have successfully implemented Course 3 Programming Assignment 1.\n", "\n", "You have implemented **semi-gradient TD(0) with State Aggregation** in a 500-state Random Walk. We used an environment with a large but discrete state space, where it was possible to compute the true state values. This allowed us to compare the values learned by your agent to the true state values. The same state aggregation function approximation can also be applied to continuous state space environments, where comparison to the true values is not usually possible.\n", "\n", "\n", "You also successfully applied supervised learning approaches to approximate value functions with semi-gradient TD(0). \n", "\n", "Finally, we plotted the learned state values and compared with true state values. We also compared learning curves of different state aggregation resolutions and learning rates. \n", "\n", "From the results, you can see why it is often desirable to use function approximation, even when tabular learning is possible. Asymptotically, an agent with tabular representation would be able to learn the true state value function, but it would learn much more slowly compared to an agent with function approximation. On the other hand, we also want to ensure we do not reduce discrimination too far (a coarse state aggregation resolution), because it will hurt the asymptotic performance.\n" ] } ], "metadata": { "coursera": { "course_slug": "prediction-control-function-approximation", "graded_item_id": "CSdxx", "launcher_item_id": "XJyLp" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }