{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "$$\n", "\\newcommand{\\mat}[1]{\\boldsymbol {#1}}\n", "\\newcommand{\\mattr}[1]{\\boldsymbol {#1}^\\top}\n", "\\newcommand{\\matinv}[1]{\\boldsymbol {#1}^{-1}}\n", "\\newcommand{\\vec}[1]{\\boldsymbol {#1}}\n", "\\newcommand{\\vectr}[1]{\\boldsymbol {#1}^\\top}\n", "\\newcommand{\\rvar}[1]{\\mathrm {#1}}\n", "\\newcommand{\\rvec}[1]{\\boldsymbol{\\mathrm{#1}}}\n", "\\newcommand{\\diag}{\\mathop{\\mathrm {diag}}}\n", "\\newcommand{\\set}[1]{\\mathbb {#1}}\n", "\\newcommand{\\cset}[1]{\\mathcal{#1}}\n", "\\newcommand{\\norm}[1]{\\left\\lVert#1\\right\\rVert}\n", "\\newcommand{\\pderiv}[2]{\\frac{\\partial #1}{\\partial #2}}\n", "\\newcommand{\\bb}[1]{\\boldsymbol{#1}}\n", "\\newcommand{\\E}[2][]{\\mathbb{E}_{#1}\\left[#2\\right]}\n", "\\newcommand{\\given}[]{~\\middle\\vert~}\n", "$$\n", "\n", "# CS236605: Deep Learning\n", "# Tutorial 7: Deep Reinforcement Learning" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Introduction\n", "\n", "In this tutorial, we will cover:\n", "\n", "- The reinforcement learning setting\n", "- OpenAI gym\n", "- Deep $q$-learning" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "running on: cpu\n" ] } ], "source": [ "# Setup\n", "%matplotlib inline\n", "import os\n", "import sys\n", "import math\n", "import time\n", "import torch\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "plt.rcParams['font.size'] = 20\n", "data_dir = os.path.expanduser('~/.pytorch-datasets')\n", "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", "print(f'running on: {device}')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Theory Reminders: The RL setting" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Reinforcement learning is a general framework of a learning setting which includes:\n", "- An agent: something which interacts, or performs actions on an **environment** on our behalf,\n", " according to some deterministic or stochcastic policy.\n", "- Actions: Things that the agent can do.\n", "- An environment: Everything outside of the agent's control. The agent can (partially) observe it's state,\n", " and it periodically gets rewards from the environment.\n", "- Observations: Things about the state of the environment which the agent can periodically observe.\n", "- Reward: A scalar value which the agent receives from the environment after (some) actions." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Example**: Playing video games\n", "- Agent: A human or AI player that can perform actions in the game. \n", "- Actions: what can be done by the player (e.g. move, fire, use, etc.).\n", "- Environment: The internal state of the game, and anything that influences it, e.g. other players. Potentially everything in the universe.\n", "- Observations: What the player sees, e.g. game screen, score.\n", "- Reward: Total accumulated score." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "So far we have seen two types of learning paradigms:\n", "- Supervised, in which we learn a mapping based on labelled samples;\n", "- Unsupervised, in which we learn the latent structure of our data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Reinforcement learning is a different paradigm which doesn't cleanly map into either supervised or unsupervised.\n", "\n", "- On one hand, there are no predefined labels.\n", "- However, instead we have a **reward system**, which guides the learning process through.\n", "- By observing rewards (which can be positive, negative or neutral) we expect our agent to learn what actions (and which states) lead to positive rewards.\n", "- In essence, we **create our own labels** based on the experiences of the agent." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The RL setting presents some unique challenges." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Non i.i.d observations\n", " - since they depend on the agent\n", " - we might only observe only non-useful information\n", "- Exploration vs. exploitation trade-off\n", " - discovering new strategies may be at the cost of short term rewards loss\n", "- Delayed reward: can even be one single reward at the end\n", " - need to discover causal relations between actions and rewards despite the delays" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Markov processes" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A **Markov process** (MP; aka Markov chain), is a system with a finite number of **states**, and time-invariante **transition probabilities** between them.\n", "- **Markov property**: transition probabilities to next state depend only on current state.\n", "- At each time step $t$, the next state $S_{t+1}$ is sampled based on the current state $S_{t}$.\n", "- Fully described by states $\\cset{S}=\\{s_i\\}_{i=1}^{N}$ and transition matrix\n", " $$P_{i,j}=P(s_i,s_j)=\\Pr(S_{t+1}=s_j|S_{t}=s_i).$$\n", "- Some states, $\\cset{S}_T\\subset\\cset{S}$, may be terminal, i.e.\n", " $\\forall s\\in\\cset{S}_T, s'\\in\\cset{S}:~P(s,s')=0$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A **Markov reward process** (MRP) is an MP where in addition we have,\n", "- Immediate reward for transition from state $s_i$ to state $s_j$: $R_{i,j} = R(s_i,s_j)$.\n", "- Discount factor $\\gamma\\in[0,1]$ for future rewards." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Total discounted reward (aka gain) from time $t$:\n", "$$\n", "G_t = R_{t+1}+\\gamma R_{t+2} + \\dots = \\sum_{k=0}^{\\infty} \\gamma^k R_{t+1+k}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The **value** of a state $s$ it's it's expected future return:\n", "$$\n", "v(s) = \\E{G_t|S_t = s}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A **Markov desicion process** (MDP) is an MRP where in addition,\n", "- We have a finite set of **actions** that can be performed by our agent at each state: $\\cset{A}=\\{a_k\\}_{k=1}^{K}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The transition probabilities are now action-dependent:\n", " $$P_{i,j,k} = P_{a_k}(s_i,s_j) = \\Pr(S_{t+1}=s_j|S_t=s_i,A_t=a_k).$$\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The immediate reward is now also action-dependent. We will also ignore the dependence on the next state:\n", "$$\n", "R_{t+1} = R(S_t,A_t).\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We define the **policy** of an agent as the conditional distribution,\n", "$$\n", "\\pi(a|s) = \\Pr(A_t=a\\vert S_t=s).\n", "$$\n", "This defines the actions the agent is likely to take at state $s$. Assumed to be time invariant." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The **state-value function** of an MDP is now policy-dependent:\n", "\n", "\\begin{align}\n", "v_{\\pi}(s) = \\E{G_t|S_t = s,\\pi} \n", "= \\E{\\sum_{t=0}^{\\infty} \\gamma^t R_{t+1}\\given S_0=s, \\pi} \n", "= \\E{R_1 +\\gamma v_\\pi(S_1) \\given S_0=s, \\pi}.\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Notice:\n", "1. The state value is the *expected immediate return* plus the *expected discounted value of the next state*.\n", "2. The expectation is over the selected action (under the policy distribution) and the next state due to this action." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Writing the expectation explicitly for the state-value function, we get:\n", "\n", "$$\n", "v_{\\pi}(s) =\n", "\\sum_{a\\in\\cset{A}}\\pi(a|s)R(s,a) +\n", "\\gamma \\sum_{a\\in\\cset{A}} \\sum_{s'\\in\\cset{S}} \\pi(a|s) P_{a}(s,s') v_{\\pi}(s').\n", "$$\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Example MDP with computed state values (not optimal)\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Value of the right study state:\n", "$$\n", "0.5\\cdot(1+0.2\\cdot -1.3 + 0.4 \\cdot 2.7 + 0.4\\cdot 7.4) + 0.5\\cdot (10+0) = 7.4\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We also define an **action-value function** which is the expected return of a an agent starting at state $s$ and performing action $a$:\n", "\n", "\\begin{align}\n", "q_{\\pi}(s,a) &= \\E{G_t|S_t = s,A_t=a,\\pi} \\\\\n", "&= \\E{\\sum_{t=0}^{\\infty} \\gamma^t R_{t+1}\\given S_0=s, A_0=a, \\pi} \\\\\n", "&= \\E{R_1 + \\gamma q_{\\pi}(S_1,A_1) \\given S_0=s, A_0=a, \\pi}.\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Similarly to before, we can write the expectation explicitly for the action-value function:\n", "\n", "$$\n", "q_{\\pi}(s,a) =\n", "R(s,a) +\n", "\\gamma \\sum_{s'\\in\\cset{S}} P_{a}(s,s') \\sum_{a'\\in\\cset{A}} \\pi(a'|s') q_{\\pi}(s',a').\n", "$$\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Notice that if instead of taking the expectation over actions we take the action with the maximal value,\n", "we'll get a better action-value for our current state.\n", "\n", "Therefore, any **optimal** action-value function $q^\\ast$ must satisfy\n", "\n", "$$\n", "q^\\ast(s,a) =\n", "R(s,a) +\n", "\\gamma \\sum_{s'\\in\\cset{S}} P_{a}(s,s') \\max_{a'\\in\\cset{A}} q^\\ast(s',a'),\n", "$$\n", "which is known as the **Bellman optimiality equation**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- For any MDP, there is always at least one **deterministic optimal policy**.\n", "- If we somehow know the optimal action value function, $q^\\ast(s,a)$, we can get an optimal policy:\n", "\n", "$$\n", "\\pi^\\ast(a|s) =\n", "\\begin{cases}\n", "1, & a = \\arg\\max_{a'\\in\\cset{A}} q^\\ast(s,a') \\\\\n", "0, & \\text{else}\n", "\\end{cases}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- In the RL setting, we generally assume an MDP-based environment, however we **do not** assume that $P$ and $R$ are known.\n", "- Therefore, the challenge is to learn both the action value function (or the policy directly)\n", " while simultaneously also learning the underlying environment dynamics (\"rules of the game\")." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Experiences and Episodes" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "These are two important RL terms which are commonly used in the context of\n", "gathering data from and training RL models on this data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- An **experience** is a tuple $(S_t,A_t,R_{t+1},S_{t+1})$. It represents what happened to an agent due to his action at time $t$.\n", "\n", "- An **episode** is a sequence of experiences\n", " $$\n", " \\left\\{ (S_0,A_0,R_1), (S_1,A_1,R_2), \\dots \\right\\}\n", " $$\n", " which represents one entire \"game\" for the agent.\n", " \n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## OpenAI Gym" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", "\n", "From the official [site](https://gym.openai.com):\n", "\n", " Gym is a toolkit for developing and comparing reinforcement\n", " learning algorithms. \n", " It supports teaching agents everything from walking to playing games\n", " like Pong or Pinball." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In RL terms, `gym` will provide us an **environment** which comes with states,\n", "possible actions and rewards.\n", "We will implement our **agent** to work with these environments." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We'll see a quick example and then explain what's going on and how to use `gym`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's play the classic Atari game **Space Invaders**, using a randomly-playing agent.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Played 874 steps. Total reward: 235.0\n" ] } ], "source": [ "import gym\n", "from gym.wrappers import Monitor\n", "\n", "# Create a new SpaceInvaders environment\n", "# Wrap it in a Monitor so that we record video\n", "with Monitor(gym.make('SpaceInvaders-v0'), \"out\", force=True) as env:\n", " \n", " # Reset the env to start a new episode\n", " env.reset()\n", " episode_done = False\n", " total_reward = 0\n", " total_steps = 0\n", " \n", " # This is our agent code. It will just play randomly.\n", " # As long as the episode is not done (not Game Over), we:\n", " while not episode_done:\n", " # 1. Choose a random valid action to do\n", " action = env.action_space.sample()\n", " \n", " # 2. Do the random action and get feedback from the environment\n", " obs, reward, episode_done, extra_info = env.step(action)\n", " \n", " total_reward += reward\n", " total_steps += 1\n", "\n", "print(f'Played {total_steps} steps. Total reward: {total_reward}')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "How do we see what happened?\n", "\n", "We have a video recording of the last episode generated by our `Monitor`." ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import IPython.display\n", "\n", "# A helper function that shows the last video from a Monitor env wrapper\n", "def show_last_video(monitor_env, width='auto', height='auto'):\n", " video_path = monitor_env.videos[-1][0]\n", " video_path = os.path.relpath(video_path, start=os.path.curdir)\n", " \n", " raw_html = f'