{ "cells": [ { "cell_type": "markdown", "id": "3e646804", "metadata": {}, "source": [ "# Cross-Entropy Methods (CEM) on MountainCarContinuous-v0\n", "\n", "> In this post, We will take a hands-on-lab of Cross-Entropy Methods (CEM for short) on openAI gym MountainCarContinuous-v0 environment. This is the coding exercise from udacity Deep Reinforcement Learning Nanodegree.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Reinforcement_Learning, PyTorch, Udacity]\n", "- image: images/MountainCarContinuous-v0.gif" ] }, { "cell_type": "markdown", "id": "b6e1d4f5", "metadata": {}, "source": [ "## Cross-Entropy Methods (CEM)\n", "---\n", "In this notebook, you will implement CEM on OpenAI Gym's MountainCarContinuous-v0 environment. For summary, The **cross-entropy method** is sort of Black box optimization and it iteratively suggests a small number of neighboring policies, and uses a small percentage of the best performing policies to calculate a new estimate.\n", "\n", "### Import the Necessary Packages" ] }, { "cell_type": "code", "execution_count": 1, "id": "063a557c", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\kcsgo\\anaconda3\\lib\\site-packages\\numpy\\_distributor_init.py:32: UserWarning: loaded more than 1 DLL from .libs:\n", "C:\\Users\\kcsgo\\anaconda3\\lib\\site-packages\\numpy\\.libs\\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll\n", "C:\\Users\\kcsgo\\anaconda3\\lib\\site-packages\\numpy\\.libs\\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll\n", " stacklevel=1)\n" ] } ], "source": [ "import gym\n", "import math\n", "import numpy as np\n", "from collections import deque\n", "import matplotlib.pyplot as plt\n", "\n", "import torch\n", "import torch.nn as nn\n", "import torch.nn.functional as F\n", "from torch.autograd import Variable\n", "\n", "import base64, io\n", "\n", "# For visualization\n", "from gym.wrappers.monitoring import video_recorder\n", "from IPython.display import HTML\n", "from IPython import display \n", "import glob" ] }, { "cell_type": "code", "execution_count": 2, "id": "de727785", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "device(type='cuda', index=0)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n", "device" ] }, { "cell_type": "markdown", "id": "67faf368", "metadata": {}, "source": [ "### Instantiate the Environment and Agent\n", "\n", "MountainCar environment has two types: Discrete and Continuous. In this notebook, we used Continuous version of MountainCar. That is, we can move the car to the left (or right) precisely." ] }, { "cell_type": "code", "execution_count": 3, "id": "c6c02db1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "observation space: Box(-1.2000000476837158, 0.6000000238418579, (2,), float32)\n", "action space: Box(-1.0, 1.0, (1,), float32)\n", " - low: [-1.]\n", " - high: [1.]\n" ] } ], "source": [ "env = gym.make('MountainCarContinuous-v0')\n", "env.seed(101)\n", "np.random.seed(101)\n", "\n", "print('observation space:', env.observation_space)\n", "print('action space:', env.action_space)\n", "print(' - low:', env.action_space.low)\n", "print(' - high:', env.action_space.high)" ] }, { "cell_type": "markdown", "id": "d6ebff61", "metadata": {}, "source": [ "### Define Agent " ] }, { "cell_type": "code", "execution_count": 4, "id": "8a769b19", "metadata": {}, "outputs": [], "source": [ "class Agent(nn.Module):\n", " def __init__(self, env, h_size=16):\n", " super(Agent, self).__init__()\n", " self.env = env\n", " # state, hidden layer, action sizes\n", " self.s_size = env.observation_space.shape[0]\n", " self.h_size = h_size\n", " self.a_size = env.action_space.shape[0]\n", " # define layers (we used 2 layers)\n", " self.fc1 = nn.Linear(self.s_size, self.h_size)\n", " self.fc2 = nn.Linear(self.h_size, self.a_size)\n", " \n", " def set_weights(self, weights):\n", " s_size = self.s_size\n", " h_size = self.h_size\n", " a_size = self.a_size\n", " # separate the weights for each layer\n", " fc1_end = (s_size * h_size) + h_size\n", " fc1_W = torch.from_numpy(weights[:s_size*h_size].reshape(s_size, h_size))\n", " fc1_b = torch.from_numpy(weights[s_size*h_size:fc1_end])\n", " fc2_W = torch.from_numpy(weights[fc1_end:fc1_end+(h_size*a_size)].reshape(h_size, a_size))\n", " fc2_b = torch.from_numpy(weights[fc1_end+(h_size*a_size):])\n", " # set the weights for each layer\n", " self.fc1.weight.data.copy_(fc1_W.view_as(self.fc1.weight.data))\n", " self.fc1.bias.data.copy_(fc1_b.view_as(self.fc1.bias.data))\n", " self.fc2.weight.data.copy_(fc2_W.view_as(self.fc2.weight.data))\n", " self.fc2.bias.data.copy_(fc2_b.view_as(self.fc2.bias.data))\n", " \n", " def get_weights_dim(self):\n", " return (self.s_size + 1) * self.h_size + (self.h_size + 1) * self.a_size\n", " \n", " def forward(self, x):\n", " x = F.relu(self.fc1(x))\n", " x = torch.tanh(self.fc2(x))\n", " return x.cpu().data\n", " \n", " def act(self, state):\n", " state = torch.from_numpy(state).float().to(device)\n", " with torch.no_grad():\n", " action = self.forward(state)\n", " return action\n", " \n", " def evaluate(self, weights, gamma=1.0, max_t=5000):\n", " self.set_weights(weights)\n", " episode_return = 0.0\n", " state = self.env.reset()\n", " for t in range(max_t):\n", " state = torch.from_numpy(state).float().to(device)\n", " action = self.forward(state)\n", " state, reward, done, _ = self.env.step(action)\n", " episode_return += reward * math.pow(gamma, t)\n", " if done:\n", " break\n", " return episode_return" ] }, { "cell_type": "markdown", "id": "ea18646f", "metadata": {}, "source": [ "### Cross Entropy Method" ] }, { "cell_type": "code", "execution_count": 5, "id": "e0c930e7", "metadata": {}, "outputs": [], "source": [ "def cem(agent, n_iterations=500, max_t=1000, gamma=1.0, print_every=10, pop_size=50, elite_frac=0.2, sigma=0.5):\n", " \"\"\"PyTorch implementation of the cross-entropy method.\n", " \n", " Params\n", " ======\n", " Agent (object): agent instance\n", " n_iterations (int): maximum number of training iterations\n", " max_t (int): maximum number of timesteps per episode\n", " gamma (float): discount rate\n", " print_every (int): how often to print average score (over last 100 episodes)\n", " pop_size (int): size of population at each iteration\n", " elite_frac (float): percentage of top performers to use in update\n", " sigma (float): standard deviation of additive noise\n", " \"\"\"\n", " n_elite=int(pop_size*elite_frac)\n", "\n", " scores_deque = deque(maxlen=100)\n", " scores = []\n", " # Initialize the weight with random noise\n", " best_weight = sigma * np.random.randn(agent.get_weights_dim())\n", "\n", " for i_iteration in range(1, n_iterations+1):\n", " # Define the cadidates and get the reward of each candidate\n", " weights_pop = [best_weight + (sigma * np.random.randn(agent.get_weights_dim())) for i in range(pop_size)]\n", " rewards = np.array([agent.evaluate(weights, gamma, max_t) for weights in weights_pop])\n", " \n", " # Select best candidates from collected rewards\n", " elite_idxs = rewards.argsort()[-n_elite:]\n", " elite_weights = [weights_pop[i] for i in elite_idxs]\n", " best_weight = np.array(elite_weights).mean(axis=0)\n", "\n", " reward = agent.evaluate(best_weight, gamma=1.0)\n", " scores_deque.append(reward)\n", " scores.append(reward)\n", " \n", " torch.save(agent.state_dict(), 'checkpoint.pth')\n", " \n", " if i_iteration % print_every == 0:\n", " print('Episode {}\\tAverage Score: {:.2f}'.format(i_iteration, np.mean(scores_deque)))\n", "\n", " if np.mean(scores_deque)>=90.0:\n", " print('\\nEnvironment solved in {:d} iterations!\\tAverage Score: {:.2f}'.format(i_iteration-100, np.mean(scores_deque)))\n", " break\n", " return scores" ] }, { "cell_type": "markdown", "id": "3735df07", "metadata": {}, "source": [ "### Run" ] }, { "cell_type": "code", "execution_count": 6, "id": "4c662c01", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Episode 10\tAverage Score: -1.44\n", "Episode 20\tAverage Score: -3.98\n", "Episode 30\tAverage Score: -4.18\n", "Episode 40\tAverage Score: 2.57\n", "Episode 50\tAverage Score: 18.74\n", "Episode 60\tAverage Score: 29.35\n", "Episode 70\tAverage Score: 38.69\n", "Episode 80\tAverage Score: 45.65\n", "Episode 90\tAverage Score: 47.98\n", "Episode 100\tAverage Score: 52.56\n", "Episode 110\tAverage Score: 62.09\n", "Episode 120\tAverage Score: 72.28\n", "Episode 130\tAverage Score: 82.21\n", "Episode 140\tAverage Score: 89.48\n", "\n", "Environment solved in 47 iterations!\tAverage Score: 90.83\n" ] } ], "source": [ "agent = Agent(env).to(device)\n", "scores = cem(agent)" ] }, { "cell_type": "markdown", "id": "51dea6d9", "metadata": {}, "source": [ "### Plot the learning progress" ] }, { "cell_type": "code", "execution_count": 7, "id": "abff3eba", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# plot the scores\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "plt.plot(np.arange(1, len(scores)+1), scores)\n", "plt.ylabel('Score')\n", "plt.xlabel('Episode #')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "75403f40", "metadata": {}, "source": [ "### Animate it with Video" ] }, { "cell_type": "code", "execution_count": 13, "id": "7fae8a12", "metadata": {}, "outputs": [], "source": [ "def show_video(env_name):\n", " mp4list = glob.glob('video/*.mp4')\n", " if len(mp4list) > 0:\n", " mp4 = 'video/{}.mp4'.format(env_name)\n", " video = io.open(mp4, 'r+b').read()\n", " encoded = base64.b64encode(video)\n", " display.display(HTML(data=''''''.format(encoded.decode('ascii'))))\n", " else:\n", " print(\"Could not find video\")\n", " \n", "def show_video_of_model(agent, env_name):\n", " env = gym.make(env_name)\n", " vid = video_recorder.VideoRecorder(env, path=\"video/{}.mp4\".format(env_name))\n", " agent.load_state_dict(torch.load('checkpoint.pth'))\n", " state = env.reset()\n", " done = False\n", " while not done:\n", " vid.capture_frame()\n", " \n", " action = agent.act(state)\n", " next_state, reward, done, _ = env.step(action)\n", " state = next_state\n", " if done:\n", " break \n", " env.close()" ] }, { "cell_type": "code", "execution_count": 17, "id": "c8a347e5", "metadata": {}, "outputs": [], "source": [ "agent = Agent(env).to(device)\n", "show_video_of_model(agent, 'MountainCarContinuous-v0')" ] }, { "cell_type": "code", "execution_count": 18, "id": "7fd10379", "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_video('MountainCarContinuous-v0')" ] }, { "cell_type": "markdown", "id": "ab1adc92-90a4-47a1-9ab8-65d5e947bec7", "metadata": {}, "source": [ "> Note: While I tried to execute the model with VideoRecorder in Linux, it doesn't show correctly. But in windows, it works! Perhaps, It may be bug in VideoRecoder." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 }