{ "cells": [ { "cell_type": "markdown", "id": "c3e89aca-6c52-4b20-b1a3-9279ef0fd99b", "metadata": {}, "source": [ "# Deep Q-Network (DQN) on LunarLander-v2\n", "\n", "> In this post, We will take a hands-on-lab of Simple Deep Q-Network (DQN) on openAI LunarLander-v2 environment. This is the coding exercise from udacity Deep Reinforcement Learning Nanodegree.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Reinforcement_Learning, PyTorch, Udacity]\n", "- image: images/LunarLander-v2.gif" ] }, { "cell_type": "markdown", "id": "fd834f1b-51c4-4910-a4d6-3b212e1a2a5a", "metadata": {}, "source": [ "## Deep Q-Network (DQN)\n", "---\n", "In this notebook, you will implement a DQN agent with OpenAI Gym's LunarLander-v2 environment.\n", "\n", "### Import the Necessary Packages" ] }, { "cell_type": "code", "execution_count": 1, "id": "22518486-3c2f-47fe-92b3-1502875eacfe", "metadata": {}, "outputs": [], "source": [ "import gym\n", "import random\n", "import torch\n", "import torch.nn as nn\n", "import torch.nn.functional as F\n", "import torch.optim as optim\n", "import matplotlib.pyplot as plt\n", "import base64, io\n", "\n", "import numpy as np\n", "from collections import deque, namedtuple\n", "\n", "# For visualization\n", "from gym.wrappers.monitoring import video_recorder\n", "from IPython.display import HTML\n", "from IPython import display \n", "import glob" ] }, { "cell_type": "markdown", "id": "f75f934c-6921-43aa-8389-6df4b993eca4", "metadata": {}, "source": [ "### Instantiate the Environment and Agent\n", "\n", "Initialize the environment." ] }, { "cell_type": "code", "execution_count": 2, "id": "594828b6-da33-481d-ab42-041e8c17ffea", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "State shape: (8,)\n", "Number of actions: 4\n" ] } ], "source": [ "env = gym.make('LunarLander-v2')\n", "env.seed(0)\n", "print('State shape: ', env.observation_space.shape)\n", "print('Number of actions: ', env.action_space.n)" ] }, { "cell_type": "markdown", "id": "03735bdc-c07e-4c87-b208-cce894bb8e43", "metadata": {}, "source": [ "### Define Neural Network Architecture.\n", "\n", "Since `LunarLander-v2` environment is sort of simple envs, we don't need complicated architecture. We just need non-linear function approximator that maps from state to action." ] }, { "cell_type": "code", "execution_count": 3, "id": "ae834607-433e-4ed5-8b23-8de7b53230a8", "metadata": {}, "outputs": [], "source": [ "class QNetwork(nn.Module):\n", " \"\"\"Actor (Policy) Model.\"\"\"\n", "\n", " def __init__(self, state_size, action_size, seed):\n", " \"\"\"Initialize parameters and build model.\n", " Params\n", " ======\n", " state_size (int): Dimension of each state\n", " action_size (int): Dimension of each action\n", " seed (int): Random seed\n", " \"\"\"\n", " super(QNetwork, self).__init__()\n", " self.seed = torch.manual_seed(seed)\n", " self.fc1 = nn.Linear(state_size, 64)\n", " self.fc2 = nn.Linear(64, 64)\n", " self.fc3 = nn.Linear(64, action_size)\n", " \n", " def forward(self, state):\n", " \"\"\"Build a network that maps state -> action values.\"\"\"\n", " x = self.fc1(state)\n", " x = F.relu(x)\n", " x = self.fc2(x)\n", " x = F.relu(x)\n", " return self.fc3(x)" ] }, { "cell_type": "markdown", "id": "b0873298-cab0-4dcb-9c84-4782dc914dd7", "metadata": {}, "source": [ "### Define some hyperparameter" ] }, { "cell_type": "code", "execution_count": 4, "id": "7010c525-29d8-445c-8769-6cfb7d00948b", "metadata": {}, "outputs": [], "source": [ "BUFFER_SIZE = int(1e5) # replay buffer size\n", "BATCH_SIZE = 64 # minibatch size\n", "GAMMA = 0.99 # discount factor\n", "TAU = 1e-3 # for soft update of target parameters\n", "LR = 5e-4 # learning rate \n", "UPDATE_EVERY = 4 # how often to update the network" ] }, { "cell_type": "code", "execution_count": 5, "id": "334bb7d8-7d62-4cfb-96f7-f8809ba8e089", "metadata": {}, "outputs": [], "source": [ "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")" ] }, { "cell_type": "markdown", "id": "1d861efe-200c-4690-9698-722abbf0b77c", "metadata": {}, "source": [ "### Define Agent " ] }, { "cell_type": "code", "execution_count": 6, "id": "0530f456-2bfd-4061-ad62-f14846a9a284", "metadata": {}, "outputs": [], "source": [ "class Agent():\n", " \"\"\"Interacts with and learns from the environment.\"\"\"\n", "\n", " def __init__(self, state_size, action_size, seed):\n", " \"\"\"Initialize an Agent object.\n", " \n", " Params\n", " ======\n", " state_size (int): dimension of each state\n", " action_size (int): dimension of each action\n", " seed (int): random seed\n", " \"\"\"\n", " self.state_size = state_size\n", " self.action_size = action_size\n", " self.seed = random.seed(seed)\n", "\n", " # Q-Network\n", " self.qnetwork_local = QNetwork(state_size, action_size, seed).to(device)\n", " self.qnetwork_target = QNetwork(state_size, action_size, seed).to(device)\n", " self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=LR)\n", "\n", " # Replay memory\n", " self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, seed)\n", " # Initialize time step (for updating every UPDATE_EVERY steps)\n", " self.t_step = 0\n", " \n", " def step(self, state, action, reward, next_state, done):\n", " # Save experience in replay memory\n", " self.memory.add(state, action, reward, next_state, done)\n", " \n", " # Learn every UPDATE_EVERY time steps.\n", " self.t_step = (self.t_step + 1) % UPDATE_EVERY\n", " if self.t_step == 0:\n", " # If enough samples are available in memory, get random subset and learn\n", " if len(self.memory) > BATCH_SIZE:\n", " experiences = self.memory.sample()\n", " self.learn(experiences, GAMMA)\n", "\n", " def act(self, state, eps=0.):\n", " \"\"\"Returns actions for given state as per current policy.\n", " \n", " Params\n", " ======\n", " state (array_like): current state\n", " eps (float): epsilon, for epsilon-greedy action selection\n", " \"\"\"\n", " state = torch.from_numpy(state).float().unsqueeze(0).to(device)\n", " self.qnetwork_local.eval()\n", " with torch.no_grad():\n", " action_values = self.qnetwork_local(state)\n", " self.qnetwork_local.train()\n", "\n", " # Epsilon-greedy action selection\n", " if random.random() > eps:\n", " return np.argmax(action_values.cpu().data.numpy())\n", " else:\n", " return random.choice(np.arange(self.action_size))\n", "\n", " def learn(self, experiences, gamma):\n", " \"\"\"Update value parameters using given batch of experience tuples.\n", "\n", " Params\n", " ======\n", " experiences (Tuple[torch.Variable]): tuple of (s, a, r, s', done) tuples \n", " gamma (float): discount factor\n", " \"\"\"\n", " # Obtain random minibatch of tuples from D\n", " states, actions, rewards, next_states, dones = experiences\n", "\n", " ## Compute and minimize the loss\n", " ### Extract next maximum estimated value from target network\n", " q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)\n", " ### Calculate target value from bellman equation\n", " q_targets = rewards + gamma * q_targets_next * (1 - dones)\n", " ### Calculate expected value from local network\n", " q_expected = self.qnetwork_local(states).gather(1, actions)\n", " \n", " ### Loss calculation (we used Mean squared error)\n", " loss = F.mse_loss(q_expected, q_targets)\n", " self.optimizer.zero_grad()\n", " loss.backward()\n", " self.optimizer.step()\n", "\n", " # ------------------- update target network ------------------- #\n", " self.soft_update(self.qnetwork_local, self.qnetwork_target, TAU) \n", "\n", " def soft_update(self, local_model, target_model, tau):\n", " \"\"\"Soft update model parameters.\n", " θ_target = τ*θ_local + (1 - τ)*θ_target\n", "\n", " Params\n", " ======\n", " local_model (PyTorch model): weights will be copied from\n", " target_model (PyTorch model): weights will be copied to\n", " tau (float): interpolation parameter \n", " \"\"\"\n", " for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):\n", " target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)" ] }, { "cell_type": "markdown", "id": "4eb5db66-ecb3-4bf4-aee1-3e18003c17b0", "metadata": {}, "source": [ "### Define Replay Buffer" ] }, { "cell_type": "code", "execution_count": 7, "id": "f74bb08e-0b95-42db-9fc8-d609514d55af", "metadata": {}, "outputs": [], "source": [ "class ReplayBuffer:\n", " \"\"\"Fixed-size buffer to store experience tuples.\"\"\"\n", "\n", " def __init__(self, action_size, buffer_size, batch_size, seed):\n", " \"\"\"Initialize a ReplayBuffer object.\n", "\n", " Params\n", " ======\n", " action_size (int): dimension of each action\n", " buffer_size (int): maximum size of buffer\n", " batch_size (int): size of each training batch\n", " seed (int): random seed\n", " \"\"\"\n", " self.action_size = action_size\n", " self.memory = deque(maxlen=buffer_size) \n", " self.batch_size = batch_size\n", " self.experience = namedtuple(\"Experience\", field_names=[\"state\", \"action\", \"reward\", \"next_state\", \"done\"])\n", " self.seed = random.seed(seed)\n", " \n", " def add(self, state, action, reward, next_state, done):\n", " \"\"\"Add a new experience to memory.\"\"\"\n", " e = self.experience(state, action, reward, next_state, done)\n", " self.memory.append(e)\n", " \n", " def sample(self):\n", " \"\"\"Randomly sample a batch of experiences from memory.\"\"\"\n", " experiences = random.sample(self.memory, k=self.batch_size)\n", "\n", " states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)\n", " actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)\n", " rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)\n", " next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)\n", " dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)\n", " \n", " return (states, actions, rewards, next_states, dones)\n", "\n", " def __len__(self):\n", " \"\"\"Return the current size of internal memory.\"\"\"\n", " return len(self.memory)" ] }, { "cell_type": "markdown", "id": "6bd61b8d-b63e-444f-9cb7-6295df46995d", "metadata": {}, "source": [ "### Training Process" ] }, { "cell_type": "code", "execution_count": 8, "id": "c907ab57-ed48-4824-b27c-8c7d707d6919", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Episode 100\tAverage Score: -203.18\n", "Episode 200\tAverage Score: -121.23\n", "Episode 300\tAverage Score: -55.393\n", "Episode 400\tAverage Score: -12.05\n", "Episode 500\tAverage Score: 41.702\n", "Episode 600\tAverage Score: 84.14\n", "Episode 700\tAverage Score: 142.01\n", "Episode 800\tAverage Score: 173.82\n", "Episode 900\tAverage Score: 164.89\n", "Episode 1000\tAverage Score: 153.46\n", "Episode 1100\tAverage Score: 184.06\n", "Episode 1200\tAverage Score: 172.37\n", "Episode 1278\tAverage Score: 200.02\n", "Environment solved in 1178 episodes!\tAverage Score: 200.02\n" ] } ], "source": [ "def dqn(n_episodes=2000, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):\n", " \"\"\"Deep Q-Learning.\n", " \n", " Params\n", " ======\n", " n_episodes (int): maximum number of training episodes\n", " max_t (int): maximum number of timesteps per episode\n", " eps_start (float): starting value of epsilon, for epsilon-greedy action selection\n", " eps_end (float): minimum value of epsilon\n", " eps_decay (float): multiplicative factor (per episode) for decreasing epsilon\n", " \"\"\"\n", " scores = [] # list containing scores from each episode\n", " scores_window = deque(maxlen=100) # last 100 scores\n", " eps = eps_start # initialize epsilon\n", " for i_episode in range(1, n_episodes+1):\n", " state = env.reset()\n", " score = 0\n", " for t in range(max_t):\n", " action = agent.act(state, eps)\n", " next_state, reward, done, _ = env.step(action)\n", " agent.step(state, action, reward, next_state, done)\n", " state = next_state\n", " score += reward\n", " if done:\n", " break \n", " scores_window.append(score) # save most recent score\n", " scores.append(score) # save most recent score\n", " eps = max(eps_end, eps_decay*eps) # decrease epsilon\n", " print('\\rEpisode {}\\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end=\"\")\n", " if i_episode % 100 == 0:\n", " print('\\rEpisode {}\\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))\n", " if np.mean(scores_window)>=200.0:\n", " print('\\nEnvironment solved in {:d} episodes!\\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))\n", " torch.save(agent.qnetwork_local.state_dict(), 'checkpoint.pth')\n", " break\n", " return scores\n", "\n", "agent = Agent(state_size=8, action_size=4, seed=0)\n", "scores = dqn()" ] }, { "cell_type": "markdown", "id": "1ba6726a-977a-4345-8897-021302bfc262", "metadata": {}, "source": [ "### Plot the learning progress" ] }, { "cell_type": "code", "execution_count": 9, "id": "d2d491c9-a5dc-4c32-a95d-796f85c60c83", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# plot the scores\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "plt.plot(np.arange(len(scores)), scores)\n", "plt.ylabel('Score')\n", "plt.xlabel('Episode #')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "ebb450a9-c530-4f10-b629-53a2d69860c2", "metadata": {}, "source": [ "### Animate it with Video" ] }, { "cell_type": "code", "execution_count": 13, "id": "c9806a31-f777-468e-8987-0139708ef532", "metadata": {}, "outputs": [], "source": [ "def show_video(env_name):\n", " mp4list = glob.glob('video/*.mp4')\n", " if len(mp4list) > 0:\n", " mp4 = 'video/{}.mp4'.format(env_name)\n", " video = io.open(mp4, 'r+b').read()\n", " encoded = base64.b64encode(video)\n", " display.display(HTML(data=''''''.format(encoded.decode('ascii'))))\n", " else:\n", " print(\"Could not find video\")\n", " \n", "def show_video_of_model(agent, env_name):\n", " env = gym.make(env_name)\n", " vid = video_recorder.VideoRecorder(env, path=\"video/{}.mp4\".format(env_name))\n", " agent.qnetwork_local.load_state_dict(torch.load('checkpoint.pth'))\n", " state = env.reset()\n", " done = False\n", " while not done:\n", " frame = env.render(mode='rgb_array')\n", " vid.capture_frame()\n", " \n", " action = agent.act(state)\n", "\n", " state, reward, done, _ = env.step(action) \n", " env.close()" ] }, { "cell_type": "code", "execution_count": 17, "id": "4beb1d45-ab9a-4fd7-a769-017bbf1672e1", "metadata": {}, "outputs": [], "source": [ "agent = Agent(state_size=8, action_size=4, seed=0)\n", "show_video_of_model(agent, 'LunarLander-v2')" ] }, { "cell_type": "code", "execution_count": 18, "id": "f54ee0a0-265b-4161-bbb6-25e28543aa2e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_video('LunarLander-v2')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 }