{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Building an Agent to Play Atari games using Deep Q Network" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "First we import all the necessary libraries \n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import gym\n", "import tensorflow as tf\n", "from tensorflow.contrib.layers import flatten, conv2d, fully_connected\n", "from collections import deque, Counter\n", "import random\n", "from datetime import datetime" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Now we define a function called preprocess_observation for preprocessing our input game screen. We reduce the image size\n", "and convert the image into greyscale." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "color = np.array([210, 164, 74]).mean()\n", "\n", "def preprocess_observation(obs):\n", "\n", " # Crop and resize the image\n", " img = obs[1:176:2, ::2]\n", "\n", " # Convert the image to greyscale\n", " img = img.mean(axis=2)\n", "\n", " # Improve image contrast\n", " img[img==color] = 0\n", "\n", " # Next we normalize the image from -1 to +1\n", " img = (img - 128) / 128 - 1\n", "\n", " return img.reshape(88,80,1)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Let us initialize our gym environment" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[2018-06-11 16:20:23,979] Making new env: MsPacman-v0\n" ] } ], "source": [ "env = gym.make(\"MsPacman-v0\")\n", "n_outputs = env.action_space.n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Okay, Now we define a function called q_network for building our Q network. We input the game state\n", "to the Q network and get the Q values for all the actions in that state.

\n", "We build Q network with three convolutional layers with same padding followed by a fully connected layer. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tf.reset_default_graph()\n", "\n", "def q_network(X, name_scope):\n", " \n", " # Initialize layers\n", " initializer = tf.contrib.layers.variance_scaling_initializer()\n", "\n", " with tf.variable_scope(name_scope) as scope: \n", "\n", " # initialize the convolutional layers\n", " layer_1 = conv2d(X, num_outputs=32, kernel_size=(8,8), stride=4, padding='SAME', weights_initializer=initializer) \n", " tf.summary.histogram('layer_1',layer_1)\n", " \n", " layer_2 = conv2d(layer_1, num_outputs=64, kernel_size=(4,4), stride=2, padding='SAME', weights_initializer=initializer)\n", " tf.summary.histogram('layer_2',layer_2)\n", " \n", " layer_3 = conv2d(layer_2, num_outputs=64, kernel_size=(3,3), stride=1, padding='SAME', weights_initializer=initializer)\n", " tf.summary.histogram('layer_3',layer_3)\n", " \n", " # Flatten the result of layer_3 before feeding to the fully connected layer\n", " flat = flatten(layer_3)\n", "\n", " fc = fully_connected(flat, num_outputs=128, weights_initializer=initializer)\n", " tf.summary.histogram('fc',fc)\n", " \n", " output = fully_connected(fc, num_outputs=n_outputs, activation_fn=None, weights_initializer=initializer)\n", " tf.summary.histogram('output',output)\n", " \n", "\n", " # Vars will store the parameters of the network such as weights\n", " vars = {v.name[len(scope.name):]: v for v in tf.get_collection(key=tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope.name)} \n", " return vars, output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we define a function called epsilon_greedy for performing epsilon greedy policy. In epsilon greedy policy we either select the best action with probability 1 - epsilon or a random action with\n", "probability epsilon.\n", "\n", "We use decaying epsilon greedy policy where value of epsilon will be decaying over time as we don't want to explore\n", "forever. So over time our policy will be exploiting only good actions." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "epsilon = 0.5\n", "eps_min = 0.05\n", "eps_max = 1.0\n", "eps_decay_steps = 500000\n", "\n", "def epsilon_greedy(action, step):\n", " p = np.random.random(1).squeeze()\n", " epsilon = max(eps_min, eps_max - (eps_max-eps_min) * step/eps_decay_steps)\n", " if np.random.rand() < epsilon:\n", " return np.random.randint(n_outputs)\n", " else:\n", " return action" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we initialize our experience replay buffer of length 20000 which holds the experience.\n", "\n", "We store all the agent's experience i.e (state, action, rewards) in the experience replay buffer\n", "and we sample from this minibatch of experience for training the network." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "buffer_len = 20000\n", "exp_buffer = deque(maxlen=buffer_len)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we define a function called sample_memories for sampling experiences from the memory. Batch size is the number of experience sampled\n", "from the memory.\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def sample_memories(batch_size):\n", " perm_batch = np.random.permutation(len(exp_buffer))[:batch_size]\n", " mem = np.array(exp_buffer)[perm_batch]\n", " return mem[:,0], mem[:,1], mem[:,2], mem[:,3], mem[:,4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we define our network hyperparameters," ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "num_episodes = 800\n", "batch_size = 48\n", "input_shape = (None, 88, 80, 1)\n", "learning_rate = 0.001\n", "X_shape = (None, 88, 80, 1)\n", "discount_factor = 0.97\n", "\n", "global_step = 0\n", "copy_steps = 100\n", "steps_train = 4\n", "start_steps = 2000" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "logdir = 'logs'\n", "tf.reset_default_graph()\n", "\n", "# Now we define the placeholder for our input i.e game state\n", "X = tf.placeholder(tf.float32, shape=X_shape)\n", "\n", "# we define a boolean called in_training_model to toggle the training\n", "in_training_mode = tf.placeholder(tf.bool)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Now let us build our primary and target Q network" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# we build our Q network, which takes the input X and generates Q values for all the actions in the state\n", "mainQ, mainQ_outputs = q_network(X, 'mainQ')\n", "\n", "# similarly we build our target Q network\n", "targetQ, targetQ_outputs = q_network(X, 'targetQ')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# define the placeholder for our action values\n", "X_action = tf.placeholder(tf.int32, shape=(None,))\n", "Q_action = tf.reduce_sum(targetQ_outputs * tf.one_hot(X_action, n_outputs), axis=-1, keep_dims=True)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copy the primary Q network parameters to the target Q network" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "copy_op = [tf.assign(main_name, targetQ[var_name]) for var_name, main_name in mainQ.items()]\n", "copy_target_to_main = tf.group(*copy_op)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute and optimize loss using gradient descent optimizer" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# define a placeholder for our output i.e action\n", "y = tf.placeholder(tf.float32, shape=(None,1))\n", "\n", "# now we calculate the loss which is the difference between actual value and predicted value\n", "loss = tf.reduce_mean(tf.square(y - Q_action))\n", "\n", "# we use adam optimizer for minimizing the loss\n", "optimizer = tf.train.AdamOptimizer(learning_rate)\n", "training_op = optimizer.minimize(loss)\n", "\n", "init = tf.global_variables_initializer()\n", "\n", "loss_summary = tf.summary.scalar('LOSS', loss)\n", "merge_summary = tf.summary.merge_all()\n", "file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Now we start the tensorflow session and run the model," ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 675 Reward 200.0\n", "Epoch 551 Reward 160.0\n", "Epoch 729 Reward 260.0\n" ] } ], "source": [ "with tf.Session() as sess:\n", " init.run()\n", " \n", " # for each episode\n", " for i in range(num_episodes):\n", " done = False\n", " obs = env.reset()\n", " epoch = 0\n", " episodic_reward = 0\n", " actions_counter = Counter() \n", " episodic_loss = []\n", "\n", " # while the state is not the terminal state\n", " while not done:\n", "\n", " #env.render()\n", " \n", " # get the preprocessed game screen\n", " obs = preprocess_observation(obs)\n", "\n", " # feed the game screen and get the Q values for each action\n", " actions = mainQ_outputs.eval(feed_dict={X:[obs], in_training_mode:False})\n", "\n", " # get the action\n", " action = np.argmax(actions, axis=-1)\n", " actions_counter[str(action)] += 1 \n", "\n", " # select the action using epsilon greedy policy\n", " action = epsilon_greedy(action, global_step)\n", " \n", " # now perform the action and move to the next state, next_obs, receive reward\n", " next_obs, reward, done, _ = env.step(action)\n", "\n", " # Store this transistion as an experience in the replay buffer\n", " exp_buffer.append([obs, action, preprocess_observation(next_obs), reward, done])\n", " \n", " # After certain steps, we train our Q network with samples from the experience replay buffer\n", " if global_step % steps_train == 0 and global_step > start_steps:\n", " \n", " # sample experience\n", " o_obs, o_act, o_next_obs, o_rew, o_done = sample_memories(batch_size)\n", "\n", " # states\n", " o_obs = [x for x in o_obs]\n", "\n", " # next states\n", " o_next_obs = [x for x in o_next_obs]\n", "\n", " # next actions\n", " next_act = mainQ_outputs.eval(feed_dict={X:o_next_obs, in_training_mode:False})\n", "\n", "\n", " # reward\n", " y_batch = o_rew + discount_factor * np.max(next_act, axis=-1) * (1-o_done) \n", "\n", " # merge all summaries and write to the file\n", " mrg_summary = merge_summary.eval(feed_dict={X:o_obs, y:np.expand_dims(y_batch, axis=-1), X_action:o_act, in_training_mode:False})\n", " file_writer.add_summary(mrg_summary, global_step)\n", "\n", " # now we train the network and calculate loss\n", " train_loss, _ = sess.run([loss, training_op], feed_dict={X:o_obs, y:np.expand_dims(y_batch, axis=-1), X_action:o_act, in_training_mode:True})\n", " episodic_loss.append(train_loss)\n", " \n", " # after some interval we copy our main Q network weights to target Q network\n", " if (global_step+1) % copy_steps == 0 and global_step > start_steps:\n", " copy_target_to_main.run()\n", " \n", " obs = next_obs\n", " epoch += 1\n", " global_step += 1\n", " episodic_reward += reward\n", " \n", " print('Epoch', epoch, 'Reward', episodic_reward,)\n", " " ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:universe]", "language": "python", "name": "conda-env-universe-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" } }, "nbformat": 4, "nbformat_minor": 2 }