{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Solving the Taxi Problem using Q Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Goal:\n", "\n", "Say our agent is the driving the taxi. There are totally four locations and the agent has to\n", "pick up a passenger at one location and drop at the another. The agent will receive +20\n", "points as a reward for successful drop off and -1 point for every time step it takes. The agent\n", "will also lose -10 points for illegal pickups and drops. So the goal of our agent is to learn to\n", "pick up and drop passengers at the correct location in a short time without boarding any illegal\n", "passengers." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ " First, we import all necessary libraries and simulate the environment" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[2018-05-23 12:23:58,368] Making new env: Taxi-v1\n" ] } ], "source": [ "import random\n", "import gym\n", "env = gym.make('Taxi-v1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " The environment is shown below, where the letters (R, G, Y, B) represents the different\n", "locations and a tiny yellow colored rectangle is the taxi driving by our agent." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+\n", "|\u001b[34;1mR\u001b[0m: | : :G|\n", "| : : : : |\n", "| : : : : |\n", "| |\u001b[43m \u001b[0m: | : |\n", "|Y| : |\u001b[35mB\u001b[0m: |\n", "+---------+\n", "\n" ] } ], "source": [ "env.render()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Now, we initialize, Q table as a dictionary which stores state-action pair specifying value of performing an action a in\n", " state s." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "q = {}\n", "for s in range(env.observation_space.n):\n", " for a in range(env.action_space.n):\n", " q[(s,a)] = 0.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "We define a function called update_q_table which will update the Q values according to our Q learning update rule. \n", "\n", "If you look at the below function, we take the value which has maximum value for a state-action pair and store it in a variable called qa, then we update the Q value of the preivous state by our update rule." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):\n", " \n", " qa = max([q[(nextstate, a)] for a in range(env.action_space.n)])\n", " q[(prev_state,action)] += alpha * (reward + gamma * qa - q[(prev_state,action)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", "Then, we define a function for performing epsilon-greedy policy. In epsilon-greedy policy, either we select best action with probability 1-epsilon or we explore new action with probability epsilon. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def epsilon_greedy_policy(state, epsilon):\n", " if random.uniform(0,1) < epsilon:\n", " return env.action_space.sample()\n", " else:\n", " return max(list(range(env.action_space.n)), key = lambda x: q[(state,x)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now we initialize necessary variables\n", "\n", "alpha - TD learning rate\n", "\n", "gamma - discount factor
\n", "epsilon - epsilon value in epsilon greedy policy" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "alpha = 0.4\n", "gamma = 0.999\n", "epsilon = 0.017" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, Let us perform Q Learning!!!!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "for i in range(8000):\n", " r = 0\n", " \n", " prev_state = env.reset()\n", " \n", " while True:\n", " \n", " \n", " env.render()\n", " \n", " # In each state, we select the action by epsilon-greedy policy\n", " action = epsilon_greedy_policy(prev_state, epsilon)\n", " \n", " # then we perform the action and move to the next state, and receive the reward\n", " nextstate, reward, done, _ = env.step(action)\n", " \n", " # Next we update the Q value using our update_q_table function\n", " # which updates the Q value by Q learning update rule\n", " \n", " update_q_table(prev_state, action, reward, nextstate, alpha, gamma)\n", " \n", " # Finally we update the previous state as next state\n", " prev_state = nextstate\n", "\n", " # Store all the rewards obtained\n", " r += reward\n", "\n", " #we will break the loop, if we are at the terminal state of the episode\n", " if done:\n", " break\n", "\n", " print(\"total reward: \", r)\n", "\n", "env.close()" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:universe]", "language": "python", "name": "conda-env-universe-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" } }, "nbformat": 4, "nbformat_minor": 2 }