{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Solving the Taxi Problem using Q Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Goal:\n",
    "\n",
    "Say our agent is the driving the taxi. There are totally four locations and the agent has to\n",
    "pick up a passenger at one location and drop at the another. The agent will receive +20\n",
    "points as a reward for successful drop off and -1 point for every time step it takes. The agent\n",
    "will also lose -10 points for illegal pickups and drops. So the goal of our agent is to learn to\n",
    "pick up and drop passengers at the correct location in a short time without boarding any illegal\n",
    "passengers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    " First, we import all necessary libraries and simulate the environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2018-05-23 12:23:58,368] Making new env: Taxi-v1\n"
     ]
    }
   ],
   "source": [
    "import random\n",
    "import gym\n",
    "env = gym.make('Taxi-v1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " The environment is shown below, where the letters (R, G, Y, B) represents the different\n",
    "locations and a tiny yellow colored rectangle is the taxi driving by our agent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+\n",
      "|\u001b[34;1mR\u001b[0m: | : :G|\n",
      "| : : : : |\n",
      "| : : : : |\n",
      "| |\u001b[43m \u001b[0m: | : |\n",
      "|Y| : |\u001b[35mB\u001b[0m: |\n",
      "+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "env.render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "Now, we initialize, Q table as a dictionary which stores state-action pair specifying value of performing an action a in\n",
    " state s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "q = {}\n",
    "for s in range(env.observation_space.n):\n",
    "    for a in range(env.action_space.n):\n",
    "        q[(s,a)] = 0.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "We define a function called update_q_table which will update the Q values according to our Q learning update rule. \n",
    "\n",
    "If you look at the below function, we take the value which has maximum value for a state-action pair and store it in a variable called qa, then we update the Q value of the preivous state by our update rule."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):\n",
    "    \n",
    "    qa = max([q[(nextstate, a)] for a in range(env.action_space.n)])\n",
    "    q[(prev_state,action)] += alpha * (reward + gamma * qa - q[(prev_state,action)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "    \n",
    "Then, we define a function for performing epsilon-greedy policy. In epsilon-greedy policy, either we select best action with probability 1-epsilon or we explore new action with probability epsilon. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def epsilon_greedy_policy(state, epsilon):\n",
    "    if random.uniform(0,1) < epsilon:\n",
    "        return env.action_space.sample()\n",
    "    else:\n",
    "        return max(list(range(env.action_space.n)), key = lambda x: q[(state,x)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Now we initialize necessary variables\n",
    "\n",
    "alpha - TD learning rate\n",
    "\n",
    "gamma - discount factor <br>\n",
    "epsilon - epsilon value in epsilon greedy policy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "alpha = 0.4\n",
    "gamma = 0.999\n",
    "epsilon = 0.017"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, Let us perform Q Learning!!!!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "for i in range(8000):\n",
    "    r = 0\n",
    "    \n",
    "    prev_state = env.reset()\n",
    "    \n",
    "    while True:\n",
    "        \n",
    "        \n",
    "        env.render()\n",
    "        \n",
    "        # In each state, we select the action by epsilon-greedy policy\n",
    "        action = epsilon_greedy_policy(prev_state, epsilon)\n",
    "        \n",
    "        # then we perform the action and move to the next state, and receive the reward\n",
    "        nextstate, reward, done, _ = env.step(action)\n",
    "        \n",
    "        # Next we update the Q value using our update_q_table function\n",
    "        # which updates the Q value by Q learning update rule\n",
    "        \n",
    "        update_q_table(prev_state, action, reward, nextstate, alpha, gamma)\n",
    "        \n",
    "        # Finally we update the previous state as next state\n",
    "        prev_state = nextstate\n",
    "\n",
    "        # Store all the rewards obtained\n",
    "        r += reward\n",
    "\n",
    "        #we will break the loop, if we are at the terminal state of the episode\n",
    "        if done:\n",
    "            break\n",
    "\n",
    "    print(\"total reward: \", r)\n",
    "\n",
    "env.close()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:universe]",
   "language": "python",
   "name": "conda-env-universe-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}