{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Solving the Taxi Problem Using SARSA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Goal:\n",
    "\n",
    "Say our agent is the driving the taxi. There are totally four locations and the agent has to\n",
    "pick up a passenger at one location and drop at the another. The agent will receive +20\n",
    "points as a reward for successful drop off and -1 point for every time step it takes. The agent\n",
    "will also lose -10 points for illegal pickups and drops. So the goal of our agent is to learn to\n",
    "pick up and drop passengers at the correct location in a short time without boarding any illegal\n",
    "passengers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we import all necessary libraries and initialize the environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2018-06-01 12:23:17,082] Making new env: Taxi-v1\n"
     ]
    }
   ],
   "source": [
    "import random\n",
    "import gym\n",
    "env = gym.make('Taxi-v1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The environment is shown below, where the letters (R, G, Y, B) represents the different\n",
    "locations and a tiny yellow colored rectangle is the taxi driving by our agent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+\n",
      "|\u001b[35m\u001b[34;1mR\u001b[0m\u001b[0m: | : :G|\n",
      "| : : : : |\n",
      "| : : : : |\n",
      "| | : | : |\n",
      "|Y| : |B:\u001b[43m \u001b[0m|\n",
      "+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "env.render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "Now, we initialize, Q table has a dictionary which stores state-action pair specifying value of performing an action in\n",
    "a state s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Q = {}\n",
    "for s in range(env.observation_space.n):\n",
    "    for a in range(env.action_space.n):\n",
    "        Q[(s,a)] = 0.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Then, we define a function for performing epsilon-greedy policy. In epsilon-greedy policy, either we select best action with probability 1-epsilon or we explore new action with probability epsilon"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def epsilon_greedy(state, epsilon):\n",
    "    if random.uniform(0,1) < epsilon:\n",
    "        return env.action_space.sample()\n",
    "    else:\n",
    "        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Now we initialize necessary variables\n",
    "\n",
    "alpha - TD learning rate\n",
    "\n",
    "gamma - discount factor <br>\n",
    "epsilon - epsilon value in epsilon greedy policy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "alpha = 0.85\n",
    "gamma = 0.90\n",
    "epsilon = 0.8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we perform SARSA!!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(4000):\n",
    "    \n",
    "    # we store cumulative reward of each episodes in r\n",
    "    r = 0\n",
    "    \n",
    "    # initialize the state,\n",
    "    state = env.reset()\n",
    "    \n",
    "    # select the action using epsilon-greedy policy\n",
    "    action = epsilon_greedy(state,epsilon)\n",
    "    \n",
    "    while True:\n",
    "       \n",
    "        # env.render()\n",
    "        \n",
    "        # then we perform the action and move to the next state, and receive the reward\n",
    "        nextstate, reward, done, _ = env.step(action)\n",
    "        \n",
    "        # again, we select the next action using epsilon greedy policy\n",
    "        nextaction = epsilon_greedy(nextstate,epsilon) \n",
    "    \n",
    "        # we calculate the Q value of previous state using our update rule\n",
    "        Q[(state,action)] += alpha * (reward + gamma * Q[(nextstate,nextaction)]-Q[(state,action)])\n",
    "\n",
    "        # finally we update our state and action with next action and next state\n",
    "        action = nextaction\n",
    "        state = nextstate\n",
    "        \n",
    "        # store the rewards\n",
    "        r += reward\n",
    "        \n",
    "        # we will break the loop, if we are at the terminal state of the episode\n",
    "        if done:\n",
    "            break\n",
    "            \n",
    "    print(\"total reward: \", r)\n",
    "\n",
    "env.close()\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:universe]",
   "language": "python",
   "name": "conda-env-universe-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}