{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Solving the Taxi Problem Using SARSA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Goal:\n", "\n", "Say our agent is the driving the taxi. There are totally four locations and the agent has to\n", "pick up a passenger at one location and drop at the another. The agent will receive +20\n", "points as a reward for successful drop off and -1 point for every time step it takes. The agent\n", "will also lose -10 points for illegal pickups and drops. So the goal of our agent is to learn to\n", "pick up and drop passengers at the correct location in a short time without boarding any illegal\n", "passengers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we import all necessary libraries and initialize the environment" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[2018-06-01 12:23:17,082] Making new env: Taxi-v1\n" ] } ], "source": [ "import random\n", "import gym\n", "env = gym.make('Taxi-v1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The environment is shown below, where the letters (R, G, Y, B) represents the different\n", "locations and a tiny yellow colored rectangle is the taxi driving by our agent." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+\n", "|\u001b[35m\u001b[34;1mR\u001b[0m\u001b[0m: | : :G|\n", "| : : : : |\n", "| : : : : |\n", "| | : | : |\n", "|Y| : |B:\u001b[43m \u001b[0m|\n", "+---------+\n", "\n" ] } ], "source": [ "env.render()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Now, we initialize, Q table has a dictionary which stores state-action pair specifying value of performing an action in\n", "a state s." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "Q = {}\n", "for s in range(env.observation_space.n):\n", " for a in range(env.action_space.n):\n", " Q[(s,a)] = 0.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Then, we define a function for performing epsilon-greedy policy. In epsilon-greedy policy, either we select best action with probability 1-epsilon or we explore new action with probability epsilon" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def epsilon_greedy(state, epsilon):\n", " if random.uniform(0,1) < epsilon:\n", " return env.action_space.sample()\n", " else:\n", " return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now we initialize necessary variables\n", "\n", "alpha - TD learning rate\n", "\n", "gamma - discount factor
\n", "epsilon - epsilon value in epsilon greedy policy" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "alpha = 0.85\n", "gamma = 0.90\n", "epsilon = 0.8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we perform SARSA!!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in range(4000):\n", " \n", " # we store cumulative reward of each episodes in r\n", " r = 0\n", " \n", " # initialize the state,\n", " state = env.reset()\n", " \n", " # select the action using epsilon-greedy policy\n", " action = epsilon_greedy(state,epsilon)\n", " \n", " while True:\n", " \n", " # env.render()\n", " \n", " # then we perform the action and move to the next state, and receive the reward\n", " nextstate, reward, done, _ = env.step(action)\n", " \n", " # again, we select the next action using epsilon greedy policy\n", " nextaction = epsilon_greedy(nextstate,epsilon) \n", " \n", " # we calculate the Q value of previous state using our update rule\n", " Q[(state,action)] += alpha * (reward + gamma * Q[(nextstate,nextaction)]-Q[(state,action)])\n", "\n", " # finally we update our state and action with next action and next state\n", " action = nextaction\n", " state = nextstate\n", " \n", " # store the rewards\n", " r += reward\n", " \n", " # we will break the loop, if we are at the terminal state of the episode\n", " if done:\n", " break\n", " \n", " print(\"total reward: \", r)\n", "\n", "env.close()\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:universe]", "language": "python", "name": "conda-env-universe-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" } }, "nbformat": 4, "nbformat_minor": 2 }