{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Multi Armed Bandit Problem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Multi-armed bandit (MAB) problem is one of the classical problems in reinforcement learning. A multi-armed bandit is actually a slot machine, a gambling game played in a casino where you pull the arm(lever) and get a payout(reward) based on some randomly generated probability distribution.  A single slot machine is called one-armed bandit and when there are multiple slot machines it is called as multi-armed bandits or k-armed bandits. Multi-armed bandits are shown below,"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](images/B09792_06_01.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As each slot machine gives us the reward from its own probability distribution, our goal is to find out which slot machine will give us the maximum cumulative reward over a sequence of time.So at each time step t, the agent performs an action i.e pulls an arm from the slot machine and receives a reward  and goal of our agent is to maximize the cumulative reward. \n",
    "\n",
    "We define the value of an arm Q(a) as average rewards received by pulling the arm, "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$Q(a) = \\frac{Sum \\,of \\,rewards \\,\\,received \\,from \\,the \\,arm}{Total\\, number \\,of \\,times \\,the \\,arm \\,was \\,pulled} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the optimal arm is the one which gives us maximum cumulative reward i.e "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$Q(a^*)= Max \\; Q(a) $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of our agent is to find the optimal arm and also to minimize the regret which can be defined as the cost of knowing which of the k arms is optimal. Now how do we find the best arm? Whether we should explore all the arms or choose the arm which already gives us a maximum cumulative reward? Here comes exploration-exploitation dilemma. Now we will see how to solve this dilemma using various exploration strategies as follows,\n",
    "\n",
    "1. Epsilon-greedy policy\n",
    "2. Softmax exploration\n",
    "3. Upper Confidence bound algorithm\n",
    "4. Thomson sampling technique"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First let us import the libraries,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2018-05-23 17:43:16,420] Making new env: BanditTenArmedGaussian-v0\n"
     ]
    }
   ],
   "source": [
    "import gym_bandits\n",
    "import gym\n",
    "import numpy as np\n",
    "import math\n",
    "import random\n",
    "env = gym.make(\"BanditTenArmedGaussian-v0\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Epsilon-Greedy Policy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def epsilon_greedy(epsilon):\n",
    "    \n",
    "    rand = np.random.random()  \n",
    "    if rand < epsilon:\n",
    "        action =  env.action_space.sample()\n",
    "    else:\n",
    "        action = np.argmax(Q)\n",
    "    \n",
    "    return action"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    " Now, let us initialize all the necessary variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# number of rounds (iterations)\n",
    "num_rounds = 20000\n",
    "\n",
    "# Count of number of times an arm was pulled\n",
    "count = np.zeros(10)\n",
    "\n",
    "# Sum of rewards of each arm\n",
    "sum_rewards = np.zeros(10)\n",
    "\n",
    "# Q value which is the average reward\n",
    "Q = np.zeros(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " Start pulling the arm!!!!!!!!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The optimal arm is 1\n"
     ]
    }
   ],
   "source": [
    "for i in range(num_rounds):\n",
    "    \n",
    "    # Select the arm using epsilon greedy \n",
    "    arm = epsilon_greedy(0.5)\n",
    "    \n",
    "    # Get the reward\n",
    "    observation, reward, done, info = env.step(arm) \n",
    "    \n",
    "    # update the count of that arm\n",
    "    count[arm] += 1\n",
    "    \n",
    "    # Sum the rewards obtained from the arm\n",
    "    sum_rewards[arm]+=reward\n",
    "    \n",
    "    # calculate Q value which is the average rewards of the arm\n",
    "    Q[arm] = sum_rewards[arm]/count[arm]\n",
    "\n",
    "print( 'The optimal arm is {}'.format(np.argmax(Q)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Softmax Exploration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def softmax(tau):\n",
    "    \n",
    "    total = sum([math.exp(val/tau) for val in Q])    \n",
    "    probs = [math.exp(val/tau)/total for val in Q]\n",
    "    \n",
    "    threshold = random.random()\n",
    "    cumulative_prob = 0.0\n",
    "    for i in range(len(probs)):\n",
    "        cumulative_prob += probs[i]\n",
    "        if (cumulative_prob > threshold):\n",
    "            return i\n",
    "    return np.argmax(probs) \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# number of rounds (iterations)\n",
    "num_rounds = 20000\n",
    "\n",
    "# Count of number of times an arm was pulled\n",
    "count = np.zeros(10)\n",
    "\n",
    "# Sum of rewards of each arm\n",
    "sum_rewards = np.zeros(10)\n",
    "\n",
    "# Q value which is the average reward\n",
    "Q = np.zeros(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The optimal arm is 1\n"
     ]
    }
   ],
   "source": [
    "for i in range(num_rounds):\n",
    "    \n",
    "    # Select the arm using softmax\n",
    "    arm = softmax(0.5)\n",
    "    \n",
    "    # Get the reward\n",
    "    observation, reward, done, info = env.step(arm) \n",
    "    \n",
    "    # update the count of that arm\n",
    "    count[arm] += 1\n",
    "    \n",
    "    # Sum the rewards obtained from the arm\n",
    "    sum_rewards[arm]+=reward\n",
    "    \n",
    "    # calculate Q value which is the average rewards of the arm\n",
    "    Q[arm] = sum_rewards[arm]/count[arm]\n",
    "    \n",
    "print( 'The optimal arm is {}'.format(np.argmax(Q)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Upper Confidence Bound"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def UCB(iters):\n",
    "    \n",
    "    ucb = np.zeros(10)\n",
    "    \n",
    "    #explore all the arms\n",
    "    if iters < 10:\n",
    "        return i\n",
    "    \n",
    "    else:\n",
    "        for arm in range(10):\n",
    "            \n",
    "            # calculate upper bound\n",
    "            upper_bound = math.sqrt((2*math.log(sum(count))) / count[arm])\n",
    "            \n",
    "            # add upper bound to the Q valyue\n",
    "            ucb[arm] = Q[arm] + upper_bound\n",
    "            \n",
    "        # return the arm which has maximum value\n",
    "        return (np.argmax(ucb))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# number of rounds (iterations)\n",
    "num_rounds = 20000\n",
    "\n",
    "# Count of number of times an arm was pulled\n",
    "count = np.zeros(10)\n",
    "\n",
    "# Sum of rewards of each arm\n",
    "sum_rewards = np.zeros(10)\n",
    "\n",
    "# Q value which is the average reward\n",
    "Q = np.zeros(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The optimal arm is 1\n"
     ]
    }
   ],
   "source": [
    "for i in range(num_rounds):\n",
    "    \n",
    "    # Select the arm using UCB\n",
    "    arm = UCB(i)\n",
    "    \n",
    "    # Get the reward\n",
    "    observation, reward, done, info = env.step(arm) \n",
    "    \n",
    "    # update the count of that arm\n",
    "    count[arm] += 1\n",
    "    \n",
    "    # Sum the rewards obtained from the arm\n",
    "    sum_rewards[arm]+=reward\n",
    "    \n",
    "    # calculate Q value which is the average rewards of the arm\n",
    "    Q[arm] = sum_rewards[arm]/count[arm]\n",
    "    \n",
    "print( 'The optimal arm is {}'.format(np.argmax(Q)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Thompson Sampling "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def thompson_sampling(alpha,beta):\n",
    "    \n",
    "    samples = [np.random.beta(alpha[i]+1,beta[i]+1) for i in range(10)]\n",
    "\n",
    "    return np.argmax(samples)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# number of rounds (iterations)\n",
    "num_rounds = 20000\n",
    "\n",
    "# Count of number of times an arm was pulled\n",
    "count = np.zeros(10)\n",
    "\n",
    "# Sum of rewards of each arm\n",
    "sum_rewards = np.zeros(10)\n",
    "\n",
    "# Q value which is the average reward\n",
    "Q = np.zeros(10)\n",
    "\n",
    "# initialize alpha and beta values\n",
    "alpha = np.ones(10)\n",
    "beta = np.ones(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The optimal arm is 1\n"
     ]
    }
   ],
   "source": [
    "for i in range(num_rounds):\n",
    "    \n",
    "    # Select the arm using thompson sampling\n",
    "    arm = thompson_sampling(alpha,beta)\n",
    "    \n",
    "    # Get the reward\n",
    "    observation, reward, done, info = env.step(arm) \n",
    "    \n",
    "    # update the count of that arm\n",
    "    count[arm] += 1\n",
    "    \n",
    "    # Sum the rewards obtained from the arm\n",
    "    sum_rewards[arm]+=reward\n",
    "    \n",
    "    # calculate Q value which is the average rewards of the arm\n",
    "    Q[arm] = sum_rewards[arm]/count[arm]\n",
    "\n",
    "    # If it is a positive reward increment alpha\n",
    "    if reward >0:\n",
    "        alpha[arm] += 1\n",
    "        \n",
    "    # If it is a negative reward increment beta\n",
    "    else:\n",
    "        beta[arm] += 1\n",
    "    \n",
    "print( 'The optimal arm is {}'.format(np.argmax(Q)))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:universe]",
   "language": "python",
   "name": "conda-env-universe-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}