{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Multi Armed Bandit Problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multi-armed bandit (MAB) problem is one of the classical problems in reinforcement learning. A multi-armed bandit is actually a slot machine, a gambling game played in a casino where you pull the arm(lever) and get a payout(reward) based on some randomly generated probability distribution. A single slot machine is called one-armed bandit and when there are multiple slot machines it is called as multi-armed bandits or k-armed bandits. Multi-armed bandits are shown below," ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![title](images/B09792_06_01.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As each slot machine gives us the reward from its own probability distribution, our goal is to find out which slot machine will give us the maximum cumulative reward over a sequence of time.So at each time step t, the agent performs an action i.e pulls an arm from the slot machine and receives a reward and goal of our agent is to maximize the cumulative reward. \n", "\n", "We define the value of an arm Q(a) as average rewards received by pulling the arm, " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$Q(a) = \\frac{Sum \\,of \\,rewards \\,\\,received \\,from \\,the \\,arm}{Total\\, number \\,of \\,times \\,the \\,arm \\,was \\,pulled} $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the optimal arm is the one which gives us maximum cumulative reward i.e " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$Q(a^*)= Max \\; Q(a) $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of our agent is to find the optimal arm and also to minimize the regret which can be defined as the cost of knowing which of the k arms is optimal. Now how do we find the best arm? Whether we should explore all the arms or choose the arm which already gives us a maximum cumulative reward? Here comes exploration-exploitation dilemma. Now we will see how to solve this dilemma using various exploration strategies as follows,\n", "\n", "1. Epsilon-greedy policy\n", "2. Softmax exploration\n", "3. Upper Confidence bound algorithm\n", "4. Thomson sampling technique" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First let us import the libraries," ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[2018-05-23 17:43:16,420] Making new env: BanditTenArmedGaussian-v0\n" ] } ], "source": [ "import gym_bandits\n", "import gym\n", "import numpy as np\n", "import math\n", "import random\n", "env = gym.make(\"BanditTenArmedGaussian-v0\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Epsilon-Greedy Policy" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def epsilon_greedy(epsilon):\n", " \n", " rand = np.random.random() \n", " if rand < epsilon:\n", " action = env.action_space.sample()\n", " else:\n", " action = np.argmax(Q)\n", " \n", " return action" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ " Now, let us initialize all the necessary variables" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# number of rounds (iterations)\n", "num_rounds = 20000\n", "\n", "# Count of number of times an arm was pulled\n", "count = np.zeros(10)\n", "\n", "# Sum of rewards of each arm\n", "sum_rewards = np.zeros(10)\n", "\n", "# Q value which is the average reward\n", "Q = np.zeros(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Start pulling the arm!!!!!!!!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The optimal arm is 1\n" ] } ], "source": [ "for i in range(num_rounds):\n", " \n", " # Select the arm using epsilon greedy \n", " arm = epsilon_greedy(0.5)\n", " \n", " # Get the reward\n", " observation, reward, done, info = env.step(arm) \n", " \n", " # update the count of that arm\n", " count[arm] += 1\n", " \n", " # Sum the rewards obtained from the arm\n", " sum_rewards[arm]+=reward\n", " \n", " # calculate Q value which is the average rewards of the arm\n", " Q[arm] = sum_rewards[arm]/count[arm]\n", "\n", "print( 'The optimal arm is {}'.format(np.argmax(Q)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Softmax Exploration" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def softmax(tau):\n", " \n", " total = sum([math.exp(val/tau) for val in Q]) \n", " probs = [math.exp(val/tau)/total for val in Q]\n", " \n", " threshold = random.random()\n", " cumulative_prob = 0.0\n", " for i in range(len(probs)):\n", " cumulative_prob += probs[i]\n", " if (cumulative_prob > threshold):\n", " return i\n", " return np.argmax(probs) \n", " " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# number of rounds (iterations)\n", "num_rounds = 20000\n", "\n", "# Count of number of times an arm was pulled\n", "count = np.zeros(10)\n", "\n", "# Sum of rewards of each arm\n", "sum_rewards = np.zeros(10)\n", "\n", "# Q value which is the average reward\n", "Q = np.zeros(10)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The optimal arm is 1\n" ] } ], "source": [ "for i in range(num_rounds):\n", " \n", " # Select the arm using softmax\n", " arm = softmax(0.5)\n", " \n", " # Get the reward\n", " observation, reward, done, info = env.step(arm) \n", " \n", " # update the count of that arm\n", " count[arm] += 1\n", " \n", " # Sum the rewards obtained from the arm\n", " sum_rewards[arm]+=reward\n", " \n", " # calculate Q value which is the average rewards of the arm\n", " Q[arm] = sum_rewards[arm]/count[arm]\n", " \n", "print( 'The optimal arm is {}'.format(np.argmax(Q)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upper Confidence Bound" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def UCB(iters):\n", " \n", " ucb = np.zeros(10)\n", " \n", " #explore all the arms\n", " if iters < 10:\n", " return i\n", " \n", " else:\n", " for arm in range(10):\n", " \n", " # calculate upper bound\n", " upper_bound = math.sqrt((2*math.log(sum(count))) / count[arm])\n", " \n", " # add upper bound to the Q valyue\n", " ucb[arm] = Q[arm] + upper_bound\n", " \n", " # return the arm which has maximum value\n", " return (np.argmax(ucb))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# number of rounds (iterations)\n", "num_rounds = 20000\n", "\n", "# Count of number of times an arm was pulled\n", "count = np.zeros(10)\n", "\n", "# Sum of rewards of each arm\n", "sum_rewards = np.zeros(10)\n", "\n", "# Q value which is the average reward\n", "Q = np.zeros(10)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The optimal arm is 1\n" ] } ], "source": [ "for i in range(num_rounds):\n", " \n", " # Select the arm using UCB\n", " arm = UCB(i)\n", " \n", " # Get the reward\n", " observation, reward, done, info = env.step(arm) \n", " \n", " # update the count of that arm\n", " count[arm] += 1\n", " \n", " # Sum the rewards obtained from the arm\n", " sum_rewards[arm]+=reward\n", " \n", " # calculate Q value which is the average rewards of the arm\n", " Q[arm] = sum_rewards[arm]/count[arm]\n", " \n", "print( 'The optimal arm is {}'.format(np.argmax(Q)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Thompson Sampling " ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def thompson_sampling(alpha,beta):\n", " \n", " samples = [np.random.beta(alpha[i]+1,beta[i]+1) for i in range(10)]\n", "\n", " return np.argmax(samples)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# number of rounds (iterations)\n", "num_rounds = 20000\n", "\n", "# Count of number of times an arm was pulled\n", "count = np.zeros(10)\n", "\n", "# Sum of rewards of each arm\n", "sum_rewards = np.zeros(10)\n", "\n", "# Q value which is the average reward\n", "Q = np.zeros(10)\n", "\n", "# initialize alpha and beta values\n", "alpha = np.ones(10)\n", "beta = np.ones(10)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The optimal arm is 1\n" ] } ], "source": [ "for i in range(num_rounds):\n", " \n", " # Select the arm using thompson sampling\n", " arm = thompson_sampling(alpha,beta)\n", " \n", " # Get the reward\n", " observation, reward, done, info = env.step(arm) \n", " \n", " # update the count of that arm\n", " count[arm] += 1\n", " \n", " # Sum the rewards obtained from the arm\n", " sum_rewards[arm]+=reward\n", " \n", " # calculate Q value which is the average rewards of the arm\n", " Q[arm] = sum_rewards[arm]/count[arm]\n", "\n", " # If it is a positive reward increment alpha\n", " if reward >0:\n", " alpha[arm] += 1\n", " \n", " # If it is a negative reward increment beta\n", " else:\n", " beta[arm] += 1\n", " \n", "print( 'The optimal arm is {}'.format(np.argmax(Q)))" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:universe]", "language": "python", "name": "conda-env-universe-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" } }, "nbformat": 4, "nbformat_minor": 2 }