{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bandits - part 2\n",
    "\n",
    "In the exercise, we will investigate in more details the properties of the bandit algorithms implemented last time and investigate reinforcement comparison.\n",
    "\n",
    "**Q:** Start by copying all class definitions of the last exercise (Bandit, Greedy, $\\epsilon$-Greedy, softmax) and re-run the experiments with correct values for the parameters in a single cell. We will ignore exploration scheduling (although we should not)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reward distribution\n",
    "\n",
    "We are now going to vary the reward distributions and investigate whether the experimental results we had previously when the true Q-values are in $\\mathcal{N}(0, 1)$ and the rewards have a variance of 1 still hold."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q:** Let's now change the distribution of true Q-values from $\\mathcal{N}(0, 1)$ to $\\mathcal{N}(10, 10)$ when creating the bandits and re-run the algorithms. What happens and why? Modify the values of `epsilon` and `tau` to try to get a better behavior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimistic initialization\n",
    "\n",
    "The initial estimates of 0 are now very **pessimistic** compared to the average reward you can get (between 10 and 20). This was not the case in the original setup.\n",
    "\n",
    "**Q:** Modify the classes so that they accept a parameter `Q_init` allowing to intialize the estimates `Q_t` to something different from 0. Change the initial value of the estimates to 10 for each algorithm. What happens? Conclude on the importance of reward scaling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now use **optimistic initialization**, i.e. initialize the estimates to a much higher value than what is realistic.\n",
    "\n",
    "**Q:** Implement optimistic initialization by initializing the estimates of all three algorithms to 25. What happens?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reinforcement comparison\n",
    "\n",
    "The problem with the previous **value-based** methods is that the Q-value estimates depend on the absolute magnitude of the rewards (by definition). The hyperparameters of the learning algorithms (learning rate, exploration, initial values) will therefore be very different depending on the scaling of the rewards (between 0 and 1, between -100 and 100, etc).\n",
    "\n",
    "A way to get rid of this dependency is to introduce **preferences** $p_t(a)$ for each action instead of the estimated Q-values. Preferences should follow the Q-values: an action with a high Q-value should have a high Q-value and vice versa, but we do not care about its exact scaling.\n",
    "\n",
    "In **reinforcement comparison**, we introduce a baseline $\\tilde{r}_t$ which is the average received reward **regardless the action**, i.e. there is a single value for the whole problem. This average reward is simply updated after each action with a moving average of the received rewards:\n",
    "\n",
    "$$\\tilde{r}_{t+1} = \\tilde{r}_{t} + \\alpha \\, (r_t - \\tilde{r}_{t})$$\n",
    "\n",
    "The average reward is used to update the preference for the action that was just executed:\n",
    "\n",
    "$$p_{t+1}(a_t) = p_{t}(a_t) + \\beta \\, (r_t - \\tilde{r}_{t})$$\n",
    "\n",
    "If the action lead to more reward than usual, its preference should be increased (good surprise). If the action lead to less reward than usual, its preference should be decreased (bad surprise).\n",
    "\n",
    "Action selection is simply a softmax over the preferences, without the temperature parameter (as we do not care about the scaling):\n",
    "\n",
    "$$\n",
    "    \\pi (a) = \\frac{\\exp p_t(a)}{ \\sum_b \\exp p_t(b)}\n",
    "$$ \n",
    "\n",
    "**Q:** Implement reinforcement comparison (with $\\alpha=\\beta=0.1$) and compare it to the other methods on the default settings. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q:** Compare all methods with optimistic initialization. The true Q-values come from $\\mathcal{N}(10, 10)$, the estimated Q-values are initialized to 20 for greedy, $\\epsilon$-greedy and softmax, and the average reward is initialized to 20 for RC (the preferences are initialized at 0).  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bonus questions\n",
    "\n",
    "At this point, you should have understood the main concepts of bandit algorithm. If you have time and interest, you can also implement Gradient Bandit (reinforcement comparison where all actions are updated, not just the one which was executed) and UCB (upper-bound confidence). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}