{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sampling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this first exercise, we will investigate how to evaluate the Q-value of each action available in a 5-armed bandit. It is mostly to give you intuition about the limits of sampling and the central limit theorem.\n", "\n", "Let's start with importing numpy and matplotlib:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sampling a n-armed bandit\n", "\n", "Let's now create the n-armed bandit. The only thing we need to do is to randomly choose 5 true Q-values $Q^*(a)$.\n", "\n", "![](../img/bandit-example.png)\n", "\n", "To be generic, let's define nb_actions=5 and create an array corresponding to the index of each action (0, 1, 2, 3, 4) for plotting purpose." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nb_actions = 5\n", "actions = np.arange(nb_actions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Create a numpy array Q_star with nb_actions values, normally distributed with a mean of 0 and standard deviation of 1 (as in the lecture). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Plot the Q-values. Identify the optimal action $a^*$.\n", "\n", "*Tip:* you could plot the array Q_star with plt.plot, but that would be ugly. Check the documentation of the plt.bar method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great, now let's start evaluating these Q-values with random sampling.\n", "\n", "**Q:** Define an action sampling method get_reward taking as arguments:\n", "* The array Q_star.\n", "* The index a of the action you want to sample (between 0 and 4).\n", "* An optional variance argument var, which should have the value 1.0 by default.\n", " \n", "It should return a single value, sampled from the normal distribution with mean Q_star[a] and variance var." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** For each possible action a, take nb_samples=10 out of the reward distribution and store them in a numpy array. Compute the mean of the samples for each action separately in a new array Q_t. Make a bar plot of these estimated Q-values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Make a bar plot of the difference between the true values Q_star and the estimates Q_t. Conclude. Re-run the sampling cell with different numbers of samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** To better understand the influence of the number of samples on the accuracy of the sample average, create a for loop over the preceding code, with a number of samples increasing from 1 to 100. For each value, compute the **mean square error** (mse) between the estimates Q_t and the true values Q^*.\n", "\n", "The mean square error is simply defined over the N = nb_actions actions as:\n", "\n", "$$\\epsilon = \\frac{1}{N} \\, \\sum_{a=0}^{N-1} (Q_t(a) - Q^*(a))^2$$\n", "\n", "At the end of the loop, plot the evolution of the mean square error with the number of samples. You can append each value of the mse in an empty list and then plot it with plt.plot, for example. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The plot should give you an indication of how many samples you at least need to correctly estimate each action (30 or so). But according to the central limit theorem (CLT), the variance of the sample average also varies with the variance of the distribution itself.\n", "\n", "> The distribution of sample averages is normally distributed with mean $\\mu$ and variance $\\frac{\\sigma^2}{N}$.\n", "\n", "$$S_N \\sim \\mathcal{N}(\\mu, \\frac{\\sigma}{\\sqrt{N}})$$\n", "\n", "**Q:** Vary the variance of the reward distribution (as an argument to get_reward) and re-run the previous experiment. Do not hesitate to take more samples. Conclude." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bandit environment\n", "\n", "In order to prepare the next exercise, let's now implement the n-armed bandit in a Python class. As reminded in the tutorial on Python, a class is defined using this structure:\n", "\n", "python\n", "class MyClass:\n", " \"\"\"\n", " Documentation of the class.\n", " \"\"\"\n", " def __init__(self, param1, param2):\n", " \"\"\"\n", " Constructor of the class.\n", " \n", " :param param1: first parameter.\n", " :param param2: second parameter.\n", " \"\"\"\n", " self.param1 = param1\n", " self.param2 = param2\n", " \n", " def method(self, another_param):\n", " \"\"\"\n", " Method to do something.\n", " \n", " :param another_param: another parameter.\n", " \"\"\"\n", " return (another_param + self.param1)/self.param2\n", "\n", "\n", "You can then create an object of the type MyClass:\n", "\n", "python\n", "my_object = MyClass(param1= 1.0, param2=2.0)\n", "\n", "\n", "and call any method of the class on the object:\n", "\n", "python\n", "result = my_object.method(3.0)\n", "\n", "\n", "**Q:** Create a Bandit class taking as arguments:\n", "\n", "* nb_actions: number of arms.\n", "* mean: mean of the normal distribution for $Q^*$.\n", "* std_Q: standard deviation of the normal distribution for $Q^*$.\n", "* std_r: standard deviation of the normal distribution for the sampled rewards.\n", "\n", "The constructor should initialize a Q_star array accordingly and store it as an attribute. It should also store the optimal action.\n", "\n", "Add a method step(action) that samples a reward for a particular action and returns it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Create a 5-armed bandits and sample each action multiple times. Compare the mean reward to the ground truth as before." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.12 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "vscode": { "interpreter": { "hash": "3d24234067c217f49dc985cbc60012ce72928059d528f330ba9cb23ce737906d" } } }, "nbformat": 4, "nbformat_minor": 4 }