# Sampling

In this first exercise, we will investigate how to evaluate the Q-value of each action available in a 5-armed bandit. It is mostly to give you intuition about the limits of sampling and the central limit theorem.

Let's start with importing numpy and matplotlib:

In [None]:
import numpy as np
import matplotlib.pyplot as plt 

## Sampling a n-armed bandit

Let's now create the n-armed bandit. The only thing we need to do is to randomly choose 5 true Q-values $Q^*(a)$.

![](../img/bandit-example.png)

To be generic, let's define `nb_actions=5` and create an array corresponding to the index of each action (0, 1, 2, 3, 4) for plotting purpose.

In [None]:
nb_actions = 5
actions = np.arange(nb_actions)

**Q:** Create a numpy array `Q_star` with `nb_actions` values, normally distributed with a mean of 0 and standard deviation of 1 (as in the lecture). 

**Q:** Plot the Q-values. Identify the optimal action $a^*$.

*Tip:* you could plot the array `Q_star` with `plt.plot`, but that would be ugly. Check the documentation of the `plt.bar` method.

Great, now let's start evaluating these Q-values with random sampling.

**Q:** Define an action sampling method `get_reward` taking as arguments:
* The array `Q_star`.
* The index `a` of the action you want to sample (between 0 and 4).
* An optional variance argument `var`, which should have the value 1.0 by default.
 
It should return a single value, sampled from the normal distribution with mean `Q_star[a]` and variance `var`.

**Q:** For each possible action `a`, take `nb_samples=10` out of the reward distribution and store them in a numpy array. Compute the mean of the samples for each action separately in a new array `Q_t`. Make a bar plot of these estimated Q-values.

**Q:** Make a bar plot of the difference between the true values `Q_star` and the estimates `Q_t`. Conclude. Re-run the sampling cell with different numbers of samples.

**Q:** To better understand the influence of the number of samples on the accuracy of the sample average, create a `for` loop over the preceding code, with a number of samples increasing from 1 to 100. For each value, compute the **mean square error** (mse) between the estimates `Q_t` and the true values `Q^*`.

The mean square error is simply defined over the `N = nb_actions` actions as:

$$\epsilon = \frac{1}{N} \, \sum_{a=0}^{N-1} (Q_t(a) - Q^*(a))^2$$

At the end of the loop, plot the evolution of the mean square error with the number of samples. You can append each value of the mse in an empty list and then plot it with `plt.plot`, for example. 

The plot should give you an indication of how many samples you at least need to correctly estimate each action (30 or so). But according to the central limit theorem (CLT), the variance of the sample average also varies with the variance of the distribution itself.

> The distribution of sample averages is normally distributed with mean $\mu$ and variance $\frac{\sigma^2}{N}$.

$$S_N \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{N}})$$

**Q:** Vary the variance of the reward distribution (as an argument to `get_reward`) and re-run the previous experiment. Do not hesitate to take more samples. Conclude.

## Bandit environment

In order to prepare the next exercise, let's now implement the n-armed bandit in a Python class. As reminded in the tutorial on Python, a class is defined using this structure:

```python
class MyClass:
 """
 Documentation of the class.
 """
 def __init__(self, param1, param2):
 """
 Constructor of the class.
 
 :param param1: first parameter.
 :param param2: second parameter.
 """
 self.param1 = param1
 self.param2 = param2
 
 def method(self, another_param):
 """
 Method to do something.
 
 :param another_param: another parameter.
 """
 return (another_param + self.param1)/self.param2
```

You can then create an object of the type `MyClass`:

```python
my_object = MyClass(param1= 1.0, param2=2.0)
```

and call any method of the class on the object:

```python
result = my_object.method(3.0)
```

**Q:** Create a `Bandit` class taking as arguments:

* nb_actions: number of arms.
* mean: mean of the normal distribution for $Q^*$.
* std_Q: standard deviation of the normal distribution for $Q^*$.
* std_r: standard deviation of the normal distribution for the sampled rewards.

The constructor should initialize a `Q_star` array accordingly and store it as an attribute. It should also store the optimal action.

Add a method `step(action)` that samples a reward for a particular action and returns it.

**Q:** Create a 5-armed bandits and sample each action multiple times. Compare the mean reward to the ground truth as before.