# _TriScale_ - Experiment Sizing

> This notebook is intended for **live tutorial** sessions about _TriScale._  
Here is the [self-study version](tutorial_exp-sizing.ipynb).

To get started, we need to import a few Python modules. All the _TriScale_-specific functions are part of one module called `triscale`.

In [None]:
import os
from pathlib import Path

import pandas as pd
import numpy as np
import plotly.graph_objects as go

import triscale

## Basics

_TriScale_'s `experiment_sizing()` function implements the computation of the minimal number of samples required to estimate any percentile with any confidence level.

In [None]:
percentile = 50 # the median
confidence = 95 # the confidence level, in %

triscale.experiment_sizing(
    percentile, 
    confidence,
    verbose=True); 

We can change the values in the cell above to see how the number of samples evolves with a larger confidence level or more extreme percentiles. 

Note that the probability distributions are symmetric: it takes the same number of samples to compute a lower bound for the $p$-th  percentile as to compute an upper bound for the $(1-p)$-th percentile.

In [None]:
percentile = 20 
confidence = 95 # the confidence level, in %

if (triscale.experiment_sizing(percentile,confidence) == 
    triscale.experiment_sizing(100-percentile,confidence)):
    print("It takes the same number of samples to estimate \
the \n{}-th and \n{}-th percentiles.".format(percentile, 100-percentile))


In [None]:
# Sets of percentiles and confidence levels to try
percentiles = [0.1, 1, 5, 10, 25, 50, 75, 90, 95, 99, 99.9]
confidences = [75, 90, 95, 99, 99.9, 99.99]

# Computing the minimum number of runs for each (perc., conf.) pair
min_number_samples = []
for c in confidences:
    tmp = []
    for p in percentiles:
        N = triscale.experiment_sizing(p,c)
        tmp.append(N[0])
    min_number_samples.append(tmp)
    
# Put the results in a DataFrame for a convenient display of the results
df = pd.DataFrame(columns=percentiles, data=min_number_samples)
df['Confidence level'] = confidences
df.set_index('Confidence level', inplace=True)

display(df)

Let's visualize the same data with a heatmap...

In [None]:
colorbar=dict(
    title='Minimal N', 
    tickvals = [0, 1, 2, 3, 3.699, 4],
    ticktext = ['1', '10', '100', '1000', '5000','10000']
)

fig = go.Figure(data=go.Heatmap(
    z = np.log10(df),
    y = df.index,
    x = df.columns,                
    colorbar = colorbar,
    hovertemplate='N:2^%{z}<br>percentile:%{x}<br>confidence:%{y}',
    )
)

fig.update_layout(
    title_text='Mininal number of samples',
    xaxis=dict(title='Percentile'),
    yaxis=dict(title='Confidence level')
)

fig.show()

> **Takeaway.** The required number of samples increase exponentially when the percentile to estimate becomes more extreme. The increase induced by the confidence level is not as dramatic. 

## Excluding outliers

By default, `experiment_sizing()` returns the minimal number of samples, such that the __smallest one__ is the percentile estimate (or the largerest one, if the percentile is $> 50$).

If the experiment is subject to outliers, or more generally to obtain tighter bounds, one may want to collect more samples. But how many? You can use the `robustness` argument to find out:

In [None]:
percentile = 10
confidence = 99
triscale.experiment_sizing(
    percentile, 
    confidence,
    robustness=3,
    verbose=True); 

> **Note.** Hence, `robustness` refers to the number of outliers that can be excluded from the confidence interval.

Again, we can plot the minimal value of $N$ as `robustness` increases. For example, for a few percentiles, with 95% confidence level:

In [None]:
robustness_values = np.arange(1,10,dtype=int)
confidence = 95
N_50 = [triscale.experiment_sizing(50, confidence, robustness=int(x))[0] for x in robustness_values]
N_75 = [triscale.experiment_sizing(75, confidence, robustness=int(x))[0] for x in robustness_values]
N_90 = [triscale.experiment_sizing(90, confidence, robustness=int(x))[0] for x in robustness_values]

fig = go.Figure()
fig.add_trace(go.Scatter(x=robustness_values, y=N_90, name='90th'))
fig.add_trace(go.Scatter(x=robustness_values, y=N_75, name='75th'))
fig.add_trace(go.Scatter(x=robustness_values, y=N_50, name='median'))

fig.update_layout(
    title_text='Mininal number of samples for 95% confidence level',
    xaxis=dict(title='robustness'),
    yaxis=dict(title='Minimal N')
)

fig.show()

> **Takeaway.** The increase in number of samples required with respect to the `robustness` parameter is essentially linear. 

## Your turn: time to practice

Based on the explanations above, use _TriScale_'s `experiment_sizing` function to answer the following questions:
- What is the minimal number of runs required to estimate the
    - **90th** percentile with **90%** confidence?
    - **90th** percentile with **95%** confidence?
    - **95th** percentile with **90%** confidence?
- Based on the answers to the previous questions, is it harder (i.e., does it require more runs) to increase the confidence level, or to estimate a more extreme percentile? 

_Optional question (harder):_ 
- For $N = 50$ samples, how many outliers can be excluded when computing a lower bound with a 95% confidence level for the 25th percentile? 

In [None]:
########## YOUR CODE HERE ###########
# ...
#####################################

### Solution

<details>
  <summary><br/>Click here to show the solution.</summary>
  
```python
>>> print(triscale.experiment_sizing(90,90)[0])
22
>>> print(triscale.experiment_sizing(90,95)[0])
29
>>> print(triscale.experiment_sizing(95,90)[0])
45
```
We observe that it "costs" many more runs to estimate a more extreme percentile (95th instead of 90th) than to increase the confidence level (90% to 95%). This observation holds true in general. The number of runs required increases exponentially when the percentiles get more extreme (close to $0$ or to $1$).
    
For the last question, we must play with the `robustness` parameter. We can write a simple loop to increase its value until the number of runs required reaches 50.
    
```python
>>> r = 0
>>> while (triscale.experiment_sizing(25,95,r)[0] <= 50):
>>>     r += 1 
>>> print(r-1)
7                                           
```        
We can exclude the 7 "worst" samples from the confidence interval. Hence, with $N=50$ samples, the best lower bound for the 25th percentile with 95% confidence is $x_8$ (assuming the first sample is $x_1$).
</details>

---
Next step: [Data Analysis](live_data-analysis.ipynb)  
[Back to repo](.)