{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "major-expansion",
   "metadata": {},
   "source": [
    "# _TriScale_ - Experiment Sizing\n",
    "\n",
    "> This notebook is intended for **live tutorial** sessions about _TriScale._  \n",
    "Here is the [self-study version](tutorial_exp-sizing.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "reflected-glass",
   "metadata": {},
   "source": [
    "To get started, we need to import a few Python modules. All the _TriScale_-specific functions are part of one module called `triscale`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "primary-secret",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import plotly.graph_objects as go\n",
    "\n",
    "import triscale"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "specific-french",
   "metadata": {},
   "source": [
    "## Basics\n",
    "\n",
    "_TriScale_'s `experiment_sizing()` function implements the computation of the minimal number of samples required to estimate any percentile with any confidence level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "atmospheric-hammer",
   "metadata": {},
   "outputs": [],
   "source": [
    "percentile = 50 # the median\n",
    "confidence = 95 # the confidence level, in %\n",
    "\n",
    "triscale.experiment_sizing(\n",
    "    percentile, \n",
    "    confidence,\n",
    "    verbose=True); "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "floating-portrait",
   "metadata": {},
   "source": [
    "We can change the values in the cell above to see how the number of samples evolves with a larger confidence level or more extreme percentiles. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "starting-member",
   "metadata": {},
   "source": [
    "Note that the probability distributions are symmetric: it takes the same number of samples to compute a lower bound for the $p$-th  percentile as to compute an upper bound for the $(1-p)$-th percentile."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "conservative-spank",
   "metadata": {},
   "outputs": [],
   "source": [
    "percentile = 20 \n",
    "confidence = 95 # the confidence level, in %\n",
    "\n",
    "if (triscale.experiment_sizing(percentile,confidence) == \n",
    "    triscale.experiment_sizing(100-percentile,confidence)):\n",
    "    print(\"It takes the same number of samples to estimate \\\n",
    "the \\n{}-th and \\n{}-th percentiles.\".format(percentile, 100-percentile))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "significant-joseph",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sets of percentiles and confidence levels to try\n",
    "percentiles = [0.1, 1, 5, 10, 25, 50, 75, 90, 95, 99, 99.9]\n",
    "confidences = [75, 90, 95, 99, 99.9, 99.99]\n",
    "\n",
    "# Computing the minimum number of runs for each (perc., conf.) pair\n",
    "min_number_samples = []\n",
    "for c in confidences:\n",
    "    tmp = []\n",
    "    for p in percentiles:\n",
    "        N = triscale.experiment_sizing(p,c)\n",
    "        tmp.append(N[0])\n",
    "    min_number_samples.append(tmp)\n",
    "    \n",
    "# Put the results in a DataFrame for a convenient display of the results\n",
    "df = pd.DataFrame(columns=percentiles, data=min_number_samples)\n",
    "df['Confidence level'] = confidences\n",
    "df.set_index('Confidence level', inplace=True)\n",
    "\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5080d7fd",
   "metadata": {},
   "source": [
    "Let's visualize the same data with a heatmap..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3dc6fc7",
   "metadata": {},
   "outputs": [],
   "source": [
    "colorbar=dict(\n",
    "    title='Minimal N', \n",
    "    tickvals = [0, 1, 2, 3, 3.699, 4],\n",
    "    ticktext = ['1', '10', '100', '1000', '5000','10000']\n",
    ")\n",
    "\n",
    "fig = go.Figure(data=go.Heatmap(\n",
    "    z = np.log10(df),\n",
    "    y = df.index,\n",
    "    x = df.columns,                \n",
    "    colorbar = colorbar,\n",
    "    hovertemplate='N:2^%{z}<br>percentile:%{x}<br>confidence:%{y}',\n",
    "    )\n",
    ")\n",
    "\n",
    "fig.update_layout(\n",
    "    title_text='Mininal number of samples',\n",
    "    xaxis=dict(title='Percentile'),\n",
    "    yaxis=dict(title='Confidence level')\n",
    ")\n",
    "\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eeea6b5c",
   "metadata": {},
   "source": [
    "> **Takeaway.** The required number of samples increase exponentially when the percentile to estimate becomes more extreme. The increase induced by the confidence level is not as dramatic. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "burning-controversy",
   "metadata": {},
   "source": [
    "## Excluding outliers\n",
    "\n",
    "By default, `experiment_sizing()` returns the minimal number of samples, such that the __smallest one__ is the percentile estimate (or the largerest one, if the percentile is $> 50$).\n",
    "\n",
    "If the experiment is subject to outliers, or more generally to obtain tighter bounds, one may want to collect more samples. But how many? You can use the `robustness` argument to find out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "immediate-testament",
   "metadata": {},
   "outputs": [],
   "source": [
    "percentile = 10\n",
    "confidence = 99\n",
    "triscale.experiment_sizing(\n",
    "    percentile, \n",
    "    confidence,\n",
    "    robustness=3,\n",
    "    verbose=True); "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6dfd718c",
   "metadata": {},
   "source": [
    "> **Note.** Hence, `robustness` refers to the number of outliers that can be excluded from the confidence interval."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a142934",
   "metadata": {},
   "source": [
    "Again, we can plot the minimal value of $N$ as `robustness` increases. For example, for a few percentiles, with 95% confidence level:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "beef9688",
   "metadata": {},
   "outputs": [],
   "source": [
    "robustness_values = np.arange(1,10,dtype=int)\n",
    "confidence = 95\n",
    "N_50 = [triscale.experiment_sizing(50, confidence, robustness=int(x))[0] for x in robustness_values]\n",
    "N_75 = [triscale.experiment_sizing(75, confidence, robustness=int(x))[0] for x in robustness_values]\n",
    "N_90 = [triscale.experiment_sizing(90, confidence, robustness=int(x))[0] for x in robustness_values]\n",
    "\n",
    "fig = go.Figure()\n",
    "fig.add_trace(go.Scatter(x=robustness_values, y=N_90, name='90th'))\n",
    "fig.add_trace(go.Scatter(x=robustness_values, y=N_75, name='75th'))\n",
    "fig.add_trace(go.Scatter(x=robustness_values, y=N_50, name='median'))\n",
    "\n",
    "fig.update_layout(\n",
    "    title_text='Mininal number of samples for 95% confidence level',\n",
    "    xaxis=dict(title='robustness'),\n",
    "    yaxis=dict(title='Minimal N')\n",
    ")\n",
    "\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "566bb82e",
   "metadata": {},
   "source": [
    "> **Takeaway.** The increase in number of samples required with respect to the `robustness` parameter is essentially linear. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "determined-welsh",
   "metadata": {},
   "source": [
    "## Your turn: time to practice\n",
    "\n",
    "Based on the explanations above, use _TriScale_'s `experiment_sizing` function to answer the following questions:\n",
    "- What is the minimal number of runs required to estimate the\n",
    "    - **90th** percentile with **90%** confidence?\n",
    "    - **90th** percentile with **95%** confidence?\n",
    "    - **95th** percentile with **90%** confidence?\n",
    "- Based on the answers to the previous questions, is it harder (i.e., does it require more runs) to increase the confidence level, or to estimate a more extreme percentile? \n",
    "\n",
    "_Optional question (harder):_ \n",
    "- For $N = 50$ samples, how many outliers can be excluded when computing a lower bound with a 95% confidence level for the 25th percentile? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "right-alarm",
   "metadata": {},
   "outputs": [],
   "source": [
    "########## YOUR CODE HERE ###########\n",
    "# ...\n",
    "#####################################"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "antique-bishop",
   "metadata": {},
   "source": [
    "### Solution\n",
    "\n",
    "<details>\n",
    "  <summary><br/>Click here to show the solution.</summary>\n",
    "  \n",
    "```python\n",
    ">>> print(triscale.experiment_sizing(90,90)[0])\n",
    "22\n",
    ">>> print(triscale.experiment_sizing(90,95)[0])\n",
    "29\n",
    ">>> print(triscale.experiment_sizing(95,90)[0])\n",
    "45\n",
    "```\n",
    "We observe that it \"costs\" many more runs to estimate a more extreme percentile (95th instead of 90th) than to increase the confidence level (90% to 95%). This observation holds true in general. The number of runs required increases exponentially when the percentiles get more extreme (close to $0$ or to $1$).\n",
    "    \n",
    "For the last question, we must play with the `robustness` parameter. We can write a simple loop to increase its value until the number of runs required reaches 50.\n",
    "    \n",
    "```python\n",
    ">>> r = 0\n",
    ">>> while (triscale.experiment_sizing(25,95,r)[0] <= 50):\n",
    ">>>     r += 1 \n",
    ">>> print(r-1)\n",
    "7                                           \n",
    "```        \n",
    "We can exclude the 7 \"worst\" samples from the confidence interval. Hence, with $N=50$ samples, the best lower bound for the 25th percentile with 95% confidence is $x_8$ (assuming the first sample is $x_1$).\n",
    "</details>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "alpine-faith",
   "metadata": {},
   "source": [
    "---\n",
    "Next step: [Data Analysis](live_data-analysis.ipynb)  \n",
    "[Back to repo](.)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.1 64-bit ('triscale': conda)",
   "language": "python",
   "name": "python391jvsc74a57bd0684f90775fb1f43db0d8eed0780ba829f42a97c2f4b9a1bd592c18e47e5c272e"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}