# TriScale - Data Analysis

> This notebook is intended for **self-study** of _TriScale._  
Here is the [version for live sessions](live_data-analysis.ipynb).

This notebook contains tutorial materials for _TriScale_. 

More specifically, this notebook presents _TriScale_'s data analysis functions,  
leading to the computation of variability scores, which serve to quantify replicability.

> If you don't know about Jupyter Notebooks and how to interact with them,  
fear not! We compiled everything that you need to know here: [Notebook Basics](tutorial_notebook-basics.ipynb) :-) 


For more details about _TriScale,_ you may refer to [the paper](https://doi.org/10.5281/zenodo.3464273).

---
- [Data analysis](#Data-analysis)
    - [Runs and Metrics](#Runs-and-Metrics)
    - [Series and KPIs](#Series-and-KPIs)
    - [Sequels and Variability Scores](#Sequels-and-Variability-Scores)
- [Your turn: time to practice](#Your-turn:-time-to-practice)

---

To get started, we need to import a few Python modules.  
All the _TriScale_-specific functions are part of one module called `triscale`.

In [None]:
import os
from pathlib import Path

import pandas as pd
import numpy as np

import triscale

Alright, we are ready to analyse some data!

## Data analysis

In this notebook, we consider that experiments have been designed, that the  
corresponding data has been collected, and we focus on the data analysis.

_TriScale_'s methodology is structured around three time scales:
- **Runs** which lead to the computation of performance **metrics**;
- **Series** which lead to the computation of **key performance indicators (KPIs)**;
- **Sequels** which lead to the computation of **variability scores**.

_TriScale_'s API provides one function per time scale, which we will look at in the next sections.

### Runs and Metrics

In _TriScale_, metrics measure performance dimension during one run. The computation of metrics  
is implmented in the `analysis_metric()` function, which takes two compulsory arguments:
- the raw data;
- the metric definition.

The raw data can be passed as a file path (i.e., a string) or as a Pandas DataFrame. 
- If a string is passed, the function tries to read the file as a csv file (comma separated)  
where `x` data is expected in the first column and `y` data in the second column. 
- If a pandas DataFrame is passed, `data` must contain columns named `x` and `y`.

The metric definition is provided as a dictionary, with only the `measure` key being compulsory.  
This key defines "what is the computation to be performed" on the data; in other word, what the "metric" is.  
Currently supported measures are:
- Any percentile ($0<P<100$);
- `mean`;
- `minimum`;
- `maximum`.

The `analysis_metric()` function returns 3 outputs:
- the result of the convergence test (not discussed in this notebook);
- the value of the metric's measure;
- a plot of the raw data.

The following cell illustrates the basic usage of `analysis_metric()` function.

In [None]:
# Input data file path
data = Path('ExampleData/raw_data.csv') # one-way delay of a full-throttled flow using TCP BBR

# Definition of a TriScale metric
metric = {  
    'measure': 50,   # Integer: interpreted as a percentile
    'unit'   : 'ms', # For display only
         }

has_converged, metric_measure, plot = triscale.analysis_metric( 
    str(data),
    metric)

print('Run metric: %0.2f %s' % (metric_measure, metric['unit']))

Passing the optional argument `plot=True` automatically displays the plot of the raw data. 

In [None]:
has_converged, metric_measure, plot = triscale.analysis_metric( 
    str(data),
    metric,
    plot=True,
)

> **Note.** As presented here, the `analysis_metric()` function is not very interesting:  
it "only" returns some percentile of an array... The function is more useful when the metric  
attempts to estimate the _long-term performance_ of the system;  that is, the value one  
would obtain shall the run last longer/more data points be collected.  
When this is the case, _TriScale_ performs a convergence test on the data, which can be  
triggered in `analysis_metric()` function by passing the optional `convergence` parameter.  
The study of convergence goes beyond the scope of this tutorial; refer to the [paper](https://doi.org/10.5281/zenodo.3464273) for more details. 

### Series and KPIs

In _TriScale_, key performance indicators (KPIs) measure performance dimensions  across a series of runs.  
Performing multiple runs allows to mitigate the inherent variability of the experimental conditions.  
KPIs capture this variability by estimating percentiles of the (unknown) metric distributions.   

Concretely, **a _TriScale_ KPI is a one-sided confidence interval of a percentile**  
e.g., a lower bound for the 25th percentile of a metric, estimated with a 95% confidence level.  
The computation of KPIs is implmented in the `analysis_kpi()` function, which takes  
two compulsory arguments:
- the metric data;
- the KPI definition.

The metric data can be passed as a list or an NumPy array.  The KPI definition is provided  
as a dictionary with three compulsory keys: 
- `percentile` ($0<P<100$); 
- `confidence` ($0<C<100$); 
- `bounds`. 

The KPI `bounds` are the expected extremal values for the metric. This information is used  
during the independence test (see below). If the metrics bounds are unknown, simply pass  
the minimum and maximum metric values as bounds.

The `analysis_kpi()` function returns the output of two computations:
1. It performs an empirical independence test (see below) and returns `True` (test is passed)  
or `False` (it is not);
2. It computes the KPI and returns its value.  


> **Why the independence test?** The metric data must be iid for the KPI to be a valid  
estimate of the underlying metric distribution. Note however that, in general,  
independence is a property of the data collection process, not of the data.  
Unfortunately though, in many practical cases in networking, independence cannot be  
guaranteed; for example, because there is some correlations in the interference  
conditions between sucessive experiments.  
In such a case, one can perform an _empirical_ test for independence; essentially,  
this test assesses whether the level of correlation in the data appears sufficiently  
low such that the data can be reasonably assumed to be iid.

The following cell illustrates the basic usage of `analysis_kpi()` function.

In [None]:
# Input data file path
data = Path('ExampleData/metric_data.csv') # Failure recovery time, in seconds

# Read data in a Pandas DataFrame
df = pd.read_csv(data, header=0, names=['metric'])

# KPI definition
KPI = {
    'percentile': 75,
    'confidence': 95,
    'bounds': [0,10],
    'unit': 's'
}

# Computes the KPI
indep_test_passed, KPI_value = triscale.analysis_kpi(
    df.metric.values,
    KPI,
)

# Output
if indep_test_passed:
    print('The metric data appears iid.')
    print('KPI value: %0.2f %s' % (KPI_value, KPI['unit']))
else:
    print('The metric data does not appear iid.')

Since the metric data appears to be iid, we can interpret the KPI value as follows:  
> __With a confidence level of 95%, the 75th percentile on the metric is smaller or equal to 1.92s.__   
In other words, with a probability of 95%, the performance metric is smaller or equal to 1.92s  
in at least three quarters of the runs. 

Even when the independence test fails, the KPI value is computed and returned. However,  
the user must then __be aware that the resulting KPI is not a trustworthy estimate__  of the   
corresponding percentile (i.e., it is not a valid confidence interval).

> For more details about the implementation of the empirical independence test, refer to the [paper](https://doi.org/10.5281/zenodo.3464273). 

Moreover, if there are not enough data points to compute the desired KPI, the function returns `np.nan` (not-a-number).

Optionally, the `analysis_kpi()` function produces and displays 3 plots: 
- the metric data series (`series`)
- the autocorrelation plot (`autocorr`)
- the metric data and the corresponding KPI value (`horizontal`)

This is illustrated below.

In [None]:
# Computes the KPI and plot
indep_test_passed, KPI_value = triscale.analysis_kpi(
    df.metric.values,
    KPI,
    to_plot=['series','autocorr','horizontal']
)

### Sequels and Variability Scores

Sequels are repetitions of series of runs. _TriScale_'s variability scores measure the variations   
of KPI values across sequels. Hence, sequels aim to detect long-term variations of KPIs and,  
ultimately, to quantify the replicability of an experiment. 

Concretely, a __variability score is composed of two one-sided CI for a symmetric pair of percentiles;__  
e.g., a 75% confidence interval for the 25-75th percentiles. The underlying computations are  
the same as for the [KPIs values](#Series-and-KPIs). The computation of variability scores  is implmented  in the  
`analysis_variability()` function, which takes two compulsory arguments:
- the KPI data;
- the variability score definition.

The KPI data can be passed as a list or an NumPy array.  The variability score definition is provided  
as a dictionary with three compulsory keys: 
- `percentile` ($0<P<100$); 
- `confidence` ($0<C<100$); 
- `bounds`. 

The `bounds` are the expected extremal values for the KPI. Like for the [`analysis_kpi()` function](#Series-and-KPIs)  
the bounds are used during the independence test. If the metrics bounds are unknown, simply pass  
the minimum and maximum metric values as bounds.

The `analysis_variability()` function returns the output of two computations:
1. It performs an empirical independence test and returns `True` (test is passed) or `False` (it is not);
2. It computes the variability score and returns its value.  

Moreover, if there are not enough data points to compute the desired variability score,  the function  
returns `np.nan` (not-a-number).

The same plotting options as for `analysis_kpi()` are available, as illustrated below.

In [None]:
# Input data file path
data = 'ExampleData/kpi_data.csv' # failure recovery time, in seconds

# Read data in a Pandas DataFrame
df = pd.read_csv(data, header=0, names=['kpi'])

# Score definition
score = {
    'percentile': 25, # the 25th-75th  percentiles range
    'confidence': 95,
    'bounds': [0,10],
    'unit': 's'
}

# Computes the variability score
(indep_test_passed, 
 upper_bound, 
 lower_bound, 
 var_score, 
 rel_score) = triscale.analysis_variability(
    df.kpi.values,
    score,
    to_plot=['series','horizontal']
)

# Output
if indep_test_passed:
    print('The KPI data appears iid.')
    print('Variability score: %0.2f %s' % (var_score, score['unit']))
else:
    print('The KPI data does not appear iid.')

Since the KPI data appears to be iid, we can interpret the variability score as follows:  
> __With a confidence level of 95%, the inter-quartile (25th-75th perc) range on the KPIs is smaller or equal to 0.4s.__   
In other words, with a probability of 95%, across all series, the middle 50% of KPI values differ by 0.4s or less.


## Your turn: time to practice

We have collected data for a comparative evaluation of congestion-control schemes using the  
[Pantheon platform.](https://pantheon.stanford.edu/) More details about the experiment setup can be found in the [TriScale paper](https://doi.org/10.5281/zenodo.3464273).

We performed five series of ten runs each, for 16 different schemes. For the purpose of this tutorial,  
we provide a dataset containing pre-computed metric values for each run. It contains two metrics:
- the mean **throughput;** 
- the 95th percentile of the **one-way delay.**

You task consists in analysing this dataset using _TriScale._ The goal is to compute and compare the  
variability scores of different congestion-control schemes. We will focus on the throughput metric only  
(an arbitrary choice).

Let us first load and visualise the dataset.

In [None]:
# Load and display the entire dataset
df = pd.read_csv(Path('ExampleData/metrics_wo_convergence.csv'))
display(df)

We can easily extract the lists of `schemes` used and `dates` identifying each series.

In [None]:
# Extract the list of congestion control schemes
schemes = df.cc.unique()
print(schemes)

The following cell contains a simple function to filter this dataset and extract metric values  
per scheme and per series.

In [None]:
def get_metrics(df, scheme, metric):
    '''Parse the dataset to extract the series of metric values f
    or one scheme and all series of runs.
    '''
    # Initialize output
    metric_data = []
    
    # List of dates (identifies the series)
    dates = df.datetime.unique()
    
    # For each series
    for date in dates:
        
        # Setup the data filter
        filter = (
            (df.cc == scheme) &
            (df.datetime == date) 
        )

        # Filter
        df_filtered = df.where(filter).dropna()
        
        # Store metrics values for that series
        metric_data.append(list(df_filtered[(metric+'_value')].values))        
    
    # Return the desired metric data
    return metric_data

We will use this function to easily extract all metrics values for one scheme and one metric  
(e.g., the `throughput` of `bbr`).

The definition of KPI and variability score to use for our analysis are provided below. 

In [None]:
# KPI
KPI  = {'percentile': 25,
         'confidence': 75,
         'name': 'KPI Throughput',
         'unit': 'Mbit/s',
         'bounds':[0,120],    # expected value range
        }

# Variability score
score  = {'percentile': 75,
         'confidence': 75,
         'name': 'Throughput',
         'unit': 'Mbit/s',
         'bounds':[0,120],    # expected value range
        }


> **Note.**  We aim to estimate the 25th percentile for the `throughput`, where higher is better.  
Thus, the KPI provides the performance expected in at least 75% of the runs.

You are now all set to analyse this dataset! 

In the following cell, we provide a skeletton code which you should complete and excute in order to  
answer the following questions:

- What is the variability score of the `throughput` metric of the `bbr` congestion-control scheme?

Modify the definition of the variability score to estimate the median `'percentile': 50` instead of  
the 25-75th percentile range. 

- What is the value of the variability score now? Does this make sense to you?

_Optional (and harder) questions:_ 

- Compute the scores for all the schemes. Do they vary a lot? 
- Do the variability scores seem "big" with respect to the range of KPI values? 
- Would you say that these experiment appear to be replicable?

In [None]:
# Extract and display the metrics values for the 5 series of 10 runs 
scheme = 'bbr'
metric = 'throughput' # valid options are 'throughput' and 'delay'
metric_data = get_metrics(df, scheme, metric)

# Initialize an empty list to collect the KPI values
KPI_values = [] 

## Step 1. Compute the KPIs
for series_data in metric_data:
    
    ########## YOUR CODE HERE ###########
    # - compute the KPI value 
    # - if the independence test is passed, 
    # store the value in the KPI list
    #####################################

    
# Print the (valid) KPI values
s = '%i valid KPIs obtained\n> ' % len(KPI_values)
for k in KPI_values:
    s += '%0.2f ' % k
s += '\n  in %s\n' % KPI['unit']
print(s)
    
    
## Step 2. Compute the variability score

########## YOUR CODE HERE ###########
# - compute the variability score 
# - print the result depending on the outcome
#   - if there are not enough KPIs, print `nan`
#   - if the independence test fails, print the score value *(-1)
#   - else, print the score value
#####################################


#### Solutions

<details>
  <summary><br/>Click here show the solutions</summary>
  
```python
    
########## YOUR CODE HERE ###########   
# - compute the KPI value 
# - if the independence test is passed, 
# store the value in the KPI list

indep_test_passed, KPI_value = triscale.analysis_kpi(
    series_data,
    KPI)
if indep_test_passed:
    KPI_values.append(KPI_value)
    
#
#####################################
    
########## YOUR CODE HERE ###########
# - compute the variability score 
# - print the result depending on the outcome
#   - if there are not enough KPIs, print `nan`
#   - if the independence test fails, print the score value *(-1)
#   - else, print the score value

(indep_test_passed, 
 upper_bound, 
 lower_bound, 
 var_score, 
 rel_score) = triscale.analysis_variability(
    KPI_values,
    score
)

if not indep_test_passed: 
    var_score *= -1

print('Variability score: %0.2f %s' % (var_score, score['unit']))
    
#
#####################################
    
```
These code blocks lead to the following output:
    
```
5 valid KPIs obtained
> 105.07 105.04 104.68 104.92 105.08 
  in Mbit/s

Variability score: 0.41 Mbit/s
```

</details>

---
Next step: [Seasonal Components](tutorial_seasonal-comp.ipynb)  
[Back to repo](.)