{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## TriScale\n", "\n", "# Scalability Study\n", "\n", "This notebook presents a scalability analysis of TriScale. \n", "We focus on the execution time to perform the computations; that is, the production of other outputs than the numerical results (i.e., textual logs, plots, etc.) is excuded from this evaluation. \n", "The results presented below have been obtained on a simple laptop.\n", "\n", "The evaluation results show that **TriScale data analysis scales very well with the input sizes**; the data analysis is practically negligeable compared to the data collection time. In particular, we show that:\n", "* For simple metrics such as percentiles, the excution of `analysis_metric()` generally \n", " * takes less than 50ms for up to 10'000 data points,\n", " * takes about **1s for up to one million data points**,\n", " * scales linearly.\n", "* Computing KPIs (`analysis_kpi()`) and variability scores (`analysis_variability()`) generally \n", " * takes less than 10ms for up to 100 data points, \n", " * takes less than **100ms for up to 1000 data points** \n", " * scales linearly.\n", "* The computation of confidence intervals using Thompson's method is very efficient (as demonstrated by the scaling of `analysis_kpi()` and `analysis_variability()`). \n", "Furthermore, computing the minimal number of runs/series required (implemented by `experiment_sizing()`)is generally independent on the percentile and confidence level and completes within **less than 3us** 95% of the time.\n", "\n", "## Menu\n", "\n", "- [List of Imports](#List-of-Imports)\n", "* [analysis_metric()](#analysis_metric)\n", "* [analysis_kpi()](#analysis_kpi)\n", "* [analysis_variability()](#analysis_variability)\n", "* [experiment_sizing()](#experiment_sizing)\n", "\n", "## List of Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "from pathlib import Path\n", "import random\n", "import timeit\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import plotly.express as px\n", "import plotly.io as pio\n", "pio.renderers.default = \"notebook\"\n", "\n", "import triscale" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `analysis_metric`\n", "\n", "The following cell generates random data and save it to csv files. This can be used to reproduced in a controlled way the timing evaluation of the `analysis_metric()` function when taking a file as input. \n", "\n", "This evaluation `analysis_metric()` uses DataFrames as inputs. One can verify that using DataFrames or csv files as inputs has barely any impact on the execution time: either way, the input data is loaded into a DataFrame at the start of `analysis_metric()`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] } ], "source": [ "# Generetion of synthetic data\n", "\n", "## Create a random DataFrame\n", "rand_x = [random.random()*100 for i in range(1000000)]\n", "rand_y = [random.random()*100 for i in range(1000000)]\n", "rand_df = pd.DataFrame(columns=['x','y'])\n", "rand_df['x'] = np.sort(rand_x)\n", "rand_df['y'] = rand_y\n", "\n", "## Save chuncks of it as csv files\n", "file_path = Path('ExampleTraces/Scalability')\n", "if not os.path.exists(file_path):\n", " os.makedirs(file_path)\n", " \n", "sample_sizes = [20,50,100,150,200,300,400,500,1000,5000,10000,200000,400000,600000,800000,1000000]\n", "for sample_size in sample_sizes:\n", " file_name = 'synthetic_%i_samples.csv' % sample_size\n", " df_chunk = rand_df[:sample_size]\n", " df_chunk.to_csv(str(file_path/file_name), index=False) \n", "print('Done.') " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", " \n", " \n", "
\n", " \n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "verbose=False\n", "\n", "columns=['Sample size [#]', 'Execution time [ms]']\n", "df = pd.DataFrame(columns=columns)\n", "for sample_size in sample_sizes: \n", " setup ='''\n", "import random\n", "import numpy as np\n", "import pandas as pd\n", "import triscale\n", "\n", "convergence = {'expected': True}\n", "metric = {'measure':int(random.random()*50)+1}\n", "rand_x = [random.random()*100 for i in range(1000000)]\n", "rand_y = [random.random()*100 for i in range(1000000)]\n", "rand_df = pd.DataFrame(columns=['x','y'])\n", "rand_df['x'] = np.sort(rand_x)\n", "rand_df['y'] = rand_y\n", "df = rand_df[:%i]\n", "''' % sample_size\n", " \n", " if verbose:\n", " print('Sample size\\t', sample_size)\n", " nb_tries = 10\n", " time_vec = (timeit.Timer('triscale.analysis_metric(df, metric, convergence=convergence)', \n", " setup=setup).repeat(10, nb_tries))\n", " time_vec = np.array(time_vec)\n", " time_vec = (time_vec/nb_tries)*1000 # Convert to ms\n", " sample_size_vec = np.ones(nb_tries)*sample_size\n", " df_new = pd.DataFrame(sample_size_vec, columns=[columns[0]])\n", " df_new[columns[1]] = np.array(time_vec)\n", " df = pd.concat([df, df_new], sort=False)\n", " if verbose:\n", " print('Done.')\n", " print('------------------')\n", " \n", "fig = px.scatter(df, x =columns[0], y=columns[1])\n", "fig.show()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", "
\n", " \n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Zooming on for small sample sizes...\n", "fig = px.scatter(df, x =columns[0], y=columns[1],range_x=[0,1010], range_y=[0,30])\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data shows two modes in the execution time of the `analysis_metric()` function: a step increase, followed by a slow linear increase. This can be easily explained.\n", "\n", "The computationally expensive part of `analysis_metric()` is the convergence test, which includes a Theil-Sen regression [[ref]](https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator). This regressor works by computing the slopes between *all pairs of points* and returns the median slope value; thus, the regressor scales with O(n²). \n", "\n", "However, TriScale does not perform the regression on the input data directly. Instead, TriScale divides the input data in chunks.\n", "For each chunk, the metric is computed to eventually create a new data series with metric values. The purpose of the convergence test is to verify that these metric values have converged; thus TriScale executes the Theil-Sen regressor on this new data series.\n", "\n", "The Theil-Sen regressor does not require many samples for producing a reliable result; a few tens of data points are often considered sufficient. Thus, we can cap the size of metric data series\n", "(TriScale caps it to 100 values), which bounds the execution time of the Theil-Sen regressor. Ultimately, this allows the `analysis_metric()` to scale very well with the sample size. \n", "\n", "The linear increase for large number of raw samples is due to the computation of the metric on increasingly large chuncks. The more complex the metric is, the longer the execution time.\n", "In this evaluation, a percentile is used as metric, which is computed efficiently by `numpy`.\n", "\n", "> * `analysis_metric()` is fast (less than 20ms) for small numbers of samples (less than 200)\n", "* For larger numbers of samples, `analysis_metric()` **scales with the same complexity as the metric computation**. \n", "In this example, the metric is a simple percentile, which appears to scale linearly when computed with `numpy`.\n", "Overall, in our experiments (running on a standard PC) `analysis_metric()` execution takes\n", " * **less than 20ms** for up to **1,000** data points,\n", " * **less than 50ms** for up to **10,000** data points,\n", " * **about 1s** for up to **one million** data points.\n", "> \n", "Thus, the data collection time depends on the networking experiment, but it is unlikely that any experiment would produce much more than a million of (useful) data points per second. \n", "Thus, we conclude that the computation time of the `analysis_metric()` is **negligible for networking experiments**.\n", "\n", "\n", "> _Note_ \n", "`analysis_metric()` takes either a DataFrame or a csv file as input. This has barely any impact on the execution time: either way, the input data is loaded into a DataFrame at the start of `analysis_metric()`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[ Back to top ]](#Scalability-Study)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `analysis_kpi`" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", " \n", " \n", "
\n", " \n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "verbose = False\n", "\n", "columns=['Sample size [#]', 'Execution time [ms]']\n", "df = pd.DataFrame(columns=columns)\n", "for sample_size in [10,20,30,40,50,100,200,300,400,600,800,1000]:\n", " setup = '''\n", "import random\n", "import triscale\n", "\n", "sample_length = %i\n", "percentile = int(random.random()*50)+1\n", "confidence = int(random.random()*25)+75\n", "KPI = { 'percentile': percentile, \n", " 'confidence': confidence, \n", " 'bounds' : [0,0.001],\n", " 'bound' : 'lower'}\n", "rand_data = [random.random() for i in range(sample_length)]\n", "''' % sample_size\n", " if verbose:\n", " print('Sample size\\t', sample_size)\n", " nb_tries = 10\n", " time_vec = (timeit.Timer('triscale.analysis_kpi(rand_data, KPI)', setup=setup).repeat(10, nb_tries))\n", " time_vec = np.array(time_vec)\n", " time_vec = (time_vec/nb_tries)*1000 # Convert to ms\n", " sample_size_vec = np.ones(nb_tries)*sample_size\n", " df_new = pd.DataFrame(sample_size_vec, columns=[columns[0]])\n", " df_new[columns[1]] = np.array(time_vec)\n", " df = pd.concat([df, df_new], sort=False)\n", " if verbose: \n", " print('Done.')\n", " print('------------------')\n", " \n", "fig = px.scatter(df, x =columns[0], y=columns[1])\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data shows a clear linear correlation between the sample size and the execution time of the `analysis_kpi()` function, which is not surprising: most computations are related to the determination of the confidence interval using Thompson's method, which is an iterative process through the ordered data samples (see `ThompsonCI()` docstring for details).\n", "\n", "> `analysis_kpi()` **scales linearly** with the number of samples. \n", "In our experiments (running on a standard PC), computation for a \n", "* sample size of 100 takes less than 10 ms, \n", "* sample size of 1000, takes less that 100ms. \n", "> \n", "Thus, we conclude that the computation time of the `analysis_kpi()` is **negligible for networking experiments**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[ Back to top ]](#Scalability-Study)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `analysis_variability`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", "
\n", " \n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "verbose = False\n", "\n", "columns=['Sample size [#]', 'Execution time [ms]']\n", "df = pd.DataFrame(columns=columns)\n", "for sample_size in [10,20,30,40,50,100,200,300,400,600,800,1000]:\n", " setup = '''\n", "import random\n", "import triscale\n", "\n", "sample_length = %i\n", "percentile = int(random.random()*50)+1\n", "confidence = int(random.random()*25)+75\n", "score = { 'percentile': percentile, \n", " 'confidence': confidence}\n", "rand_data = [random.random() for i in range(sample_length)]\n", "''' % sample_size\n", " if verbose: \n", " print('Sample size\\t', sample_size)\n", " nb_tries = 10\n", " time_vec = (timeit.Timer('triscale.analysis_variability(rand_data, score)', setup=setup).repeat(10, nb_tries))\n", " time_vec = np.array(time_vec)\n", " time_vec = (time_vec/nb_tries)*1000 # Convert to ms\n", " sample_size_vec = np.ones(nb_tries)*sample_size\n", " df_new = pd.DataFrame(sample_size_vec, columns=[columns[0]])\n", " df_new[columns[1]] = np.array(time_vec)\n", " df = pd.concat([df, df_new], sort=False)\n", " if verbose: \n", " print('Done.')\n", " print('------------------')\n", " \n", "fig = px.scatter(df, x =columns[0], y=columns[1])\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data shows a clear linear correlation between the sample size and the execution time of the `analysis_variability()` function, which is not surprising: most computations are related to the determination of the confidence interval using Thompson's method, which is an iterative process through the ordered data samples (see `ThompsonCI()` docstring for details).\n", "\n", "Unsurprisingly, the data is very similar as for `analysis_kpi()`. `analysis_kpi()` and `analysis_variability()` do the same computations. They only differ in the generation of outputs (logs and plots). Since the outputs are not considered in this scalability evaluation, we obtain very similar results for both functions.\n", "\n", "> `analysis_variability()` **scales linearly** with the number of samples. \n", "In our experiments (running on a standard PC), computation for a \n", "* sample size of 100 takes less than 10 ms, \n", "* sample size of 1000 takes less than 100ms. \n", "> \n", "Thus, we conclude that the computation time of the `analysis_variability()` is **negligible for networking experiments**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[ Back to top ]](#Scalability-Study)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `experiment_sizing`" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", "
\n", " \n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "setup = '''\n", "import random \n", "import triscale\n", "\n", "percentile = int(random.random()*50)+1\n", "confidence = int(random.random()*25)+75\n", "'''\n", "nb_tries = 10\n", "time_vec = (timeit.Timer('triscale.experiment_sizing(percentile,confidence)', setup=setup).repeat(25000, nb_tries))\n", "time_vec = np.array(time_vec)\n", "time_vec /= nb_tries\n", "data = pd.DataFrame(time_vec, columns=['exec_time'])\n", "fig = px.scatter(data, y=\"exec_time\")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use TriScale to evaluate the expected execution time of the `experiment_sizing()` function. We choose a large percentile (95th) and a standard confidence level (95%). \n", "\n", "However, the scatter plot of the timing measurement indicates that the data is correlated (which is not suprising: the experiment is disturbed by other processes running on the PC)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(False, 2.87230359390378e-06)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "KPI = {'percentile': 95,\n", " 'confidence': 95,\n", " 'bounds' : [0,0.001],\n", " 'name': 'Execution time of TriScale experiment_sizing() function',\n", " 'unit': 's'}\n", "triscale.analysis_kpi(time_vec,\n", " KPI,)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed, TriScale indicates that data do not appear to be i.i.d., thus the analysis module cannot confidently compute a KPI.\n", "\n", "However, since the sample size is large, we can try to randomly subsample to dilute the correlation between the data points." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Subsampling factor\t 2\n", "Appear i.i.d.?\t\t False\n", "------------------\n", "Subsampling factor\t 4\n", "Appear i.i.d.?\t\t False\n", "------------------\n", "Subsampling factor\t 8\n", "Appear i.i.d.?\t\t False\n", "------------------\n", "Subsampling factor\t 16\n", "Appear i.i.d.?\t\t False\n", "------------------\n", "Subsampling factor\t 64\n", "Appear i.i.d.?\t\t False\n", "------------------\n", "Subsampling factor\t 128\n", "Appear i.i.d.?\t\t False\n", "------------------\n", "Subsampling factor\t 256\n", "Appear i.i.d.?\t\t False\n", "------------------\n", "Subsampling factor\t 512\n", "Appear i.i.d.?\t\t False\n", "------------------\n" ] } ], "source": [ "for k in [2,4,8,16,64,128,256,512]:\n", " sample_size = int(len(time_vec)/k)\n", " ix = random.sample(range(len(time_vec)), sample_size)\n", " ix = np.sort(ix)\n", " time_vec_sub = time_vec[ix]\n", " iid_test, KPI_value = triscale.analysis_kpi(time_vec_sub,\n", " KPI)\n", " print('Subsampling factor\\t ', k)\n", " print('Appear i.i.d.?\\t\\t ', iid_test)\n", " if iid_test:\n", " print('KPI value \\t\\t ', KPI_value)\n", " print('------------------')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With a subsampling factor of 256, we sometimes obtain sufficiently uncorrelated data to compute a KPI. This is not a determistic process though, since the subsampling is random. Even with large subsampling, there is still too much correlation in the remaining data ton confidently compute a KPI.\n", "\n", "Thus, we resume with a simpler _descriptive only_ metric: the 95th percentile of our data smaples." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.820801455527544e-06" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.percentile(time_vec, 95 , interpolation='nearest')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> The 95th percentile of execution time for the TriScale `experiment_sizing()` function is about \n", "**2.73us**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[ Back to top ]](#Scalability-Study)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }