{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## TriScale\n", "\n", "# Scalability Study\n", "\n", "This notebook presents a scalability analysis of TriScale. \n", "We focus on the execution time to perform the computations; that is, the production of other outputs than the numerical results (i.e., textual logs, plots, etc.) is excuded from this evaluation. \n", "The results presented below have been obtained on a simple laptop.\n", "\n", "The evaluation results show that **TriScale data analysis scales very well with the input sizes**; the data analysis is practically negligeable compared to the data collection time. In particular, we show that:\n", "* For simple metrics such as percentiles, the excution of `analysis_metric()` generally \n", " * takes less than 50ms for up to 10'000 data points,\n", " * takes about **1s for up to one million data points**,\n", " * scales linearly.\n", "* Computing KPIs (`analysis_kpi()`) and variability scores (`analysis_variability()`) generally \n", " * takes less than 10ms for up to 100 data points, \n", " * takes less than **100ms for up to 1000 data points** \n", " * scales linearly.\n", "* The computation of confidence intervals using Thompson's method is very efficient (as demonstrated by the scaling of `analysis_kpi()` and `analysis_variability()`). \n", "Furthermore, computing the minimal number of runs/series required (implemented by `experiment_sizing()`)is generally independent on the percentile and confidence level and completes within **less than 3us** 95% of the time.\n", "\n", "## Menu\n", "\n", "- [List of Imports](#List-of-Imports)\n", "* [analysis_metric()](#analysis_metric)\n", "* [analysis_kpi()](#analysis_kpi)\n", "* [analysis_variability()](#analysis_variability)\n", "* [experiment_sizing()](#experiment_sizing)\n", "\n", "## List of Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "from pathlib import Path\n", "import random\n", "import timeit\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import plotly.express as px\n", "import plotly.io as pio\n", "pio.renderers.default = \"notebook\"\n", "\n", "import triscale" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `analysis_metric`\n", "\n", "The following cell generates random data and save it to csv files. This can be used to reproduced in a controlled way the timing evaluation of the `analysis_metric()` function when taking a file as input. \n", "\n", "This evaluation `analysis_metric()` uses DataFrames as inputs. One can verify that using DataFrames or csv files as inputs has barely any impact on the execution time: either way, the input data is loaded into a DataFrame at the start of `analysis_metric()`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] } ], "source": [ "# Generetion of synthetic data\n", "\n", "## Create a random DataFrame\n", "rand_x = [random.random()*100 for i in range(1000000)]\n", "rand_y = [random.random()*100 for i in range(1000000)]\n", "rand_df = pd.DataFrame(columns=['x','y'])\n", "rand_df['x'] = np.sort(rand_x)\n", "rand_df['y'] = rand_y\n", "\n", "## Save chuncks of it as csv files\n", "file_path = Path('ExampleTraces/Scalability')\n", "if not os.path.exists(file_path):\n", " os.makedirs(file_path)\n", " \n", "sample_sizes = [20,50,100,150,200,300,400,500,1000,5000,10000,200000,400000,600000,800000,1000000]\n", "for sample_size in sample_sizes:\n", " file_name = 'synthetic_%i_samples.csv' % sample_size\n", " df_chunk = rand_df[:sample_size]\n", " df_chunk.to_csv(str(file_path/file_name), index=False) \n", "print('Done.') " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "