{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
">### đźš© *Create a free WhyLabs account to get more value out of whylogs!*
\n",
">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=KS_profiling)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=KS_profiling) to leverage the power of whylogs and WhyLabs together!*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Benchmark - KS Test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/benchmarks/KS_Profiling.ipynb)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"> This notebook is a complement to the blog post [_Understanding Kolmogorov-Smirnov (KS) Tests for Data Drift on Profiled Data_](https://medium.com/p/5c8317796f78).\n",
">\n",
">Please refer to the blog post for additional context."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We applied whylogs' implementation of KS Test and compared the results when applying the same test to the complete data set. The results allow us to discuss the limitations of data profiling for KS drift detection, and also the pros and cons of the KS algorithm itself for different scenarios."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Experiment Design"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we need the data. For this experiment, we will take two samples of equal size from the following distributions:\n",
"\n",
"- Normal: Broad class of data. Unskewed and Peaked around the center\n",
"\n",
"- Pareto: Skewed data with long tail/outliers\n",
"\n",
"- Uniform: Evenly sampled across its domain\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Drift Injection\n",
"\n",
"Next, we’ll inject drift into one sample (which we’ll call the target distribution) to compare it to the reference, unaltered, distribution.\n",
"\n",
"We will inject drift artificially by simply shifting the data according to a parameter. We’ll use as the parameter a ratio of the distribution’s interquartile range.\n",
"\n",
"The idea is to have 4 different scenarios:\n",
"\n",
"- No drift\n",
"\n",
"- Small drift\n",
"\n",
"- Medium drift\n",
"\n",
"- Large drift\n",
"\n",
"The ideal process of detecting/alerting for drifts can be very subjective, depending on the desired sensitivity for your particular application. In this case, we are assuming that the small-drift scenario is small enough for it to be safely ignored. We are also expecting that the medium and large drift scenarios should result in a drift alert since both would be cases for further inspection. \n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Applying the KS test\n",
"\n",
"As the ground truth, we will use scipy’s implementation of the two-sample KS test, with the complete data from both samples. We will then compare those results with the profiled version of the test. To do so, we’ll use whylogs’ implementation of the same test, which uses only the statistical profile of each sample. \n",
"\n",
"The distribution metrics contained in the profiles are obtained from a process called data sketching, which adds some degree of randomness to the result. For this reason, the KS test result can be different each time a profile is generated. For this reason, we’ll profile the data 10 times for every scenario, and compare the ground truth to statistics such as the mean, maximum, and minimum of those runs.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Experiment Variables\n",
"\n",
"Our main goal is to answer:\n",
"\n",
"“How does whylogs’ KS implementation compare to scipy’s implementation?”\n",
"\n",
"However, this answer depends on several different variables. We will run three separate experiments to better understand the effect of each variable: Data Volume, Number of buckets, and Profile Size. The first one relates to the number of samples being tested, whereas the last two relate to whylogs internal, tunable parameters.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installing the required packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Note: you may need to restart the kernel to use updated packages.\n",
"%pip install 'matplotlib' 'numpy' 'seaborn==0.12.1'\n",
"%pip install 'scipy==1.7.3' 'whylogs[viz]' 'typing_extensions'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parameters and Functions - Experiments and Plots"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section, we'll compile all of the parameter and functions required to run the following experiments and plot the results."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"# For experiment #2, we're running the test with different number of buckets for the KS Test calculation\n",
"QUANTILE_LIST = [\n",
" list(np.linspace(0,1,5)),\n",
" list(np.linspace(0,1,10)),\n",
" list(np.linspace(0,1,50)),\n",
" list(np.linspace(0,1,100)),\n",
"]\n",
"\n",
"# no drift, small drift, medium drift and large drift\n",
"drift_magnitudes = [0,0.05,0.4,0.75]\n",
"\n",
"sample_sizes = [500,1000,5000,10000,50000]\n",
"sample_sizes_labels = ['500','1k', '5k','10k','50k']\n",
"random_seed = 22"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from numpy.random import Generator, PCG64\n",
"import pandas as pd\n",
"from scipy.stats import iqr\n",
"\n",
"def generate_data(\n",
" distribution: str = \"normal\", generator=None, drift_magnitude: float = 0, size: int = 100000\n",
") -> pd.Series:\n",
" \"\"\"Generates a pandas series with samples drawn from a distribution (normal, pareto or uniform). The internal parameters\n",
" for each distributions are fixed. You can specify the number of samples you want and also if a drift of a certain magnitude is\n",
" to be injected. The drift magnitude is a ratio of the distribution's interquartile range.\n",
"\n",
" \"\"\"\n",
" if generator is None:\n",
" generator = Generator(PCG64(12345))\n",
"\n",
" if distribution == \"normal\":\n",
" sample = generator.standard_normal(size)\n",
" elif distribution == \"pareto\":\n",
" a,m = 7.,2.\n",
" sample = (generator.pareto(a, size) + 1) * m\n",
" elif distribution == \"uniform\":\n",
" sample = generator.uniform(-5,5,size)\n",
" else:\n",
" raise ValueError(\"Distribution not found.\")\n",
" offset = (iqr(sample)) * drift_magnitude\n",
" drifted_sample = sample + offset\n",
" return pd.Series(drifted_sample)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from numpy.random import Generator, PCG64\n",
"import numpy as np\n",
"import pandas as pd\n",
"import whylogs as why\n",
"from whylogs.viz.utils.drift_calculations import calculate_drift_values, _compute_ks_test_p_value\n",
"from scipy import stats\n",
"\n",
"def run_ks_experiment(distribution=\"normal\",drift_magnitude=0, quantile_list = QUANTILE_LIST):\n",
" \"\"\"Runs KS experiment for given distribution type and drift magnitude. The experiments are run\n",
" for multiple sample sizes and number of buckets, 10 times for each combination of parameters.\n",
"\n",
" Parameters\n",
" ----------\n",
" distribution : str, optional\n",
" distribution type. \"normal\",\"pareto\", or \"uniform\", by default \"normal\"\n",
" drift_magnitude : int, optional\n",
" drift magnitude, by default 0 (no drift)\n",
"\n",
" Returns\n",
" -------\n",
" _type_\n",
" Dictionary with statistics such as pvalues for whylogs and scipy's implementation, mean absolute errors,\n",
" and error ranges (minimum and maximum) for whylogs KS test.\n",
" \"\"\"\n",
" # size_list = [1000, 5000, 10000, 50000, 100000]\n",
" size_list = sample_sizes\n",
" experiment_results = {}\n",
" for quant_i,QUANTILES in enumerate(quantile_list):\n",
" bars1 = []\n",
" heights1 = []\n",
" pv_means = []\n",
" pv_ranges = []\n",
" pv_truths = []\n",
" for sample_size in size_list:\n",
" rng = Generator(PCG64(random_seed))\n",
" ref, target = pd.DataFrame(), pd.DataFrame()\n",
" ref[\"col\"] = generate_data(distribution=distribution, generator=rng, drift_magnitude=0, size=sample_size)\n",
" target[\"col\"] = generate_data(distribution=distribution, generator=rng, drift_magnitude=drift_magnitude, size=sample_size)\n",
" scipy_res = stats.ks_2samp(ref[\"col\"], target[\"col\"], mode='asymp')\n",
" scipy_pvalue = scipy_res.pvalue\n",
" errors = []\n",
" pvalues = []\n",
" for i in range(10):\n",
" ref_profile = why.log(ref).profile()\n",
" target_profile = why.log(target).profile()\n",
"\n",
" ref_dist = ref_profile._columns[\"col\"]._metrics[\"distribution\"].kll.value\n",
" target_dist = target_profile._columns[\"col\"]._metrics[\"distribution\"].kll.value\n",
"\n",
" res = _compute_ks_test_p_value(reference_distribution=ref_dist, target_distribution=target_dist, quantiles=QUANTILES)\n",
" pv = res[\"p_value\"]\n",
" error = abs(res[\"p_value\"] - scipy_pvalue)\n",
" errors.append(error)\n",
" pvalues.append(pv)\n",
" mean = sum(errors) / len(errors)\n",
" mean_pv = sum(pvalues) / len(pvalues)\n",
"\n",
" range_pv = [abs(mean_pv-min(pvalues)),abs(mean_pv-max(pvalues))] \n",
"\n",
" error = [abs(mean-min(errors)),abs(mean-max(errors))]\n",
"\n",
" pv_truth = scipy_pvalue\n",
" pv_truths.append(pv_truth)\n",
" bars1.append(mean)\n",
" heights1.append(error)\n",
"\n",
" pv_means.append(mean_pv)\n",
" pv_ranges.append(range_pv)\n",
"\n",
" y_err = [[x[0] for x in heights1],[x[1] for x in heights1]]\n",
" pv_rg = [[x[0] for x in pv_ranges],[x[1] for x in pv_ranges]]\n",
"\n",
" experiment_results[quant_i] = {}\n",
" experiment_results[quant_i]['bar'] = bars1\n",
" experiment_results[quant_i]['yerr'] = y_err\n",
" experiment_results[quant_i]['pv_means'] = pv_means\n",
" experiment_results[quant_i]['pv_ranges'] = pv_rg\n",
" experiment_results[quant_i]['pv_truths'] = pv_truths\n",
" experiment_results[quant_i]['label'] = \"buckets={}\".format(len(QUANTILES))\n",
" experiment_results[quant_i]['distribution'] = distribution\n",
" return experiment_results\n",
"\n",
"def run_experiment_on_params(distribution=\"normal\",magnitudes = drift_magnitudes):\n",
" \"\"\"\n",
" Runs the KS experiment for a list of different drift magnitudes\n",
" \"\"\"\n",
" exps_results = {}\n",
"\n",
" for drift_magnitude in magnitudes:\n",
" exps_results[drift_magnitude] = run_ks_experiment(distribution=distribution, drift_magnitude=drift_magnitude)\n",
"\n",
" return exps_results"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from numpy.random import Generator, PCG64\n",
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib\n",
"def plot_drift(distribution=\"normal\", magnitudes = [0,0.05,0.4,0.75], sample_size = 100000):\n",
" \"\"\"Plots 4 histograms in a 2x2 matrix for different drift magnitudes, for the given distribution type\n",
" and number of samples.\n",
" \"\"\"\n",
"\n",
" rng = Generator(PCG64(random_seed))\n",
" fig, axs = plt.subplots(2,2)\n",
" fig.tight_layout(pad=1.5)\n",
" for ix,drift_magnitude in enumerate(magnitudes):\n",
" x,y = int(ix%2),int(ix/2)\n",
"\n",
" df = pd.DataFrame()\n",
"\n",
" df[\"ref\"] = generate_data(distribution=distribution, generator=rng, drift_magnitude=0, size=sample_size)\n",
" df[\"target\"] = generate_data(distribution=distribution, generator=rng, drift_magnitude=drift_magnitude, size=sample_size)\n",
"\n",
"\n",
"\n",
" # set a grey background (use sns.set_theme() if seaborn version 0.11.0 or above)\n",
" sns.set(style=\"darkgrid\")\n",
"\n",
" sns.histplot(ax=axs[x][y],data=df, x=\"ref\", color=\"skyblue\", label=\"ref\", kde=False)\n",
" sns.histplot(ax=axs[x][y],data=df, x=\"target\", color=\"red\", label=\"target\", kde=False)\n",
" if ix==2:\n",
" axs[x][y].legend(loc=0, prop={'size': 12})\n",
" axs[x][y].set_xlabel('Drift Size:{}'.format(drift_magnitude))\n",
"\n",
" fig.text(.5, -0.05, \"Artificial drift injection for varying drift magnitudes for {} distribution. Number of samples: {}\".format(distribution,sample_size), ha='center',fontsize=10)\n",
" plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def plot_pvalues(quant_bars_per_drift):\n",
" \"\"\"Plots the pvalues for both whylogs and scipy's implementation for\n",
" different drift magnitudes. Expects 4 different drift magnitudes to\n",
" plot a 2x2 matrix.\n",
"\n",
" Parameters\n",
" ----------\n",
" quant_bars_per_drift : dict\n",
" Statistics previously collected by KS experiment.\n",
" \"\"\"\n",
" matplotlib.rc_file_defaults()\n",
" fig, axs = plt.subplots(2,2)\n",
" fig.tight_layout(pad=1.0)\n",
"\n",
" for i,key in enumerate(quant_bars_per_drift): \n",
" quant_bars = quant_bars_per_drift[key]\n",
" # whylogs by default uses 100 buckets, so let's choose that\n",
" quant_index = 3\n",
" x,y = int(i%2),int(i/2)\n",
" pv_means = quant_bars[quant_index]['pv_means']\n",
" pv_range = quant_bars[quant_index]['pv_ranges']\n",
" pv_truth = quant_bars[quant_index]['pv_truths']\n",
"\n",
"\n",
" r1 = np.arange(len(pv_means))\n",
" axs[x][y].set_xticks([r for r in range(len(quant_bars[quant_index]['bar']))], sample_sizes_labels)\n",
" axs[x][y].errorbar(r1, pv_means, yerr=pv_range, capsize=7, label=\"whylogs\")\n",
" axs[x][y].set_ylabel('pvalue')\n",
" axs[x][y].plot(r1,pv_truth, label=\"scipy\")\n",
" if i==2:\n",
" axs[x][y].legend(loc=1, prop={'size': 12})\n",
" axs[x][y].set_xlabel('Drift Size:{}'.format(key))\n",
" axs[x][y].set_ylim(bottom=0)\n",
"\n",
"\n",
" fig.text(.5, -0.05, \"KS pvalue comparison between whylogs and scipy. K=1024, 100 buckets, {} distribution\".format(quant_bars[quant_index]['distribution']), ha='center',fontsize=10)\n",
" plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"def plot_buckets_experiment(quant_bars_per_drift):\n",
" \"\"\"Plots the mean absolute errors for different number of buckets for\n",
" different drift magnitudes. Expects 4 different drift magnitudes to\n",
" plot a 2x2 matrix. The errors are calculated based on the difference\n",
" between whylogs' and scipy's KS implementation.\n",
"\n",
" Parameters\n",
" ----------\n",
" quant_bars_per_drift : dict\n",
" Statistics previously collected by KS experiment.\n",
" \"\"\"\n",
" matplotlib.rc_file_defaults()\n",
" fig, axs = plt.subplots(2,2)\n",
" fig.tight_layout(pad=1.0)\n",
" barWidth = 0.17\n",
" # this plots 4 subplots in a 2x2 matrix\n",
" assert len(quant_bars_per_drift)==4\n",
" for i,drift_magnitude in enumerate(quant_bars_per_drift):\n",
" quant_bars = quant_bars_per_drift[drift_magnitude]\n",
" x,y = int(i%2),int(i/2)\n",
"\n",
"\n",
" for ix,key in enumerate(quant_bars):\n",
" r1 = np.arange(len(quant_bars[key]['bar']))\n",
" r2 = [x+ix*barWidth for x in r1]\n",
" axs[x][y].bar(r2, quant_bars[key]['bar'], width = barWidth, edgecolor = 'black', yerr=quant_bars[key]['yerr'], label=quant_bars[key]['label'])\n",
" \n",
" # general layout\n",
" axs[x][y].tick_params(axis='both', which='major', labelsize=6)\n",
" axs[x][y].set_xticks([r+1.5*barWidth for r in range(len(quant_bars[key]['bar']))], sample_sizes_labels)\n",
" axs[x][y].set_ylabel('error')\n",
" if i==2:\n",
" axs[x][y].legend(loc=1, prop={'size': 6})\n",
" axs[x][y].set_xlabel('Drift Size:{}'.format(drift_magnitude))\n",
" axs[x][y].set_ylim(bottom=0)\n",
" fig.text(.5, -0.05, \"KS pvalue mean abs. error according to sample size for {} distribution and varying drift magnitudes. K=1024\".format(quant_bars[key]['distribution']), ha='center',fontsize=10)\n",
"\n",
"\n",
" plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from whylogs.core.resolvers import Resolver\n",
"from whylogs.core.datatypes import DataType\n",
"from typing import Dict, List\n",
"from typing_extensions import TypedDict\n",
"from whylogs.core.metrics import StandardMetric\n",
"from whylogs.core.metrics.metrics import Metric\n",
"import whylogs as why\n",
"from whylogs.core import DatasetSchema\n",
"from whylogs.core import MetricConfig\n",
"import io\n",
"import pandas as pd\n",
"\n",
"\n",
"class DataFrameSize(TypedDict):\n",
" sample_frac: float\n",
" number_samples: int\n",
" number_bytes: int\n",
"\n",
"\n",
"class MyCustomResolver(Resolver):\n",
" \"\"\"Resolver that assigns DistributionMetric to every column (which is ok because we only have one numerical column).\"\"\"\n",
"\n",
" def resolve(self, name: str, why_type: DataType, column_schema) -> Dict[str, Metric]:\n",
" metrics: List[StandardMetric] = [StandardMetric.distribution]\n",
"\n",
" result: Dict[str, Metric] = {}\n",
" for m in metrics:\n",
" result[m.name] = m.zero(column_schema.cfg)\n",
" return result\n",
"\n",
"def get_parquet_size(df: pd.DataFrame, frac) -> int:\n",
" \"\"\"Get the size in bytes of a serialized pandas DF\n",
" in parquet format. This is only used when comparing\n",
" profiling vs. sampling results.\n",
"\n",
" Returns\n",
" -------\n",
" int\n",
" number of bytes of serialized dataframe.\n",
" \"\"\"\n",
" res = DataFrameSize()\n",
" with io.BytesIO() as buffer:\n",
" df_sampled = df.sample(frac=frac)\n",
" df_sampled.to_parquet(buffer)\n",
" number_bytes = buffer.tell()\n",
" sample_frac = frac\n",
" number_samples = len(df_sampled)\n",
" return DataFrameSize(sample_frac=sample_frac,number_bytes=number_bytes,number_samples=number_samples)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from numpy.random import Generator, PCG64\n",
"import pandas as pd\n",
"\n",
"def get_kll_size_map(kll_list:list):\n",
" \"\"\"\n",
" This is just to find the proper number of samples and size in bytes to match sizes between profiles and sampled dataframes.\n",
" Distribution type does not affect significantly the output.\n",
" \"\"\"\n",
" rng = Generator(PCG64(42))\n",
" distribution=\"normal\"\n",
" sample_size = 100000\n",
" drift_magnitude = 0\n",
" ref, target = pd.DataFrame(), pd.DataFrame()\n",
" ref[\"col\"] = generate_data(distribution=distribution, generator=rng, drift_magnitude=0, size=sample_size)\n",
" kllSizeMap = {}\n",
" for kll_val in kll_list:\n",
" ref_view = why.log(ref, schema=DatasetSchema(default_configs=MetricConfig(kll_k_large=kll_val),resolvers=MyCustomResolver())).profile().view()\n",
" ref_size = len(ref_view.serialize())\n",
" closest_sample = DataFrameSize()\n",
" distance = 10000000\n",
" for frac in list(np.linspace(0.0001,0.2,200)):\n",
" sample_size = get_parquet_size(ref,frac=frac)\n",
" if abs(ref_size-sample_size[\"number_bytes\"]) < distance:\n",
" closest_sample = sample_size\n",
" distance = abs(ref_size-sample_size[\"number_bytes\"])\n",
" kllSizeMap[kll_val] = closest_sample\n",
" return kllSizeMap\n",
"\n",
"kll_list = [256,512,1024,2048,4096]\n",
"kllSizeMap = get_kll_size_map(kll_list=kll_list)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"from scipy import stats\n",
"from whylogs.viz.utils.drift_calculations import calculate_drift_values\n",
"\n",
"def run_kll_experiment(distribution=\"normal\", magnitudes = [0,0.01,0.05,0.1], kllSizeMap:dict = {}):\n",
" \"\"\"Runs experiment on different KLL_K parameters. \n",
"\n",
" Parameters\n",
" ----------\n",
" distribution : str, optional\n",
" distribution type. normal, pareto or uniform, by default \"normal\"\n",
" magnitudes : list, optional\n",
" drift magnitudes, by default [0,0.01,0.05,0.1]\n",
" kllSizeMap : dict, optional\n",
" Relation between profile and sampled df for a given KLL_K, by default {}\n",
"\n",
" Returns\n",
" -------\n",
" dict\n",
" Statistics such as errors and size in KB for different KLL_K parameters for profile case,\n",
" and for dfs sampled on different ratios for sampling case.\n",
" \"\"\"\n",
" kll_list = [int(key) for key in kllSizeMap]\n",
" sample_size = 100000\n",
" kll_bars = {}\n",
" for drift_magnitude in magnitudes:\n",
" rng = Generator(PCG64(random_seed))\n",
" kll_bars[drift_magnitude] = {}\n",
" ref, target = pd.DataFrame(), pd.DataFrame()\n",
" ref[\"col\"] = generate_data(distribution=distribution, generator=rng, drift_magnitude=0, size=sample_size)\n",
" target[\"col\"] = generate_data(distribution=distribution, generator=rng, drift_magnitude=drift_magnitude, size=sample_size)\n",
"\n",
" for kll_val in kll_list:\n",
" kll_bars[drift_magnitude][kll_val] = {}\n",
" scipy_res = stats.ks_2samp(ref[\"col\"], target[\"col\"], mode='asymp')\n",
" scipy_pvalue = scipy_res.pvalue\n",
"\n",
" profiled_pvalues = []\n",
" sampled_pvalues = []\n",
" for i in range(10):\n",
" ref_profile = why.log(ref, schema=DatasetSchema(default_configs=MetricConfig(kll_k_large=kll_val),resolvers=MyCustomResolver())).profile()\n",
" ref_view = ref_profile.view()\n",
"\n",
" target_profile = why.log(target, schema=DatasetSchema(default_configs=MetricConfig(kll_k_large=kll_val),resolvers=MyCustomResolver())).profile()\n",
" target_view = target_profile.view()\n",
"\n",
" profiled_pvalue = calculate_drift_values(target_view=target_view, reference_view=ref_view)['col']['p_value']\n",
"\n",
" ref_sampled = ref.sample(frac=kllSizeMap[kll_val]['sample_frac'])\n",
" target_sampled = target.sample(frac=kllSizeMap[kll_val]['sample_frac'])\n",
"\n",
" sampled_pvalue = stats.ks_2samp(ref_sampled[\"col\"], target_sampled[\"col\"],mode='asymp').pvalue\n",
"\n",
" profiled_pvalues.append(profiled_pvalue)\n",
" sampled_pvalues.append(sampled_pvalue)\n",
" size_bytes = kllSizeMap[kll_val]['number_bytes']\n",
" size_kb = int(size_bytes/1000)\n",
" profile_errors = [abs(pv-scipy_pvalue) for pv in profiled_pvalues]\n",
" sample_errors = [abs(pv-scipy_pvalue) for pv in sampled_pvalues]\n",
"\n",
" profile_mean = sum(profile_errors)/len(profile_errors)\n",
" sample_mean = sum(sample_errors)/len(sample_errors)\n",
"\n",
" range_profile_errors = [abs(profile_mean-min(profile_errors)),abs(profile_mean-max(profile_errors))] \n",
" range_sample_errors = [abs(sample_mean-min(sample_errors)),abs(sample_mean-max(sample_errors))] \n",
"\n",
" kll_bars[drift_magnitude][kll_val]['size_kb'] = size_kb\n",
" kll_bars[drift_magnitude][kll_val]['profile_bars'] = profile_mean\n",
" kll_bars[drift_magnitude][kll_val]['sample_bars'] = sample_mean\n",
" kll_bars[drift_magnitude][kll_val]['profile_yerr'] = range_profile_errors\n",
" kll_bars[drift_magnitude][kll_val]['sample_yerr'] = range_sample_errors\n",
" kll_bars[drift_magnitude][kll_val]['distribution'] = distribution\n",
" return kll_bars"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"def plot_kll_experiment(bars_per_kll_per_drift, include_sampling=True):\n",
" \"\"\"Plots kll experiments for different magnitudes (expects 4).\n",
"\n",
" Parameters\n",
" ----------\n",
" bars_per_kll_per_drift : dict\n",
" KLL Experiment results.\n",
" include_sampling : bool, optional\n",
" If false, plots only errors for the profiling case, by default True\n",
" \"\"\" \n",
" matplotlib.rc_file_defaults()\n",
" fig, axs = plt.subplots(2,2)\n",
" fig.tight_layout(pad=1.0)\n",
" barWidth = 0.17\n",
" for i,drift_magnitude in enumerate(bars_per_kll_per_drift):\n",
" x,y = int(i%2),int(i/2)\n",
" bars_per_kll = bars_per_kll_per_drift[drift_magnitude] \n",
" profile_bars = [bars_per_kll[key]['profile_bars'] for key in bars_per_kll]\n",
" sample_bars = [bars_per_kll[key]['sample_bars'] for key in bars_per_kll]\n",
" \n",
" profile_error = [[bars_per_kll[key]['profile_yerr'][0] for key in bars_per_kll],[bars_per_kll[key]['profile_yerr'][1] for key in bars_per_kll]]\n",
" size_kb = [\"{} KB\".format(bars_per_kll[key]['size_kb']) for key in bars_per_kll]\n",
"\n",
" r1 = np.arange(len(profile_bars))\n",
" r2 = [x + barWidth for x in r1]\n",
"\n",
" if include_sampling:\n",
" sample_error = [[bars_per_kll[key]['sample_yerr'][0] for key in bars_per_kll],[bars_per_kll[key]['sample_yerr'][1] for key in bars_per_kll]]\n",
" axs[x][y].bar(r2, sample_bars, color='tab:blue', yerr = sample_error, width=barWidth,label='Sampled')\n",
" axs[x][y].set_xticks([r+1*barWidth for r in range(len(profile_bars))], size_kb)\n",
" else:\n",
" axs[x][y].set_xticks([r for r in range(len(profile_bars))], size_kb)\n",
" axs[x][y].tick_params(axis='both', which='major', labelsize=6)\n",
" axs[x][y].bar(r1, profile_bars, color='tab:orange', yerr = profile_error, width=barWidth,label='Profiled')\n",
" axs[x][y].set_xlabel('Drift Size:{}'.format(drift_magnitude))\n",
" axs[x][y].set_ylabel('error')\n",
" axs[x][y].set_ylim(bottom=0)\n",
"\n",
" if i==2:\n",
" axs[x][y].legend(loc=1)\n",
" first_key = list(bars_per_kll.keys())[0]\n",
" if include_sampling:\n",
" fig.text(.5, -0.05, \"KS pvalue mean abs. error comparison between profiling and sampling. {} distribution. Sample size = 100 000.\".format(bars_per_kll[first_key]['distribution']), ha='center',fontsize=10)\n",
" else:\n",
" fig.text(.5, -0.05, \"KS pvalue mean abs. error for profiling. {} distribution. Sample size = 100 000.\".format(bars_per_kll[first_key]['distribution']), ha='center',fontsize=10)\n",
"\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Normal Distribution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Drift Injection"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"