{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Aggregating Percentiles\n", "\n", "There's a lot of articles explaning how percentiles can't actually be aggregated without loss of accuracy, so I won't rehash that here. \n", "\n", "But, there is a good question around what do when you have no choice but to aggregate them. In that case, what type of aggregation is best? \n", "\n", "To begin, let's provide some basic functionality that will help us explore various scenarios." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# for templating and display\n", "from IPython.display import display\n", "from tabulate import tabulate\n", "\n", "from math import ceil\n", "\n", "def percentile(percentage, data_points):\n", " \"\"\" return the point that is at the desired percentage \"\"\"\n", " data_points = sorted(data_points)\n", " point_at_percentage = ceil(len(data_points) * percentage) - 1\n", " return data_points[point_at_percentage]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can test our percentile with:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9867958186960467" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import random\n", "\n", "# we seed with a constant, to reproduce results. but this notebook should work\n", "# with any value or random generator\n", "random.seed(0)\n", "SAMPLE_SIZE = 1000\n", "\n", "random_points = [random.random() for _ in range(SAMPLE_SIZE)]\n", "\n", "# should end up being around 0.99, since the points are random across 0 - 100\n", "percentile(0.99, random_points)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we've verified this function works. Now comes the fun part. Let's provide a way to randomly distribute these values across multiple bins." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([[20.78373073698856],\n", " [2.0644462457050095],\n", " [1.6599971633412063],\n", " [10.182311868152327],\n", " [39.20323829703609],\n", " [16.221512415891407],\n", " [32.322723393505704],\n", " [3.400298651748122],\n", " [3.7147164248552023],\n", " [7.201239971993294]],\n", " [20.78373073698856,\n", " 2.0644462457050095,\n", " 1.6599971633412063,\n", " 10.182311868152327,\n", " 39.20323829703609,\n", " 16.221512415891407,\n", " 32.322723393505704,\n", " 3.400298651748122,\n", " 3.7147164248552023,\n", " 7.201239971993294])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def exponential_random():\n", " \"\"\" \n", " return a number on the exponential scale,\n", " to emulate web service traffic.\n", " \n", " this will produce a sharp exponent from\n", " 1 - 100\n", " \"\"\"\n", " return 10 ** (random.random() * 2)\n", "\n", "def create_bins(count, points):\n", " \"\"\" return bins, with random points per bin \"\"\"\n", " all_points = []\n", " bins = []\n", " for _ in range(count):\n", " b = []\n", " for _ in range(points):\n", " p = exponential_random()\n", " all_points.append(p)\n", " b.append(p)\n", " bins.append(b)\n", " return bins, all_points\n", "\n", "create_bins(10, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we need a way to test these aggregations under various conditions. \n", "There's a few real-world scenarios that our functions should handle:\n", "\n", "### Low Data Per Bin\n", "\n", "It may happen that your percentile bins are not that accurate themselves, due to a low volume of data. For example, if each bin only recieves 10 data points, a p99 calculation cannot be accurate, as the last data point represents the worst 10 percent of the points.\n", "\n", "This scenario occurs when tracking latencies for web services. If your service is spread across multiple hosts and percentiles are calculated on a per-host basis, and you have a low-volume endpoint, often the number of requests in a short time frame (one second, even one minute) will not capture enough data.\n", "\n", "### Single Bin\n", "\n", "A single bin effectively reduces the scenario to the accuracy of the percentile itself. We won't test this scenario since no aggregation is really performed in this case.\n", "\n", "So let's write some test code that modulates:\n", "\n", "* the percentile we're trying to aggregate\n", "* the number of bins we're aggregating\n", "* the points per bin" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def test_percentile(aggregation_func, percentage, bin_count, point_count):\n", " \"\"\" test our aggregation functionality, under different circumstances \"\"\"\n", " bins, points = create_bins(bin_count, point_count)\n", " percentiles = [\n", " percentile(percentage, bin_points) for bin_points in bins\n", " ]\n", " actual_value = percentile(percentage, points)\n", " return aggregation_func(percentage, percentiles), actual_value\n", "\n", "\n", "\n", " \n", "def benchmark_accuracy(strategies, percentage, bin_count, point_count, trials=100000):\n", " \"\"\" \n", " Try each strategy times, and take the average of the accuracy \n", " with the expected value. return the winner, with the accurracy per each.\n", " \"\"\"\n", " result_by_strategy = {}\n", " for strategy in strategies:\n", " total_deviation = 0\n", " for _ in range(trials):\n", " result, expected = test_percentile(strategy, percentage, bin_count, point_count)\n", " total_deviation += abs(result - expected)\n", " result_by_strategy[strategy.__name__] = total_deviation / trials\n", " # note in python3.6 iteration via insertion order is now \n", " # a language specification\n", " winner = None\n", " min_deviation = trials * percentage\n", " for name, result in result_by_strategy.items(): \n", " if result < min_deviation:\n", " min_deviation = result\n", " winner = name\n", " return [winner] + list(result_by_strategy.values())" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "BIN_COUNTS = [10, 100]\n", "POINT_COUNTS = [1, 10, 100]\n", "PERCENTAGES = [0.5, 0.95, 0.99]\n", "# PERCENTAGES = [0.5, 0.95, 0.99]\n", "# BIN_COUNTS = [1]\n", "# POINT_COUNTS = [1]\n", "\n", "\n", "def benchmark_all():\n", " data = [[\"bin_count\", \"points_per_bin\", \"percentage\", \"winner\"] + [s.__name__ for s in STRATEGIES]]\n", " for bin_count in BIN_COUNTS:\n", " for point_count in POINT_COUNTS:\n", " for percentage in PERCENTAGES:\n", " print((bin_count, point_count, percentage))\n", " result = benchmark_accuracy(\n", " STRATEGIES, percentage, bin_count, point_count\n", " )\n", " data.append([\n", " bin_count,\n", " point_count,\n", " percentage,\n", " ] + result)\n", " print(tabulate(data, headers=\"firstrow\"))\n", " \n", "\n", "def benchmark_strategy(aggregation_func):\n", " \"\"\" benchmark a single strategy against various scenarios \"\"\"\n", " data = [[\"function\", \"bin\", \"points_per_bin\", \"percentage / expected\", \"result\"]]\n", " for bin_count in BIN_COUNTS:\n", " for point_count in POINT_COUNTS:\n", " for percentage in PERCENTAGES:\n", " result = test_percentile(aggregation_func, percentage,\n", " bin_count, point_count)\n", " data.append([\n", " aggregation_func.__name__, \n", " percentage,\n", " bin_count,\n", " point_count,\n", " percentage,\n", " result\n", " ])\n", " print(tabulate(data, headers=\"firstrow\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we outline the strategies:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def median(percentage, percentiles):\n", " \"\"\" return the median point \"\"\"\n", " sorted(percentiles)\n", " return percentiles[len(percentiles) // 2]\n", "\n", "def average(_, percentiles):\n", " return sum(percentiles) / len(percentiles)\n", "\n", "def max_(_, percentile):\n", " return max(percentile)\n", "\n", "def min_(_, percentile):\n", " return min(percentile)\n", "\n", "STRATEGIES = [median, average, percentile, max_, min_]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Strategy Prediction\n", "\n", "So which one will be the best? To help evaluate that, I think there's two limits to consider.\n", "\n", "At low data points, the problem reduces down to each bin containing only one, or no, data points. In this case, percentiles are the most accurate, as median and average would be taking the median and average of the bins, which would effectively take the median and average of the traffic.\n", "\n", "At a high number of data points per bin, the bins themselves are a more accurate representation of the percentiles they are capturing. As the number of data points reach infinity.." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(10, 1, 0.5)\n", "(10, 1, 0.95)\n", "(10, 1, 0.99)\n", "(10, 10, 0.5)\n", "(10, 10, 0.95)\n", "(10, 10, 0.99)\n", "(10, 100, 0.5)\n", "(10, 100, 0.95)\n", "(10, 100, 0.99)\n", "(100, 1, 0.5)\n", "(100, 1, 0.95)\n", "(100, 1, 0.99)\n", "(100, 10, 0.5)\n", "(100, 10, 0.95)\n", "(100, 10, 0.99)\n", "(100, 100, 0.5)\n", "(100, 100, 0.95)\n", "(100, 100, 0.99)\n", " bin_count points_per_bin percentage winner median average percentile max_ min_\n", "----------- ---------------- ------------ ---------- -------- ---------- ------------ -------- --------\n", " 10 1 0.5 percentile 16.4145 11.4482 0 59.8732 8.46032\n", " 10 1 0.95 percentile 48.2928 48.4826 0 0 68.2175\n", " 10 1 0.99 percentile 48.4232 48.4967 0 0 68.3683\n", " 10 10 0.5 average 5.05107 1.20435 2.43379 14.1471 6.85546\n", " 10 10 0.95 average 16.4724 7.05708 19.1464 19.1643 42.1451\n", " 10 10 0.99 percentile 22.3421 21.5031 4.17321 4.19587 57.2037\n", " 10 100 0.5 average 1.73056 0.226398 0.556829 3.99453 3.05509\n", " 10 100 0.95 average 6.24118 2.65317 8.51557 8.49106 15.766\n", " 10 100 0.99 percentile 4.57526 3.61064 2.80232 2.80631 13.732\n", " 100 1 0.5 percentile 17.478 11.4759 0 85.6289 8.98846\n", " 100 1 0.95 percentile 56.1186 54.9946 0 19.1492 75.4618\n", " 100 1 0.99 percentile 70.006 69.988 0 4.17005 90.4284\n", " 100 10 0.5 average 5.28758 0.408567 2.03797 30.4184 8.09651\n", " 100 10 0.95 average 17.8629 9.18991 18.0849 20.4303 62.4184\n", " 100 10 0.99 percentile 25.7039 25.1357 4.00724 4.45979 78.4014\n", " 100 100 0.5 average 1.81351 0.0785362 0.260815 7.30585 4.42301\n", " 100 100 0.95 average 6.61081 2.90541 8.79043 13.2074 25.0559\n", " 100 100 0.99 percentile 4.91507 3.99404 3.61271 3.93704 23.0564\n" ] } ], "source": [ "# this will take a while. This is slow in Native Python,\n", "# at some point it would be valuable to switch to Numpy,\n", "# and use the functions inside of it.\n", "benchmark_all()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }