{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Aggregating Percentiles\n",
    "\n",
    "There's a lot of articles explaning how percentiles can't actually be aggregated without loss of accuracy, so I won't rehash that here. \n",
    "\n",
    "But, there is a good question around what do when you have no choice but to aggregate them. In that case, what type of aggregation is best? \n",
    "\n",
    "To begin, let's provide some basic functionality that will help us explore various scenarios."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# for templating and display\n",
    "from IPython.display import display\n",
    "from tabulate import tabulate\n",
    "\n",
    "from math import ceil\n",
    "\n",
    "def percentile(percentage, data_points):\n",
    "    \"\"\" return the point that is at the desired percentage \"\"\"\n",
    "    data_points = sorted(data_points)\n",
    "    point_at_percentage = ceil(len(data_points) * percentage) - 1\n",
    "    return data_points[point_at_percentage]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we can test our percentile with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9867958186960467"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import random\n",
    "\n",
    "# we seed with a constant, to reproduce results. but this notebook should work\n",
    "# with any value or random generator\n",
    "random.seed(0)\n",
    "SAMPLE_SIZE = 1000\n",
    "\n",
    "random_points = [random.random() for _ in range(SAMPLE_SIZE)]\n",
    "\n",
    "# should end up being around 0.99, since the points are random across 0 - 100\n",
    "percentile(0.99, random_points)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we've verified this function works. Now comes the fun part. Let's provide a way to randomly distribute these values across multiple bins."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "([[20.78373073698856],\n",
       "  [2.0644462457050095],\n",
       "  [1.6599971633412063],\n",
       "  [10.182311868152327],\n",
       "  [39.20323829703609],\n",
       "  [16.221512415891407],\n",
       "  [32.322723393505704],\n",
       "  [3.400298651748122],\n",
       "  [3.7147164248552023],\n",
       "  [7.201239971993294]],\n",
       " [20.78373073698856,\n",
       "  2.0644462457050095,\n",
       "  1.6599971633412063,\n",
       "  10.182311868152327,\n",
       "  39.20323829703609,\n",
       "  16.221512415891407,\n",
       "  32.322723393505704,\n",
       "  3.400298651748122,\n",
       "  3.7147164248552023,\n",
       "  7.201239971993294])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def exponential_random():\n",
    "    \"\"\" \n",
    "    return a number on the exponential scale,\n",
    "    to emulate web service traffic.\n",
    "    \n",
    "    this will produce a sharp exponent from\n",
    "    1 - 100\n",
    "    \"\"\"\n",
    "    return 10 ** (random.random() * 2)\n",
    "\n",
    "def create_bins(count, points):\n",
    "    \"\"\" return <count> bins, with <points> random points per bin \"\"\"\n",
    "    all_points = []\n",
    "    bins = []\n",
    "    for _ in range(count):\n",
    "        b = []\n",
    "        for _ in range(points):\n",
    "            p = exponential_random()\n",
    "            all_points.append(p)\n",
    "            b.append(p)\n",
    "        bins.append(b)\n",
    "    return bins, all_points\n",
    "\n",
    "create_bins(10, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we need a way to test these aggregations under various conditions. \n",
    "There's a few real-world scenarios that our functions should handle:\n",
    "\n",
    "### Low Data Per Bin\n",
    "\n",
    "It may happen that your percentile bins are not that accurate themselves, due to a low volume of data. For example, if each bin only recieves 10 data points, a p99 calculation cannot be accurate, as the last data point represents the worst 10 percent of the points.\n",
    "\n",
    "This scenario occurs when tracking latencies for web services. If your service is spread across multiple hosts and percentiles are calculated on a per-host basis, and you have a low-volume endpoint, often the number of requests in a short time frame (one second, even one minute) will not capture enough data.\n",
    "\n",
    "### Single Bin\n",
    "\n",
    "A single bin effectively reduces the scenario to the accuracy of the percentile itself. We won't test this scenario since no aggregation is really performed in this case.\n",
    "\n",
    "So let's write some test code that modulates:\n",
    "\n",
    "* the percentile we're trying to aggregate\n",
    "* the number of bins we're aggregating\n",
    "* the points per bin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def test_percentile(aggregation_func, percentage, bin_count, point_count):\n",
    "    \"\"\" test our aggregation functionality, under different circumstances \"\"\"\n",
    "    bins, points = create_bins(bin_count, point_count)\n",
    "    percentiles = [\n",
    "            percentile(percentage, bin_points) for bin_points in bins\n",
    "    ]\n",
    "    actual_value = percentile(percentage, points)\n",
    "    return aggregation_func(percentage, percentiles), actual_value\n",
    "\n",
    "\n",
    "\n",
    "    \n",
    "def benchmark_accuracy(strategies, percentage, bin_count, point_count, trials=100000):\n",
    "    \"\"\" \n",
    "    Try each strategy <trials> times, and take the average of the accuracy \n",
    "    with the expected value. return the winner, with the accurracy per each.\n",
    "    \"\"\"\n",
    "    result_by_strategy = {}\n",
    "    for strategy in strategies:\n",
    "        total_deviation = 0\n",
    "        for _ in range(trials):\n",
    "            result, expected = test_percentile(strategy, percentage, bin_count, point_count)\n",
    "            total_deviation += abs(result - expected)\n",
    "        result_by_strategy[strategy.__name__] = total_deviation / trials\n",
    "    # note in python3.6 iteration via insertion order is now \n",
    "    # a language specification\n",
    "    winner = None\n",
    "    min_deviation = trials * percentage\n",
    "    for name, result in result_by_strategy.items(): \n",
    "        if result < min_deviation:\n",
    "            min_deviation = result\n",
    "            winner = name\n",
    "    return [winner] + list(result_by_strategy.values())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "BIN_COUNTS = [10, 100]\n",
    "POINT_COUNTS = [1, 10, 100]\n",
    "PERCENTAGES = [0.5, 0.95, 0.99]\n",
    "# PERCENTAGES = [0.5, 0.95, 0.99]\n",
    "# BIN_COUNTS = [1]\n",
    "# POINT_COUNTS = [1]\n",
    "\n",
    "\n",
    "def benchmark_all():\n",
    "    data = [[\"bin_count\", \"points_per_bin\", \"percentage\", \"winner\"] + [s.__name__ for s in STRATEGIES]]\n",
    "    for bin_count in BIN_COUNTS:\n",
    "        for point_count in POINT_COUNTS:\n",
    "            for percentage in PERCENTAGES:\n",
    "                print((bin_count, point_count, percentage))\n",
    "                result = benchmark_accuracy(\n",
    "                    STRATEGIES, percentage, bin_count, point_count\n",
    "                )\n",
    "                data.append([\n",
    "                    bin_count,\n",
    "                    point_count,\n",
    "                    percentage,\n",
    "                ] + result)\n",
    "    print(tabulate(data, headers=\"firstrow\"))\n",
    "  \n",
    "\n",
    "def benchmark_strategy(aggregation_func):\n",
    "    \"\"\" benchmark a single strategy against various scenarios \"\"\"\n",
    "    data = [[\"function\", \"bin\", \"points_per_bin\", \"percentage / expected\", \"result\"]]\n",
    "    for bin_count in BIN_COUNTS:\n",
    "        for point_count in POINT_COUNTS:\n",
    "            for percentage in PERCENTAGES:\n",
    "                result = test_percentile(aggregation_func, percentage,\n",
    "                                         bin_count, point_count)\n",
    "                data.append([\n",
    "                    aggregation_func.__name__, \n",
    "                    percentage,\n",
    "                    bin_count,\n",
    "                    point_count,\n",
    "                    percentage,\n",
    "                    result\n",
    "                ])\n",
    "    print(tabulate(data, headers=\"firstrow\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we outline the strategies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def median(percentage, percentiles):\n",
    "    \"\"\" return the median point \"\"\"\n",
    "    sorted(percentiles)\n",
    "    return percentiles[len(percentiles) // 2]\n",
    "\n",
    "def average(_, percentiles):\n",
    "    return sum(percentiles) / len(percentiles)\n",
    "\n",
    "def max_(_, percentile):\n",
    "    return max(percentile)\n",
    "\n",
    "def min_(_, percentile):\n",
    "    return min(percentile)\n",
    "\n",
    "STRATEGIES = [median, average, percentile, max_, min_]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Strategy Prediction\n",
    "\n",
    "So which one will be the best? To help evaluate that, I think there's two limits to consider.\n",
    "\n",
    "At low data points, the problem reduces down to each bin containing only one, or no, data points. In this case, percentiles are the most accurate, as median and average would be taking the median and average of the bins, which would effectively take the median and average of the traffic.\n",
    "\n",
    "At a high number of data points per bin, the bins themselves are a more accurate representation of the percentiles they are capturing. As the number of data points reach infinity.."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(10, 1, 0.5)\n",
      "(10, 1, 0.95)\n",
      "(10, 1, 0.99)\n",
      "(10, 10, 0.5)\n",
      "(10, 10, 0.95)\n",
      "(10, 10, 0.99)\n",
      "(10, 100, 0.5)\n",
      "(10, 100, 0.95)\n",
      "(10, 100, 0.99)\n",
      "(100, 1, 0.5)\n",
      "(100, 1, 0.95)\n",
      "(100, 1, 0.99)\n",
      "(100, 10, 0.5)\n",
      "(100, 10, 0.95)\n",
      "(100, 10, 0.99)\n",
      "(100, 100, 0.5)\n",
      "(100, 100, 0.95)\n",
      "(100, 100, 0.99)\n",
      "  bin_count    points_per_bin    percentage  winner        median     average    percentile      max_      min_\n",
      "-----------  ----------------  ------------  ----------  --------  ----------  ------------  --------  --------\n",
      "         10                 1          0.5   percentile  16.4145   11.4482         0         59.8732    8.46032\n",
      "         10                 1          0.95  percentile  48.2928   48.4826         0          0        68.2175\n",
      "         10                 1          0.99  percentile  48.4232   48.4967         0          0        68.3683\n",
      "         10                10          0.5   average      5.05107   1.20435        2.43379   14.1471    6.85546\n",
      "         10                10          0.95  average     16.4724    7.05708       19.1464    19.1643   42.1451\n",
      "         10                10          0.99  percentile  22.3421   21.5031         4.17321    4.19587  57.2037\n",
      "         10               100          0.5   average      1.73056   0.226398       0.556829   3.99453   3.05509\n",
      "         10               100          0.95  average      6.24118   2.65317        8.51557    8.49106  15.766\n",
      "         10               100          0.99  percentile   4.57526   3.61064        2.80232    2.80631  13.732\n",
      "        100                 1          0.5   percentile  17.478    11.4759         0         85.6289    8.98846\n",
      "        100                 1          0.95  percentile  56.1186   54.9946         0         19.1492   75.4618\n",
      "        100                 1          0.99  percentile  70.006    69.988          0          4.17005  90.4284\n",
      "        100                10          0.5   average      5.28758   0.408567       2.03797   30.4184    8.09651\n",
      "        100                10          0.95  average     17.8629    9.18991       18.0849    20.4303   62.4184\n",
      "        100                10          0.99  percentile  25.7039   25.1357         4.00724    4.45979  78.4014\n",
      "        100               100          0.5   average      1.81351   0.0785362      0.260815   7.30585   4.42301\n",
      "        100               100          0.95  average      6.61081   2.90541        8.79043   13.2074   25.0559\n",
      "        100               100          0.99  percentile   4.91507   3.99404        3.61271    3.93704  23.0564\n"
     ]
    }
   ],
   "source": [
    "# this will take a while. This is slow in Native Python,\n",
    "# at some point it would be valuable to switch to Numpy,\n",
    "# and use the functions inside of it.\n",
    "benchmark_all()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}