{ "cells": [ { "cell_type": "markdown", "source": [ "# \"CDFs with Seaborn\"\n", "> Plotting the cumulatlive distribution of latency measurements\n", "\n", "- toc: true\n", "- badges: true\n", "- comments: false\n", "- categories: [jupyter, cdf, seabron]" ], "metadata": {} }, { "cell_type": "code", "execution_count": 1, "source": [ "#hide\n", "from collections import Counter\n", "import pandas as pd\n", "import seaborn as sns\n", "import random as r\n", "\n", "r.seed(42)\n", "sns.set()" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 2, "source": [ "#hide\n", "# generate the dataset\n", "data = []\n", "\n", "for path in ['a', 'b', 'c']:\n", " for timestamp in range(1, 10001):\n", " latency = -1\n", " if (path == 'a'):\n", " latency = r.normalvariate(30, 3)\n", " elif (path == 'b'):\n", " latency = r.normalvariate(40, 10)\n", " else:\n", " # c has a 50/50 latency\n", " if (r.choice([True, False])):\n", " latency = r.normalvariate(40, 1)\n", " else:\n", " latency = r.normalvariate(60, 1)\n", "\n", " data.append({\n", " 'timestamp': timestamp,\n", " 'path': path,\n", " 'latency': latency\n", " })\n", "\n", "df = pd.DataFrame(data)\n", "\n", "df[df.path == 'a'].describe(), df[df.path == 'b'].describe(), df[df.path == 'c'].describe()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "( timestamp latency\n", " count 10000.00000 10000.000000\n", " mean 5000.50000 29.992668\n", " std 2886.89568 3.045471\n", " min 1.00000 18.794713\n", " 25% 2500.75000 27.926180\n", " 50% 5000.50000 29.967227\n", " 75% 7500.25000 32.047904\n", " max 10000.00000 41.444034,\n", " timestamp latency\n", " count 10000.00000 10000.000000\n", " mean 5000.50000 40.047845\n", " std 2886.89568 9.940275\n", " min 1.00000 1.230942\n", " 25% 2500.75000 33.460950\n", " 50% 5000.50000 40.048686\n", " 75% 7500.25000 46.730936\n", " max 10000.00000 71.753491,\n", " timestamp latency\n", " count 10000.00000 10000.000000\n", " mean 5000.50000 49.982874\n", " std 2886.89568 10.060132\n", " min 1.00000 36.721305\n", " 25% 2500.75000 39.991170\n", " 50% 5000.50000 42.958066\n", " 75% 7500.25000 59.988394\n", " max 10000.00000 63.566645)" ] }, "metadata": {}, "execution_count": 2 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "# Introduction\n", "\n", "During my PHD, I often had to create CDF (cumulative distribution function) plots.\n", "For example, I use CDF plots in my paper *Managing Latency and Excess Data Dissemination in Fog-Based Publish/Subscribe Systems* ([DOI](https://doi.org/10.1109/ICFC49376.2020.00010)/[Website](https://moewex.github.io/academic/publication/2020-broadcastgroups/)) for reporting latency measurement that have been collected by multiple end-devices for different data distribution strategies.\n", "\n", "In this blog post, I will showcase why CDFs are a particulary good fit for such a use case and how easy it is to generate them with [seaborn](https://seaborn.pydata.org)." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "# Exploring the Sample Data\n", "\n", "For the purpose of this blog post, I created an artificial sample dataset with latency measurements for three coummunication paths." ], "metadata": {} }, { "cell_type": "code", "execution_count": 3, "source": [ "df.head()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", " | timestamp | \n", "path | \n", "latency | \n", "
---|---|---|---|
0 | \n", "1 | \n", "a | \n", "30.735979 | \n", "
1 | \n", "2 | \n", "a | \n", "28.509467 | \n", "
2 | \n", "3 | \n", "a | \n", "33.764358 | \n", "
3 | \n", "4 | \n", "a | \n", "29.585823 | \n", "
4 | \n", "5 | \n", "a | \n", "27.072539 | \n", "
\n", " | path | \n", "latency | \n", "count | \n", "cumsum | \n", "cumulative_distribution | \n", "
---|---|---|---|---|---|
0 | \n", "a | \n", "18.794713 | \n", "1 | \n", "1 | \n", "0.0001 | \n", "
6663 | \n", "a | \n", "31.286361 | \n", "1 | \n", "6664 | \n", "0.6664 | \n", "
6664 | \n", "a | \n", "31.286763 | \n", "1 | \n", "6665 | \n", "0.6665 | \n", "
6665 | \n", "a | \n", "31.286795 | \n", "1 | \n", "6666 | \n", "0.6666 | \n", "
6666 | \n", "a | \n", "31.286961 | \n", "1 | \n", "6667 | \n", "0.6667 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
23333 | \n", "c | \n", "40.414325 | \n", "1 | \n", "3334 | \n", "0.3334 | \n", "
23334 | \n", "c | \n", "40.414616 | \n", "1 | \n", "3335 | \n", "0.3335 | \n", "
23335 | \n", "c | \n", "40.414784 | \n", "1 | \n", "3336 | \n", "0.3336 | \n", "
23328 | \n", "c | \n", "40.411314 | \n", "1 | \n", "3329 | \n", "0.3329 | \n", "
29999 | \n", "c | \n", "63.566645 | \n", "1 | \n", "10000 | \n", "1.0000 | \n", "
30000 rows × 5 columns
\n", "