{ "cells": [ { "cell_type": "markdown", "metadata": { "scrolled": true }, "source": [ "# Analyzing Prometheus Alerts in Ceph\n", "\n", "For a better understanding of the structure of prometheus data types have a look at [Prometheus Metric Types](https://prometheus.io/docs/concepts/metric_types/), especially the [difference between Summaries and Histograms](https://prometheus.io/docs/practices/histograms/)\n", "\n", "The measurements are stored in an Ceph. Let's examine what we have stored." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import statistics libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import json\n", "import numpy as np\n", "import seaborn as sns\n", "import sys\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "import pyspark\n", "import json\n", "from pyspark.sql import SparkSession\n", "\n", "from datetime import datetime\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set Spark Configuration" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Set the Spark configuration\n", "#This will point to a local Spark instance running in stand-alone mode on the notebook\n", "conf = pyspark.SparkConf().setAppName('Analyzing Prometheus Alerts in Ceph').setMaster('local[*]')\n", "sc = pyspark.SparkContext.getOrCreate(conf) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Access Ceph Object Storage over S3A" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Set the S3 configurations to access Ceph Object Storage\n", "sc._jsc.hadoopConfiguration().set(\"fs.s3a.access.key\", 'S3user1') \n", "sc._jsc.hadoopConfiguration().set(\"fs.s3a.secret.key\", 'S3user1key') \n", "sc._jsc.hadoopConfiguration().set(\"fs.s3a.endpoint\", 'http://10.0.1.111') " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set SQL Context and Read Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Get the SQL context\n", "sqlContext = pyspark.SQLContext(sc)\n", "\n", "#Read the Prometheus JSON BZip data\n", "jsonFile = sqlContext.read.option(\"multiline\", True).option(\"mode\", \"PERMISSIVE\").json(\"s3a://METRICS/kubelet_docker_operations_latency_microseconds/\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### IMPORTANT: If you run the above step with incorrect Ceph parameters, you must reset the Kernel to see changes.\n", "This can be done by going to Kernel in the menu and selecting 'Restart'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prometheus alerts\n", "\n", "```\n", "alert: DockerLatencyHigh\n", "message: Docker latency is high\n", "description: Docker latency is {{ $value }} seconds for 90% of kubelet operations\n", "expr: round(max(kubelet_docker_operations_latency_microseconds{quantile=\"0.9\"}) BY (hostname) / 1e+06, 0.1) > 10\n", "``` \n", "\n", "
\n", "```\n", "alert: KubernetesAPIErrorsHigh\n", "message: Kubernetes API server errors high\n", "description: Kubernetes API server errors (response code 5xx) are {{ $value }}% of total requests\n", "expr: rate(apiserver_request_count{code=~\"^(?:5..)$\"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5\n", "```\n", "\n", "
\n", "```\n", "alert: KubernetesAPIClusterLatencyHigh\n", "message: Kubernetes API server cluster latency high\n", "description: 'Kubernetes API server request latency is {{ $value }} seconds for\n", " 90% of cluster requests. NOTE: long-standing requests (e.g. watch, watchlist,\n", " list, proxy, connect) have been removed from alert query.'\n", "expr: round(apiserver_request_latencies_summary{quantile=\"0.9\",scope=\"cluster\",subresource!=\"log\",verb!~\"^(?:WATCH|WATCHLIST|LIST|PROXY|CONNECT)$\"}\n", " / 1e+06, 0.1) > 1\n", "```\n", "\n", "
\n", "```\n", "alert: KubernetesAPIGetLatencyHigh\n", "message: Kubernetes API server GET latency high\n", "description: Kubernetes API server request latency is {{ $value }} seconds for 99%\n", " of GET requests.\n", "expr: round(apiserver_request_latencies_summary{quantile=\"0.99\",subresource!=\"log\",verb=\"GET\"}\n", " / 1e+06, 0.1) > 1\n", "```\n", "\n", "
\n", "\n", "```\n", "alert: KubernetesAPIPOSTLatencyHigh\n", "message: Kubernetes API server POST|PUT|PATCH|DELETE latency high\n", "description: Kubernetes API server request latency is {{ $value }} seconds for 99%\n", " of POST|PUT|PATCH|DELETE requests.\n", "expr: round(apiserver_request_latencies_summary{quantile=\"0.99\",subresource!=\"log\",verb=~\"^(?:POST|PUT|PATCH)$\"}\n", " / 1e+06, 0.1) > 2\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Display the schema of the files" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Display schema:')\n", "jsonFile.printSchema()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Query the JSON data using filters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Register the created SchemaRDD as a temporary table.\n", "jsonFile.registerTempTable(\"kubelet_docker_operations_latency_microseconds\")\n", "\n", "#Filter the results into a data frame\n", "data = sqlContext.sql(\"SELECT values, metric.operation_type FROM kubelet_docker_operations_latency_microseconds WHERE metric.quantile='0.9' AND metric.hostname='free-stg-master-03fb6'\")\n", "\n", "data.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_pd = data.toPandas()\n", "\n", "sc.stop()\n", "\n", "OP_TYPE = 'list_images'\n", "\n", "df2 = pd.DataFrame(columns = ['utc_timestamp','value', 'operation_type'])\n", "#df2 ='\n", "for op in set(data_pd['operation_type']):\n", " dict_raw = data_pd[data_pd['operation_type'] == op]['values']\n", " list_raw = []\n", " for key in dict_raw.keys():\n", " list_raw.extend(dict_raw[key])\n", " temp_frame = pd.DataFrame(list_raw, columns = ['utc_timestamp','value'])\n", " temp_frame['operation_type'] = op\n", " \n", " df2 = df2.append(temp_frame)\n", "\n", "\n", "df2 = df2[df2['value'] != 'NaN']\n", "\n", "df2['value'] = df2['value'].apply(lambda a: int(a))\n", "\n", "df2['timestamp'] = df2['utc_timestamp'].apply(lambda a : datetime.fromtimestamp(int(a)))\n", "\n", "df2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Objective: verify the above alerts" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "#### Store time stamp with data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df2.reset_index(inplace =True)\n", "\n", "del df2['index']\n", "\n", "df2['operation_type'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Segregate the values by operation type in separate variables as Series" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_filtered_op_frame(op_type):\n", " temp = df2[df2.operation_type == op_type]\n", " temp = temp.sort_values(by='timestamp')\n", " return temp\n", "\n", "operation_type_value = {}\n", "for temp in list(df2.operation_type.unique()):\n", " operation_type_value[temp] = get_filtered_op_frame(temp)['value']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Descriptive Stats\n", "It refers to the portion of statistics dedicated to summarizing a total population" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Mean \n", "Arithmetic average of a range of values or quantities, computed by dividing the total of all values by the number of values.\n", "![title](../img/mean.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for temp in operation_type_value.keys():\n", " print(\"Mean of: \",temp, \" - \", np.mean(operation_type_value[temp]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Variance\n", "In the same way that the mean is used to describe the central tendency, variance is intended to describe the spread.\n", "The xi – μ is called the “deviation from the mean”, making the variance the squared deviation multiplied by 1 over the number of samples. This is why the square root of the variance, σ, is called the standard deviation.\n", "![title](../img/variance.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for temp in operation_type_value.keys():\n", " print(\"Variance of: \",temp, \" - \", np.var(operation_type_value[temp]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Standard Deviation\n", "Standard deviation (SD, also represented by the Greek letter sigma σ or the Latin letter s) is a measure that is used to quantify the amount of variation or dispersion of a set of data values.[1] A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for temp in operation_type_value.keys():\n", " print(\"Standard Deviation of: \",temp, \" - \", np.std(operation_type_value[temp]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Median\n", "\n", "Denotes value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it. Simply put, it is the *middle* value in the list of numbers.\n", "The median is a better choice when the indicator can be affected by some outliers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for temp in operation_type_value.keys():\n", " print(\"Median of: \",temp, \" - \", np.median(operation_type_value[temp]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Histogram \n", "The most common representation of a distribution is a histogram, which is a graph that shows the frequency or probability of each value. Plots will be generated by operation type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use Seaborn module for this. __Kernel Density Estimation__ * will be added for smoothing.\n", "* In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample.\n", "* The kernel density estimate may be less familiar, but it can be a useful tool for plotting the shape of a distribution. Like the histogram, the KDE plots encodes the density of observations on one axis with height along the other axis:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "sns.set(color_codes = True)\n", "\n", "for temp in operation_type_value.keys():\n", " fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,12))\n", " sns.distplot(get_filtered_op_frame(temp)['value'], kde=True, ax=ax[0], axlabel= temp)\n", " sns.distplot(np.log(get_filtered_op_frame(temp)['value']), kde=True, ax=ax[1], axlabel = \"Log transformed \"+ temp)\n", " fig.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Understanding\n", "They are all log normals, cause value will always be greater than 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Box-Whisker\n", "Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. __Outliers__ may be plotted as individual points." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Log normalisation is required because, for different operations, values seems to be in very different scales" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_whisker = df2\n", "df_whisker['log_transformed_value'] = np.log(df2['value'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_whisker.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(20,15))\n", "ax = sns.boxplot(x=\"operation_type\", y=\"log_transformed_value\", hue=\"operation_type\", data=df_whisker) # RUN PLOT \n", "plt.show()\n", "\n", "plt.clf()\n", "plt.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding trend in time series, if there any \n", "Trend means, if over time values have increasing or decreasing pattern. In this example we see that there is a trend of a slow and steady increase followed by a sharp drop." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "operation_type_value.keys()\n", "\n", "for temp in operation_type_value.keys():\n", " #fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,12))\n", " temp_frame = get_filtered_op_frame(temp)\n", " temp_frame = temp_frame.set_index(temp_frame.timestamp)\n", " temp_frame = temp_frame[['log_transformed_value']]\n", " temp_frame.plot(figsize=(15,12),title=temp)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }