# Analyzing Prometheus Alerts in Ceph

For a better understanding of the structure of prometheus data types have a look at [Prometheus Metric Types](https://prometheus.io/docs/concepts/metric_types/), especially the [difference between Summaries and Histograms](https://prometheus.io/docs/practices/histograms/)

The measurements are stored in an Ceph. Let's examine what we have stored.

### Import statistics libraries

In [None]:
import pandas as pd
import json
import numpy as np
import seaborn as sns
import sys
import matplotlib.pyplot as plt
%matplotlib inline

import pyspark
import json
from pyspark.sql import SparkSession

from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

### Set Spark Configuration

In [None]:
#Set the Spark configuration
#This will point to a local Spark instance running in stand-alone mode on the notebook
conf = pyspark.SparkConf().setAppName('Analyzing Prometheus Alerts in Ceph').setMaster('local[*]')
sc = pyspark.SparkContext.getOrCreate(conf) 

### Access Ceph Object Storage over S3A

In [None]:
#Set the S3 configurations to access Ceph Object Storage
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", 'S3user1') 
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", 'S3user1key') 
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", 'http://10.0.1.111') 

### Set SQL Context and Read Dataset

In [None]:
#Get the SQL context
sqlContext = pyspark.SQLContext(sc)

#Read the Prometheus JSON BZip data
jsonFile = sqlContext.read.option("multiline", True).option("mode", "PERMISSIVE").json("s3a://METRICS/kubelet_docker_operations_latency_microseconds/")

#### IMPORTANT: If you run the above step with incorrect Ceph parameters, you must reset the Kernel to see changes.
This can be done by going to Kernel in the menu and selecting 'Restart'

## Prometheus alerts

```
alert: DockerLatencyHigh
message: Docker latency is high
description: Docker latency is {{ $value }} seconds for 90% of kubelet operations
expr: round(max(kubelet_docker_operations_latency_microseconds{quantile="0.9"}) BY (hostname) / 1e+06, 0.1) > 10
```    

<hr>
```
alert: KubernetesAPIErrorsHigh
message: Kubernetes API server errors high
description: Kubernetes API server errors (response code 5xx) are {{ $value }}% of total requests
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5
```

<hr>
```
alert: KubernetesAPIClusterLatencyHigh
message: Kubernetes API server cluster latency high
description: 'Kubernetes API server request latency is {{ $value }} seconds for
    90% of cluster requests. NOTE: long-standing requests (e.g. watch, watchlist,
    list, proxy, connect) have been removed from alert query.'
expr: round(apiserver_request_latencies_summary{quantile="0.9",scope="cluster",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|LIST|PROXY|CONNECT)$"}
  / 1e+06, 0.1) > 1
```

<hr>
```
alert: KubernetesAPIGetLatencyHigh
message: Kubernetes API server GET latency high
description: Kubernetes API server request latency is {{ $value }} seconds for 99%
    of GET requests.
expr: round(apiserver_request_latencies_summary{quantile="0.99",subresource!="log",verb="GET"}
  / 1e+06, 0.1) > 1
```

<hr>

```
alert: KubernetesAPIPOSTLatencyHigh
message: Kubernetes API server POST|PUT|PATCH|DELETE latency high
description: Kubernetes API server request latency is {{ $value }} seconds for 99%
    of POST|PUT|PATCH|DELETE requests.
expr: round(apiserver_request_latencies_summary{quantile="0.99",subresource!="log",verb=~"^(?:POST|PUT|PATCH)$"}
  / 1e+06, 0.1) > 2
```


### Display the schema of the files

In [None]:
print('Display schema:')
jsonFile.printSchema()

### Query the JSON data using filters

In [None]:
#Register the created SchemaRDD as a temporary table.
jsonFile.registerTempTable("kubelet_docker_operations_latency_microseconds")

#Filter the results into a data frame
data = sqlContext.sql("SELECT values, metric.operation_type FROM kubelet_docker_operations_latency_microseconds WHERE metric.quantile='0.9' AND metric.hostname='free-stg-master-03fb6'")

data.show()

In [None]:
data_pd = data.toPandas()

sc.stop()

OP_TYPE = 'list_images'

df2 = pd.DataFrame(columns = ['utc_timestamp','value', 'operation_type'])
#df2 ='
for op in set(data_pd['operation_type']):
    dict_raw = data_pd[data_pd['operation_type'] == op]['values']
    list_raw = []
    for key in dict_raw.keys():
        list_raw.extend(dict_raw[key])
    temp_frame = pd.DataFrame(list_raw, columns = ['utc_timestamp','value'])
    temp_frame['operation_type'] = op
    
    df2 = df2.append(temp_frame)


df2 = df2[df2['value'] != 'NaN']

df2['value'] = df2['value'].apply(lambda a: int(a))

df2['timestamp'] = df2['utc_timestamp'].apply(lambda a : datetime.fromtimestamp(int(a)))

df2.head()

### Objective: verify the above alerts

#### Store time stamp with data

In [None]:
df2.reset_index(inplace =True)

del df2['index']

df2['operation_type'].unique()

#### Segregate the values by operation type in separate variables as Series

In [None]:
def get_filtered_op_frame(op_type):
    temp = df2[df2.operation_type == op_type]
    temp = temp.sort_values(by='timestamp')
    return temp

operation_type_value = {}
for temp in list(df2.operation_type.unique()):
    operation_type_value[temp] = get_filtered_op_frame(temp)['value']

### Descriptive Stats
It refers to the portion of statistics dedicated to summarizing a total population

#### Mean 
Arithmetic average of a range of values or quantities, computed by dividing the total of all values by the number of values.
![title](../img/mean.png)

In [None]:
for temp in operation_type_value.keys():
    print("Mean of: ",temp, " - ", np.mean(operation_type_value[temp]))

#### Variance
In the same way that the mean is used to describe the central tendency, variance is intended to describe the spread.
The xi – μ is called the “deviation from the mean”, making the variance the squared deviation multiplied by 1 over the number of samples. This is why the square root of the variance, σ, is called the standard deviation.
![title](../img/variance.png)

In [None]:
for temp in operation_type_value.keys():
    print("Variance of: ",temp, " - ", np.var(operation_type_value[temp]))

#### Standard Deviation
Standard deviation (SD, also represented by the Greek letter sigma σ or the Latin letter s) is a measure that is used to quantify the amount of variation or dispersion of a set of data values.[1] A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.


In [None]:
for temp in operation_type_value.keys():
    print("Standard Deviation of: ",temp, " - ", np.std(operation_type_value[temp]))

#### Median

Denotes value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it. Simply put, it is the *middle* value in the list of numbers.
The median is a better choice when the indicator can be affected by some outliers.

In [None]:
for temp in operation_type_value.keys():
    print("Median of: ",temp, " - ", np.median(operation_type_value[temp]))

### Histogram 
The most common representation of a distribution is a histogram, which is a graph that shows the frequency or probability of each value. Plots will be generated by operation type

We will use Seaborn module for this. __Kernel Density Estimation__ * will be added for smoothing.
* In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample.
* The kernel density estimate may be less familiar, but it can be a useful tool for plotting the shape of a distribution. Like the histogram, the KDE plots encodes the density of observations on one axis with height along the other axis:

In [None]:
sns.set(color_codes = True)

for temp in operation_type_value.keys():
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,12))
    sns.distplot(get_filtered_op_frame(temp)['value'], kde=True, ax=ax[0], axlabel= temp)
    sns.distplot(np.log(get_filtered_op_frame(temp)['value']), kde=True, ax=ax[1], axlabel = "Log transformed "+ temp)
    fig.show()


#### Understanding
They are all log normals, cause value will always be greater than 0

In [None]:
df2.columns

#### Box-Whisker
Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. __Outliers__ may be plotted as individual points.

Log normalisation is required because, for different operations, values seems to be in very different scales

In [None]:
df_whisker =  df2
df_whisker['log_transformed_value'] = np.log(df2['value'])

In [None]:
df_whisker.head()

In [None]:
plt.figure(figsize=(20,15))
ax = sns.boxplot(x="operation_type", y="log_transformed_value", hue="operation_type", data=df_whisker)  # RUN PLOT   
plt.show()

plt.clf()
plt.close()

### Finding trend in time series, if there any 
Trend means, if over time values have increasing or decreasing pattern. In this example we see that there is a trend of a slow and steady increase followed by a sharp drop.

In [None]:
operation_type_value.keys()

for temp in operation_type_value.keys():
    #fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,12))
    temp_frame = get_filtered_op_frame(temp)
    temp_frame = temp_frame.set_index(temp_frame.timestamp)
    temp_frame = temp_frame[['log_transformed_value']]
    temp_frame.plot(figsize=(15,12),title=temp)
