{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Anomaly Detection Using Gaussian Distribution\n", "\n", "_Source: 🤖[Homemade Machine Learning](https://github.com/trekhleb/homemade-machine-learning) repository_\n", "\n", "> ☝Before moving on with this demo you might want to take a look at:\n", "> - 📗[Math behind the Anomaly Detection](https://github.com/trekhleb/homemade-machine-learning/tree/master/homemade/anomaly_detection)\n", "> - ⚙️[Gaussian Anomaly Detection Source Code](https://github.com/trekhleb/homemade-machine-learning/blob/master/homemade/anomaly_detection/gaussian_anomaly_detection.py)\n", "\n", "**Anomaly detection** (also **outlier detection**) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.\n", "\n", "The **normal** (or **Gaussian**) distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate.\n", "\n", "> **Demo Project:** In this demo we will build a model that will find anomalies in server operational parameters such as `Latency` and `Throughput`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# To make debugging of logistic_regression module easier we enable imported modules autoreloading feature.\n", "# By doing this you may change the code of logistic_regression library and all these changes will be available here.\n", "%load_ext autoreload\n", "%autoreload 2\n", "\n", "# Add project root folder to module loading paths.\n", "import sys\n", "sys.path.append('../..')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import Dependencies\n", "\n", "- [pandas](https://pandas.pydata.org/) - library that we will use for loading and displaying the data in a table\n", "- [numpy](http://www.numpy.org/) - library that we will use for linear algebra operations\n", "- [matplotlib](https://matplotlib.org/) - library that we will use for plotting the data\n", "- [anomaly_detection](https://github.com/trekhleb/homemade-machine-learning/blob/master/homemade/anomaly_detection/gaussian_anomaly_detection.py) - custom implementation of anomaly detection using Gaussian distribution." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Import 3rd party dependencies.\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# Import custom Gaussian anomaly detection implementation.\n", "from homemade.anomaly_detection import GaussianAnomalyDetection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load the Data\n", "\n", "In this demo we will use the dataset with server operational parameters such as `Latency` and `Throughput` and will try to find anomalies in them." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Latency (ms) | \n", "Throughput (mb/s) | \n", "Anomaly | \n", "
---|---|---|---|
0 | \n", "13.046815 | \n", "14.741152 | \n", "0 | \n", "
1 | \n", "13.408520 | \n", "13.763270 | \n", "0 | \n", "
2 | \n", "14.195915 | \n", "15.853181 | \n", "0 | \n", "
3 | \n", "14.914701 | \n", "16.174260 | \n", "0 | \n", "
4 | \n", "13.576700 | \n", "14.042849 | \n", "0 | \n", "
5 | \n", "13.922403 | \n", "13.406469 | \n", "0 | \n", "
6 | \n", "12.822132 | \n", "14.223188 | \n", "0 | \n", "
7 | \n", "15.676366 | \n", "15.891691 | \n", "0 | \n", "
8 | \n", "16.162875 | \n", "16.202998 | \n", "0 | \n", "
9 | \n", "12.666451 | \n", "14.899084 | \n", "1 | \n", "