{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## LOF algorithm for outlier detection\n", "
\n", "Author: Daniil Chepenko" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Theory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Outlier detection (also known as anomaly detection) is the process of finding data\n", "objects with behaviors that are very different from expectation. Such objects are called\n", "outliers or anomalies. \n", "\n", "The most interesting objects are those, that deviates significantly from the normal object. Outliers are not being generated by the same mechanism as rest of the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Income and wealth inequality in various countries. South Korea, China, Japan, Denmark, Zimbabwe and Nambia cases requires further studies.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Outlier detection is important in many applications, such as:\n", "- Fraud in financial data\n", "- Intrusions in communication networks\n", "- Fake news and misinformation\n", "- Healthcare analysis\n", "- Industry damage detection\n", "- Security and surveillance\n", "- etc\n", "\n", "Outlier detection and clustering analysis are two highly related tasks. Clustering finds\n", "the majority patterns in a data set and organizes the data accordingly, whereas outlier\n", "detection tries to capture those exceptional cases that deviate substantially from the\n", "majority patterns.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outliers and noisy data\n", "Outliers differes from the noisy data. \n", "\n", "Moreover it should be removed while applying outlier detection. Noise may distort the normal objects and blur the distinction between normal objects and outliers. It may help hide outliers and reduce the effectiveness of outlier detection. For example, if a user consider of buying more expensive lunch that he used to buy usually, this behavior should be treated as a “noise transactions” like “random errors” or “variance”.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Types of outliers\n", "In general, outliers can be classified into three categories, namely global outliers, contextual\n", "(or conditional) outliers, and collective outliers.\n", "\n", "- Global outlier - Object significantly deviates from the rest of the data set\n", "- Contextual outlier - Object deviates significantly based on a selected context. For example, 28C is an outlier for a Moscow winter, but not an outlier in another context, 28C is not an outlier for a Moscow summer.\n", "- Collective outlier - A subset of data objects collectively deviate significantly from the whole data set, even if the individual data objects may not be outliers. For example, a collective outlier occurs when a number of computers keep sending denial-of-service packages to each other.\n", "\n", "Usually, a data set may contains different types of outliers and at the same time may belong to more than one type of outlier.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outlier Detection Methods\n", "There are many outlier detection methods in the literature and in practice. Firstly the outlier detection methods differs according to whether the sample of data for analysis is given with domain expert–provided labels that can be used to build an outlier detection model. Secondly, methods can be divided into groups according to their assumptions regarding normal objects versus outliers.\n", "\n", "If expert-labeled examples of normal and/or outlier objects can be obtained, they can be\n", "used to build outlier detection models. The methods used can be divided into supervised methods, semi-supervised methods, and unsupervised methods.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Supervised method\n", "Modeling outlier detection as a classification problem. Samples examined by domain experts used for training & testing.\n", "\n", "Challenges:\n", "- Classes are unbalanced. That is, the population of outliers is typically much smaller than that of normal objects. Methods for handling unbalanced classes can be used.\n", "- Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e., not mislabeling normal objects as outliers)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Unsupervised method\n", "In some application scenarios, objects labeled as “normal” or “outlier” are not available.\n", "Thus, an unsupervised learning method has to be used.\n", "Unsupervised outlier detection methods make an implicit assumption: The normal\n", "objects are somewhat “clustered.” In other words, an unsupervised outlier detection\n", "method expects that normal objects follow a pattern far more frequently than outliers.\n", "\n", "Challenges:\n", "Unsupervised methods can’t detect collective outlier effectively. This assumption may not be true all the time.\n", "- Normal objects may not share any strong patterns, but the collective outliers may share high similarity in a small area\n", "\n", "For example in some intrusion or virus detection, normal activities are diverse\n", "- Unsupervised methods may have a high false positive rate but still miss many real outliers.\n", "- Supervised methods can be more effective, e.g., identify attacking some key resources\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Semi-Supervised Methods\n", "\n", "In many applications, although obtaining some labeled examples is feasible, the number\n", "of such labeled examples is often small. \n", "If some labeled normal objects are available:\n", "- Use the labeled examples and the proximate unlabeled objects to train a model for normal objects\n", "- Those not fitting the model of normal objects are detected as outliers\n", "\n", "If only some labeled outliers are available, a small number of labeled outliers may not cover the possible outliers well.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Statistical methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Statistical methods (also known as model-based methods) assume that the normal data follow some statistical model (a stochastic model). The idea is to learn a generative model fitting the given data set, and then identify the objects in low probability regions of the model as outliers.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Parametric methods\n", "A parametric method assumes that the normal data objects are generated by a parametric\n", "distribution with parameter $\\theta$. The probability density function of the parametric\n", "distribution $f (x,2)$ gives the probability that object x is generated by the distribution.\n", "The smaller this value, the more likely x is an outlier\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "avg_temp = np.array([24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the maximum likelihood method to estimate the parameters μ and \". That is, we maximize the log-likelihood function\n", "\n", "$$lnL(\\mu,\\sigma^2) = \\sum_{i=1}^{n}lnf[x_i|(\\mu,\\sigma^2)] = -\\frac{n}{2}ln(2\\pi) -\\frac{n}{2}ln(2\\sigma) - \\frac{1}{2\\sigma^2} \\sum_{i=1}^{n}(x_i-\\mu)^2$$\n", "\n", "Taking derivatives with respect to μ and \"2 and solving the resulting system of first-order conditions leads to the following maximum likelihood estimates:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mean = avg_temp.mean()\n", "std = avg_temp.std()\n", "sigma = (avg_temp.min() - avg_temp.max()) / std\n", "print mean, std, sigma" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most deviating value, 24.0\"C, is 4.61\"C away from the estimated mean. We know that the $\\mu±3$ region contains 99.7% data under the assumption of normal distribution\n", "\n", "$\\sigma < -3 $. Then 24 is outlier \n", "\n", "The main drawback for parametric model is that in many cases, data distribution may not be known" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.boxplot(avg_temp)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For multivariative data other methods are used. For example, we can compute Mahalaobis distance or use $\\chi^2$ statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Nonparametric methods\n", "In nonparametric methods for outlier detection, the model of “normal data” is learned\n", "from the input data, rather than assuming one a priori. Nonparametric methods often\n", "make fewer assumptions about the data, and thus can be applicable in more scenarios. \n", "As an example we can use histograms\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Proximity-Based Approaches\n", "\n", "Given a set of objects in feature space, a distance measure can be used to quantify the\n", "similarity between objects. Intuitively, objects that are far from others can be regarded\n", "as outliers. Proximity-based approaches assume that the proximity of an outlier object\n", "to its nearest neighbors significantly deviates from the proximity of the object to most\n", "of the other objects in the data set.\n", "\n", "#### Distance-based \n", "Distance-based outlier detection method consults the neighborhood of an object, which is defined by a given radius. An object is then considered\n", "an outlier if its neighborhood does not have enough other points.\n", "\n", "Formally, let $r (r > 0)$ be a distance threshold and $\\pi (0< \\pi < 1)$ be a fraction\n", " threshold. An object, o, is a $DB(r,\\pi)$-outlier if $$\\frac{||{o'|dist(o,o')\\leq r||}}{||D||} \\leq \\pi $$\n", "where $dist(o,o')$ is a distance measure\n", "\n", "Algorithms for mining distance-based outliers:\n", "- Index-based algorithm\n", "- Nested-loop algorithm\n", "- Cell-based algorithm\n", "\n", "#### Density-based\n", "Density-based outlier detection method investigates the density of an object and that of its neighbors.\n", "Here, an object is identified as an outlier if its density is relatively much lower than that\n", "of its neighbors.\n", "\n", "Many real-world data sets demonstrate a more complex structure, where objects\n", "may be considered outliers with respect to their local neighborhoods, rather than with\n", "respect to the global data distribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Is o1 and o2 be considered as an outliers?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Distanse-based methods are not able to detect $o_1$ and $o_2$ as an outlier. However, when they are considered\n", "locally with respect to cluster $C_1$ because $o_1$ and $o_2$ deviate significantly from the objects\n", "in $C_1$. Moreover, $o_1$ and $o_2$ are also far from the objects in $C_2$.\n", "\n", "The critical\n", "idea here is that we need to compare the density around an object with the density\n", "around its local neighbors. The basic assumption of density-based outlier detection\n", "methods is that the density around a nonoutlier object is similar to the density around\n", "its neighbors, while the density around an outlier object is significantly different from\n", "the density around its neighbors.\n", "\n", "$dist_k(o)$ -is a distance between object $o$ and k-nearest neighbors. The k-distance neighborhood of $o$ contains all objects of which the distance to $o$ is not greater that $dist_k(o)$ , the $k$-distance of $o$\n", "\n", "$$N_k(o)= [o'|o' \\in D, dist(o,o') \\leq dist_k(o)]$$\n", "\n", "We can use the average distance from the objects in $N_k(o)$ to o as the measure of the local density of $o$. If $o$ has very close neighbors $o'$ such that $dist(o, o')$ is very small, the statistical fluctuations of the\n", "distance measure can be undesirably high. To overcome this problem, we can switch to the following reachability distance measure by adding a smoothing effect.\n", "\n", "$$reachdist_k(o,o') = max[dist_k(o), dist(o,o')]$$\n", "\n", "$k$ is a user-specified parameter that controls the smoothing effect. Essentially, $k$\n", "specifies the minimum neighborhood to be examined to determine the local density\n", "of an object.Reachability distance is not symmetric.\n", "\n", "Local reachability density of an object $o$ is $$lrd_k(o) = \\frac{||N_k(o)||}{\\sum_{o' \\in N_k(o)}reachdist_k(o,o')}$$\n", "\n", "We calculate the local reachability density for an object\n", "and compare it with that of its neighbors to quantify the degree to which the object is\n", "considered an outlier.\n", "\n", "$$LOF_k(o) = \\frac{\\sum_{o'\\in N_k(o)} \\frac{lrd_k(o')}{lrd_k(o)} }{||N_k(o)||} = \\sum_{o' \\in N_k(o)}lrd_k(o')* \\sum_{o' \\in N_k(o)}reachdist_k(o',o)$$\n", "\n", "Local outlier factor is the average of the ratio of the local reachability\n", "density of o and those of $o’$ s $k$-nearest neighbors. The lower the local reachability density\n", "of $o$ and the higher the local\n", "reachability densities of the $k$-nearest neighbors of o, the higher the LOF value is. This\n", "exactly captures a local outlier of which the local density is relatively low compared to\n", "the local densities of its k-nearest neighbors.\n", "\n", "The meaning of LOF is easy to understand.\n", "
\n", "\n", "A property of LOF($o$)\n", "
\n", "\n", "The minimum reachability distance from o to its k-nearest neighbors\n", "$$direct_{min}(o) = min[reachdist_k(o',o)|o' \\in N_k(o)]$$\n", "\n", "While the maximum can be defined as follows:\n", "$$direct_{max}(o) = max[reachdist_k(o',o)|o' \\in N_k(o)]$$\n", "\n", "We also consider the neighbors of o' k-nearest neighbors\n", "$$indirect_{min}(o) = min[reachdistk(o'',o')|o' \\in N_k(o) \\text{ and } o'' \\in N_k(o')]$$\n", "\n", "and \n", "$$indirect_{max}(o) = max[reachdistk(o'',o')|o' \\in N_k(o) \\text{ and } o'' \\in N_k(o')]$$\n", "\n", "Then LOF(o) is bounded as\n", "\n", "$$\\frac{direct_{min}(o)}{indirect_{max}(o)} \\leq LOF(o) \\leq \\frac{direct_{max}(o)}{indirect_{min}(o)}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Outlier detection with LOF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the clickstream event frequency pattern in data.csv, we will apply LOF algorithm to\n", "calculate LOF for each point with the following initial settings:\n", "1. k = 2 and use Manhattan distance. \n", "2. k = 3 and use Euclidean distance.\n", "\n", "\n", "As the result, we will find 5 outliers and their $LOF_k(o)$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "from scipy.spatial.distance import pdist, squareform" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_input = pd.read_csv(\"../../data/clikstream_data.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_input.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Reachdist function\n", "def reachdist(distance_df, observation, index):\n", " return distance_df[observation][index]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# LOF algorithm implementation from scratch\n", "def LOF_algorithm(data_input, distance_metric=\"cityblock\", p=5):\n", " distances = pdist(data_input.values, metric=distance_metric)\n", " dist_matrix = squareform(distances)\n", " distance_df = pd.DataFrame(dist_matrix)\n", "\n", " k = 2 if distance_metric == \"cityblock\" else 3\n", " observations = distance_df.columns\n", " lrd_dict = {}\n", " n_dist_index = {}\n", " reach_array_dict = {}\n", "\n", " for observation in observations:\n", " dist = distance_df[observation].nsmallest(k + 1).iloc[k]\n", " indexes = distance_df[distance_df[observation] <= dist].drop(observation).index\n", " n_dist_index[observation] = indexes\n", "\n", " reach_dist_array = []\n", " for index in indexes:\n", " # make a function reachdist(observation, index)\n", " dist_between_observation_and_index = reachdist(\n", " distance_df, observation, index\n", " )\n", " dist_index = distance_df[index].nsmallest(k + 1).iloc[k]\n", " reach_dist = max(dist_index, dist_between_observation_and_index)\n", " reach_dist_array.append(reach_dist)\n", " lrd_observation = len(indexes) / sum(reach_dist_array)\n", " reach_array_dict[observation] = reach_dist_array\n", " lrd_dict[observation] = lrd_observation\n", "\n", " # Calculate LOF\n", " LOF_dict = {}\n", " for observation in observations:\n", " lrd_array = []\n", " for index in n_dist_index[observation]:\n", " lrd_array.append(lrd_dict[index])\n", " LOF = (\n", " sum(lrd_array)\n", " * sum(reach_array_dict[observation])\n", " / np.square(len(n_dist_index[observation]))\n", " )\n", " LOF_dict[observation] = LOF\n", "\n", " return sorted(LOF_dict.items(), key=lambda x: x[1], reverse=True)[:p]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "LOF_algorithm(data_input, p=5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "LOF_algorithm(data_input, p=5, distance_metric=\"euclidean\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }