{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Anomaly detection\n", "\n", "Anomaly detection is a machine learning task that consists in spotting so-called outliers.\n", "\n", "“An outlier is an observation in a data set which appears to be inconsistent with the remainder of that set of data.”\n", "Johnson 1992\n", "\n", "“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.”\n", " Outlier/Anomaly\n", "Hawkins 1980\n", "\n", "### Types of anomaly detection setups\n", "\n", "- Supervised AD\n", " - Labels available for both normal data and anomalies\n", " - Similar to rare class mining / imbalanced classification\n", "- Semi-supervised AD (Novelty Detection)\n", " - Only normal data available to train\n", " - The algorithm learns on normal data only\n", "- Unsupervised AD (Outlier Detection)\n", " - no labels, training set = normal + abnormal data\n", " - Assumption: anomalies are very rare" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "import numpy as np\n", "import matplotlib\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Let's first get familiar with different unsupervised anomaly detection approaches and algorithms. In order to visualise the output of the different algorithms we consider a toy data set consisting in a two-dimensional Gaussian mixture." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Generating the data set" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.datasets import make_blobs\n", "\n", "X, y = make_blobs(n_features=2, centers=3, n_samples=500,\n", " random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "X.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "plt.figure()\n", "plt.scatter(X[:, 0], X[:, 1])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Anomaly detection with density estimation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.neighbors.kde import KernelDensity\n", "\n", "# Estimate density with a Gaussian kernel density estimator\n", "kde = KernelDensity(kernel='gaussian')\n", "kde = kde.fit(X)\n", "kde" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "kde_X = kde.score_samples(X)\n", "print(kde_X.shape) # contains the log-likelihood of the data. The smaller it is the rarer is the sample" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from scipy.stats.mstats import mquantiles\n", "alpha_set = 0.95\n", "tau_kde = mquantiles(kde_X, 1. - alpha_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "n_samples, n_features = X.shape\n", "X_range = np.zeros((n_features, 2))\n", "X_range[:, 0] = np.min(X, axis=0) - 1.\n", "X_range[:, 1] = np.max(X, axis=0) + 1.\n", "\n", "h = 0.1 # step size of the mesh\n", "x_min, x_max = X_range[0]\n", "y_min, y_max = X_range[1]\n", "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n", " np.arange(y_min, y_max, h))\n", "\n", "grid = np.c_[xx.ravel(), yy.ravel()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "scrolled": true }, "outputs": [], "source": [ "Z_kde = kde.score_samples(grid)\n", "Z_kde = Z_kde.reshape(xx.shape)\n", "\n", "plt.figure()\n", "c_0 = plt.contour(xx, yy, Z_kde, levels=tau_kde, colors='red', linewidths=3)\n", "plt.clabel(c_0, inline=1, fontsize=15, fmt={tau_kde[0]: str(alpha_set)})\n", "plt.scatter(X[:, 0], X[:, 1])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## now with One-Class SVM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem of density based estimation is that they tend to become inefficient when the dimensionality of the data increase. It's the so-called curse of dimensionality that affects particularly density estimation algorithms. The one-class SVM algorithm can be used in such cases." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.svm import OneClassSVM" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "nu = 0.05 # theory says it should be an upper bound of the fraction of outliers\n", "ocsvm = OneClassSVM(kernel='rbf', gamma=0.05, nu=nu)\n", "ocsvm.fit(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "X_outliers = X[ocsvm.predict(X) == -1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "scrolled": false }, "outputs": [], "source": [ "Z_ocsvm = ocsvm.decision_function(grid)\n", "Z_ocsvm = Z_ocsvm.reshape(xx.shape)\n", "\n", "plt.figure()\n", "c_0 = plt.contour(xx, yy, Z_ocsvm, levels=[0], colors='red', linewidths=3)\n", "plt.clabel(c_0, inline=1, fontsize=15, fmt={0: str(alpha_set)})\n", "plt.scatter(X[:, 0], X[:, 1])\n", "plt.scatter(X_outliers[:, 0], X_outliers[:, 1], color='red')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Support vectors - Outliers\n", "\n", "The so-called support vectors of the one-class SVM form the outliers" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "X_SV = X[ocsvm.support_]\n", "n_SV = len(X_SV)\n", "n_outliers = len(X_outliers)\n", "\n", "print('{0:.2f} <= {1:.2f} <= {2:.2f}?'.format(1./n_samples*n_outliers, nu, 1./n_samples*n_SV))" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Only the support vectors are involved in the decision function of the One-Class SVM.\n", "\n", "1. Plot the level sets of the One-Class SVM decision function as we did for the true density.\n", "2. Emphasize the Support vectors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "scrolled": true }, "outputs": [], "source": [ "plt.figure()\n", "plt.contourf(xx, yy, Z_ocsvm, 10, cmap=plt.cm.Blues_r)\n", "plt.scatter(X[:, 0], X[:, 1], s=1.)\n", "plt.scatter(X_SV[:, 0], X_SV[:, 1], color='orange')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "