{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Anomaly detection\n", "\n", "Anomaly detection is a machine learning task that consists in spotting so-called outliers.\n", "\n", "“An outlier is an observation in a data set which appears to be inconsistent with the remainder of that set of data.”\n", "Johnson 1992\n", "\n", "“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.”\n", " Outlier/Anomaly\n", "Hawkins 1980\n", "\n", "### Types of anomaly detection setups\n", "\n", "- Supervised AD\n", " - Labels available for both normal data and anomalies\n", " - Similar to rare class mining / imbalanced classification\n", "- Semi-supervised AD (Novelty Detection)\n", " - Only normal data available to train\n", " - The algorithm learns on normal data only\n", "- Unsupervised AD (Outlier Detection)\n", " - no labels, training set = normal + abnormal data\n", " - Assumption: anomalies are very rare" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "import numpy as np\n", "import matplotlib\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Let's first get familiar with different unsupervised anomaly detection approaches and algorithms. In order to visualise the output of the different algorithms we consider a toy data set consisting in a two-dimensional Gaussian mixture." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Generating the data set" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.datasets import make_blobs\n", "\n", "X, y = make_blobs(n_features=2, centers=3, n_samples=500,\n", " random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "X.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "plt.figure()\n", "plt.scatter(X[:, 0], X[:, 1])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Anomaly detection with density estimation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.neighbors.kde import KernelDensity\n", "\n", "# Estimate density with a Gaussian kernel density estimator\n", "kde = KernelDensity(kernel='gaussian')\n", "kde = kde.fit(X)\n", "kde" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "kde_X = kde.score_samples(X)\n", "print(kde_X.shape) # contains the log-likelihood of the data. The smaller it is the rarer is the sample" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from scipy.stats.mstats import mquantiles\n", "alpha_set = 0.95\n", "tau_kde = mquantiles(kde_X, 1. - alpha_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "n_samples, n_features = X.shape\n", "X_range = np.zeros((n_features, 2))\n", "X_range[:, 0] = np.min(X, axis=0) - 1.\n", "X_range[:, 1] = np.max(X, axis=0) + 1.\n", "\n", "h = 0.1 # step size of the mesh\n", "x_min, x_max = X_range[0]\n", "y_min, y_max = X_range[1]\n", "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n", " np.arange(y_min, y_max, h))\n", "\n", "grid = np.c_[xx.ravel(), yy.ravel()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "scrolled": true }, "outputs": [], "source": [ "Z_kde = kde.score_samples(grid)\n", "Z_kde = Z_kde.reshape(xx.shape)\n", "\n", "plt.figure()\n", "c_0 = plt.contour(xx, yy, Z_kde, levels=tau_kde, colors='red', linewidths=3)\n", "plt.clabel(c_0, inline=1, fontsize=15, fmt={tau_kde[0]: str(alpha_set)})\n", "plt.scatter(X[:, 0], X[:, 1])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## now with One-Class SVM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem of density based estimation is that they tend to become inefficient when the dimensionality of the data increase. It's the so-called curse of dimensionality that affects particularly density estimation algorithms. The one-class SVM algorithm can be used in such cases." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.svm import OneClassSVM" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "nu = 0.05 # theory says it should be an upper bound of the fraction of outliers\n", "ocsvm = OneClassSVM(kernel='rbf', gamma=0.05, nu=nu)\n", "ocsvm.fit(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "X_outliers = X[ocsvm.predict(X) == -1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "scrolled": false }, "outputs": [], "source": [ "Z_ocsvm = ocsvm.decision_function(grid)\n", "Z_ocsvm = Z_ocsvm.reshape(xx.shape)\n", "\n", "plt.figure()\n", "c_0 = plt.contour(xx, yy, Z_ocsvm, levels=[0], colors='red', linewidths=3)\n", "plt.clabel(c_0, inline=1, fontsize=15, fmt={0: str(alpha_set)})\n", "plt.scatter(X[:, 0], X[:, 1])\n", "plt.scatter(X_outliers[:, 0], X_outliers[:, 1], color='red')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Support vectors - Outliers\n", "\n", "The so-called support vectors of the one-class SVM form the outliers" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "X_SV = X[ocsvm.support_]\n", "n_SV = len(X_SV)\n", "n_outliers = len(X_outliers)\n", "\n", "print('{0:.2f} <= {1:.2f} <= {2:.2f}?'.format(1./n_samples*n_outliers, nu, 1./n_samples*n_SV))" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Only the support vectors are involved in the decision function of the One-Class SVM.\n", "\n", "1. Plot the level sets of the One-Class SVM decision function as we did for the true density.\n", "2. Emphasize the Support vectors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "scrolled": true }, "outputs": [], "source": [ "plt.figure()\n", "plt.contourf(xx, yy, Z_ocsvm, 10, cmap=plt.cm.Blues_r)\n", "plt.scatter(X[:, 0], X[:, 1], s=1.)\n", "plt.scatter(X_SV[:, 0], X_SV[:, 1], color='orange')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE:\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# %load solutions/22_A-anomaly_ocsvm_gamma.py" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Isolation Forest\n", "\n", "Isolation Forest is an anomaly detection algorithm based on trees. The algorithm builds a number of random trees and the rationale is that if a sample is isolated it should alone in a leaf after very few random splits. Isolation Forest builds a score of abnormality based the depth of the tree at which samples end up." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.ensemble import IsolationForest" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "iforest = IsolationForest(n_estimators=300, contamination=0.10)\n", "iforest = iforest.fit(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "scrolled": true }, "outputs": [], "source": [ "Z_iforest = iforest.decision_function(grid)\n", "Z_iforest = Z_iforest.reshape(xx.shape)\n", "\n", "plt.figure()\n", "c_0 = plt.contour(xx, yy, Z_iforest,\n", " levels=[iforest.threshold_],\n", " colors='red', linewidths=3)\n", "plt.clabel(c_0, inline=1, fontsize=15,\n", " fmt={iforest.threshold_: str(alpha_set)})\n", "plt.scatter(X[:, 0], X[:, 1], s=1.)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE:\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# %load solutions/22_B-anomaly_iforest_n_trees.py" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Illustration on Digits data set\n", "\n", "\n", "We will now apply the IsolationForest algorithm to spot digits written in an unconventional way." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.datasets import load_digits\n", "digits = load_digits()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The digits data set consists in images (8 x 8) of digits." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "images = digits.images\n", "labels = digits.target\n", "images.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "i = 102\n", "\n", "plt.figure(figsize=(2, 2))\n", "plt.title('{0}'.format(labels[i]))\n", "plt.axis('off')\n", "plt.imshow(images[i], cmap=plt.cm.gray_r, interpolation='nearest')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "To use the images as a training set we need to flatten the images." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "n_samples = len(digits.images)\n", "data = digits.images.reshape((n_samples, -1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "data.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "X = data\n", "y = digits.target" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "X.shape" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Let's focus on digit 5." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "X_5 = X[y == 5]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "X_5.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 5, figsize=(10, 4))\n", "for ax, x in zip(axes, X_5[:5]):\n", " img = x.reshape(8, 8)\n", " ax.imshow(img, cmap=plt.cm.gray_r, interpolation='nearest')\n", " ax.axis('off')" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "1. Let's use IsolationForest to find the top 5% most abnormal images.\n", "2. Let's plot them !" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.ensemble import IsolationForest\n", "iforest = IsolationForest(contamination=0.05)\n", "iforest = iforest.fit(X_5)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Compute the level of \"abnormality\" with `iforest.decision_function`. The lower, the more abnormal." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "iforest_X = iforest.decision_function(X_5)\n", "plt.hist(iforest_X);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's plot the strongest inliers" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "X_strong_inliers = X_5[np.argsort(iforest_X)[-10:]]\n", "\n", "fig, axes = plt.subplots(2, 5, figsize=(10, 5))\n", "\n", "for i, ax in zip(range(len(X_strong_inliers)), axes.ravel()):\n", " ax.imshow(X_strong_inliers[i].reshape((8, 8)),\n", " cmap=plt.cm.gray_r, interpolation='nearest')\n", " ax.axis('off')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's plot the strongest outliers" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "fig, axes = plt.subplots(2, 5, figsize=(10, 5))\n", "\n", "X_outliers = X_5[iforest.predict(X_5) == -1]\n", "\n", "for i, ax in zip(range(len(X_outliers)), axes.ravel()):\n", " ax.imshow(X_outliers[i].reshape((8, 8)),\n", " cmap=plt.cm.gray_r, interpolation='nearest')\n", " ax.axis('off')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE:\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# %load solutions/22_C-anomaly_digits.py" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }