{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Anomaly Detection using Deep Learning\n", "\n", "## Demo\n", "\n", "We'll train a neural network called an Autoencoder on a credit card fraud dataset from [Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud) without labels. It will learn how to reconstruct the input data after time. Then, we'll use its learned representations to turn to perform binary classification.\n", "\n", "## What's an autoencoder?\n", "\n", "![alt text](https://vanishingcodes.files.wordpress.com/2017/06/stackedae.png \"Logo Title Text 1\")\n", "\n", "Auto-encoder is a type of neural network that approximates the function: \n", "\n", "f(x) = x. \n", "\n", "Basically, given an input x, network will learn to output f(x) that is as close as to x. \n", "\n", "The error between output and x is commonly measured using root mean square error (RMSE)\n", "\n", "mean((f(x) – x) ^ 2) \n", "\n", "which is the loss function we try to minimise in our network.\n", "\n", "Autoencoders follows a typical feed-forward neural networks architecture except that the output layer has exactly same number of neurons as input layer. And it uses the input data itself as its target. Therefore it works in a way of unsupervised learning – learn without predicting an actual label\n", "\n", "The ‘encoder’ – job is to ’embed’ the input data into a lower dimensional array. \n", "The ‘decoder’ - job is to try to decode the embedding array into the original one.\n", "\n", "We can have either one hidden layer, or in the case below, have multiple layers depending on the complexity of our features.\n", "\n", "## What are other Anomaly Detection methods?\n", "\n", "PCA - Principal Component analysis. This is actually a dimensionality reduction technique but can be used to detect anomalies since when visualized, the learned 'principal components' will show outliers.\n", "\n", "![alt text](https://upload.wikimedia.org/wikipedia/commons/6/69/Principal_Component_Analysis_of_the_Italian_population.png \"Logo Title Text 1\")\n", "\n", "K means - This is simple arithmetic, a very popular algorithm. Gives great results for small datasets, utility declines as your data grows >1000 data points.\n", "\n", "![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/ClusterAnalysis_Mouse.svg/450px-ClusterAnalysis_Mouse.svg.png \"Logo Title Text 1\")\n", "\n", "![alt text](https://blog.keras.io/img/ae/autoencoder_schema.jpg \"Logo Title Text 1\")\n", "\n", "\n", "![alt text](https://cdn-images-1.medium.com/max/1600/1*8ixTe1VHLsmKB3AquWdxpQ.png \"Logo Title Text 1\")\n", "\n", "\n", "## What are the steps\n", "\n", "1. First part of the forward pass - encode the data points into an encoded representation\n", "2. Second part of the forward pass - decode the data points \n", "3. Measure the Root mean squared error. Minimize it using backpropagation. \n", "4. We'll define a threshold value. If the RMSE is above that threshold value then, it's considered an anomaly. We can Rank the RSME values and the top .1 can be considered anomalies.\n", "\n", "The reason the top X datapoints with the highest RMSE error are considered anomalies is because the RMSE will measure the difference between the learned patterns and the anomaly value. Since the majority of the datapoints share similar patterns, the anomalies will stick out. They'll have the highest difference. \n", "\n", "So we're basically creating a generalized latent representation of all the datapoints (both fraud and valid) and since the valid datapoints far outweight the fraud datapoints, this learned representation will be one for valid datapoints. Then when the difference between the learning and the fraud datapoint is measured, it will be very high. Anomaly spotted! \n", "\n", "\n", "## How can this be applied at CERN?\n", "\n", "Why use deep learning?\n", "-Particle data is very high dimensional\n", "-It works well on big datasets and big computing power\n", "-No need to perform feature engineering, deep nets will learn the most relevant features.\n", "\n", "-Classification for known particles\n", "-Generative models for simulations (Generative adversarial networks, variational autoencoders)\n", "-Clustering (What we're talking about)\n", "\n", "![alt text](https://home.cern/sites/home.web.cern.ch/files/image/featured/2014/01/higgs-simulation-3.jpg \"Logo Title Text 1\")\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Load libraries " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd #data manipulation\n", "import numpy as np #matrix math\n", "import tensorflow as tf #machine learning\n", "import os #saving files\n", "from datetime import datetime #logging\n", "from sklearn.metrics import roc_auc_score as auc #measuring accuracy\n", "import seaborn as sns #plotting" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import matplotlib.pyplot as plt #plotting\n", "import matplotlib.gridspec as gridspec\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Exploration" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data_dir = \"C:\\\\Users\\\\weimin\\\\Desktop\\\\Fraud\"\n", "df = pd.read_csv(os.path.join(data_dir, 'creditcard.csv'))" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(284807, 31)" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total time spanning: 2.0 days\n", "0.173 % of all transactions are fraud. \n" ] } ], "source": [ "print(\"Total time spanning: {:.1f} days\".format(df['Time'].max() / (3600 * 24.0)))\n", "print(\"{:.3f} % of all transactions are fraud. \".format(np.sum(df['Class']) / df.shape[0] * 100))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | Time | \n", "V1 | \n", "V2 | \n", "V3 | \n", "V4 | \n", "V5 | \n", "V6 | \n", "V7 | \n", "V8 | \n", "V9 | \n", "... | \n", "V21 | \n", "V22 | \n", "V23 | \n", "V24 | \n", "V25 | \n", "V26 | \n", "V27 | \n", "V28 | \n", "Amount | \n", "Class | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.0 | \n", "-1.359807 | \n", "-0.072781 | \n", "2.536347 | \n", "1.378155 | \n", "-0.338321 | \n", "0.462388 | \n", "0.239599 | \n", "0.098698 | \n", "0.363787 | \n", "... | \n", "-0.018307 | \n", "0.277838 | \n", "-0.110474 | \n", "0.066928 | \n", "0.128539 | \n", "-0.189115 | \n", "0.133558 | \n", "-0.021053 | \n", "149.62 | \n", "0 | \n", "
1 | \n", "0.0 | \n", "1.191857 | \n", "0.266151 | \n", "0.166480 | \n", "0.448154 | \n", "0.060018 | \n", "-0.082361 | \n", "-0.078803 | \n", "0.085102 | \n", "-0.255425 | \n", "... | \n", "-0.225775 | \n", "-0.638672 | \n", "0.101288 | \n", "-0.339846 | \n", "0.167170 | \n", "0.125895 | \n", "-0.008983 | \n", "0.014724 | \n", "2.69 | \n", "0 | \n", "
2 | \n", "1.0 | \n", "-1.358354 | \n", "-1.340163 | \n", "1.773209 | \n", "0.379780 | \n", "-0.503198 | \n", "1.800499 | \n", "0.791461 | \n", "0.247676 | \n", "-1.514654 | \n", "... | \n", "0.247998 | \n", "0.771679 | \n", "0.909412 | \n", "-0.689281 | \n", "-0.327642 | \n", "-0.139097 | \n", "-0.055353 | \n", "-0.059752 | \n", "378.66 | \n", "0 | \n", "
3 | \n", "1.0 | \n", "-0.966272 | \n", "-0.185226 | \n", "1.792993 | \n", "-0.863291 | \n", "-0.010309 | \n", "1.247203 | \n", "0.237609 | \n", "0.377436 | \n", "-1.387024 | \n", "... | \n", "-0.108300 | \n", "0.005274 | \n", "-0.190321 | \n", "-1.175575 | \n", "0.647376 | \n", "-0.221929 | \n", "0.062723 | \n", "0.061458 | \n", "123.50 | \n", "0 | \n", "
4 | \n", "2.0 | \n", "-1.158233 | \n", "0.877737 | \n", "1.548718 | \n", "0.403034 | \n", "-0.407193 | \n", "0.095921 | \n", "0.592941 | \n", "-0.270533 | \n", "0.817739 | \n", "... | \n", "-0.009431 | \n", "0.798278 | \n", "-0.137458 | \n", "0.141267 | \n", "-0.206010 | \n", "0.502292 | \n", "0.219422 | \n", "0.215153 | \n", "69.99 | \n", "0 | \n", "
5 rows × 31 columns
\n", "