{
 "metadata": {
  "name": "",
  "signature": "sha256:c042e7400f7b7152f868146d99f995cf5b9bc9d1c28fcf88c1de47242be5da95"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "MLlib: Decision Trees  "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[Introduction to Spark with Python, by Jose A. Dianes](https://github.com/jadianes/spark-py-notebooks)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this notebook we will use Spark's machine learning library [MLlib](https://spark.apache.org/docs/latest/mllib-guide.html) to build a **Decision Tree** classifier for network attack detection. We will use the complete [KDD Cup 1999](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) datasets in order to test Spark capabilities with large datasets. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Decision trees are a popular machine learning tool in part because they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions. In this notebook, we will first train a classification tree including every single predictor. Then we will use our results to perform model selection. Once we find out the most important ones (the main splits in the tree) we will build a minimal tree using just three of them (the first two levels of the tree in order to compare performance and accuracy.   "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "At the time of processing this notebook, our Spark cluster contains:  \n",
      "\n",
      "- Eight nodes, with one of them acting as master and the rest as workers.  \n",
      "- Each node contains 8Gb of RAM, with 6Gb being used for each node.  \n",
      "- Each node has a 2.4Ghz Intel dual core processor.  \n",
      "- Running Apache Spark 1.3.1.  "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Getting the data and creating the RDD"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As we said, this time we will use the complete dataset provided for the [KDD Cup 1999](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html), containing nearly half million network interactions. The file is provided as a Gzip file that we will download locally.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import urllib"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "f = urllib.urlretrieve (\"http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data.gz\", \"kddcup.data.gz\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data_file = \"./kddcup.data.gz\"\n",
      "raw_data = sc.textFile(data_file)\n",
      "\n",
      "print \"Train data size is {}\".format(raw_data.count())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Train data size is 4898431\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The [KDD Cup 1999](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) also provide test data that we will load in a separate RDD.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ft = urllib.urlretrieve(\"http://kdd.ics.uci.edu/databases/kddcup99/corrected.gz\", \"corrected.gz\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "test_data_file = \"./corrected.gz\"\n",
      "test_raw_data = sc.textFile(test_data_file)\n",
      "\n",
      "print \"Test data size is {}\".format(test_raw_data.count())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Test data size is 311029\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Detecting network attacks using Decision Trees"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this section we will train a *classification tree* that, as we did with *logistic regression*, will predict if a network interaction is either `normal` or `attack`.  "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Training a classification tree using [MLlib](https://spark.apache.org/docs/latest/mllib-decision-tree.html) requires some parameters:  \n",
      "- Training data  \n",
      "- Num classes  \n",
      "- Categorical features info: a map from column to categorical variables arity. This is optional, although it should increase model accuracy. However it requires that we know the levels in our categorical variables in advance. second we need to parse our data to convert labels to integer values within the arity range.  \n",
      "- Impurity metric  \n",
      "- Tree maximum depth  \n",
      "- And tree maximum number of bins  \n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the next section we will see how to obtain all the labels within a dataset and convert them to numerical factors.  "
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Preparing the data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As we said, in order to benefits from trees ability to seamlessly with categorical variables, we need to convert them to numerical factors. But first we need to obtain all the possible levels. We will use *set* transformations on a csv parsed RDD.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from pyspark.mllib.regression import LabeledPoint\n",
      "from numpy import array\n",
      "\n",
      "csv_data = raw_data.map(lambda x: x.split(\",\"))\n",
      "test_csv_data = test_raw_data.map(lambda x: x.split(\",\"))\n",
      "\n",
      "protocols = csv_data.map(lambda x: x[1]).distinct().collect()\n",
      "services = csv_data.map(lambda x: x[2]).distinct().collect()\n",
      "flags = csv_data.map(lambda x: x[3]).distinct().collect()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And now we can use this Python lists in our `create_labeled_point` function. If a factor level is not in the training data, we assign an especial level. Remember that we cannot use testing data for training our model, not even the factor levels. The testing data represents the unknown to us in a real case.     "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def create_labeled_point(line_split):\n",
      "    # leave_out = [41]\n",
      "    clean_line_split = line_split[0:41]\n",
      "    \n",
      "    # convert protocol to numeric categorical variable\n",
      "    try: \n",
      "        clean_line_split[1] = protocols.index(clean_line_split[1])\n",
      "    except:\n",
      "        clean_line_split[1] = len(protocols)\n",
      "        \n",
      "    # convert service to numeric categorical variable\n",
      "    try:\n",
      "        clean_line_split[2] = services.index(clean_line_split[2])\n",
      "    except:\n",
      "        clean_line_split[2] = len(services)\n",
      "    \n",
      "    # convert flag to numeric categorical variable\n",
      "    try:\n",
      "        clean_line_split[3] = flags.index(clean_line_split[3])\n",
      "    except:\n",
      "        clean_line_split[3] = len(flags)\n",
      "    \n",
      "    # convert label to binary label\n",
      "    attack = 1.0\n",
      "    if line_split[41]=='normal.':\n",
      "        attack = 0.0\n",
      "        \n",
      "    return LabeledPoint(attack, array([float(x) for x in clean_line_split]))\n",
      "\n",
      "training_data = csv_data.map(create_labeled_point)\n",
      "test_data = test_csv_data.map(create_labeled_point)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Training a classifier"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We are now ready to train our classification tree. We will keep the `maxDepth` value small. This will lead to smaller accuracy, but we will obtain less splits so later on we can better interpret the tree. In a production system we will try to increase this value in order to find a better accuracy.    "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from pyspark.mllib.tree import DecisionTree, DecisionTreeModel\n",
      "from time import time\n",
      "\n",
      "# Build the model\n",
      "t0 = time()\n",
      "tree_model = DecisionTree.trainClassifier(training_data, numClasses=2, \n",
      "                                          categoricalFeaturesInfo={1: len(protocols), 2: len(services), 3: len(flags)},\n",
      "                                          impurity='gini', maxDepth=4, maxBins=100)\n",
      "tt = time() - t0\n",
      "\n",
      "print \"Classifier trained in {} seconds\".format(round(tt,3))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Classifier trained in 439.971 seconds\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Evaluating the model"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In order to measure the classification error on our test data, we use `map` on the `test_data` RDD and the model to predict each test point class. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "predictions = tree_model.predict(test_data.map(lambda p: p.features))\n",
      "labels_and_preds = test_data.map(lambda p: p.label).zip(predictions)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Classification results are returned in pars, with the actual test label and the predicted one. This is used to calculate the classification error by using `filter` and `count` as follows."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "t0 = time()\n",
      "test_accuracy = labels_and_preds.filter(lambda (v, p): v == p).count() / float(test_data.count())\n",
      "tt = time() - t0\n",
      "\n",
      "print \"Prediction made in {} seconds. Test accuracy is {}\".format(round(tt,3), round(test_accuracy,4))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Prediction made in 38.651 seconds. Test accuracy is 0.9155\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "*NOTE: the zip transformation doesn't work properly with pySpark 1.2.1. It does in 1.3*"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Interpreting the model"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Understanding our tree splits is a great exercise in order to explain our classification labels in terms of predictors and the values they take. Using the `toDebugString` method in our three model we can obtain a lot of information regarding splits, nodes, etc.   "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"Learned classification tree model:\"\n",
      "print tree_model.toDebugString()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Learned classification tree model:\n",
        "DecisionTreeModel classifier of depth 4 with 27 nodes\n",
        "  If (feature 22 <= 89.0)\n",
        "   If (feature 3 in {2.0,3.0,4.0,7.0,9.0,10.0})\n",
        "    If (feature 36 <= 0.43)\n",
        "     If (feature 28 <= 0.19)\n",
        "      Predict: 1.0\n",
        "     Else (feature 28 > 0.19)\n",
        "      Predict: 0.0\n",
        "    Else (feature 36 > 0.43)\n",
        "     If (feature 2 in {0.0,3.0,15.0,26.0,27.0,36.0,42.0,58.0,67.0})\n",
        "      Predict: 0.0\n",
        "     Else (feature 2 not in {0.0,3.0,15.0,26.0,27.0,36.0,42.0,58.0,67.0})\n",
        "      Predict: 1.0\n",
        "   Else (feature 3 not in {2.0,3.0,4.0,7.0,9.0,10.0})\n",
        "    If (feature 2 in {50.0,51.0})\n",
        "     Predict: 0.0\n",
        "    Else (feature 2 not in {50.0,51.0})\n",
        "     If (feature 32 <= 168.0)\n",
        "      Predict: 1.0\n",
        "     Else (feature 32 > 168.0)\n",
        "      Predict: 0.0\n",
        "  Else (feature 22 > 89.0)\n",
        "   If (feature 5 <= 0.0)\n",
        "    If (feature 11 <= 0.0)\n",
        "     If (feature 31 <= 253.0)\n",
        "      Predict: 1.0\n",
        "     Else (feature 31 > 253.0)\n",
        "      Predict: 1.0\n",
        "    Else (feature 11 > 0.0)\n",
        "     If (feature 2 in {12.0})\n",
        "      Predict: 0.0\n",
        "     Else (feature 2 not in {12.0})\n",
        "      Predict: 1.0\n",
        "   Else (feature 5 > 0.0)\n",
        "    If (feature 29 <= 0.08)\n",
        "     If (feature 2 in {3.0,4.0,26.0,36.0,42.0,58.0,68.0})\n",
        "      Predict: 0.0\n",
        "     Else (feature 2 not in {3.0,4.0,26.0,36.0,42.0,58.0,68.0})\n",
        "      Predict: 1.0\n",
        "    Else (feature 29 > 0.08)\n",
        "     Predict: 1.0\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For example, a network interaction with the following features (see description [here](http://kdd.ics.uci.edu/databases/kddcup99/task.html)) will be classified as an attack by our model:  \n",
      "- `count`, the number of connections to the same host as the current connection in the past two seconds, being greater than 32. \n",
      "- `dst_bytes`, the number of data bytes from destination to source, is 0.  \n",
      "- `service` is neither level 0 nor 52.  \n",
      "- `logged_in` is false.  \n",
      "From our services list we know that:  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"Service 0 is {}\".format(services[0])\n",
      "print \"Service 52 is {}\".format(services[52])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Service 0 is urp_i\n",
        "Service 52 is tftp_u\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So we can characterise network interactions with more than 32 connections to the same server in the last 2 seconds, transferring zero bytes from destination to source, where service is neither *urp_i* nor *tftp_u*, and not logged in, as network attacks. A similar approach can be used for each tree terminal node.     "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can see that `count` is the first node split in the tree. Remember that each partition is chosen greedily by selecting the best split from a set of possible splits, in order to maximize the information gain at a tree node (see more [here](https://spark.apache.org/docs/latest/mllib-decision-tree.html#basic-algorithm)). At a second level we find variables `flag` (normal or error status of the connection) and `dst_bytes` (the number of data bytes from destination to source) and so on.    "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This explaining capability of a classification (or regression) tree is one of its main benefits. Understaining data is a key factor to build better models."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Building a minimal model using the three main splits"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So now that we know the main features predicting a network attack, thanks to our classification tree splits, let's use them to build a minimal classification tree with just the main three variables: `count`, `dst_bytes`, and `flag`.  "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We need to define the appropriate function to create labeled points.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def create_labeled_point_minimal(line_split):\n",
      "    # leave_out = [41]\n",
      "    clean_line_split = line_split[3:4] + line_split[5:6] + line_split[22:23]\n",
      "    \n",
      "    # convert flag to numeric categorical variable\n",
      "    try:\n",
      "        clean_line_split[0] = flags.index(clean_line_split[0])\n",
      "    except:\n",
      "        clean_line_split[0] = len(flags)\n",
      "    \n",
      "    # convert label to binary label\n",
      "    attack = 1.0\n",
      "    if line_split[41]=='normal.':\n",
      "        attack = 0.0\n",
      "        \n",
      "    return LabeledPoint(attack, array([float(x) for x in clean_line_split]))\n",
      "\n",
      "training_data_minimal = csv_data.map(create_labeled_point_minimal)\n",
      "test_data_minimal = test_csv_data.map(create_labeled_point_minimal)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "That we use to train the model.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Build the model\n",
      "t0 = time()\n",
      "tree_model_minimal = DecisionTree.trainClassifier(training_data_minimal, numClasses=2, \n",
      "                                          categoricalFeaturesInfo={0: len(flags)},\n",
      "                                          impurity='gini', maxDepth=3, maxBins=32)\n",
      "tt = time() - t0\n",
      "\n",
      "print \"Classifier trained in {} seconds\".format(round(tt,3))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Classifier trained in 226.519 seconds\n"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we can predict on the testing data and calculate accuracy.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "predictions_minimal = tree_model_minimal.predict(test_data_minimal.map(lambda p: p.features))\n",
      "labels_and_preds_minimal = test_data_minimal.map(lambda p: p.label).zip(predictions_minimal)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 17
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "t0 = time()\n",
      "test_accuracy = labels_and_preds_minimal.filter(lambda (v, p): v == p).count() / float(test_data_minimal.count())\n",
      "tt = time() - t0\n",
      "\n",
      "print \"Prediction made in {} seconds. Test accuracy is {}\".format(round(tt,3), round(test_accuracy,4))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Prediction made in 23.202 seconds. Test accuracy is 0.909\n"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So we have trained a classification tree with just the three most important predictors, in half of the time, and with a not so bad accuracy. In fact, a classification tree is a very good model selection tool!    "
     ]
    }
   ],
   "metadata": {}
  }
 ]
}