{ "metadata": { "name": "", "signature": "sha256:0e119d09c9a6aa7c07a2d8793325942daa8c14bd2cbca47bc247be4d926c625b" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "MLlib: Basic Statistics and Exploratory Data Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Introduction to Spark with Python, by Jose A. Dianes](https://github.com/jadianes/spark-py-notebooks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far we have used different map and aggregation functions, on simple and key/value pair RDD's, in order to get simple statistics that help us understand our datasets. In this notebook we will introduce Spark's machine learning library [MLlib](https://spark.apache.org/docs/latest/mllib-guide.html) through its basic statistics functionality in order to better understand our dataset. We will use the reduced 10-percent [KDD Cup 1999](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) datasets through the notebook. " ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Getting the data and creating the RDD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we did in our first notebook, we will use the reduced dataset (10 percent) provided for the [KDD Cup 1999](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html), containing nearly half million network interactions. The file is provided as a Gzip file that we will download locally. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import urllib\n", "f = urllib.urlretrieve (\"http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz\", \"kddcup.data_10_percent.gz\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "data_file = \"./kddcup.data_10_percent.gz\"\n", "raw_data = sc.textFile(data_file)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Local vectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A [local vector](https://spark.apache.org/docs/latest/mllib-data-types.html#local-vector) is often used as a base type for RDDs in Spark MLlib. A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For dense vectors, MLlib uses either Python *lists* or the *NumPy* `array` type. The later is recommended, so you can simply pass NumPy arrays around. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For sparse vectors, users can construct a `SparseVector` object from MLlib or pass *SciPy* `scipy.sparse` column vectors if SciPy is available in their environment. The easiest way to create sparse vectors is to use the factory methods implemented in `Vectors`. " ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "An RDD of dense vectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's represent each network interaction in our dataset as a dense vector. For that we will use the *NumPy* `array` type. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "\n", "def parse_interaction(line):\n", " line_split = line.split(\",\")\n", " # keep just numeric and logical values\n", " symbolic_indexes = [1,2,3,41]\n", " clean_line_split = [item for i,item in enumerate(line_split) if i not in symbolic_indexes]\n", " return np.array([float(x) for x in clean_line_split])\n", "\n", "vector_data = raw_data.map(parse_interaction)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Summary statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Spark's MLlib provides column summary statistics for `RDD[Vector]` through the function [`colStats`](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics.colStats) available in [`Statistics`](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics). The method returns an instance of [`MultivariateStatisticalSummary`](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.MultivariateStatisticalSummary), which contains the column-wise *max*, *min*, *mean*, *variance*, and *number of nonzeros*, as well as the *total count*. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pyspark.mllib.stat import Statistics \n", "from math import sqrt \n", "\n", "# Compute column summary statistics.\n", "summary = Statistics.colStats(vector_data)\n", "\n", "print \"Duration Statistics:\"\n", "print \" Mean: {}\".format(round(summary.mean()[0],3))\n", "print \" St. deviation: {}\".format(round(sqrt(summary.variance()[0]),3))\n", "print \" Max value: {}\".format(round(summary.max()[0],3))\n", "print \" Min value: {}\".format(round(summary.min()[0],3))\n", "print \" Total value count: {}\".format(summary.count())\n", "print \" Number of non-zero values: {}\".format(summary.numNonzeros()[0])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Duration Statistics:\n", " Mean: 47.979\n", " St. deviation: 707.746\n", " Max value: 58329.0\n", " Min value: 0.0\n", " Total value count: 494021\n", " Number of non-zero values: 12350.0\n" ] } ], "prompt_number": 4 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Summary statistics by label " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The interesting part of summary statistics, in our case, comes from being able to obtain them by the type of network attack or 'label' in our dataset. By doing so we will be able to better characterise our dataset dependent variable in terms of the independent variables range of values. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to do such a thing we could filter our RDD containing labels as keys and vectors as values. For that we just need to adapt our `parse_interaction` function to return a tuple with both elements. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "def parse_interaction_with_key(line):\n", " line_split = line.split(\",\")\n", " # keep just numeric and logical values\n", " symbolic_indexes = [1,2,3,41]\n", " clean_line_split = [item for i,item in enumerate(line_split) if i not in symbolic_indexes]\n", " return (line_split[41], np.array([float(x) for x in clean_line_split]))\n", "\n", "label_vector_data = raw_data.map(parse_interaction_with_key)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is not very sophisticated. We use `filter` on the RDD to leave out other labels but the one we want to gather statistics from. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "normal_label_data = label_vector_data.filter(lambda x: x[0]==\"normal.\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use the new RDD to call `colStats` on the values. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "normal_summary = Statistics.colStats(normal_label_data.values())" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And collect the results as we did before. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "print \"Duration Statistics for label: {}\".format(\"normal\")\n", "print \" Mean: {}\".format(normal_summary.mean()[0],3)\n", "print \" St. deviation: {}\".format(round(sqrt(normal_summary.variance()[0]),3))\n", "print \" Max value: {}\".format(round(normal_summary.max()[0],3))\n", "print \" Min value: {}\".format(round(normal_summary.min()[0],3))\n", "print \" Total value count: {}\".format(normal_summary.count())\n", "print \" Number of non-zero values: {}\".format(normal_summary.numNonzeros()[0])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Duration Statistics for label: normal\n", " Mean: 216.657322313\n", " St. deviation: 1359.213\n", " Max value: 58329.0\n", " Min value: 0.0\n", " Total value count: 97278\n", " Number of non-zero values: 11690.0\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of working with a key/value pair we could have just filter our raw data split using the label in column 41. Then we can parse the results as we did before. This will work as well. However having our data organised as key/value pairs will open the door to better manipulations. Since `values()` is a transformation on an RDD, and not an action, we don't perform any computation until we call `colStats` anyway. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But lets wrap this within a function so we can reuse it with any label." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def summary_by_label(raw_data, label):\n", " label_vector_data = raw_data.map(parse_interaction_with_key).filter(lambda x: x[0]==label)\n", " return Statistics.colStats(label_vector_data.values())" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's give it a try with the \"normal.\" label again. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "normal_sum = summary_by_label(raw_data, \"normal.\")\n", "\n", "print \"Duration Statistics for label: {}\".format(\"normal\")\n", "print \" Mean: {}\".format(normal_sum.mean()[0],3)\n", "print \" St. deviation: {}\".format(round(sqrt(normal_sum.variance()[0]),3))\n", "print \" Max value: {}\".format(round(normal_sum.max()[0],3))\n", "print \" Min value: {}\".format(round(normal_sum.min()[0],3))\n", "print \" Total value count: {}\".format(normal_sum.count())\n", "print \" Number of non-zero values: {}\".format(normal_sum.numNonzeros()[0])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Duration Statistics for label: normal\n", " Mean: 216.657322313\n", " St. deviation: 1359.213\n", " Max value: 58329.0\n", " Min value: 0.0\n", " Total value count: 97278\n", " Number of non-zero values: 11690.0\n" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try now with some network attack. We have all of them listed [here](http://kdd.ics.uci.edu/databases/kddcup99/training_attack_types). " ] }, { "cell_type": "code", "collapsed": false, "input": [ "guess_passwd_summary = summary_by_label(raw_data, \"guess_passwd.\")\n", "\n", "print \"Duration Statistics for label: {}\".format(\"guess_password\")\n", "print \" Mean: {}\".format(guess_passwd_summary.mean()[0],3)\n", "print \" St. deviation: {}\".format(round(sqrt(guess_passwd_summary.variance()[0]),3))\n", "print \" Max value: {}\".format(round(guess_passwd_summary.max()[0],3))\n", "print \" Min value: {}\".format(round(guess_passwd_summary.min()[0],3))\n", "print \" Total value count: {}\".format(guess_passwd_summary.count())\n", "print \" Number of non-zero values: {}\".format(guess_passwd_summary.numNonzeros()[0])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Duration Statistics for label: guess_password\n", " Mean: 2.71698113208\n", " St. deviation: 11.88\n", " Max value: 60.0\n", " Min value: 0.0\n", " Total value count: 53\n", " Number of non-zero values: 4.0\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that this type of attack is shorter in duration than a normal interaction. We could build a table with duration statistics for each type of interaction in our dataset. First we need to get a list of labels as described in the first line [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names). " ] }, { "cell_type": "code", "collapsed": false, "input": [ "label_list = [\"back.\",\"buffer_overflow.\",\"ftp_write.\",\"guess_passwd.\",\n", " \"imap.\",\"ipsweep.\",\"land.\",\"loadmodule.\",\"multihop.\",\n", " \"neptune.\",\"nmap.\",\"normal.\",\"perl.\",\"phf.\",\"pod.\",\"portsweep.\",\n", " \"rootkit.\",\"satan.\",\"smurf.\",\"spy.\",\"teardrop.\",\"warezclient.\",\n", " \"warezmaster.\"]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we get a list of statistics for each label. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "stats_by_label = [(label, summary_by_label(raw_data, label)) for label in label_list]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we get the *duration* column, first in our dataset (i.e. index 0). " ] }, { "cell_type": "code", "collapsed": false, "input": [ "duration_by_label = [ \n", " (stat[0], np.array([float(stat[1].mean()[0]), float(sqrt(stat[1].variance()[0])), float(stat[1].min()[0]), float(stat[1].max()[0]), int(stat[1].count())])) \n", " for stat in stats_by_label]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "That we can put into a Pandas data frame. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "pd.set_option('display.max_columns', 50)\n", "\n", "stats_by_label_df = pd.DataFrame.from_items(duration_by_label, columns=[\"Mean\", \"Std Dev\", \"Min\", \"Max\", \"Count\"], orient='index')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And print it." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print \"Duration statistics, by label\"\n", "stats_by_label_df" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Duration statistics, by label\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MeanStd DevMinMaxCount
back. 0.128915 1.110062 0 14 2203
buffer_overflow. 91.700000 97.514685 0 321 30
ftp_write. 32.375000 47.449033 0 134 8
guess_passwd. 2.716981 11.879811 0 60 53
imap. 6.000000 14.174240 0 41 12
ipsweep. 0.034483 0.438439 0 7 1247
land. 0.000000 0.000000 0 0 21
loadmodule. 36.222222 41.408869 0 103 9
multihop. 184.000000 253.851006 0 718 7
neptune. 0.000000 0.000000 0 0 107201
nmap. 0.000000 0.000000 0 0 231
normal. 216.657322 1359.213469 0 58329 97278
perl. 41.333333 14.843629 25 54 3
phf. 4.500000 5.744563 0 12 4
pod. 0.000000 0.000000 0 0 264
portsweep. 1915.299038 7285.125159 0 42448 1040
rootkit. 100.800000 216.185003 0 708 10
satan. 0.040277 0.522433 0 11 1589
smurf. 0.000000 0.000000 0 0 280790
spy. 318.000000 26.870058 299 337 2
teardrop. 0.000000 0.000000 0 0 979
warezclient. 615.257843 2207.694966 0 15168 1020
warezmaster. 15.050000 33.385271 0 156 20
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ " Mean Std Dev Min Max Count\n", "back. 0.128915 1.110062 0 14 2203\n", "buffer_overflow. 91.700000 97.514685 0 321 30\n", "ftp_write. 32.375000 47.449033 0 134 8\n", "guess_passwd. 2.716981 11.879811 0 60 53\n", "imap. 6.000000 14.174240 0 41 12\n", "ipsweep. 0.034483 0.438439 0 7 1247\n", "land. 0.000000 0.000000 0 0 21\n", "loadmodule. 36.222222 41.408869 0 103 9\n", "multihop. 184.000000 253.851006 0 718 7\n", "neptune. 0.000000 0.000000 0 0 107201\n", "nmap. 0.000000 0.000000 0 0 231\n", "normal. 216.657322 1359.213469 0 58329 97278\n", "perl. 41.333333 14.843629 25 54 3\n", "phf. 4.500000 5.744563 0 12 4\n", "pod. 0.000000 0.000000 0 0 264\n", "portsweep. 1915.299038 7285.125159 0 42448 1040\n", "rootkit. 100.800000 216.185003 0 708 10\n", "satan. 0.040277 0.522433 0 11 1589\n", "smurf. 0.000000 0.000000 0 0 280790\n", "spy. 318.000000 26.870058 299 337 2\n", "teardrop. 0.000000 0.000000 0 0 979\n", "warezclient. 615.257843 2207.694966 0 15168 1020\n", "warezmaster. 15.050000 33.385271 0 156 20" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to reuse this code and get a dataframe from any variable in our dataset we will define a function. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "def get_variable_stats_df(stats_by_label, column_i):\n", " column_stats_by_label = [\n", " (stat[0], np.array([float(stat[1].mean()[column_i]), float(sqrt(stat[1].variance()[column_i])), float(stat[1].min()[column_i]), float(stat[1].max()[column_i]), int(stat[1].count())])) \n", " for stat in stats_by_label\n", " ]\n", " return pd.DataFrame.from_items(column_stats_by_label, columns=[\"Mean\", \"Std Dev\", \"Min\", \"Max\", \"Count\"], orient='index')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try for *duration* again. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "get_variable_stats_df(stats_by_label,0)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MeanStd DevMinMaxCount
back. 0.128915 1.110062 0 14 2203
buffer_overflow. 91.700000 97.514685 0 321 30
ftp_write. 32.375000 47.449033 0 134 8
guess_passwd. 2.716981 11.879811 0 60 53
imap. 6.000000 14.174240 0 41 12
ipsweep. 0.034483 0.438439 0 7 1247
land. 0.000000 0.000000 0 0 21
loadmodule. 36.222222 41.408869 0 103 9
multihop. 184.000000 253.851006 0 718 7
neptune. 0.000000 0.000000 0 0 107201
nmap. 0.000000 0.000000 0 0 231
normal. 216.657322 1359.213469 0 58329 97278
perl. 41.333333 14.843629 25 54 3
phf. 4.500000 5.744563 0 12 4
pod. 0.000000 0.000000 0 0 264
portsweep. 1915.299038 7285.125159 0 42448 1040
rootkit. 100.800000 216.185003 0 708 10
satan. 0.040277 0.522433 0 11 1589
smurf. 0.000000 0.000000 0 0 280790
spy. 318.000000 26.870058 299 337 2
teardrop. 0.000000 0.000000 0 0 979
warezclient. 615.257843 2207.694966 0 15168 1020
warezmaster. 15.050000 33.385271 0 156 20
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ " Mean Std Dev Min Max Count\n", "back. 0.128915 1.110062 0 14 2203\n", "buffer_overflow. 91.700000 97.514685 0 321 30\n", "ftp_write. 32.375000 47.449033 0 134 8\n", "guess_passwd. 2.716981 11.879811 0 60 53\n", "imap. 6.000000 14.174240 0 41 12\n", "ipsweep. 0.034483 0.438439 0 7 1247\n", "land. 0.000000 0.000000 0 0 21\n", "loadmodule. 36.222222 41.408869 0 103 9\n", "multihop. 184.000000 253.851006 0 718 7\n", "neptune. 0.000000 0.000000 0 0 107201\n", "nmap. 0.000000 0.000000 0 0 231\n", "normal. 216.657322 1359.213469 0 58329 97278\n", "perl. 41.333333 14.843629 25 54 3\n", "phf. 4.500000 5.744563 0 12 4\n", "pod. 0.000000 0.000000 0 0 264\n", "portsweep. 1915.299038 7285.125159 0 42448 1040\n", "rootkit. 100.800000 216.185003 0 708 10\n", "satan. 0.040277 0.522433 0 11 1589\n", "smurf. 0.000000 0.000000 0 0 280790\n", "spy. 318.000000 26.870058 299 337 2\n", "teardrop. 0.000000 0.000000 0 0 979\n", "warezclient. 615.257843 2207.694966 0 15168 1020\n", "warezmaster. 15.050000 33.385271 0 156 20" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now for the next numeric column in the dataset, *src_bytes*. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "print \"src_bytes statistics, by label\"\n", "get_variable_stats_df(stats_by_label,1)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "src_bytes statistics, by label\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MeanStd DevMinMaxCount
back. 54156.355878 3159.360232 13140 54540 2203
buffer_overflow. 1400.433333 1337.132616 0 6274 30
ftp_write. 220.750000 267.747616 0 676 8
guess_passwd. 125.339623 3.037860 104 126 53
imap. 347.583333 629.926036 0 1492 12
ipsweep. 10.083400 5.231658 0 18 1247
land. 0.000000 0.000000 0 0 21
loadmodule. 151.888889 127.745298 0 302 9
multihop. 435.142857 540.960389 0 1412 7
neptune. 0.000000 0.000000 0 0 107201
nmap. 24.116883 59.419871 0 207 231
normal. 1157.047524 34226.124718 0 2194619 97278
perl. 265.666667 4.932883 260 269 3
phf. 51.000000 0.000000 51 51 4
pod. 1462.651515 125.098044 564 1480 264
portsweep. 666707.436538 21500665.866700 0 693375640 1040
rootkit. 294.700000 538.578180 0 1727 10
satan. 1.337319 42.946200 0 1710 1589
smurf. 935.772300 200.022386 520 1032 280790
spy. 174.500000 88.388348 112 237 2
teardrop. 28.000000 0.000000 28 28 979
warezclient. 300219.562745 1200905.243130 30 5135678 1020
warezmaster. 49.300000 212.155132 0 950 20
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ " Mean Std Dev Min Max Count\n", "back. 54156.355878 3159.360232 13140 54540 2203\n", "buffer_overflow. 1400.433333 1337.132616 0 6274 30\n", "ftp_write. 220.750000 267.747616 0 676 8\n", "guess_passwd. 125.339623 3.037860 104 126 53\n", "imap. 347.583333 629.926036 0 1492 12\n", "ipsweep. 10.083400 5.231658 0 18 1247\n", "land. 0.000000 0.000000 0 0 21\n", "loadmodule. 151.888889 127.745298 0 302 9\n", "multihop. 435.142857 540.960389 0 1412 7\n", "neptune. 0.000000 0.000000 0 0 107201\n", "nmap. 24.116883 59.419871 0 207 231\n", "normal. 1157.047524 34226.124718 0 2194619 97278\n", "perl. 265.666667 4.932883 260 269 3\n", "phf. 51.000000 0.000000 51 51 4\n", "pod. 1462.651515 125.098044 564 1480 264\n", "portsweep. 666707.436538 21500665.866700 0 693375640 1040\n", "rootkit. 294.700000 538.578180 0 1727 10\n", "satan. 1.337319 42.946200 0 1710 1589\n", "smurf. 935.772300 200.022386 520 1032 280790\n", "spy. 174.500000 88.388348 112 237 2\n", "teardrop. 28.000000 0.000000 28 28 979\n", "warezclient. 300219.562745 1200905.243130 30 5135678 1020\n", "warezmaster. 49.300000 212.155132 0 950 20" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And so on. By reusing the `summary_by_label` and `get_variable_stats_df` functions we can perform some exploratory data analysis in large datasets with Spark. " ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Correlations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Spark's MLlib supports [Pearson\u2019s](http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) and [Spearman\u2019s](http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) to calculate pairwise correlation methods among many series. Both of them are provided by the `corr` method in the `Statistics` package. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have two options as input. Either two `RDD[Double]`s or an `RDD[Vector]`. In the first case the output will be a `Double` value, while in the second a whole correlation Matrix. Due to the nature of our data, we will obtain the second. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pyspark.mllib.stat import Statistics \n", "correlation_matrix = Statistics.corr(vector_data, method=\"spearman\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have the correlations ready, we can start inspecting their values. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "pd.set_option('display.max_columns', 50)\n", "\n", "col_names = [\"duration\",\"src_bytes\",\"dst_bytes\",\"land\",\"wrong_fragment\",\n", " \"urgent\",\"hot\",\"num_failed_logins\",\"logged_in\",\"num_compromised\",\n", " \"root_shell\",\"su_attempted\",\"num_root\",\"num_file_creations\",\n", " \"num_shells\",\"num_access_files\",\"num_outbound_cmds\",\n", " \"is_hot_login\",\"is_guest_login\",\"count\",\"srv_count\",\"serror_rate\",\n", " \"srv_serror_rate\",\"rerror_rate\",\"srv_rerror_rate\",\"same_srv_rate\",\n", " \"diff_srv_rate\",\"srv_diff_host_rate\",\"dst_host_count\",\"dst_host_srv_count\",\n", " \"dst_host_same_srv_rate\",\"dst_host_diff_srv_rate\",\"dst_host_same_src_port_rate\",\n", " \"dst_host_srv_diff_host_rate\",\"dst_host_serror_rate\",\"dst_host_srv_serror_rate\",\n", " \"dst_host_rerror_rate\",\"dst_host_srv_rerror_rate\"]\n", "\n", "corr_df = pd.DataFrame(correlation_matrix, index=col_names, columns=col_names)\n", "\n", "corr_df" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
durationsrc_bytesdst_byteslandwrong_fragmenturgenthotnum_failed_loginslogged_innum_compromisedroot_shellsu_attemptednum_rootnum_file_creationsnum_shellsnum_access_filesnum_outbound_cmdsis_hot_loginis_guest_logincountsrv_countserror_ratesrv_serror_ratererror_ratesrv_rerror_ratesame_srv_ratediff_srv_ratesrv_diff_host_ratedst_host_countdst_host_srv_countdst_host_same_srv_ratedst_host_diff_srv_ratedst_host_same_src_port_ratedst_host_srv_diff_host_ratedst_host_serror_ratedst_host_srv_serror_ratedst_host_rerror_ratedst_host_srv_rerror_rate
duration 1.000000 0.014196 0.299189-0.001068-0.008025 0.017883 0.108639 0.014363 0.159564 0.010687 0.040425 0.026015 0.013401 0.061099 0.008632 0.019407-0.000019-0.000010 0.205606-0.259032-0.250139-0.074211-0.073663-0.025936-0.026420 0.062291-0.050875 0.123621-0.161107-0.217167-0.211979 0.231644-0.065202 0.100692-0.056753-0.057298-0.007759-0.013891
src_bytes 0.014196 1.000000-0.167931-0.009404-0.019358 0.000094 0.113920-0.008396-0.089702 0.118562 0.003067 0.002282-0.002050 0.027710 0.014403-0.001497 0.000010 0.000019 0.027511 0.666230 0.722609-0.657460-0.652391-0.342180-0.332977 0.744046-0.739988-0.104042 0.130377 0.741979 0.729151-0.712965 0.815039-0.140231-0.645920-0.641792-0.297338-0.300581
dst_bytes 0.299189-0.167931 1.000000-0.003040-0.022659 0.007234 0.193156 0.021952 0.882185 0.169772 0.026054 0.012192-0.003884 0.034154-0.000054 0.065776-0.000031 0.000041 0.085947-0.639157-0.497683-0.205848-0.198715-0.100958-0.081307 0.229677-0.222572 0.521003-0.611972 0.024124 0.055033-0.035073-0.396195 0.578557-0.167047-0.158378-0.003042 0.001621
land-0.001068-0.009404-0.003040 1.000000-0.000333-0.000065-0.000539-0.000076-0.002785-0.000447-0.000093-0.000049-0.000230-0.000150-0.000076-0.000211-0.002881 0.002089-0.000250-0.010939-0.010128 0.014160 0.014342-0.000451-0.001690 0.002153-0.001846 0.020678-0.019923-0.012341 0.002576-0.001803 0.004265 0.016171 0.013566 0.012265 0.000389-0.001816
wrong_fragment-0.008025-0.019358-0.022659-0.000333 1.000000-0.000150-0.004042-0.000568-0.020911-0.003370-0.000528-0.000248-0.001727-0.001160-0.000507-0.001519-0.000147 0.000441-0.001869-0.057711-0.029117-0.008849-0.023382 0.000430-0.012676 0.010218-0.009386 0.012117-0.029149-0.058225-0.049560 0.055542-0.015449 0.007306 0.010387-0.024117 0.046656-0.013666
urgent 0.017883 0.000094 0.007234-0.000065-0.000150 1.000000 0.008594 0.063009 0.006821 0.031765 0.067437 0.000020 0.061994 0.061383-0.000066 0.023380 0.012879 0.005162-0.000100-0.004778-0.004799-0.001338-0.001327-0.000705-0.000726 0.001521-0.001522-0.000788-0.005894-0.005698-0.004078 0.005208-0.001939-0.000976-0.001381-0.001370-0.000786-0.000782
hot 0.108639 0.113920 0.193156-0.000539-0.004042 0.008594 1.000000 0.112560 0.189126 0.811529 0.101983-0.000400 0.003096 0.028694 0.009146 0.004224-0.000393-0.000248 0.463706-0.120847-0.114735-0.035487-0.034934 0.013468 0.052003 0.041342-0.040555 0.032141-0.074178-0.017960 0.018783-0.017198-0.086998-0.014141-0.004706-0.010721 0.199019 0.189142
num_failed_logins 0.014363-0.008396 0.021952-0.000076-0.000568 0.063009 0.112560 1.000000-0.002190 0.004619 0.016895 0.072748 0.010060 0.015211-0.000093 0.005581 0.003431-0.001560-0.000428-0.018024-0.018027-0.003674-0.004027 0.035324 0.034876 0.005716-0.005538-0.003096-0.028369-0.015092 0.003004-0.002960-0.006617-0.002588 0.014713 0.014914 0.032395 0.032151
logged_in 0.159564-0.089702 0.882185-0.002785-0.020911 0.006821 0.189126-0.002190 1.000000 0.161190 0.025293 0.011813 0.082533 0.055530 0.024354 0.072698 0.000079 0.000127 0.089318-0.578287-0.438947-0.187114-0.180122-0.091962-0.072287 0.216969-0.214019 0.503807-0.682721 0.080352 0.114526-0.093565-0.359506 0.659078-0.143283-0.132474 0.007236 0.012979
num_compromised 0.010687 0.118562 0.169772-0.000447-0.003370 0.031765 0.811529 0.004619 0.161190 1.000000 0.085558 0.048985 0.028557 0.031223 0.011256 0.006977 0.001048-0.000438-0.002504-0.097212-0.091154-0.030516-0.030264 0.008573 0.054006 0.035253-0.034953 0.036497-0.041615 0.003465 0.038980-0.039091-0.078843-0.020979-0.005019-0.004504 0.214115 0.217858
root_shell 0.040425 0.003067 0.026054-0.000093-0.000528 0.067437 0.101983 0.016895 0.025293 0.085558 1.000000 0.233486 0.094512 0.140650 0.132056 0.069353 0.011462-0.006602-0.000405-0.016409-0.015174-0.004952-0.004923-0.001104-0.001143 0.004946-0.004553 0.002286-0.021367-0.011906 0.000515-0.000916-0.004617 0.008631-0.003498-0.003032 0.002763 0.002151
su_attempted 0.026015 0.002282 0.012192-0.000049-0.000248 0.000020-0.000400 0.072748 0.011813 0.048985 0.233486 1.000000 0.119326 0.053110 0.040487 0.081272-0.018896 0.012927-0.000219-0.008279-0.008225-0.002318-0.002295-0.001227-0.001253 0.002634-0.002649 0.000348-0.006697-0.006288-0.005738 0.006687-0.005020 0.001052 0.001974 0.002893 0.003173 0.001731
num_root 0.013401-0.002050-0.003884-0.000230-0.001727 0.061994 0.003096 0.010060 0.082533 0.028557 0.094512 0.119326 1.000000 0.047521 0.034405 0.014513 0.001524-0.002585-0.001281-0.054721-0.053530-0.016031-0.015936-0.008610-0.008708 0.013881-0.011337 0.006316-0.078717-0.038689-0.038935 0.047414-0.015968 0.061030-0.008457-0.007096-0.000421-0.005012
num_file_creations 0.061099 0.027710 0.034154-0.000150-0.001160 0.061383 0.028694 0.015211 0.055530 0.031223 0.140650 0.053110 0.047521 1.000000 0.068660 0.031042-0.004081-0.001664 0.013242-0.036467-0.034598-0.009703-0.010390-0.005069-0.004775 0.009784-0.008711 0.014412-0.049529-0.026890-0.021731 0.027092-0.015018 0.030590-0.002257-0.004295 0.000626-0.001096
num_shells 0.008632 0.014403-0.000054-0.000076-0.000507-0.000066 0.009146-0.000093 0.024354 0.011256 0.132056 0.040487 0.034405 0.068660 1.000000 0.019438-0.002592-0.006631-0.000405-0.013938-0.011784-0.004343-0.004740-0.002541-0.002572 0.004282-0.003743 0.001096-0.021200-0.012017-0.009962 0.010761-0.003521 0.015882-0.001588-0.002357-0.000617-0.002020
num_access_files 0.019407-0.001497 0.065776-0.000211-0.001519 0.023380 0.004224 0.005581 0.072698 0.006977 0.069353 0.081272 0.014513 0.031042 0.019438 1.000000-0.001597-0.002850 0.002466-0.045282-0.040497-0.013945-0.013572-0.007581 0.001874 0.015499-0.015112 0.024266-0.023865-0.023657-0.021358 0.026703-0.033288 0.011765-0.011197-0.011487-0.004743-0.004552
num_outbound_cmds-0.000019 0.000010-0.000031-0.002881-0.000147 0.012879-0.000393 0.003431 0.000079 0.001048 0.011462-0.018896 0.001524-0.004081-0.002592-0.001597 1.000000 0.822890 0.000924-0.000076 0.000100 0.000167 0.000209 0.000536 0.000346 0.000208 0.000328-0.000141-0.000424-0.000280-0.000503-0.000181-0.000455 0.000288-0.000011-0.000372-0.000823-0.001038
is_hot_login-0.000010 0.000019 0.000041 0.002089 0.000441 0.005162-0.000248-0.001560 0.000127-0.000438-0.006602 0.012927-0.002585-0.001664-0.006631-0.002850 0.822890 1.000000 0.001512 0.000036 0.000064 0.000102-0.000302-0.000550 0.000457-0.000159-0.000235-0.000360-0.000106 0.000206 0.000229-0.000004 0.000283 0.000538-0.000076-0.000007-0.000435-0.000529
is_guest_login 0.205606 0.027511 0.085947-0.000250-0.001869-0.000100 0.463706-0.000428 0.089318-0.002504-0.000405-0.000219-0.001281 0.013242-0.000405 0.002466 0.000924 0.001512 1.000000-0.062340-0.062713-0.017343-0.017240-0.008867-0.009193 0.018042-0.017000-0.008878-0.055453-0.044366-0.041749 0.044640-0.038092-0.012578-0.001066-0.016885 0.025282-0.004292
count-0.259032 0.666230-0.639157-0.010939-0.057711-0.004778-0.120847-0.018024-0.578287-0.097212-0.016409-0.008279-0.054721-0.036467-0.013938-0.045282-0.000076 0.000036-0.062340 1.000000 0.950587-0.303538-0.308923-0.213824-0.221352 0.346718-0.361737-0.384010 0.547443 0.586979 0.539698-0.546869 0.776906-0.496554-0.331571-0.335290-0.261194-0.256176
srv_count-0.250139 0.722609-0.497683-0.010128-0.029117-0.004799-0.114735-0.018027-0.438947-0.091154-0.015174-0.008225-0.053530-0.034598-0.011784-0.040497 0.000100 0.000064-0.062713 0.950587 1.000000-0.428185-0.421424-0.281468-0.284034 0.517227-0.511998-0.239057 0.442611 0.720746 0.681955-0.673916 0.812280-0.391712-0.449096-0.442823-0.313442-0.308132
serror_rate-0.074211-0.657460-0.205848 0.014160-0.008849-0.001338-0.035487-0.003674-0.187114-0.030516-0.004952-0.002318-0.016031-0.009703-0.004343-0.013945 0.000167 0.000102-0.017343-0.303538-0.428185 1.000000 0.990888-0.091157-0.095285-0.851915 0.828012-0.121489 0.165350-0.724317-0.745745 0.719708-0.650336-0.153568 0.973947 0.965663-0.103198-0.105434
srv_serror_rate-0.073663-0.652391-0.198715 0.014342-0.023382-0.001327-0.034934-0.004027-0.180122-0.030264-0.004923-0.002295-0.015936-0.010390-0.004740-0.013572 0.000209-0.000302-0.017240-0.308923-0.421424 0.990888 1.000000-0.110664-0.115286-0.839315 0.815305-0.112222 0.160322-0.713313-0.734334 0.707753-0.646256-0.148072 0.967214 0.970617-0.122630-0.124656
rerror_rate-0.025936-0.342180-0.100958-0.000451 0.000430-0.000705 0.013468 0.035324-0.091962 0.008573-0.001104-0.001227-0.008610-0.005069-0.002541-0.007581 0.000536-0.000550-0.008867-0.213824-0.281468-0.091157-0.110664 1.000000 0.978813-0.327986 0.345571-0.017902-0.067857-0.330391-0.303126 0.308722-0.278465 0.073061-0.094076-0.110646 0.910225 0.911622
srv_rerror_rate-0.026420-0.332977-0.081307-0.001690-0.012676-0.000726 0.052003 0.034876-0.072287 0.054006-0.001143-0.001253-0.008708-0.004775-0.002572 0.001874 0.000346 0.000457-0.009193-0.221352-0.284034-0.095285-0.115286 0.978813 1.000000-0.316568 0.333439 0.011285-0.072595-0.323032-0.294328 0.300186-0.282239 0.075178-0.096146-0.114341 0.904591 0.914904
same_srv_rate 0.062291 0.744046 0.229677 0.002153 0.010218 0.001521 0.041342 0.005716 0.216969 0.035253 0.004946 0.002634 0.013881 0.009784 0.004282 0.015499 0.000208-0.000159 0.018042 0.346718 0.517227-0.851915-0.839315-0.327986-0.316568 1.000000-0.982109 0.140660-0.190121 0.848754 0.873551-0.844537 0.732841 0.179040-0.830067-0.819335-0.282487-0.282913
diff_srv_rate-0.050875-0.739988-0.222572-0.001846-0.009386-0.001522-0.040555-0.005538-0.214019-0.034953-0.004553-0.002649-0.011337-0.008711-0.003743-0.015112 0.000328-0.000235-0.017000-0.361737-0.511998 0.828012 0.815305 0.345571 0.333439-0.982109 1.000000-0.138293 0.185942-0.844028-0.868580 0.850911-0.727031-0.176930 0.807205 0.795844 0.299041 0.298904
srv_diff_host_rate 0.123621-0.104042 0.521003 0.020678 0.012117-0.000788 0.032141-0.003096 0.503807 0.036497 0.002286 0.000348 0.006316 0.014412 0.001096 0.024266-0.000141-0.000360-0.008878-0.384010-0.239057-0.121489-0.112222-0.017902 0.011285 0.140660-0.138293 1.000000-0.445051 0.035010 0.068648-0.050472-0.222707 0.433173-0.097973-0.092661 0.022585 0.024722
dst_host_count-0.161107 0.130377-0.611972-0.019923-0.029149-0.005894-0.074178-0.028369-0.682721-0.041615-0.021367-0.006697-0.078717-0.049529-0.021200-0.023865-0.000424-0.000106-0.055453 0.547443 0.442611 0.165350 0.160322-0.067857-0.072595-0.190121 0.185942-0.445051 1.000000 0.022731-0.070448 0.044338 0.189876-0.918894 0.123881 0.113845-0.125142-0.125273
dst_host_srv_count-0.217167 0.741979 0.024124-0.012341-0.058225-0.005698-0.017960-0.015092 0.080352 0.003465-0.011906-0.006288-0.038689-0.026890-0.012017-0.023657-0.000280 0.000206-0.044366 0.586979 0.720746-0.724317-0.713313-0.330391-0.323032 0.848754-0.844028 0.035010 0.022731 1.000000 0.970072-0.955178 0.769481 0.043668-0.722607-0.708392-0.312040-0.300787
dst_host_same_srv_rate-0.211979 0.729151 0.055033 0.002576-0.049560-0.004078 0.018783 0.003004 0.114526 0.038980 0.000515-0.005738-0.038935-0.021731-0.009962-0.021358-0.000503 0.000229-0.041749 0.539698 0.681955-0.745745-0.734334-0.303126-0.294328 0.873551-0.868580 0.068648-0.070448 0.970072 1.000000-0.980245 0.771158 0.107926-0.742045-0.725272-0.278068-0.264383
dst_host_diff_srv_rate 0.231644-0.712965-0.035073-0.001803 0.055542 0.005208-0.017198-0.002960-0.093565-0.039091-0.000916 0.006687 0.047414 0.027092 0.010761 0.026703-0.000181-0.000004 0.044640-0.546869-0.673916 0.719708 0.707753 0.308722 0.300186-0.844537 0.850911-0.050472 0.044338-0.955178-0.980245 1.000000-0.766402-0.088665 0.719275 0.701149 0.287476 0.271067
dst_host_same_src_port_rate-0.065202 0.815039-0.396195 0.004265-0.015449-0.001939-0.086998-0.006617-0.359506-0.078843-0.004617-0.005020-0.015968-0.015018-0.003521-0.033288-0.000455 0.000283-0.038092 0.776906 0.812280-0.650336-0.646256-0.278465-0.282239 0.732841-0.727031-0.222707 0.189876 0.769481 0.771158-0.766402 1.000000-0.175310-0.658737-0.652636-0.299273-0.297100
dst_host_srv_diff_host_rate 0.100692-0.140231 0.578557 0.016171 0.007306-0.000976-0.014141-0.002588 0.659078-0.020979 0.008631 0.001052 0.061030 0.030590 0.015882 0.011765 0.000288 0.000538-0.012578-0.496554-0.391712-0.153568-0.148072 0.073061 0.075178 0.179040-0.176930 0.433173-0.918894 0.043668 0.107926-0.088665-0.175310 1.000000-0.118697-0.103715 0.114971 0.120767
dst_host_serror_rate-0.056753-0.645920-0.167047 0.013566 0.010387-0.001381-0.004706 0.014713-0.143283-0.005019-0.003498 0.001974-0.008457-0.002257-0.001588-0.011197-0.000011-0.000076-0.001066-0.331571-0.449096 0.973947 0.967214-0.094076-0.096146-0.830067 0.807205-0.097973 0.123881-0.722607-0.742045 0.719275-0.658737-0.118697 1.000000 0.968015-0.087531-0.096899
dst_host_srv_serror_rate-0.057298-0.641792-0.158378 0.012265-0.024117-0.001370-0.010721 0.014914-0.132474-0.004504-0.003032 0.002893-0.007096-0.004295-0.002357-0.011487-0.000372-0.000007-0.016885-0.335290-0.442823 0.965663 0.970617-0.110646-0.114341-0.819335 0.795844-0.092661 0.113845-0.708392-0.725272 0.701149-0.652636-0.103715 0.968015 1.000000-0.111578-0.110532
dst_host_rerror_rate-0.007759-0.297338-0.003042 0.000389 0.046656-0.000786 0.199019 0.032395 0.007236 0.214115 0.002763 0.003173-0.000421 0.000626-0.000617-0.004743-0.000823-0.000435 0.025282-0.261194-0.313442-0.103198-0.122630 0.910225 0.904591-0.282487 0.299041 0.022585-0.125142-0.312040-0.278068 0.287476-0.299273 0.114971-0.087531-0.111578 1.000000 0.950964
dst_host_srv_rerror_rate-0.013891-0.300581 0.001621-0.001816-0.013666-0.000782 0.189142 0.032151 0.012979 0.217858 0.002151 0.001731-0.005012-0.001096-0.002020-0.004552-0.001038-0.000529-0.004292-0.256176-0.308132-0.105434-0.124656 0.911622 0.914904-0.282913 0.298904 0.024722-0.125273-0.300787-0.264383 0.271067-0.297100 0.120767-0.096899-0.110532 0.950964 1.000000
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ " duration src_bytes dst_bytes land \\\n", "duration 1.000000 0.014196 0.299189 -0.001068 \n", "src_bytes 0.014196 1.000000 -0.167931 -0.009404 \n", "dst_bytes 0.299189 -0.167931 1.000000 -0.003040 \n", "land -0.001068 -0.009404 -0.003040 1.000000 \n", "wrong_fragment -0.008025 -0.019358 -0.022659 -0.000333 \n", "urgent 0.017883 0.000094 0.007234 -0.000065 \n", "hot 0.108639 0.113920 0.193156 -0.000539 \n", "num_failed_logins 0.014363 -0.008396 0.021952 -0.000076 \n", "logged_in 0.159564 -0.089702 0.882185 -0.002785 \n", "num_compromised 0.010687 0.118562 0.169772 -0.000447 \n", "root_shell 0.040425 0.003067 0.026054 -0.000093 \n", "su_attempted 0.026015 0.002282 0.012192 -0.000049 \n", "num_root 0.013401 -0.002050 -0.003884 -0.000230 \n", "num_file_creations 0.061099 0.027710 0.034154 -0.000150 \n", "num_shells 0.008632 0.014403 -0.000054 -0.000076 \n", "num_access_files 0.019407 -0.001497 0.065776 -0.000211 \n", "num_outbound_cmds -0.000019 0.000010 -0.000031 -0.002881 \n", "is_hot_login -0.000010 0.000019 0.000041 0.002089 \n", "is_guest_login 0.205606 0.027511 0.085947 -0.000250 \n", "count -0.259032 0.666230 -0.639157 -0.010939 \n", "srv_count -0.250139 0.722609 -0.497683 -0.010128 \n", "serror_rate -0.074211 -0.657460 -0.205848 0.014160 \n", "srv_serror_rate -0.073663 -0.652391 -0.198715 0.014342 \n", "rerror_rate -0.025936 -0.342180 -0.100958 -0.000451 \n", "srv_rerror_rate -0.026420 -0.332977 -0.081307 -0.001690 \n", "same_srv_rate 0.062291 0.744046 0.229677 0.002153 \n", "diff_srv_rate -0.050875 -0.739988 -0.222572 -0.001846 \n", "srv_diff_host_rate 0.123621 -0.104042 0.521003 0.020678 \n", "dst_host_count -0.161107 0.130377 -0.611972 -0.019923 \n", "dst_host_srv_count -0.217167 0.741979 0.024124 -0.012341 \n", "dst_host_same_srv_rate -0.211979 0.729151 0.055033 0.002576 \n", "dst_host_diff_srv_rate 0.231644 -0.712965 -0.035073 -0.001803 \n", "dst_host_same_src_port_rate -0.065202 0.815039 -0.396195 0.004265 \n", "dst_host_srv_diff_host_rate 0.100692 -0.140231 0.578557 0.016171 \n", "dst_host_serror_rate -0.056753 -0.645920 -0.167047 0.013566 \n", "dst_host_srv_serror_rate -0.057298 -0.641792 -0.158378 0.012265 \n", "dst_host_rerror_rate -0.007759 -0.297338 -0.003042 0.000389 \n", "dst_host_srv_rerror_rate -0.013891 -0.300581 0.001621 -0.001816 \n", "\n", " wrong_fragment urgent hot \\\n", "duration -0.008025 0.017883 0.108639 \n", "src_bytes -0.019358 0.000094 0.113920 \n", "dst_bytes -0.022659 0.007234 0.193156 \n", "land -0.000333 -0.000065 -0.000539 \n", "wrong_fragment 1.000000 -0.000150 -0.004042 \n", "urgent -0.000150 1.000000 0.008594 \n", "hot -0.004042 0.008594 1.000000 \n", "num_failed_logins -0.000568 0.063009 0.112560 \n", "logged_in -0.020911 0.006821 0.189126 \n", "num_compromised -0.003370 0.031765 0.811529 \n", "root_shell -0.000528 0.067437 0.101983 \n", "su_attempted -0.000248 0.000020 -0.000400 \n", "num_root -0.001727 0.061994 0.003096 \n", "num_file_creations -0.001160 0.061383 0.028694 \n", "num_shells -0.000507 -0.000066 0.009146 \n", "num_access_files -0.001519 0.023380 0.004224 \n", "num_outbound_cmds -0.000147 0.012879 -0.000393 \n", "is_hot_login 0.000441 0.005162 -0.000248 \n", "is_guest_login -0.001869 -0.000100 0.463706 \n", "count -0.057711 -0.004778 -0.120847 \n", "srv_count -0.029117 -0.004799 -0.114735 \n", "serror_rate -0.008849 -0.001338 -0.035487 \n", "srv_serror_rate -0.023382 -0.001327 -0.034934 \n", "rerror_rate 0.000430 -0.000705 0.013468 \n", "srv_rerror_rate -0.012676 -0.000726 0.052003 \n", "same_srv_rate 0.010218 0.001521 0.041342 \n", "diff_srv_rate -0.009386 -0.001522 -0.040555 \n", "srv_diff_host_rate 0.012117 -0.000788 0.032141 \n", "dst_host_count -0.029149 -0.005894 -0.074178 \n", "dst_host_srv_count -0.058225 -0.005698 -0.017960 \n", "dst_host_same_srv_rate -0.049560 -0.004078 0.018783 \n", "dst_host_diff_srv_rate 0.055542 0.005208 -0.017198 \n", "dst_host_same_src_port_rate -0.015449 -0.001939 -0.086998 \n", "dst_host_srv_diff_host_rate 0.007306 -0.000976 -0.014141 \n", "dst_host_serror_rate 0.010387 -0.001381 -0.004706 \n", "dst_host_srv_serror_rate -0.024117 -0.001370 -0.010721 \n", "dst_host_rerror_rate 0.046656 -0.000786 0.199019 \n", "dst_host_srv_rerror_rate -0.013666 -0.000782 0.189142 \n", "\n", " num_failed_logins logged_in num_compromised \\\n", "duration 0.014363 0.159564 0.010687 \n", "src_bytes -0.008396 -0.089702 0.118562 \n", "dst_bytes 0.021952 0.882185 0.169772 \n", "land -0.000076 -0.002785 -0.000447 \n", "wrong_fragment -0.000568 -0.020911 -0.003370 \n", "urgent 0.063009 0.006821 0.031765 \n", "hot 0.112560 0.189126 0.811529 \n", "num_failed_logins 1.000000 -0.002190 0.004619 \n", "logged_in -0.002190 1.000000 0.161190 \n", "num_compromised 0.004619 0.161190 1.000000 \n", "root_shell 0.016895 0.025293 0.085558 \n", "su_attempted 0.072748 0.011813 0.048985 \n", "num_root 0.010060 0.082533 0.028557 \n", "num_file_creations 0.015211 0.055530 0.031223 \n", "num_shells -0.000093 0.024354 0.011256 \n", "num_access_files 0.005581 0.072698 0.006977 \n", "num_outbound_cmds 0.003431 0.000079 0.001048 \n", "is_hot_login -0.001560 0.000127 -0.000438 \n", "is_guest_login -0.000428 0.089318 -0.002504 \n", "count -0.018024 -0.578287 -0.097212 \n", "srv_count -0.018027 -0.438947 -0.091154 \n", "serror_rate -0.003674 -0.187114 -0.030516 \n", "srv_serror_rate -0.004027 -0.180122 -0.030264 \n", "rerror_rate 0.035324 -0.091962 0.008573 \n", "srv_rerror_rate 0.034876 -0.072287 0.054006 \n", "same_srv_rate 0.005716 0.216969 0.035253 \n", "diff_srv_rate -0.005538 -0.214019 -0.034953 \n", "srv_diff_host_rate -0.003096 0.503807 0.036497 \n", "dst_host_count -0.028369 -0.682721 -0.041615 \n", "dst_host_srv_count -0.015092 0.080352 0.003465 \n", "dst_host_same_srv_rate 0.003004 0.114526 0.038980 \n", "dst_host_diff_srv_rate -0.002960 -0.093565 -0.039091 \n", "dst_host_same_src_port_rate -0.006617 -0.359506 -0.078843 \n", "dst_host_srv_diff_host_rate -0.002588 0.659078 -0.020979 \n", "dst_host_serror_rate 0.014713 -0.143283 -0.005019 \n", "dst_host_srv_serror_rate 0.014914 -0.132474 -0.004504 \n", "dst_host_rerror_rate 0.032395 0.007236 0.214115 \n", "dst_host_srv_rerror_rate 0.032151 0.012979 0.217858 \n", "\n", " root_shell su_attempted num_root \\\n", "duration 0.040425 0.026015 0.013401 \n", "src_bytes 0.003067 0.002282 -0.002050 \n", "dst_bytes 0.026054 0.012192 -0.003884 \n", "land -0.000093 -0.000049 -0.000230 \n", "wrong_fragment -0.000528 -0.000248 -0.001727 \n", "urgent 0.067437 0.000020 0.061994 \n", "hot 0.101983 -0.000400 0.003096 \n", "num_failed_logins 0.016895 0.072748 0.010060 \n", "logged_in 0.025293 0.011813 0.082533 \n", "num_compromised 0.085558 0.048985 0.028557 \n", "root_shell 1.000000 0.233486 0.094512 \n", "su_attempted 0.233486 1.000000 0.119326 \n", "num_root 0.094512 0.119326 1.000000 \n", "num_file_creations 0.140650 0.053110 0.047521 \n", "num_shells 0.132056 0.040487 0.034405 \n", "num_access_files 0.069353 0.081272 0.014513 \n", "num_outbound_cmds 0.011462 -0.018896 0.001524 \n", "is_hot_login -0.006602 0.012927 -0.002585 \n", "is_guest_login -0.000405 -0.000219 -0.001281 \n", "count -0.016409 -0.008279 -0.054721 \n", "srv_count -0.015174 -0.008225 -0.053530 \n", "serror_rate -0.004952 -0.002318 -0.016031 \n", "srv_serror_rate -0.004923 -0.002295 -0.015936 \n", "rerror_rate -0.001104 -0.001227 -0.008610 \n", "srv_rerror_rate -0.001143 -0.001253 -0.008708 \n", "same_srv_rate 0.004946 0.002634 0.013881 \n", "diff_srv_rate -0.004553 -0.002649 -0.011337 \n", "srv_diff_host_rate 0.002286 0.000348 0.006316 \n", "dst_host_count -0.021367 -0.006697 -0.078717 \n", "dst_host_srv_count -0.011906 -0.006288 -0.038689 \n", "dst_host_same_srv_rate 0.000515 -0.005738 -0.038935 \n", "dst_host_diff_srv_rate -0.000916 0.006687 0.047414 \n", "dst_host_same_src_port_rate -0.004617 -0.005020 -0.015968 \n", "dst_host_srv_diff_host_rate 0.008631 0.001052 0.061030 \n", "dst_host_serror_rate -0.003498 0.001974 -0.008457 \n", "dst_host_srv_serror_rate -0.003032 0.002893 -0.007096 \n", "dst_host_rerror_rate 0.002763 0.003173 -0.000421 \n", "dst_host_srv_rerror_rate 0.002151 0.001731 -0.005012 \n", "\n", " num_file_creations num_shells num_access_files \\\n", "duration 0.061099 0.008632 0.019407 \n", "src_bytes 0.027710 0.014403 -0.001497 \n", "dst_bytes 0.034154 -0.000054 0.065776 \n", "land -0.000150 -0.000076 -0.000211 \n", "wrong_fragment -0.001160 -0.000507 -0.001519 \n", "urgent 0.061383 -0.000066 0.023380 \n", "hot 0.028694 0.009146 0.004224 \n", "num_failed_logins 0.015211 -0.000093 0.005581 \n", "logged_in 0.055530 0.024354 0.072698 \n", "num_compromised 0.031223 0.011256 0.006977 \n", "root_shell 0.140650 0.132056 0.069353 \n", "su_attempted 0.053110 0.040487 0.081272 \n", "num_root 0.047521 0.034405 0.014513 \n", "num_file_creations 1.000000 0.068660 0.031042 \n", "num_shells 0.068660 1.000000 0.019438 \n", "num_access_files 0.031042 0.019438 1.000000 \n", "num_outbound_cmds -0.004081 -0.002592 -0.001597 \n", "is_hot_login -0.001664 -0.006631 -0.002850 \n", "is_guest_login 0.013242 -0.000405 0.002466 \n", "count -0.036467 -0.013938 -0.045282 \n", "srv_count -0.034598 -0.011784 -0.040497 \n", "serror_rate -0.009703 -0.004343 -0.013945 \n", "srv_serror_rate -0.010390 -0.004740 -0.013572 \n", "rerror_rate -0.005069 -0.002541 -0.007581 \n", "srv_rerror_rate -0.004775 -0.002572 0.001874 \n", "same_srv_rate 0.009784 0.004282 0.015499 \n", "diff_srv_rate -0.008711 -0.003743 -0.015112 \n", "srv_diff_host_rate 0.014412 0.001096 0.024266 \n", "dst_host_count -0.049529 -0.021200 -0.023865 \n", "dst_host_srv_count -0.026890 -0.012017 -0.023657 \n", "dst_host_same_srv_rate -0.021731 -0.009962 -0.021358 \n", "dst_host_diff_srv_rate 0.027092 0.010761 0.026703 \n", "dst_host_same_src_port_rate -0.015018 -0.003521 -0.033288 \n", "dst_host_srv_diff_host_rate 0.030590 0.015882 0.011765 \n", "dst_host_serror_rate -0.002257 -0.001588 -0.011197 \n", "dst_host_srv_serror_rate -0.004295 -0.002357 -0.011487 \n", "dst_host_rerror_rate 0.000626 -0.000617 -0.004743 \n", "dst_host_srv_rerror_rate -0.001096 -0.002020 -0.004552 \n", "\n", " num_outbound_cmds is_hot_login is_guest_login \\\n", "duration -0.000019 -0.000010 0.205606 \n", "src_bytes 0.000010 0.000019 0.027511 \n", "dst_bytes -0.000031 0.000041 0.085947 \n", "land -0.002881 0.002089 -0.000250 \n", "wrong_fragment -0.000147 0.000441 -0.001869 \n", "urgent 0.012879 0.005162 -0.000100 \n", "hot -0.000393 -0.000248 0.463706 \n", "num_failed_logins 0.003431 -0.001560 -0.000428 \n", "logged_in 0.000079 0.000127 0.089318 \n", "num_compromised 0.001048 -0.000438 -0.002504 \n", "root_shell 0.011462 -0.006602 -0.000405 \n", "su_attempted -0.018896 0.012927 -0.000219 \n", "num_root 0.001524 -0.002585 -0.001281 \n", "num_file_creations -0.004081 -0.001664 0.013242 \n", "num_shells -0.002592 -0.006631 -0.000405 \n", "num_access_files -0.001597 -0.002850 0.002466 \n", "num_outbound_cmds 1.000000 0.822890 0.000924 \n", "is_hot_login 0.822890 1.000000 0.001512 \n", "is_guest_login 0.000924 0.001512 1.000000 \n", "count -0.000076 0.000036 -0.062340 \n", "srv_count 0.000100 0.000064 -0.062713 \n", "serror_rate 0.000167 0.000102 -0.017343 \n", "srv_serror_rate 0.000209 -0.000302 -0.017240 \n", "rerror_rate 0.000536 -0.000550 -0.008867 \n", "srv_rerror_rate 0.000346 0.000457 -0.009193 \n", "same_srv_rate 0.000208 -0.000159 0.018042 \n", "diff_srv_rate 0.000328 -0.000235 -0.017000 \n", "srv_diff_host_rate -0.000141 -0.000360 -0.008878 \n", "dst_host_count -0.000424 -0.000106 -0.055453 \n", "dst_host_srv_count -0.000280 0.000206 -0.044366 \n", "dst_host_same_srv_rate -0.000503 0.000229 -0.041749 \n", "dst_host_diff_srv_rate -0.000181 -0.000004 0.044640 \n", "dst_host_same_src_port_rate -0.000455 0.000283 -0.038092 \n", "dst_host_srv_diff_host_rate 0.000288 0.000538 -0.012578 \n", "dst_host_serror_rate -0.000011 -0.000076 -0.001066 \n", "dst_host_srv_serror_rate -0.000372 -0.000007 -0.016885 \n", "dst_host_rerror_rate -0.000823 -0.000435 0.025282 \n", "dst_host_srv_rerror_rate -0.001038 -0.000529 -0.004292 \n", "\n", " count srv_count serror_rate \\\n", "duration -0.259032 -0.250139 -0.074211 \n", "src_bytes 0.666230 0.722609 -0.657460 \n", "dst_bytes -0.639157 -0.497683 -0.205848 \n", "land -0.010939 -0.010128 0.014160 \n", "wrong_fragment -0.057711 -0.029117 -0.008849 \n", "urgent -0.004778 -0.004799 -0.001338 \n", "hot -0.120847 -0.114735 -0.035487 \n", "num_failed_logins -0.018024 -0.018027 -0.003674 \n", "logged_in -0.578287 -0.438947 -0.187114 \n", "num_compromised -0.097212 -0.091154 -0.030516 \n", "root_shell -0.016409 -0.015174 -0.004952 \n", "su_attempted -0.008279 -0.008225 -0.002318 \n", "num_root -0.054721 -0.053530 -0.016031 \n", "num_file_creations -0.036467 -0.034598 -0.009703 \n", "num_shells -0.013938 -0.011784 -0.004343 \n", "num_access_files -0.045282 -0.040497 -0.013945 \n", "num_outbound_cmds -0.000076 0.000100 0.000167 \n", "is_hot_login 0.000036 0.000064 0.000102 \n", "is_guest_login -0.062340 -0.062713 -0.017343 \n", "count 1.000000 0.950587 -0.303538 \n", "srv_count 0.950587 1.000000 -0.428185 \n", "serror_rate -0.303538 -0.428185 1.000000 \n", "srv_serror_rate -0.308923 -0.421424 0.990888 \n", "rerror_rate -0.213824 -0.281468 -0.091157 \n", "srv_rerror_rate -0.221352 -0.284034 -0.095285 \n", "same_srv_rate 0.346718 0.517227 -0.851915 \n", "diff_srv_rate -0.361737 -0.511998 0.828012 \n", "srv_diff_host_rate -0.384010 -0.239057 -0.121489 \n", "dst_host_count 0.547443 0.442611 0.165350 \n", "dst_host_srv_count 0.586979 0.720746 -0.724317 \n", "dst_host_same_srv_rate 0.539698 0.681955 -0.745745 \n", "dst_host_diff_srv_rate -0.546869 -0.673916 0.719708 \n", "dst_host_same_src_port_rate 0.776906 0.812280 -0.650336 \n", "dst_host_srv_diff_host_rate -0.496554 -0.391712 -0.153568 \n", "dst_host_serror_rate -0.331571 -0.449096 0.973947 \n", "dst_host_srv_serror_rate -0.335290 -0.442823 0.965663 \n", "dst_host_rerror_rate -0.261194 -0.313442 -0.103198 \n", "dst_host_srv_rerror_rate -0.256176 -0.308132 -0.105434 \n", "\n", " srv_serror_rate rerror_rate srv_rerror_rate \\\n", "duration -0.073663 -0.025936 -0.026420 \n", "src_bytes -0.652391 -0.342180 -0.332977 \n", "dst_bytes -0.198715 -0.100958 -0.081307 \n", "land 0.014342 -0.000451 -0.001690 \n", "wrong_fragment -0.023382 0.000430 -0.012676 \n", "urgent -0.001327 -0.000705 -0.000726 \n", "hot -0.034934 0.013468 0.052003 \n", "num_failed_logins -0.004027 0.035324 0.034876 \n", "logged_in -0.180122 -0.091962 -0.072287 \n", "num_compromised -0.030264 0.008573 0.054006 \n", "root_shell -0.004923 -0.001104 -0.001143 \n", "su_attempted -0.002295 -0.001227 -0.001253 \n", "num_root -0.015936 -0.008610 -0.008708 \n", "num_file_creations -0.010390 -0.005069 -0.004775 \n", "num_shells -0.004740 -0.002541 -0.002572 \n", "num_access_files -0.013572 -0.007581 0.001874 \n", "num_outbound_cmds 0.000209 0.000536 0.000346 \n", "is_hot_login -0.000302 -0.000550 0.000457 \n", "is_guest_login -0.017240 -0.008867 -0.009193 \n", "count -0.308923 -0.213824 -0.221352 \n", "srv_count -0.421424 -0.281468 -0.284034 \n", "serror_rate 0.990888 -0.091157 -0.095285 \n", "srv_serror_rate 1.000000 -0.110664 -0.115286 \n", "rerror_rate -0.110664 1.000000 0.978813 \n", "srv_rerror_rate -0.115286 0.978813 1.000000 \n", "same_srv_rate -0.839315 -0.327986 -0.316568 \n", "diff_srv_rate 0.815305 0.345571 0.333439 \n", "srv_diff_host_rate -0.112222 -0.017902 0.011285 \n", "dst_host_count 0.160322 -0.067857 -0.072595 \n", "dst_host_srv_count -0.713313 -0.330391 -0.323032 \n", "dst_host_same_srv_rate -0.734334 -0.303126 -0.294328 \n", "dst_host_diff_srv_rate 0.707753 0.308722 0.300186 \n", "dst_host_same_src_port_rate -0.646256 -0.278465 -0.282239 \n", "dst_host_srv_diff_host_rate -0.148072 0.073061 0.075178 \n", "dst_host_serror_rate 0.967214 -0.094076 -0.096146 \n", "dst_host_srv_serror_rate 0.970617 -0.110646 -0.114341 \n", "dst_host_rerror_rate -0.122630 0.910225 0.904591 \n", "dst_host_srv_rerror_rate -0.124656 0.911622 0.914904 \n", "\n", " same_srv_rate diff_srv_rate srv_diff_host_rate \\\n", "duration 0.062291 -0.050875 0.123621 \n", "src_bytes 0.744046 -0.739988 -0.104042 \n", "dst_bytes 0.229677 -0.222572 0.521003 \n", "land 0.002153 -0.001846 0.020678 \n", "wrong_fragment 0.010218 -0.009386 0.012117 \n", "urgent 0.001521 -0.001522 -0.000788 \n", "hot 0.041342 -0.040555 0.032141 \n", "num_failed_logins 0.005716 -0.005538 -0.003096 \n", "logged_in 0.216969 -0.214019 0.503807 \n", "num_compromised 0.035253 -0.034953 0.036497 \n", "root_shell 0.004946 -0.004553 0.002286 \n", "su_attempted 0.002634 -0.002649 0.000348 \n", "num_root 0.013881 -0.011337 0.006316 \n", "num_file_creations 0.009784 -0.008711 0.014412 \n", "num_shells 0.004282 -0.003743 0.001096 \n", "num_access_files 0.015499 -0.015112 0.024266 \n", "num_outbound_cmds 0.000208 0.000328 -0.000141 \n", "is_hot_login -0.000159 -0.000235 -0.000360 \n", "is_guest_login 0.018042 -0.017000 -0.008878 \n", "count 0.346718 -0.361737 -0.384010 \n", "srv_count 0.517227 -0.511998 -0.239057 \n", "serror_rate -0.851915 0.828012 -0.121489 \n", "srv_serror_rate -0.839315 0.815305 -0.112222 \n", "rerror_rate -0.327986 0.345571 -0.017902 \n", "srv_rerror_rate -0.316568 0.333439 0.011285 \n", "same_srv_rate 1.000000 -0.982109 0.140660 \n", "diff_srv_rate -0.982109 1.000000 -0.138293 \n", "srv_diff_host_rate 0.140660 -0.138293 1.000000 \n", "dst_host_count -0.190121 0.185942 -0.445051 \n", "dst_host_srv_count 0.848754 -0.844028 0.035010 \n", "dst_host_same_srv_rate 0.873551 -0.868580 0.068648 \n", "dst_host_diff_srv_rate -0.844537 0.850911 -0.050472 \n", "dst_host_same_src_port_rate 0.732841 -0.727031 -0.222707 \n", "dst_host_srv_diff_host_rate 0.179040 -0.176930 0.433173 \n", "dst_host_serror_rate -0.830067 0.807205 -0.097973 \n", "dst_host_srv_serror_rate -0.819335 0.795844 -0.092661 \n", "dst_host_rerror_rate -0.282487 0.299041 0.022585 \n", "dst_host_srv_rerror_rate -0.282913 0.298904 0.024722 \n", "\n", " dst_host_count dst_host_srv_count \\\n", "duration -0.161107 -0.217167 \n", "src_bytes 0.130377 0.741979 \n", "dst_bytes -0.611972 0.024124 \n", "land -0.019923 -0.012341 \n", "wrong_fragment -0.029149 -0.058225 \n", "urgent -0.005894 -0.005698 \n", "hot -0.074178 -0.017960 \n", "num_failed_logins -0.028369 -0.015092 \n", "logged_in -0.682721 0.080352 \n", "num_compromised -0.041615 0.003465 \n", "root_shell -0.021367 -0.011906 \n", "su_attempted -0.006697 -0.006288 \n", "num_root -0.078717 -0.038689 \n", "num_file_creations -0.049529 -0.026890 \n", "num_shells -0.021200 -0.012017 \n", "num_access_files -0.023865 -0.023657 \n", "num_outbound_cmds -0.000424 -0.000280 \n", "is_hot_login -0.000106 0.000206 \n", "is_guest_login -0.055453 -0.044366 \n", "count 0.547443 0.586979 \n", "srv_count 0.442611 0.720746 \n", "serror_rate 0.165350 -0.724317 \n", "srv_serror_rate 0.160322 -0.713313 \n", "rerror_rate -0.067857 -0.330391 \n", "srv_rerror_rate -0.072595 -0.323032 \n", "same_srv_rate -0.190121 0.848754 \n", "diff_srv_rate 0.185942 -0.844028 \n", "srv_diff_host_rate -0.445051 0.035010 \n", "dst_host_count 1.000000 0.022731 \n", "dst_host_srv_count 0.022731 1.000000 \n", "dst_host_same_srv_rate -0.070448 0.970072 \n", "dst_host_diff_srv_rate 0.044338 -0.955178 \n", "dst_host_same_src_port_rate 0.189876 0.769481 \n", "dst_host_srv_diff_host_rate -0.918894 0.043668 \n", "dst_host_serror_rate 0.123881 -0.722607 \n", "dst_host_srv_serror_rate 0.113845 -0.708392 \n", "dst_host_rerror_rate -0.125142 -0.312040 \n", "dst_host_srv_rerror_rate -0.125273 -0.300787 \n", "\n", " dst_host_same_srv_rate dst_host_diff_srv_rate \\\n", "duration -0.211979 0.231644 \n", "src_bytes 0.729151 -0.712965 \n", "dst_bytes 0.055033 -0.035073 \n", "land 0.002576 -0.001803 \n", "wrong_fragment -0.049560 0.055542 \n", "urgent -0.004078 0.005208 \n", "hot 0.018783 -0.017198 \n", "num_failed_logins 0.003004 -0.002960 \n", "logged_in 0.114526 -0.093565 \n", "num_compromised 0.038980 -0.039091 \n", "root_shell 0.000515 -0.000916 \n", "su_attempted -0.005738 0.006687 \n", "num_root -0.038935 0.047414 \n", "num_file_creations -0.021731 0.027092 \n", "num_shells -0.009962 0.010761 \n", "num_access_files -0.021358 0.026703 \n", "num_outbound_cmds -0.000503 -0.000181 \n", "is_hot_login 0.000229 -0.000004 \n", "is_guest_login -0.041749 0.044640 \n", "count 0.539698 -0.546869 \n", "srv_count 0.681955 -0.673916 \n", "serror_rate -0.745745 0.719708 \n", "srv_serror_rate -0.734334 0.707753 \n", "rerror_rate -0.303126 0.308722 \n", "srv_rerror_rate -0.294328 0.300186 \n", "same_srv_rate 0.873551 -0.844537 \n", "diff_srv_rate -0.868580 0.850911 \n", "srv_diff_host_rate 0.068648 -0.050472 \n", "dst_host_count -0.070448 0.044338 \n", "dst_host_srv_count 0.970072 -0.955178 \n", "dst_host_same_srv_rate 1.000000 -0.980245 \n", "dst_host_diff_srv_rate -0.980245 1.000000 \n", "dst_host_same_src_port_rate 0.771158 -0.766402 \n", "dst_host_srv_diff_host_rate 0.107926 -0.088665 \n", "dst_host_serror_rate -0.742045 0.719275 \n", "dst_host_srv_serror_rate -0.725272 0.701149 \n", "dst_host_rerror_rate -0.278068 0.287476 \n", "dst_host_srv_rerror_rate -0.264383 0.271067 \n", "\n", " dst_host_same_src_port_rate \\\n", "duration -0.065202 \n", "src_bytes 0.815039 \n", "dst_bytes -0.396195 \n", "land 0.004265 \n", "wrong_fragment -0.015449 \n", "urgent -0.001939 \n", "hot -0.086998 \n", "num_failed_logins -0.006617 \n", "logged_in -0.359506 \n", "num_compromised -0.078843 \n", "root_shell -0.004617 \n", "su_attempted -0.005020 \n", "num_root -0.015968 \n", "num_file_creations -0.015018 \n", "num_shells -0.003521 \n", "num_access_files -0.033288 \n", "num_outbound_cmds -0.000455 \n", "is_hot_login 0.000283 \n", "is_guest_login -0.038092 \n", "count 0.776906 \n", "srv_count 0.812280 \n", "serror_rate -0.650336 \n", "srv_serror_rate -0.646256 \n", "rerror_rate -0.278465 \n", "srv_rerror_rate -0.282239 \n", "same_srv_rate 0.732841 \n", "diff_srv_rate -0.727031 \n", "srv_diff_host_rate -0.222707 \n", "dst_host_count 0.189876 \n", "dst_host_srv_count 0.769481 \n", "dst_host_same_srv_rate 0.771158 \n", "dst_host_diff_srv_rate -0.766402 \n", "dst_host_same_src_port_rate 1.000000 \n", "dst_host_srv_diff_host_rate -0.175310 \n", "dst_host_serror_rate -0.658737 \n", "dst_host_srv_serror_rate -0.652636 \n", "dst_host_rerror_rate -0.299273 \n", "dst_host_srv_rerror_rate -0.297100 \n", "\n", " dst_host_srv_diff_host_rate \\\n", "duration 0.100692 \n", "src_bytes -0.140231 \n", "dst_bytes 0.578557 \n", "land 0.016171 \n", "wrong_fragment 0.007306 \n", "urgent -0.000976 \n", "hot -0.014141 \n", "num_failed_logins -0.002588 \n", "logged_in 0.659078 \n", "num_compromised -0.020979 \n", "root_shell 0.008631 \n", "su_attempted 0.001052 \n", "num_root 0.061030 \n", "num_file_creations 0.030590 \n", "num_shells 0.015882 \n", "num_access_files 0.011765 \n", "num_outbound_cmds 0.000288 \n", "is_hot_login 0.000538 \n", "is_guest_login -0.012578 \n", "count -0.496554 \n", "srv_count -0.391712 \n", "serror_rate -0.153568 \n", "srv_serror_rate -0.148072 \n", "rerror_rate 0.073061 \n", "srv_rerror_rate 0.075178 \n", "same_srv_rate 0.179040 \n", "diff_srv_rate -0.176930 \n", "srv_diff_host_rate 0.433173 \n", "dst_host_count -0.918894 \n", "dst_host_srv_count 0.043668 \n", "dst_host_same_srv_rate 0.107926 \n", "dst_host_diff_srv_rate -0.088665 \n", "dst_host_same_src_port_rate -0.175310 \n", "dst_host_srv_diff_host_rate 1.000000 \n", "dst_host_serror_rate -0.118697 \n", "dst_host_srv_serror_rate -0.103715 \n", "dst_host_rerror_rate 0.114971 \n", "dst_host_srv_rerror_rate 0.120767 \n", "\n", " dst_host_serror_rate dst_host_srv_serror_rate \\\n", "duration -0.056753 -0.057298 \n", "src_bytes -0.645920 -0.641792 \n", "dst_bytes -0.167047 -0.158378 \n", "land 0.013566 0.012265 \n", "wrong_fragment 0.010387 -0.024117 \n", "urgent -0.001381 -0.001370 \n", "hot -0.004706 -0.010721 \n", "num_failed_logins 0.014713 0.014914 \n", "logged_in -0.143283 -0.132474 \n", "num_compromised -0.005019 -0.004504 \n", "root_shell -0.003498 -0.003032 \n", "su_attempted 0.001974 0.002893 \n", "num_root -0.008457 -0.007096 \n", "num_file_creations -0.002257 -0.004295 \n", "num_shells -0.001588 -0.002357 \n", "num_access_files -0.011197 -0.011487 \n", "num_outbound_cmds -0.000011 -0.000372 \n", "is_hot_login -0.000076 -0.000007 \n", "is_guest_login -0.001066 -0.016885 \n", "count -0.331571 -0.335290 \n", "srv_count -0.449096 -0.442823 \n", "serror_rate 0.973947 0.965663 \n", "srv_serror_rate 0.967214 0.970617 \n", "rerror_rate -0.094076 -0.110646 \n", "srv_rerror_rate -0.096146 -0.114341 \n", "same_srv_rate -0.830067 -0.819335 \n", "diff_srv_rate 0.807205 0.795844 \n", "srv_diff_host_rate -0.097973 -0.092661 \n", "dst_host_count 0.123881 0.113845 \n", "dst_host_srv_count -0.722607 -0.708392 \n", "dst_host_same_srv_rate -0.742045 -0.725272 \n", "dst_host_diff_srv_rate 0.719275 0.701149 \n", "dst_host_same_src_port_rate -0.658737 -0.652636 \n", "dst_host_srv_diff_host_rate -0.118697 -0.103715 \n", "dst_host_serror_rate 1.000000 0.968015 \n", "dst_host_srv_serror_rate 0.968015 1.000000 \n", "dst_host_rerror_rate -0.087531 -0.111578 \n", "dst_host_srv_rerror_rate -0.096899 -0.110532 \n", "\n", " dst_host_rerror_rate dst_host_srv_rerror_rate \n", "duration -0.007759 -0.013891 \n", "src_bytes -0.297338 -0.300581 \n", "dst_bytes -0.003042 0.001621 \n", "land 0.000389 -0.001816 \n", "wrong_fragment 0.046656 -0.013666 \n", "urgent -0.000786 -0.000782 \n", "hot 0.199019 0.189142 \n", "num_failed_logins 0.032395 0.032151 \n", "logged_in 0.007236 0.012979 \n", "num_compromised 0.214115 0.217858 \n", "root_shell 0.002763 0.002151 \n", "su_attempted 0.003173 0.001731 \n", "num_root -0.000421 -0.005012 \n", "num_file_creations 0.000626 -0.001096 \n", "num_shells -0.000617 -0.002020 \n", "num_access_files -0.004743 -0.004552 \n", "num_outbound_cmds -0.000823 -0.001038 \n", "is_hot_login -0.000435 -0.000529 \n", "is_guest_login 0.025282 -0.004292 \n", "count -0.261194 -0.256176 \n", "srv_count -0.313442 -0.308132 \n", "serror_rate -0.103198 -0.105434 \n", "srv_serror_rate -0.122630 -0.124656 \n", "rerror_rate 0.910225 0.911622 \n", "srv_rerror_rate 0.904591 0.914904 \n", "same_srv_rate -0.282487 -0.282913 \n", "diff_srv_rate 0.299041 0.298904 \n", "srv_diff_host_rate 0.022585 0.024722 \n", "dst_host_count -0.125142 -0.125273 \n", "dst_host_srv_count -0.312040 -0.300787 \n", "dst_host_same_srv_rate -0.278068 -0.264383 \n", "dst_host_diff_srv_rate 0.287476 0.271067 \n", "dst_host_same_src_port_rate -0.299273 -0.297100 \n", "dst_host_srv_diff_host_rate 0.114971 0.120767 \n", "dst_host_serror_rate -0.087531 -0.096899 \n", "dst_host_srv_serror_rate -0.111578 -0.110532 \n", "dst_host_rerror_rate 1.000000 0.950964 \n", "dst_host_srv_rerror_rate 0.950964 1.000000 " ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have used a *Pandas* `DataFrame` here to render the correlation matrix in a more comprehensive way. Now we want those variables that are highly correlated. For that we do a bit of dataframe manipulation. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "# get a boolean dataframe where true means that a pair of variables is highly correlated\n", "highly_correlated_df = (abs(corr_df) > .8) & (corr_df < 1.0)\n", "# get the names of the variables so we can use them to slice the dataframe\n", "correlated_vars_index = (highly_correlated_df==True).any()\n", "correlated_var_names = correlated_vars_index[correlated_vars_index==True].index\n", "# slice it\n", "highly_correlated_df.loc[correlated_var_names,correlated_var_names]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
src_bytesdst_byteshotlogged_innum_compromisednum_outbound_cmdsis_hot_logincountsrv_countserror_ratesrv_serror_ratererror_ratesrv_rerror_ratesame_srv_ratediff_srv_ratedst_host_countdst_host_srv_countdst_host_same_srv_ratedst_host_diff_srv_ratedst_host_same_src_port_ratedst_host_srv_diff_host_ratedst_host_serror_ratedst_host_srv_serror_ratedst_host_rerror_ratedst_host_srv_rerror_rate
src_bytes False False False False False False False False False False False False False False False False False False False True False False False False False
dst_bytes False False False True False False False False False False False False False False False False False False False False False False False False False
hot False False False False True False False False False False False False False False False False False False False False False False False False False
logged_in False True False False False False False False False False False False False False False False False False False False False False False False False
num_compromised False False True False False False False False False False False False False False False False False False False False False False False False False
num_outbound_cmds False False False False False False True False False False False False False False False False False False False False False False False False False
is_hot_login False False False False False True False False False False False False False False False False False False False False False False False False False
count False False False False False False False False True False False False False False False False False False False False False False False False False
srv_count False False False False False False False True False False False False False False False False False False False True False False False False False
serror_rate False False False False False False False False False False True False False True True False False False False False False True True False False
srv_serror_rate False False False False False False False False False True False False False True True False False False False False False True True False False
rerror_rate False False False False False False False False False False False False True False False False False False False False False False False True True
srv_rerror_rate False False False False False False False False False False False True False False False False False False False False False False False True True
same_srv_rate False False False False False False False False False True True False False False True False True True True False False True True False False
diff_srv_rate False False False False False False False False False True True False False True False False True True True False False True False False False
dst_host_count False False False False False False False False False False False False False False False False False False False False True False False False False
dst_host_srv_count False False False False False False False False False False False False False True True False False True True False False False False False False
dst_host_same_srv_rate False False False False False False False False False False False False False True True False True False True False False False False False False
dst_host_diff_srv_rate False False False False False False False False False False False False False True True False True True False False False False False False False
dst_host_same_src_port_rate True False False False False False False False True False False False False False False False False False False False False False False False False
dst_host_srv_diff_host_rate False False False False False False False False False False False False False False False True False False False False False False False False False
dst_host_serror_rate False False False False False False False False False True True False False True True False False False False False False False True False False
dst_host_srv_serror_rate False False False False False False False False False True True False False True False False False False False False False True False False False
dst_host_rerror_rate False False False False False False False False False False False True True False False False False False False False False False False False True
dst_host_srv_rerror_rate False False False False False False False False False False False True True False False False False False False False False False False True False
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ " src_bytes dst_bytes hot logged_in \\\n", "src_bytes False False False False \n", "dst_bytes False False False True \n", "hot False False False False \n", "logged_in False True False False \n", "num_compromised False False True False \n", "num_outbound_cmds False False False False \n", "is_hot_login False False False False \n", "count False False False False \n", "srv_count False False False False \n", "serror_rate False False False False \n", "srv_serror_rate False False False False \n", "rerror_rate False False False False \n", "srv_rerror_rate False False False False \n", "same_srv_rate False False False False \n", "diff_srv_rate False False False False \n", "dst_host_count False False False False \n", "dst_host_srv_count False False False False \n", "dst_host_same_srv_rate False False False False \n", "dst_host_diff_srv_rate False False False False \n", "dst_host_same_src_port_rate True False False False \n", "dst_host_srv_diff_host_rate False False False False \n", "dst_host_serror_rate False False False False \n", "dst_host_srv_serror_rate False False False False \n", "dst_host_rerror_rate False False False False \n", "dst_host_srv_rerror_rate False False False False \n", "\n", " num_compromised num_outbound_cmds is_hot_login \\\n", "src_bytes False False False \n", "dst_bytes False False False \n", "hot True False False \n", "logged_in False False False \n", "num_compromised False False False \n", "num_outbound_cmds False False True \n", "is_hot_login False True False \n", "count False False False \n", "srv_count False False False \n", "serror_rate False False False \n", "srv_serror_rate False False False \n", "rerror_rate False False False \n", "srv_rerror_rate False False False \n", "same_srv_rate False False False \n", "diff_srv_rate False False False \n", "dst_host_count False False False \n", "dst_host_srv_count False False False \n", "dst_host_same_srv_rate False False False \n", "dst_host_diff_srv_rate False False False \n", "dst_host_same_src_port_rate False False False \n", "dst_host_srv_diff_host_rate False False False \n", "dst_host_serror_rate False False False \n", "dst_host_srv_serror_rate False False False \n", "dst_host_rerror_rate False False False \n", "dst_host_srv_rerror_rate False False False \n", "\n", " count srv_count serror_rate srv_serror_rate \\\n", "src_bytes False False False False \n", "dst_bytes False False False False \n", "hot False False False False \n", "logged_in False False False False \n", "num_compromised False False False False \n", "num_outbound_cmds False False False False \n", "is_hot_login False False False False \n", "count False True False False \n", "srv_count True False False False \n", "serror_rate False False False True \n", "srv_serror_rate False False True False \n", "rerror_rate False False False False \n", "srv_rerror_rate False False False False \n", "same_srv_rate False False True True \n", "diff_srv_rate False False True True \n", "dst_host_count False False False False \n", "dst_host_srv_count False False False False \n", "dst_host_same_srv_rate False False False False \n", "dst_host_diff_srv_rate False False False False \n", "dst_host_same_src_port_rate False True False False \n", "dst_host_srv_diff_host_rate False False False False \n", "dst_host_serror_rate False False True True \n", "dst_host_srv_serror_rate False False True True \n", "dst_host_rerror_rate False False False False \n", "dst_host_srv_rerror_rate False False False False \n", "\n", " rerror_rate srv_rerror_rate same_srv_rate \\\n", "src_bytes False False False \n", "dst_bytes False False False \n", "hot False False False \n", "logged_in False False False \n", "num_compromised False False False \n", "num_outbound_cmds False False False \n", "is_hot_login False False False \n", "count False False False \n", "srv_count False False False \n", "serror_rate False False True \n", "srv_serror_rate False False True \n", "rerror_rate False True False \n", "srv_rerror_rate True False False \n", "same_srv_rate False False False \n", "diff_srv_rate False False True \n", "dst_host_count False False False \n", "dst_host_srv_count False False True \n", "dst_host_same_srv_rate False False True \n", "dst_host_diff_srv_rate False False True \n", "dst_host_same_src_port_rate False False False \n", "dst_host_srv_diff_host_rate False False False \n", "dst_host_serror_rate False False True \n", "dst_host_srv_serror_rate False False True \n", "dst_host_rerror_rate True True False \n", "dst_host_srv_rerror_rate True True False \n", "\n", " diff_srv_rate dst_host_count dst_host_srv_count \\\n", "src_bytes False False False \n", "dst_bytes False False False \n", "hot False False False \n", "logged_in False False False \n", "num_compromised False False False \n", "num_outbound_cmds False False False \n", "is_hot_login False False False \n", "count False False False \n", "srv_count False False False \n", "serror_rate True False False \n", "srv_serror_rate True False False \n", "rerror_rate False False False \n", "srv_rerror_rate False False False \n", "same_srv_rate True False True \n", "diff_srv_rate False False True \n", "dst_host_count False False False \n", "dst_host_srv_count True False False \n", "dst_host_same_srv_rate True False True \n", "dst_host_diff_srv_rate True False True \n", "dst_host_same_src_port_rate False False False \n", "dst_host_srv_diff_host_rate False True False \n", "dst_host_serror_rate True False False \n", "dst_host_srv_serror_rate False False False \n", "dst_host_rerror_rate False False False \n", "dst_host_srv_rerror_rate False False False \n", "\n", " dst_host_same_srv_rate dst_host_diff_srv_rate \\\n", "src_bytes False False \n", "dst_bytes False False \n", "hot False False \n", "logged_in False False \n", "num_compromised False False \n", "num_outbound_cmds False False \n", "is_hot_login False False \n", "count False False \n", "srv_count False False \n", "serror_rate False False \n", "srv_serror_rate False False \n", "rerror_rate False False \n", "srv_rerror_rate False False \n", "same_srv_rate True True \n", "diff_srv_rate True True \n", "dst_host_count False False \n", "dst_host_srv_count True True \n", "dst_host_same_srv_rate False True \n", "dst_host_diff_srv_rate True False \n", "dst_host_same_src_port_rate False False \n", "dst_host_srv_diff_host_rate False False \n", "dst_host_serror_rate False False \n", "dst_host_srv_serror_rate False False \n", "dst_host_rerror_rate False False \n", "dst_host_srv_rerror_rate False False \n", "\n", " dst_host_same_src_port_rate \\\n", "src_bytes True \n", "dst_bytes False \n", "hot False \n", "logged_in False \n", "num_compromised False \n", "num_outbound_cmds False \n", "is_hot_login False \n", "count False \n", "srv_count True \n", "serror_rate False \n", "srv_serror_rate False \n", "rerror_rate False \n", "srv_rerror_rate False \n", "same_srv_rate False \n", "diff_srv_rate False \n", "dst_host_count False \n", "dst_host_srv_count False \n", "dst_host_same_srv_rate False \n", "dst_host_diff_srv_rate False \n", "dst_host_same_src_port_rate False \n", "dst_host_srv_diff_host_rate False \n", "dst_host_serror_rate False \n", "dst_host_srv_serror_rate False \n", "dst_host_rerror_rate False \n", "dst_host_srv_rerror_rate False \n", "\n", " dst_host_srv_diff_host_rate dst_host_serror_rate \\\n", "src_bytes False False \n", "dst_bytes False False \n", "hot False False \n", "logged_in False False \n", "num_compromised False False \n", "num_outbound_cmds False False \n", "is_hot_login False False \n", "count False False \n", "srv_count False False \n", "serror_rate False True \n", "srv_serror_rate False True \n", "rerror_rate False False \n", "srv_rerror_rate False False \n", "same_srv_rate False True \n", "diff_srv_rate False True \n", "dst_host_count True False \n", "dst_host_srv_count False False \n", "dst_host_same_srv_rate False False \n", "dst_host_diff_srv_rate False False \n", "dst_host_same_src_port_rate False False \n", "dst_host_srv_diff_host_rate False False \n", "dst_host_serror_rate False False \n", "dst_host_srv_serror_rate False True \n", "dst_host_rerror_rate False False \n", "dst_host_srv_rerror_rate False False \n", "\n", " dst_host_srv_serror_rate dst_host_rerror_rate \\\n", "src_bytes False False \n", "dst_bytes False False \n", "hot False False \n", "logged_in False False \n", "num_compromised False False \n", "num_outbound_cmds False False \n", "is_hot_login False False \n", "count False False \n", "srv_count False False \n", "serror_rate True False \n", "srv_serror_rate True False \n", "rerror_rate False True \n", "srv_rerror_rate False True \n", "same_srv_rate True False \n", "diff_srv_rate False False \n", "dst_host_count False False \n", "dst_host_srv_count False False \n", "dst_host_same_srv_rate False False \n", "dst_host_diff_srv_rate False False \n", "dst_host_same_src_port_rate False False \n", "dst_host_srv_diff_host_rate False False \n", "dst_host_serror_rate True False \n", "dst_host_srv_serror_rate False False \n", "dst_host_rerror_rate False False \n", "dst_host_srv_rerror_rate False True \n", "\n", " dst_host_srv_rerror_rate \n", "src_bytes False \n", "dst_bytes False \n", "hot False \n", "logged_in False \n", "num_compromised False \n", "num_outbound_cmds False \n", "is_hot_login False \n", "count False \n", "srv_count False \n", "serror_rate False \n", "srv_serror_rate False \n", "rerror_rate True \n", "srv_rerror_rate True \n", "same_srv_rate False \n", "diff_srv_rate False \n", "dst_host_count False \n", "dst_host_srv_count False \n", "dst_host_same_srv_rate False \n", "dst_host_diff_srv_rate False \n", "dst_host_same_src_port_rate False \n", "dst_host_srv_diff_host_rate False \n", "dst_host_serror_rate False \n", "dst_host_srv_serror_rate False \n", "dst_host_rerror_rate True \n", "dst_host_srv_rerror_rate False " ] } ], "prompt_number": 22 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Conclusions and possible model selection hints" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The previous dataframe showed us which variables are highly correlated. We have kept just those variables with at least one strong correlation. We can use as we please, but a good way could be to do some model selection. That is, if we have a group of variables that are highly correlated, we can keep just one of them to represent the group under the assumption that they convey similar information as predictors. Reducing the number of variables will not improve our model accuracy, but it will make it easier to understand and also more efficient to compute. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, from the description of the [KDD Cup 99 task](http://kdd.ics.uci.edu/databases/kddcup99/task.html) we know that the variable `dst_host_same_src_port_rate` references the percentage of the last 100 connections to the same port, for the same destination host. In our correlation matrix (and auxiliar dataframes) we find that this one is highly and positively correlated to `src_bytes` and `srv_count`. The former is the number of bytes sent form source to destination. The later is the number of connections to the same service as the current connection in the past 2 seconds. We might decide not to include `dst_host_same_src_port_rate` in our model if we include the other two, as a way to reduce the number of variables and later one better interpret our models. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Later on, in those notebooks dedicated to build predictive models, we will make use of this information to build more interpretable models. " ] } ], "metadata": {} } ] }