{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 05\n",
    "\n",
    "## Logistic regression exercise to detect network intrusions\n",
    "\n",
    "\n",
    "Software to detect network intrusions protects a computer network from unauthorized users, including perhaps insiders.  The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections.\n",
    "The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection.  A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided.  The 1999 KDD intrusion detection contest uses a version of this dataset.\n",
    "\n",
    "Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN.  They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.\n",
    "\n",
    "The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic.  This was processed into about five million connection records.  Similarly, the two weeks of test data yielded around two million connection records. [description](http://kdd.ics.uci.edu/databases/kddcup99/task.html)\n",
    "\n",
    "A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol.  Each connection is labeled as either normal, or as an attack, with exactly one specific attack type.  Each connection record consists of about 100 bytes.\n",
    "\n",
    "Attacks fall into four main categories:\n",
    "\n",
    "* DOS: denial-of-service, e.g. syn flood;\n",
    "* R2L: unauthorized access from a remote machine, e.g. guessing password;\n",
    "* U2R:  unauthorized access to local superuser (root) privileges, e.g., various buffer overflow attacks;\n",
    "* probing: surveillance and other probing, e.g., port scanning.\n",
    "It is important to note that the test data is not from the same probability distribution as the training data, and it includes specific attack types not in the training data.  This makes the task more realistic.  Some intrusion experts believe that most novel attacks are variants of known attacks and the \"signature\" of known attacks can be sufficient to catch novel variants.  The datasets contain a total of 24 training attack types, with an additional 14 types in the test data only. \n",
    "  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the data into Pandas "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>duration</th>\n",
       "      <th>protocol_type</th>\n",
       "      <th>service</th>\n",
       "      <th>flag</th>\n",
       "      <th>src_bytes</th>\n",
       "      <th>dst_bytes</th>\n",
       "      <th>land</th>\n",
       "      <th>wrong_fragment</th>\n",
       "      <th>urgent</th>\n",
       "      <th>hot</th>\n",
       "      <th>num_failed_logins</th>\n",
       "      <th>logged_in</th>\n",
       "      <th>num_compromised</th>\n",
       "      <th>root_shell</th>\n",
       "      <th>su_attempted</th>\n",
       "      <th>num_root</th>\n",
       "      <th>num_file_creations</th>\n",
       "      <th>num_shells</th>\n",
       "      <th>num_access_files</th>\n",
       "      <th>num_outbound_cmds</th>\n",
       "      <th>is_host_login</th>\n",
       "      <th>is_guest_login</th>\n",
       "      <th>count</th>\n",
       "      <th>srv_count</th>\n",
       "      <th>serror_rate</th>\n",
       "      <th>srv_serror_rate</th>\n",
       "      <th>rerror_rate</th>\n",
       "      <th>srv_rerror_rate</th>\n",
       "      <th>same_srv_rate</th>\n",
       "      <th>diff_srv_rate</th>\n",
       "      <th>srv_diff_host_rate</th>\n",
       "      <th>dst_host_count</th>\n",
       "      <th>dst_host_srv_count</th>\n",
       "      <th>dst_host_same_srv_rate</th>\n",
       "      <th>dst_host_diff_srv_rate</th>\n",
       "      <th>dst_host_same_src_port_rate</th>\n",
       "      <th>dst_host_srv_diff_host_rate</th>\n",
       "      <th>dst_host_serror_rate</th>\n",
       "      <th>dst_host_srv_serror_rate</th>\n",
       "      <th>dst_host_rerror_rate</th>\n",
       "      <th>dst_host_srv_rerror_rate</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>tcp</td>\n",
       "      <td>ftp_data</td>\n",
       "      <td>SF</td>\n",
       "      <td>491</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>150</td>\n",
       "      <td>25</td>\n",
       "      <td>0.17</td>\n",
       "      <td>0.03</td>\n",
       "      <td>0.17</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.05</td>\n",
       "      <td>0.00</td>\n",
       "      <td>normal</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>udp</td>\n",
       "      <td>other</td>\n",
       "      <td>SF</td>\n",
       "      <td>146</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>13</td>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.08</td>\n",
       "      <td>0.15</td>\n",
       "      <td>0.00</td>\n",
       "      <td>255</td>\n",
       "      <td>1</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.60</td>\n",
       "      <td>0.88</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>normal</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>tcp</td>\n",
       "      <td>private</td>\n",
       "      <td>S0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>123</td>\n",
       "      <td>6</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.05</td>\n",
       "      <td>0.07</td>\n",
       "      <td>0.00</td>\n",
       "      <td>255</td>\n",
       "      <td>26</td>\n",
       "      <td>0.10</td>\n",
       "      <td>0.05</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>anomaly</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>tcp</td>\n",
       "      <td>http</td>\n",
       "      <td>SF</td>\n",
       "      <td>232</td>\n",
       "      <td>8153</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>30</td>\n",
       "      <td>255</td>\n",
       "      <td>1.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.03</td>\n",
       "      <td>0.04</td>\n",
       "      <td>0.03</td>\n",
       "      <td>0.01</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.01</td>\n",
       "      <td>normal</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>tcp</td>\n",
       "      <td>http</td>\n",
       "      <td>SF</td>\n",
       "      <td>199</td>\n",
       "      <td>420</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>30</td>\n",
       "      <td>32</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.09</td>\n",
       "      <td>255</td>\n",
       "      <td>255</td>\n",
       "      <td>1.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>normal</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   duration protocol_type   service flag  src_bytes  dst_bytes  land  \\\n",
       "0         0           tcp  ftp_data   SF        491          0     0   \n",
       "1         0           udp     other   SF        146          0     0   \n",
       "2         0           tcp   private   S0          0          0     0   \n",
       "3         0           tcp      http   SF        232       8153     0   \n",
       "4         0           tcp      http   SF        199        420     0   \n",
       "\n",
       "   wrong_fragment  urgent  hot  num_failed_logins  logged_in  num_compromised  \\\n",
       "0               0       0    0                  0          0                0   \n",
       "1               0       0    0                  0          0                0   \n",
       "2               0       0    0                  0          0                0   \n",
       "3               0       0    0                  0          1                0   \n",
       "4               0       0    0                  0          1                0   \n",
       "\n",
       "   root_shell  su_attempted  num_root  num_file_creations  num_shells  \\\n",
       "0           0             0         0                   0           0   \n",
       "1           0             0         0                   0           0   \n",
       "2           0             0         0                   0           0   \n",
       "3           0             0         0                   0           0   \n",
       "4           0             0         0                   0           0   \n",
       "\n",
       "   num_access_files  num_outbound_cmds  is_host_login  is_guest_login  count  \\\n",
       "0                 0                  0              0               0      2   \n",
       "1                 0                  0              0               0     13   \n",
       "2                 0                  0              0               0    123   \n",
       "3                 0                  0              0               0      5   \n",
       "4                 0                  0              0               0     30   \n",
       "\n",
       "   srv_count  serror_rate  srv_serror_rate  rerror_rate  srv_rerror_rate  \\\n",
       "0          2          0.0              0.0          0.0              0.0   \n",
       "1          1          0.0              0.0          0.0              0.0   \n",
       "2          6          1.0              1.0          0.0              0.0   \n",
       "3          5          0.2              0.2          0.0              0.0   \n",
       "4         32          0.0              0.0          0.0              0.0   \n",
       "\n",
       "   same_srv_rate  diff_srv_rate  srv_diff_host_rate  dst_host_count  \\\n",
       "0           1.00           0.00                0.00             150   \n",
       "1           0.08           0.15                0.00             255   \n",
       "2           0.05           0.07                0.00             255   \n",
       "3           1.00           0.00                0.00              30   \n",
       "4           1.00           0.00                0.09             255   \n",
       "\n",
       "   dst_host_srv_count  dst_host_same_srv_rate  dst_host_diff_srv_rate  \\\n",
       "0                  25                    0.17                    0.03   \n",
       "1                   1                    0.00                    0.60   \n",
       "2                  26                    0.10                    0.05   \n",
       "3                 255                    1.00                    0.00   \n",
       "4                 255                    1.00                    0.00   \n",
       "\n",
       "   dst_host_same_src_port_rate  dst_host_srv_diff_host_rate  \\\n",
       "0                         0.17                         0.00   \n",
       "1                         0.88                         0.00   \n",
       "2                         0.00                         0.00   \n",
       "3                         0.03                         0.04   \n",
       "4                         0.00                         0.00   \n",
       "\n",
       "   dst_host_serror_rate  dst_host_srv_serror_rate  dst_host_rerror_rate  \\\n",
       "0                  0.00                      0.00                  0.05   \n",
       "1                  0.00                      0.00                  0.00   \n",
       "2                  1.00                      1.00                  0.00   \n",
       "3                  0.03                      0.01                  0.00   \n",
       "4                  0.00                      0.00                  0.00   \n",
       "\n",
       "   dst_host_srv_rerror_rate    class  \n",
       "0                      0.00   normal  \n",
       "1                      0.00   normal  \n",
       "2                      0.00  anomaly  \n",
       "3                      0.01   normal  \n",
       "4                      0.00   normal  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "pd.set_option('display.max_columns', 500)\n",
    "import zipfile\n",
    "with zipfile.ZipFile('../datasets/UNB_ISCX_NSL_KDD.csv.zip', 'r') as z:\n",
    "    f = z.open('UNB_ISCX_NSL_KDD.csv')\n",
    "    data = pd.io.parsers.read_table(f, sep=',')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create X and y\n",
    "\n",
    "Use only **same_srv_rate** and\t**dst_host_srv_count**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y = (data['class'] == 'anomaly').astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    77054\n",
       "1    71463\n",
       "Name: class, dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X = data[['same_srv_rate','dst_host_srv_count']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 5.1 \n",
    "\n",
    "Split the data into training and testing sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 5.2\n",
    "\n",
    "Fit a logistic regression model and examine the coefficients\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 5.3\n",
    "\n",
    "Make predictions on the testing set and calculate the accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 5.4\n",
    "\n",
    "Confusion matrix of predictions\n",
    "\n",
    "What is the percentage of detected anomalies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 5.5\n",
    "\n",
    "Increase sensitivity by lowering the threshold for predicting anomaly connection\n",
    "\n",
    "Create a new classifier by changing the probability threshold to 0.3\n",
    "\n",
    "What is the new confusion matrix?\n",
    "\n",
    "What is the new percentage of detected anomalies?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}