{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div style=\"float: right; margin: 20px 20px 60px 20px\"><img src=\"images/anxious.jpg\" width=\"300px\"></div>\n",
    "\n",
    "# Risky Domains\n",
    "As one of our first notebooks we're going to keep it fairly simple and just focus on TLDs. We'll revisit **Risky Domains** with another notebook where use all the parts of the domain and we'll cover more advanced modeling and machine leanrning techniques.\n",
    "\n",
    "This notebook explores the modeling of risky domains where the usage of the term **risky** has the dual characteristics of being both 'not common' and 'associated with bad'.\n",
    "\n",
    "\n",
    "Domain blacklists are great but they only go so far. In this notebook we explore and analyze domain blacklists. The approach will be to pin down indicators or patterns and then use those to flag domains. We're trying to differentiate the common vs. uncommon or more specificially the common vs. blacklist. Our intention is to **cast a wider net** than the blacklist. In general we trying to achieve the following benefits:\n",
    "- We don't have to exactly match the blacklist (which is probably already out of date).\n",
    "- We might identify common patterns that capture a family or larger set of malicious domains.\n",
    "\n",
    "In this notebook we're going to use data from MalwareDomains, Malwarebytes and CyberCrime Tracker. We're going to analyze those domains with a statistical technique called G-Test. We'll use the statistical results to evaluate and score new domains streaming in from Zeek IDS.\n",
    "\n",
    "Data Used\n",
    "- Malware Domain Blocklist: http://www.malwaredomains.com\n",
    "- Malwarebytes(hpHosts EMD): https://hosts-file.net/emd.txt\n",
    "- CyberCrime Tracker: http://cybercrime-tracker.net/\n",
    "\n",
    "\n",
    "- Cisco Umbrella: http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip\n",
    "- Alexa: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip\n",
    "\n",
    "Software\n",
    "- zat: https://github.com/SuperCowPowers/zat\n",
    "- Pandas: https://github.com/pandas-dev/pandas\n",
    "- TLDExtract: https://github.com/john-kurkowski/tldextract\n",
    "\n",
    "Techniques\n",
    "- G-Test: https://en.wikipedia.org/wiki/G-test\n",
    "\n",
    "\n",
    "Shout Outs:\n",
    "- Netresec (Alexa vs. Umbrella Blog): http://netres.ec/?b=1743FAE\n",
    "- Netresec (Threat Hunting Rinse-Repeat): http://netres.ec/?b=1582D1D\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "zat: 0.1.5\n",
      "Pandas: 0.19.2\n",
      "Numpy: 1.12.1\n",
      "Scikit Learn Version: 0.18.1\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import zat\n",
    "from zat.utils import file_utils\n",
    "print('zat: {:s}'.format(zat.__version__))\n",
    "import pandas as pd\n",
    "print('Pandas: {:s}'.format(pd.__version__))\n",
    "import numpy as np\n",
    "print('Numpy: {:s}'.format(np.__version__))\n",
    "from sklearn.externals import joblib\n",
    "import sklearn.ensemble\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "print('Scikit Learn Version:', sklearn.__version__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Grab all the datasets\n",
    "notebook_path = %%pwd\n",
    "data_path = os.path.join(notebook_path, 'data')\n",
    "block_file = os.path.join(data_path, 'mal_dom_block.txt')\n",
    "cyber_file = os.path.join(data_path, 'cybercrime.txt')\n",
    "emd_file = os.path.join(data_path, 'emd.txt')\n",
    "alexa_file = os.path.join(data_path, 'alexa_1m.csv')\n",
    "umbrella_file = os.path.join(data_path, 'umbrella_1m.csv')\n",
    "with open(block_file) as bfp:\n",
    "    block_domains = [row.strip() for row in bfp.readlines()]\n",
    "with open(cyber_file) as bfp:\n",
    "    cyber_domains = [row.strip() for row in bfp.readlines()]\n",
    "with open(emd_file) as bfp:\n",
    "    emd_domains = [row.split('\\t')[1].strip() for row in bfp.readlines() if '#' not in row]\n",
    "with open(alexa_file) as afp:\n",
    "    alexa_domains = [row.split(',')[1].strip() for row in afp.readlines()]\n",
    "with open(umbrella_file) as afp:\n",
    "    umbrella_domains = [row.split(',')[1].strip() for row in afp.readlines()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div style=\"float: left; margin: 20px 20px 0px 0px\"><img src=\"images/eyeball.jpeg\" width=\"100px\"></div>\n",
    "## Always look at the data\n",
    "When you pull in data, always make sure to visually inspect it before going any further. In my experience about 75% of the time you aren't getting what you think on the first try."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000000\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['google.com',\n",
       " 'www.google.com',\n",
       " 'facebook.com',\n",
       " 'microsoft.com',\n",
       " 'doubleclick.net',\n",
       " 'g.doubleclick.net',\n",
       " 'clients4.google.com',\n",
       " 'googleads.g.doubleclick.net',\n",
       " 'google-analytics.com',\n",
       " 'apple.com']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Look at the Cisco Umbrella domains\n",
    "print(len(umbrella_domains))\n",
    "umbrella_domains[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000000\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['google.com',\n",
       " 'youtube.com',\n",
       " 'facebook.com',\n",
       " 'baidu.com',\n",
       " 'wikipedia.org',\n",
       " 'yahoo.com',\n",
       " 'reddit.com',\n",
       " 'google.co.in',\n",
       " 'qq.com',\n",
       " 'twitter.com']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Look at the Alexa domains\n",
    "print(len(alexa_domains))\n",
    "alexa_domains[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Malware Domain Blocklist: 18383\n",
      "['amazon.co.uk.security-check.ga', 'autosegurancabrasil.com', 'christianmensfellowshipsoftball.org', 'dadossolicitado-antendimento.sad879.mobi', 'hitnrun.com.my']\n",
      "\n",
      "Malwarebytes(hpHosts EMD): 10111\n",
      "['fpbqrouphaiti.com/sales!11-04/admin.php', 'jensonsintrenational.com/class/fat/cp.php?m=login', 'cboy.sytes.net/mypage/admin.php', '46.183.223.114/igere/3/admin.php', 'frankweb.club/temple/admin.php']\n",
      "\n",
      "CyberCrime Tracker: 156698\n",
      "['-sso.anbtr.com', '0.gvt0.com', '0.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfre18704554415.error2212.in', '000-101.org', '00005ik.rcomhost.com', '0000663c.tslocosumo.us', '0000a-fast-proxy.de', '0000pv6.rxportalhosting.com', '000my001.eu', '000my002.eu']\n"
     ]
    }
   ],
   "source": [
    "# Look at all the known bad domains\n",
    "print('Malware Domain Blocklist: {:d}'.format(len(block_domains)))\n",
    "print(block_domains[:5])\n",
    "print('\\nMalwarebytes(hpHosts EMD): {:d}'.format(len(cyber_domains)))\n",
    "print(cyber_domains[:5])\n",
    "print('\\nCyberCrime Tracker: {:d}'.format(len(emd_domains)))\n",
    "print(emd_domains[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div style=\"float: right; margin: 0px 0px 0px 0px\"><img src=\"images/cleanup.jpeg\" width=\"150px\"></div>\n",
    "## We see we need to do some cleanup/normalization\n",
    "Data cleanup or data normalization is always part of doing data analysis and we often spend quite a bit of time on it. Thanksfully in this case the **tldextract** Python module does the hard work for us. For the purposes of this notebook we'll be using the following terminology for the parts of the fully qualified domain name (**subdomain.domain.tld**). So for example:\n",
    "- www.google.com: **www**=subdomain, **google**=domain, **com**=tld"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import tldextract\n",
    "def clean_domains(domain_list):\n",
    "    for domain in domain_list:\n",
    "        ext = tldextract.extract(domain)\n",
    "        if ext.suffix: # If we don't have suffix either IP address or 'local/home/lan/etc'\n",
    "            yield ext.subdomain, ext.domain, ext.suffix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('amazon.co.uk', 'security-check', 'ga'),\n",
       " ('', 'autosegurancabrasil', 'com'),\n",
       " ('', 'christianmensfellowshipsoftball', 'org'),\n",
       " ('dadossolicitado-antendimento', 'sad879', 'mobi'),\n",
       " ('', 'hitnrun', 'com.my')]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Clean up and combine bad domains\n",
    "bad_domains = [domain for domain in clean_domains(block_domains)]\n",
    "bad_domains += [domain for domain in clean_domains(cyber_domains)]\n",
    "bad_domains += [domain for domain in clean_domains(emd_domains)]\n",
    "bad_domains[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Before duplication removal: 183405\n",
      "After duplication removal: 179029\n"
     ]
    }
   ],
   "source": [
    "# Remove ALL domain.tld duplicates\n",
    "# Note: This will be a lot as these lists will often have many subdomain s\n",
    "print('Before duplication removal: {:d}'.format(len(bad_domains)))\n",
    "bad_domains = list(set(bad_domains))\n",
    "print('After duplication removal: {:d}'.format(len(bad_domains)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div style=\"float: right; margin: 20px 20px 40px 20px\"><img src=\"images/umbrella.png\" width=\"200px\"></div>\n",
    "## Alexa or Cisco Umbrella?\n",
    "The Alexa dataset does not contain subdomain information so although all these sites are extremely popular:\n",
    "- **www.google.com, accounts.google.com, apis.google.com, play.google.com, mtalk.google.com, mail.google.com**\n",
    "\n",
    "All of these will simply get rolled up into 'google.com' in the Alexa set. Since we're interested in the subdomains (in a later notebook) we're going to use the Umbrella dataset.\n",
    "\n",
    "**NOTE:** The benefit Alexa has is that it covers more domains, so one could argue that our statistics below are 'wrong' because Umbrella doesn't cover ALL one million domains. We recognize and understand this. The stats below are for **common** vs. known bad and Umbrella is certainly covering the common domains (even if it's not one million total)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Process the common domains (common not 'good')\n",
    "Notice here that we're using the term **common** instead of **good**. It's well known that Alexa/Umbrella lists contain some malicious/hacked domains.\n",
    "- See Netresec blog http://netres.ec/?b=1743FAE for more info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('', 'google', 'com'),\n",
       " ('www', 'google', 'com'),\n",
       " ('', 'facebook', 'com'),\n",
       " ('', 'microsoft', 'com'),\n",
       " ('', 'doubleclick', 'net')]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Clean up and append common domains\n",
    "common_domains = [domain for domain in clean_domains(umbrella_domains)]\n",
    "common_domains[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Before duplication removal: 994946\n",
      "After duplication removal: 994946\n"
     ]
    }
   ],
   "source": [
    "# Remove any duplicates\n",
    "print('Before duplication removal: {:d}'.format(len(common_domains)))\n",
    "common_domains = list(set(common_domains))\n",
    "print('After duplication removal: {:d}'.format(len(common_domains)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of overlaps: 1468\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('', 'ddth', 'com'),\n",
       " ('', 'cjb', 'net'),\n",
       " ('', 'ludashi', 'com'),\n",
       " ('xpi', 'searchtabnew', 'com'),\n",
       " ('dnspod-free', 'mydnspod', 'net'),\n",
       " ('', 'rol', 'ru'),\n",
       " ('dl', 'pconline', 'com.cn'),\n",
       " ('', 'webshieldonline', 'com'),\n",
       " ('rep', 'ytdownloader', 'com'),\n",
       " ('start', 'funmoods', 'com')]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Speaking of common instead of 'good' lets look at some\n",
    "# of the domains that intersect the blacklists\n",
    "bad_common = set(common_domains).intersection(bad_domains)\n",
    "print('Number of overlaps: {:d}'.format(len(bad_common)))\n",
    "list(bad_common)[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Removing blacklisted domains from common list\n",
    "So now we're going to remove any blacklisted domains from the common list (Cisco Umbrella list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original common: 994946\n",
      "Blacklisted domains removed: 993478\n"
     ]
    }
   ],
   "source": [
    "print('Original common: {:d}'.format(len(common_domains)))\n",
    "common_domains = list(set(common_domains).difference(bad_domains))\n",
    "print('Blacklisted domains removed: {:d}'.format(len(common_domains)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Create Pandas DataFrames\n",
    "- **DataFrames** are used in both *R* and *Python*. Pandas has an excellent implementation that really helps when doing any kind of processing, statistics or machine learning work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Create dataframes\n",
    "df_bad = pd.DataFrame.from_records(bad_domains, columns=['subdomain', 'domain', 'tld'])\n",
    "df_bad['label'] = 'bad'\n",
    "df_common = pd.DataFrame.from_records(common_domains, columns=['subdomain', 'domain', 'tld'])\n",
    "df_common['label'] = 'common'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bad Domains: 179029\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subdomain</th>\n",
       "      <th>domain</th>\n",
       "      <th>tld</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>www</td>\n",
       "      <td>qdlhprdtwhvgxuzklovisrdbkhptpfarrbcmtrxbzlvhyg...</td>\n",
       "      <td>com</td>\n",
       "      <td>bad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td></td>\n",
       "      <td>advancecomputers</td>\n",
       "      <td>online</td>\n",
       "      <td>bad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td></td>\n",
       "      <td>perseepona</td>\n",
       "      <td>com</td>\n",
       "      <td>bad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>www</td>\n",
       "      <td>downloadfriend</td>\n",
       "      <td>info</td>\n",
       "      <td>bad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2o9jkm6yfj</td>\n",
       "      <td>centade</td>\n",
       "      <td>com</td>\n",
       "      <td>bad</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    subdomain                                             domain     tld label\n",
       "0         www  qdlhprdtwhvgxuzklovisrdbkhptpfarrbcmtrxbzlvhyg...     com   bad\n",
       "1                                               advancecomputers  online   bad\n",
       "2                                                     perseepona     com   bad\n",
       "3         www                                     downloadfriend    info   bad\n",
       "4  2o9jkm6yfj                                            centade     com   bad"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('Bad Domains: {:d}'.format(len(df_bad)))\n",
    "df_bad.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Common Domains: 993478\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subdomain</th>\n",
       "      <th>domain</th>\n",
       "      <th>tld</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td></td>\n",
       "      <td>msgapp</td>\n",
       "      <td>com</td>\n",
       "      <td>common</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>online</td>\n",
       "      <td>citibank</td>\n",
       "      <td>co.in</td>\n",
       "      <td>common</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td></td>\n",
       "      <td>emhapfokdlyxtfgucmjm</td>\n",
       "      <td>cx</td>\n",
       "      <td>common</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>asset</td>\n",
       "      <td>affectv</td>\n",
       "      <td>com</td>\n",
       "      <td>common</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td></td>\n",
       "      <td>gamerswithjobs</td>\n",
       "      <td>com</td>\n",
       "      <td>common</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  subdomain                domain    tld   label\n",
       "0                          msgapp    com  common\n",
       "1    online              citibank  co.in  common\n",
       "2            emhapfokdlyxtfgucmjm     cx  common\n",
       "3     asset               affectv    com  common\n",
       "4                  gamerswithjobs    com  common"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('Common Domains: {:d}'.format(len(df_common)))\n",
    "df_common.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subdomain</th>\n",
       "      <th>domain</th>\n",
       "      <th>tld</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td></td>\n",
       "      <td>msgapp</td>\n",
       "      <td>com</td>\n",
       "      <td>common</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>online</td>\n",
       "      <td>citibank</td>\n",
       "      <td>co.in</td>\n",
       "      <td>common</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td></td>\n",
       "      <td>emhapfokdlyxtfgucmjm</td>\n",
       "      <td>cx</td>\n",
       "      <td>common</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>asset</td>\n",
       "      <td>affectv</td>\n",
       "      <td>com</td>\n",
       "      <td>common</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td></td>\n",
       "      <td>gamerswithjobs</td>\n",
       "      <td>com</td>\n",
       "      <td>common</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  subdomain                domain    tld   label\n",
       "0                          msgapp    com  common\n",
       "1    online              citibank  co.in  common\n",
       "2            emhapfokdlyxtfgucmjm     cx  common\n",
       "3     asset               affectv    com  common\n",
       "4                  gamerswithjobs    com  common"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Now that the records have labels on them (bad/common) we can combine them into one DataFrame\n",
    "df_all = df_common.append(df_bad, ignore_index=True)\n",
    "df_all.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Just the TLDs (for now)\n",
    "We're going to do some statistics on the TLDs to see how they're distributed between the bad and common domains. The zat python package provides a nice set of functionality for statistics on Pandas DataFrames (https://github.com/SuperCowPowers/zat)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Run a bunch of statistics from the zat python package\n",
    "import zat.dataframe_stats as df_stats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Contingency Table\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>label</th>\n",
       "      <th>bad</th>\n",
       "      <th>common</th>\n",
       "      <th>All</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tld</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ab.ca</th>\n",
       "      <td>0.0</td>\n",
       "      <td>24.0</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>abbott</th>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>abruzzo.it</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ac</th>\n",
       "      <td>3.0</td>\n",
       "      <td>440.0</td>\n",
       "      <td>443.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ac.ae</th>\n",
       "      <td>0.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>12.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "label       bad  common    All\n",
       "tld                           \n",
       "ab.ca       0.0    24.0   24.0\n",
       "abbott      0.0     2.0    2.0\n",
       "abruzzo.it  0.0     1.0    1.0\n",
       "ac          3.0   440.0  443.0\n",
       "ac.ae       0.0    12.0   12.0"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print out the contingency_table\n",
    "print('\\nContingency Table')\n",
    "cont_table = df_stats.contingency_table(df_all, 'tld', 'label')\n",
    "cont_table.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Expected Counts Table\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>label</th>\n",
       "      <th>bad</th>\n",
       "      <th>common</th>\n",
       "      <th>All</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tld</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ab.ca</th>\n",
       "      <td>3.664538</td>\n",
       "      <td>20.335462</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>abbott</th>\n",
       "      <td>0.305378</td>\n",
       "      <td>1.694622</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>abruzzo.it</th>\n",
       "      <td>0.152689</td>\n",
       "      <td>0.847311</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ac</th>\n",
       "      <td>67.641257</td>\n",
       "      <td>375.358743</td>\n",
       "      <td>443.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ac.ae</th>\n",
       "      <td>1.832269</td>\n",
       "      <td>10.167731</td>\n",
       "      <td>12.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "label             bad      common    All\n",
       "tld                                     \n",
       "ab.ca        3.664538   20.335462   24.0\n",
       "abbott       0.305378    1.694622    2.0\n",
       "abruzzo.it   0.152689    0.847311    1.0\n",
       "ac          67.641257  375.358743  443.0\n",
       "ac.ae        1.832269   10.167731   12.0"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print out the expected_counts\n",
    "print('\\nExpected Counts Table')\n",
    "expect_counts = df_stats.expected_counts(df_all, 'tld', 'label')\n",
    "expect_counts.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "G-Test Scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>label</th>\n",
       "      <th>bad</th>\n",
       "      <th>common</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tld</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ab.ca</th>\n",
       "      <td>-7</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>abbott</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>abruzzo.it</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ac</th>\n",
       "      <td>-121</td>\n",
       "      <td>121</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ac.ae</th>\n",
       "      <td>-3</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "label       bad  common\n",
       "tld                    \n",
       "ab.ca        -7       7\n",
       "abbott        0       0\n",
       "abruzzo.it    0       0\n",
       "ac         -121     121\n",
       "ac.ae        -3       3"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print out the g_test scores\n",
    "print('\\nG-Test Scores')\n",
    "g_scores = df_stats.g_test_scores(df_all, 'tld', 'label')\n",
    "g_scores.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Sort the GTest Scores\n",
    "For a formal interpretation of these scores please see (https://en.wikipedia.org/wiki/G-test). Informally, the higher the score the more that item **stands out** from a probability perspective from what the expected counts would be given the *null hypothesis* that the TLDs should occur equally likely in both classes.\n",
    "\n",
    "**Example:**\n",
    "\n",
    "The **tk** TLD occured about **~6200 times** across both datasets. So because we have about 5x more common domains than bad domains then if all else is equal we should see it about **~950 times** in the bad set and **~5250 times** in the common set. The actual observation is that we see it **6071 times** in the bad set and **only 94 times** in the common set. So seeing a **tk** domain pass through your IDS would definitely be a good thing to put on your **'short list'**.\n",
    "\n",
    "**See Expected Counts and Actual Counts Below**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>label</th>\n",
       "      <th>bad</th>\n",
       "      <th>common</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tld</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>info</th>\n",
       "      <td>35476</td>\n",
       "      <td>-35476</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tk</th>\n",
       "      <td>21877</td>\n",
       "      <td>-21877</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>xyz</th>\n",
       "      <td>20444</td>\n",
       "      <td>-20444</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>online</th>\n",
       "      <td>6701</td>\n",
       "      <td>-6701</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>club</th>\n",
       "      <td>3252</td>\n",
       "      <td>-3252</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ru</th>\n",
       "      <td>3124</td>\n",
       "      <td>-3124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>website</th>\n",
       "      <td>1923</td>\n",
       "      <td>-1923</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>in</th>\n",
       "      <td>1069</td>\n",
       "      <td>-1069</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ws</th>\n",
       "      <td>752</td>\n",
       "      <td>-752</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>693</td>\n",
       "      <td>-693</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>site</th>\n",
       "      <td>614</td>\n",
       "      <td>-614</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>work</th>\n",
       "      <td>576</td>\n",
       "      <td>-576</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>biz</th>\n",
       "      <td>556</td>\n",
       "      <td>-556</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <td>516</td>\n",
       "      <td>-516</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tech</th>\n",
       "      <td>478</td>\n",
       "      <td>-478</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "label      bad  common\n",
       "tld                   \n",
       "info     35476  -35476\n",
       "tk       21877  -21877\n",
       "xyz      20444  -20444\n",
       "online    6701   -6701\n",
       "club      3252   -3252\n",
       "ru        3124   -3124\n",
       "website   1923   -1923\n",
       "in        1069   -1069\n",
       "ws         752    -752\n",
       "top        693    -693\n",
       "site       614    -614\n",
       "work       576    -576\n",
       "biz        556    -556\n",
       "name       516    -516\n",
       "tech       478    -478"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Sort the GTest scores    \n",
    "g_scores.sort_values('bad', ascending=False).head(15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Expected Counts:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>label</th>\n",
       "      <th>bad</th>\n",
       "      <th>common</th>\n",
       "      <th>All</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tld</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>club</th>\n",
       "      <td>310.569562</td>\n",
       "      <td>1723.430438</td>\n",
       "      <td>2034.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>info</th>\n",
       "      <td>2732.523545</td>\n",
       "      <td>15163.476455</td>\n",
       "      <td>17896.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>online</th>\n",
       "      <td>392.258213</td>\n",
       "      <td>2176.741787</td>\n",
       "      <td>2569.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tk</th>\n",
       "      <td>941.328099</td>\n",
       "      <td>5223.671901</td>\n",
       "      <td>6165.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>xyz</th>\n",
       "      <td>1216.015730</td>\n",
       "      <td>6747.984270</td>\n",
       "      <td>7964.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "label           bad        common      All\n",
       "tld                                       \n",
       "club     310.569562   1723.430438   2034.0\n",
       "info    2732.523545  15163.476455  17896.0\n",
       "online   392.258213   2176.741787   2569.0\n",
       "tk       941.328099   5223.671901   6165.0\n",
       "xyz     1216.015730   6747.984270   7964.0"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Lets look at some of the TLDs Expected Counts vs. Actual Counts\n",
    "interesting_tlds = ['info', 'tk', 'xyz', 'online', 'club']\n",
    "print('Expected Counts:') \n",
    "expect_counts[expect_counts.index.isin(interesting_tlds)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Actual Counts:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>label</th>\n",
       "      <th>bad</th>\n",
       "      <th>common</th>\n",
       "      <th>All</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tld</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>club</th>\n",
       "      <td>1459.0</td>\n",
       "      <td>575.0</td>\n",
       "      <td>2034.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>info</th>\n",
       "      <td>14053.0</td>\n",
       "      <td>3843.0</td>\n",
       "      <td>17896.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>online</th>\n",
       "      <td>2259.0</td>\n",
       "      <td>310.0</td>\n",
       "      <td>2569.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tk</th>\n",
       "      <td>6071.0</td>\n",
       "      <td>94.0</td>\n",
       "      <td>6165.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>xyz</th>\n",
       "      <td>6958.0</td>\n",
       "      <td>1006.0</td>\n",
       "      <td>7964.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "label       bad  common      All\n",
       "tld                             \n",
       "club     1459.0   575.0   2034.0\n",
       "info    14053.0  3843.0  17896.0\n",
       "online   2259.0   310.0   2569.0\n",
       "tk       6071.0    94.0   6165.0\n",
       "xyz      6958.0  1006.0   7964.0"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('Actual Counts:')\n",
    "cont_table[cont_table.index.isin(interesting_tlds)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Phase1 Complete\n",
    "\n",
    "We can see from the sorted GTest Score table that domains like **info, tk, xyz, club, ...** occur much more often in the blacklists than they do in the common lists (Umbrella/Alexa). So even with this small insight we could set up a Zeek Script (or a **zat Python script**) to mark domains with those TLDs as **risky**.\n",
    "\n",
    "In **Phase 2** of this notebook we'll dive into the domains and subdomains. Using NGram extraction and our G-Test statistics on **all** the extracted features to do feature selection for a **sparse data machine learning model**. We'll leverage the fantastic set of models available in the Python **scikit-learn** module and we'll show how to use zat to deploy that model so that new domains coming from Zeek can be evaluated and scored in realtime."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Deployment with zat\n",
    "Now that we know which TLDs are 'risky' we can take action with zat. See the example [risky_domains.py](https://zat-tools.readthedocs.io/en/latest/examples.html#risky-domains) that uses these results to flag the realtime DNS logs coming from Zeek and makes a Virus Total query on any flagged domains. If the Virus Total query returns positives then we report the observation.\n",
    "\n",
    "Although this sounds simplistic it's actually quite effective. The number of VT queries we make is extremely small compared to the total volume of DNS queries and given the statistical results the probably of a 'hit' is reasonably high and of course we're casting a wider net then the original blacklist.\n",
    "\n",
    "## Try it Out\n",
    "If you liked this notebook please visit the [zat](https://github.com/SuperCowPowers/zat) project for more notebooks and examples. You can run all the examples with a simple $pip install zat (and a running Zeek instance of course)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}