{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
\n", "\n", "# Risky Domains\n", "As one of our first notebooks we're going to keep it fairly simple and just focus on TLDs. We'll revisit **Risky Domains** with another notebook where use all the parts of the domain and we'll cover more advanced modeling and machine leanrning techniques.\n", "\n", "This notebook explores the modeling of risky domains where the usage of the term **risky** has the dual characteristics of being both 'not common' and 'associated with bad'.\n", "\n", "\n", "Domain blacklists are great but they only go so far. In this notebook we explore and analyze domain blacklists. The approach will be to pin down indicators or patterns and then use those to flag domains. We're trying to differentiate the common vs. uncommon or more specificially the common vs. blacklist. Our intention is to **cast a wider net** than the blacklist. In general we trying to achieve the following benefits:\n", "- We don't have to exactly match the blacklist (which is probably already out of date).\n", "- We might identify common patterns that capture a family or larger set of malicious domains.\n", "\n", "In this notebook we're going to use data from MalwareDomains, Malwarebytes and CyberCrime Tracker. We're going to analyze those domains with a statistical technique called G-Test. We'll use the statistical results to evaluate and score new domains streaming in from Zeek IDS.\n", "\n", "Data Used\n", "- Malware Domain Blocklist: http://www.malwaredomains.com\n", "- Malwarebytes(hpHosts EMD): https://hosts-file.net/emd.txt\n", "- CyberCrime Tracker: http://cybercrime-tracker.net/\n", "\n", "\n", "- Cisco Umbrella: http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip\n", "- Alexa: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip\n", "\n", "Software\n", "- zat: https://github.com/SuperCowPowers/zat\n", "- Pandas: https://github.com/pandas-dev/pandas\n", "- TLDExtract: https://github.com/john-kurkowski/tldextract\n", "\n", "Techniques\n", "- G-Test: https://en.wikipedia.org/wiki/G-test\n", "\n", "\n", "Shout Outs:\n", "- Netresec (Alexa vs. Umbrella Blog): http://netres.ec/?b=1743FAE\n", "- Netresec (Threat Hunting Rinse-Repeat): http://netres.ec/?b=1582D1D\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "zat: 0.1.5\n", "Pandas: 0.19.2\n", "Numpy: 1.12.1\n", "Scikit Learn Version: 0.18.1\n" ] } ], "source": [ "import os\n", "import zat\n", "from zat.utils import file_utils\n", "print('zat: {:s}'.format(zat.__version__))\n", "import pandas as pd\n", "print('Pandas: {:s}'.format(pd.__version__))\n", "import numpy as np\n", "print('Numpy: {:s}'.format(np.__version__))\n", "from sklearn.externals import joblib\n", "import sklearn.ensemble\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "print('Scikit Learn Version:', sklearn.__version__)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Grab all the datasets\n", "notebook_path = %%pwd\n", "data_path = os.path.join(notebook_path, 'data')\n", "block_file = os.path.join(data_path, 'mal_dom_block.txt')\n", "cyber_file = os.path.join(data_path, 'cybercrime.txt')\n", "emd_file = os.path.join(data_path, 'emd.txt')\n", "alexa_file = os.path.join(data_path, 'alexa_1m.csv')\n", "umbrella_file = os.path.join(data_path, 'umbrella_1m.csv')\n", "with open(block_file) as bfp:\n", " block_domains = [row.strip() for row in bfp.readlines()]\n", "with open(cyber_file) as bfp:\n", " cyber_domains = [row.strip() for row in bfp.readlines()]\n", "with open(emd_file) as bfp:\n", " emd_domains = [row.split('\\t')[1].strip() for row in bfp.readlines() if '#' not in row]\n", "with open(alexa_file) as afp:\n", " alexa_domains = [row.split(',')[1].strip() for row in afp.readlines()]\n", "with open(umbrella_file) as afp:\n", " umbrella_domains = [row.split(',')[1].strip() for row in afp.readlines()]" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
\n", "## Always look at the data\n", "When you pull in data, always make sure to visually inspect it before going any further. In my experience about 75% of the time you aren't getting what you think on the first try." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1000000\n" ] }, { "data": { "text/plain": [ "['google.com',\n", " 'www.google.com',\n", " 'facebook.com',\n", " 'microsoft.com',\n", " 'doubleclick.net',\n", " 'g.doubleclick.net',\n", " 'clients4.google.com',\n", " 'googleads.g.doubleclick.net',\n", " 'google-analytics.com',\n", " 'apple.com']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Look at the Cisco Umbrella domains\n", "print(len(umbrella_domains))\n", "umbrella_domains[:10]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1000000\n" ] }, { "data": { "text/plain": [ "['google.com',\n", " 'youtube.com',\n", " 'facebook.com',\n", " 'baidu.com',\n", " 'wikipedia.org',\n", " 'yahoo.com',\n", " 'reddit.com',\n", " 'google.co.in',\n", " 'qq.com',\n", " 'twitter.com']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Look at the Alexa domains\n", "print(len(alexa_domains))\n", "alexa_domains[:10]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Malware Domain Blocklist: 18383\n", "['amazon.co.uk.security-check.ga', 'autosegurancabrasil.com', 'christianmensfellowshipsoftball.org', 'dadossolicitado-antendimento.sad879.mobi', 'hitnrun.com.my']\n", "\n", "Malwarebytes(hpHosts EMD): 10111\n", "['fpbqrouphaiti.com/sales!11-04/admin.php', 'jensonsintrenational.com/class/fat/cp.php?m=login', 'cboy.sytes.net/mypage/admin.php', '46.183.223.114/igere/3/admin.php', 'frankweb.club/temple/admin.php']\n", "\n", "CyberCrime Tracker: 156698\n", "['-sso.anbtr.com', '0.gvt0.com', '0.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfree18704554415.yourspchasbeenblockedcalltollfre18704554415.error2212.in', '000-101.org', '00005ik.rcomhost.com', '0000663c.tslocosumo.us', '0000a-fast-proxy.de', '0000pv6.rxportalhosting.com', '000my001.eu', '000my002.eu']\n" ] } ], "source": [ "# Look at all the known bad domains\n", "print('Malware Domain Blocklist: {:d}'.format(len(block_domains)))\n", "print(block_domains[:5])\n", "print('\\nMalwarebytes(hpHosts EMD): {:d}'.format(len(cyber_domains)))\n", "print(cyber_domains[:5])\n", "print('\\nCyberCrime Tracker: {:d}'.format(len(emd_domains)))\n", "print(emd_domains[:10])" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
\n", "## We see we need to do some cleanup/normalization\n", "Data cleanup or data normalization is always part of doing data analysis and we often spend quite a bit of time on it. Thanksfully in this case the **tldextract** Python module does the hard work for us. For the purposes of this notebook we'll be using the following terminology for the parts of the fully qualified domain name (**subdomain.domain.tld**). So for example:\n", "- www.google.com: **www**=subdomain, **google**=domain, **com**=tld" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "import tldextract\n", "def clean_domains(domain_list):\n", " for domain in domain_list:\n", " ext = tldextract.extract(domain)\n", " if ext.suffix: # If we don't have suffix either IP address or 'local/home/lan/etc'\n", " yield ext.subdomain, ext.domain, ext.suffix" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/plain": [ "[('amazon.co.uk', 'security-check', 'ga'),\n", " ('', 'autosegurancabrasil', 'com'),\n", " ('', 'christianmensfellowshipsoftball', 'org'),\n", " ('dadossolicitado-antendimento', 'sad879', 'mobi'),\n", " ('', 'hitnrun', 'com.my')]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Clean up and combine bad domains\n", "bad_domains = [domain for domain in clean_domains(block_domains)]\n", "bad_domains += [domain for domain in clean_domains(cyber_domains)]\n", "bad_domains += [domain for domain in clean_domains(emd_domains)]\n", "bad_domains[:5]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before duplication removal: 183405\n", "After duplication removal: 179029\n" ] } ], "source": [ "# Remove ALL domain.tld duplicates\n", "# Note: This will be a lot as these lists will often have many subdomain s\n", "print('Before duplication removal: {:d}'.format(len(bad_domains)))\n", "bad_domains = list(set(bad_domains))\n", "print('After duplication removal: {:d}'.format(len(bad_domains)))" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
\n", "## Alexa or Cisco Umbrella?\n", "The Alexa dataset does not contain subdomain information so although all these sites are extremely popular:\n", "- **www.google.com, accounts.google.com, apis.google.com, play.google.com, mtalk.google.com, mail.google.com**\n", "\n", "All of these will simply get rolled up into 'google.com' in the Alexa set. Since we're interested in the subdomains (in a later notebook) we're going to use the Umbrella dataset.\n", "\n", "**NOTE:** The benefit Alexa has is that it covers more domains, so one could argue that our statistics below are 'wrong' because Umbrella doesn't cover ALL one million domains. We recognize and understand this. The stats below are for **common** vs. known bad and Umbrella is certainly covering the common domains (even if it's not one million total)." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Process the common domains (common not 'good')\n", "Notice here that we're using the term **common** instead of **good**. It's well known that Alexa/Umbrella lists contain some malicious/hacked domains.\n", "- See Netresec blog http://netres.ec/?b=1743FAE for more info" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/plain": [ "[('', 'google', 'com'),\n", " ('www', 'google', 'com'),\n", " ('', 'facebook', 'com'),\n", " ('', 'microsoft', 'com'),\n", " ('', 'doubleclick', 'net')]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Clean up and append common domains\n", "common_domains = [domain for domain in clean_domains(umbrella_domains)]\n", "common_domains[:5]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before duplication removal: 994946\n", "After duplication removal: 994946\n" ] } ], "source": [ "# Remove any duplicates\n", "print('Before duplication removal: {:d}'.format(len(common_domains)))\n", "common_domains = list(set(common_domains))\n", "print('After duplication removal: {:d}'.format(len(common_domains)))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of overlaps: 1468\n" ] }, { "data": { "text/plain": [ "[('', 'ddth', 'com'),\n", " ('', 'cjb', 'net'),\n", " ('', 'ludashi', 'com'),\n", " ('xpi', 'searchtabnew', 'com'),\n", " ('dnspod-free', 'mydnspod', 'net'),\n", " ('', 'rol', 'ru'),\n", " ('dl', 'pconline', 'com.cn'),\n", " ('', 'webshieldonline', 'com'),\n", " ('rep', 'ytdownloader', 'com'),\n", " ('start', 'funmoods', 'com')]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Speaking of common instead of 'good' lets look at some\n", "# of the domains that intersect the blacklists\n", "bad_common = set(common_domains).intersection(bad_domains)\n", "print('Number of overlaps: {:d}'.format(len(bad_common)))\n", "list(bad_common)[:10]" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Removing blacklisted domains from common list\n", "So now we're going to remove any blacklisted domains from the common list (Cisco Umbrella list)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original common: 994946\n", "Blacklisted domains removed: 993478\n" ] } ], "source": [ "print('Original common: {:d}'.format(len(common_domains)))\n", "common_domains = list(set(common_domains).difference(bad_domains))\n", "print('Blacklisted domains removed: {:d}'.format(len(common_domains)))" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Create Pandas DataFrames\n", "- **DataFrames** are used in both *R* and *Python*. Pandas has an excellent implementation that really helps when doing any kind of processing, statistics or machine learning work." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Create dataframes\n", "df_bad = pd.DataFrame.from_records(bad_domains, columns=['subdomain', 'domain', 'tld'])\n", "df_bad['label'] = 'bad'\n", "df_common = pd.DataFrame.from_records(common_domains, columns=['subdomain', 'domain', 'tld'])\n", "df_common['label'] = 'common'" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Bad Domains: 179029\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subdomaindomaintldlabel
0wwwqdlhprdtwhvgxuzklovisrdbkhptpfarrbcmtrxbzlvhyg...combad
1advancecomputersonlinebad
2perseeponacombad
3wwwdownloadfriendinfobad
42o9jkm6yfjcentadecombad
\n", "
" ], "text/plain": [ " subdomain domain tld label\n", "0 www qdlhprdtwhvgxuzklovisrdbkhptpfarrbcmtrxbzlvhyg... com bad\n", "1 advancecomputers online bad\n", "2 perseepona com bad\n", "3 www downloadfriend info bad\n", "4 2o9jkm6yfj centade com bad" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('Bad Domains: {:d}'.format(len(df_bad)))\n", "df_bad.head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Common Domains: 993478\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subdomaindomaintldlabel
0msgappcomcommon
1onlinecitibankco.incommon
2emhapfokdlyxtfgucmjmcxcommon
3assetaffectvcomcommon
4gamerswithjobscomcommon
\n", "
" ], "text/plain": [ " subdomain domain tld label\n", "0 msgapp com common\n", "1 online citibank co.in common\n", "2 emhapfokdlyxtfgucmjm cx common\n", "3 asset affectv com common\n", "4 gamerswithjobs com common" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('Common Domains: {:d}'.format(len(df_common)))\n", "df_common.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subdomaindomaintldlabel
0msgappcomcommon
1onlinecitibankco.incommon
2emhapfokdlyxtfgucmjmcxcommon
3assetaffectvcomcommon
4gamerswithjobscomcommon
\n", "
" ], "text/plain": [ " subdomain domain tld label\n", "0 msgapp com common\n", "1 online citibank co.in common\n", "2 emhapfokdlyxtfgucmjm cx common\n", "3 asset affectv com common\n", "4 gamerswithjobs com common" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now that the records have labels on them (bad/common) we can combine them into one DataFrame\n", "df_all = df_common.append(df_bad, ignore_index=True)\n", "df_all.head()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Just the TLDs (for now)\n", "We're going to do some statistics on the TLDs to see how they're distributed between the bad and common domains. The zat python package provides a nice set of functionality for statistics on Pandas DataFrames (https://github.com/SuperCowPowers/zat)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Run a bunch of statistics from the zat python package\n", "import zat.dataframe_stats as df_stats" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Contingency Table\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelbadcommonAll
tld
ab.ca0.024.024.0
abbott0.02.02.0
abruzzo.it0.01.01.0
ac3.0440.0443.0
ac.ae0.012.012.0
\n", "
" ], "text/plain": [ "label bad common All\n", "tld \n", "ab.ca 0.0 24.0 24.0\n", "abbott 0.0 2.0 2.0\n", "abruzzo.it 0.0 1.0 1.0\n", "ac 3.0 440.0 443.0\n", "ac.ae 0.0 12.0 12.0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print out the contingency_table\n", "print('\\nContingency Table')\n", "cont_table = df_stats.contingency_table(df_all, 'tld', 'label')\n", "cont_table.head()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Expected Counts Table\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelbadcommonAll
tld
ab.ca3.66453820.33546224.0
abbott0.3053781.6946222.0
abruzzo.it0.1526890.8473111.0
ac67.641257375.358743443.0
ac.ae1.83226910.16773112.0
\n", "
" ], "text/plain": [ "label bad common All\n", "tld \n", "ab.ca 3.664538 20.335462 24.0\n", "abbott 0.305378 1.694622 2.0\n", "abruzzo.it 0.152689 0.847311 1.0\n", "ac 67.641257 375.358743 443.0\n", "ac.ae 1.832269 10.167731 12.0" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print out the expected_counts\n", "print('\\nExpected Counts Table')\n", "expect_counts = df_stats.expected_counts(df_all, 'tld', 'label')\n", "expect_counts.head()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "G-Test Scores\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelbadcommon
tld
ab.ca-77
abbott00
abruzzo.it00
ac-121121
ac.ae-33
\n", "
" ], "text/plain": [ "label bad common\n", "tld \n", "ab.ca -7 7\n", "abbott 0 0\n", "abruzzo.it 0 0\n", "ac -121 121\n", "ac.ae -3 3" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print out the g_test scores\n", "print('\\nG-Test Scores')\n", "g_scores = df_stats.g_test_scores(df_all, 'tld', 'label')\n", "g_scores.head()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Sort the GTest Scores\n", "For a formal interpretation of these scores please see (https://en.wikipedia.org/wiki/G-test). Informally, the higher the score the more that item **stands out** from a probability perspective from what the expected counts would be given the *null hypothesis* that the TLDs should occur equally likely in both classes.\n", "\n", "**Example:**\n", "\n", "The **tk** TLD occured about **~6200 times** across both datasets. So because we have about 5x more common domains than bad domains then if all else is equal we should see it about **~950 times** in the bad set and **~5250 times** in the common set. The actual observation is that we see it **6071 times** in the bad set and **only 94 times** in the common set. So seeing a **tk** domain pass through your IDS would definitely be a good thing to put on your **'short list'**.\n", "\n", "**See Expected Counts and Actual Counts Below**" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelbadcommon
tld
info35476-35476
tk21877-21877
xyz20444-20444
online6701-6701
club3252-3252
ru3124-3124
website1923-1923
in1069-1069
ws752-752
top693-693
site614-614
work576-576
biz556-556
name516-516
tech478-478
\n", "
" ], "text/plain": [ "label bad common\n", "tld \n", "info 35476 -35476\n", "tk 21877 -21877\n", "xyz 20444 -20444\n", "online 6701 -6701\n", "club 3252 -3252\n", "ru 3124 -3124\n", "website 1923 -1923\n", "in 1069 -1069\n", "ws 752 -752\n", "top 693 -693\n", "site 614 -614\n", "work 576 -576\n", "biz 556 -556\n", "name 516 -516\n", "tech 478 -478" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sort the GTest scores \n", "g_scores.sort_values('bad', ascending=False).head(15)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Expected Counts:\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelbadcommonAll
tld
club310.5695621723.4304382034.0
info2732.52354515163.47645517896.0
online392.2582132176.7417872569.0
tk941.3280995223.6719016165.0
xyz1216.0157306747.9842707964.0
\n", "
" ], "text/plain": [ "label bad common All\n", "tld \n", "club 310.569562 1723.430438 2034.0\n", "info 2732.523545 15163.476455 17896.0\n", "online 392.258213 2176.741787 2569.0\n", "tk 941.328099 5223.671901 6165.0\n", "xyz 1216.015730 6747.984270 7964.0" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Lets look at some of the TLDs Expected Counts vs. Actual Counts\n", "interesting_tlds = ['info', 'tk', 'xyz', 'online', 'club']\n", "print('Expected Counts:') \n", "expect_counts[expect_counts.index.isin(interesting_tlds)]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Actual Counts:\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelbadcommonAll
tld
club1459.0575.02034.0
info14053.03843.017896.0
online2259.0310.02569.0
tk6071.094.06165.0
xyz6958.01006.07964.0
\n", "
" ], "text/plain": [ "label bad common All\n", "tld \n", "club 1459.0 575.0 2034.0\n", "info 14053.0 3843.0 17896.0\n", "online 2259.0 310.0 2569.0\n", "tk 6071.0 94.0 6165.0\n", "xyz 6958.0 1006.0 7964.0" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('Actual Counts:')\n", "cont_table[cont_table.index.isin(interesting_tlds)]" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "deletable": true, "editable": true }, "source": [ "## Phase1 Complete\n", "\n", "We can see from the sorted GTest Score table that domains like **info, tk, xyz, club, ...** occur much more often in the blacklists than they do in the common lists (Umbrella/Alexa). So even with this small insight we could set up a Zeek Script (or a **zat Python script**) to mark domains with those TLDs as **risky**.\n", "\n", "In **Phase 2** of this notebook we'll dive into the domains and subdomains. Using NGram extraction and our G-Test statistics on **all** the extracted features to do feature selection for a **sparse data machine learning model**. We'll leverage the fantastic set of models available in the Python **scikit-learn** module and we'll show how to use zat to deploy that model so that new domains coming from Zeek can be evaluated and scored in realtime." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "deletable": true, "editable": true }, "source": [ "# Deployment with zat\n", "Now that we know which TLDs are 'risky' we can take action with zat. See the example [risky_domains.py](https://zat-tools.readthedocs.io/en/latest/examples.html#risky-domains) that uses these results to flag the realtime DNS logs coming from Zeek and makes a Virus Total query on any flagged domains. If the Virus Total query returns positives then we report the observation.\n", "\n", "Although this sounds simplistic it's actually quite effective. The number of VT queries we make is extremely small compared to the total volume of DNS queries and given the statistical results the probably of a 'hit' is reasonably high and of course we're casting a wider net then the original blacklist.\n", "\n", "## Try it Out\n", "If you liked this notebook please visit the [zat](https://github.com/SuperCowPowers/zat) project for more notebooks and examples. You can run all the examples with a simple $pip install zat (and a running Zeek instance of course)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }