{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Exploration of a publicly available dataset.\n", "\n", "\n", "\n", "Data processing, cleaning and normalization is often 95% of the battle. Never underestimate this part of the process, if you're not careful about it your derrière will be sore later. Another good reason to spend a bit of time on understanding your data is that you may realize that the data isn't going to be useful for your task at hand. Quick pruning of fruitless branches is good.\n", "\n", "#### Data as an analogy: Data is almost always a big pile of shit, the only real question is, \"Is there a Pony inside?\" and that's what data exploration and understanding is about. ####\n", "\n", "For this exploration we're going to pull some data from the Malware Domain List website [http://www.malwaredomainlist.com](http://www.malwaredomainlist.com). We'd like to thank them for providing a great resourse and making their data available to the public. In general data is messy so even though we're going to be nit-picking quite a bit, we recognized that many datasets will have similar issues which is why we feel like this is a good 'real world' example of data.\n", "\n", "* Full database: [ http://www.malwaredomainlist.com/mdlcsv.php](http://www.malwaredomainlist.com/mdlcsv.php)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.13.1'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This exercise is mostly for us to understand what kind of data we have and then\n", "# run some simple stats on the fields/values in the data. Pandas will be great for that\n", "import pandas as pd\n", "pd.__version__" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Set default figure sizes\n", "pylab.rcParams['figure.figsize'] = (14.0, 5.0)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# This data url can be a web location http://foo.bar.com/mydata.csv or it can be a\n", "# a path to your disk where the data resides /full/path/to/data/mydata.csv\n", "# Note: Be a good web citizen, download the data once and then specify a path to your local file :)\n", "# For instance: > wget http://www.malwaredomainlist.com/mdlcsv.php -O mdl_data.csv\n", "# data_url = 'http://www.malwaredomainlist.com/mdlcsv.php'\n", "data_url = 'data/mdl_data.csv'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Note: when the data was pulled it didn't have column names, so poking around\n", "# on the website we found the column headers referenced so we're explicitly\n", "# specifying them to the CSV reader:\n", "# date,domain,ip,reverse,description,registrant,asn,inactive,country\n", "dataframe = pd.read_csv(data_url, names=['date','domain','ip','reverse','description',\n", " 'registrant','asn','inactive','country'], header=None, error_bad_lines=False, low_memory=False)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | date | \n", "domain | \n", "ip | \n", "reverse | \n", "description | \n", "registrant | \n", "asn | \n", "inactive | \n", "country | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "2009/01/01_10:00 | \n", "- | \n", "82.165.74.94 | \n", "kundenserver.de | \n", "several compromised domains redirect to Mebroot | \n", "- | \n", "8560 | \n", "1 | \n", "1 | \n", "
1 | \n", "2009/01/01_10:00 | \n", "saudieng.net | \n", "63.245.196.12 | \n", "- | \n", "compromised site/redirects to Mebroot | \n", "privacy protected | \n", "- | \n", "1 | \n", "1 | \n", "
2 | \n", "2009/01/01_10:00 | \n", "thick-click.com | \n", "74.52.59.66 | \n", "gator126.hostgator.com. | \n", "compromised site/redirects to Mebroot | \n", "Alvin slap30_1@juno.com | \n", "21844 | \n", "1 | \n", "1 | \n", "
3 | \n", "2009/01/01_10:00 | \n", "webfo.biz/fxid1.txt | \n", "69.89.27.211 | \n", "box211.bluehost.com | \n", "RFI | \n", "Anthony Stebbing stebbing@omen-designs.com | \n", "11798 | \n", "1 | \n", "1 | \n", "
4 | \n", "2009/01/01_10:00 | \n", "www.xtipp.hu/index.php_ | \n", "195.70.48.68 | \n", "s5.mediacenter.hu | \n", "compromised site/redirects to Mebroot | \n", "Wolf Center Kft | \n", "8358 | \n", "1 | \n", "1 | \n", "
5 rows × 9 columns
\n", "\n", " | date | \n", "domain | \n", "ip | \n", "reverse | \n", "description | \n", "registrant | \n", "asn | \n", "inactive | \n", "country | \n", "
---|---|---|---|---|---|---|---|---|---|
88009 | \n", "2014/05/09_15:10 | \n", "assurances-immobilier.com/111 | \n", "213.186.33.4 | \n", "cluster003.ovh.net. | \n", "Trojan.Zeus.GameOver | \n", "Valette Frederic / yany.valette@wanadoo.fr | \n", "16276 | \n", "0 | \n", "0 | \n", "
88010 | \n", "2014/05/02_06:37 | \n", "www.adesse-anwaltskanzlei.de/kunde729/ | \n", "81.169.222.233 | \n", "h2288326.stratoserver.net. | \n", "exploit kit | \n", "hostmaster@one.com | \n", "6724 | \n", "0 | \n", "0 | \n", "
88011 | \n", "2014/05/02_06:42 | \n", "thibautoskian.com/redesigns/quartet | \n", "82.165.211.214 | \n", "kundenserver.de. | \n", "Zbot trojan | \n", "Registrar Abuse Contact abuse@1and1.com | \n", "8560 | \n", "0 | \n", "0 | \n", "
88012 | \n", "2014/05/01_00:23 | \n", "antiviru.ru/download?midlet_id=3253&midlet_nam... | \n", "91.202.63.75 | \n", "- | \n", "Android.Trojan | \n", "- | \n", "44571 | \n", "0 | \n", "0 | \n", "
88013 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows × 9 columns
\n", "\n", " | date | \n", "domain | \n", "ip | \n", "reverse | \n", "description | \n", "registrant | \n", "asn | \n", "inactive | \n", "country | \n", "
---|---|---|---|---|---|---|---|---|---|
2 | \n", "2009/01/01_10:00 | \n", "thick-click.com | \n", "74.52.59.66 | \n", "gator126.hostgator.com. | \n", "compromised site/redirects to mebroot | \n", "alvin slap30_1@juno.com | \n", "21844 | \n", "1 | \n", "1 | \n", "
3 | \n", "2009/01/01_10:00 | \n", "webfo.biz/fxid1.txt | \n", "69.89.27.211 | \n", "box211.bluehost.com | \n", "rfi | \n", "anthony stebbing stebbing@omen-designs.com | \n", "11798 | \n", "1 | \n", "1 | \n", "
4 | \n", "2009/01/01_10:00 | \n", "www.xtipp.hu/index.php_ | \n", "195.70.48.68 | \n", "s5.mediacenter.hu | \n", "compromised site/redirects to mebroot | \n", "wolf center kft | \n", "8358 | \n", "1 | \n", "1 | \n", "
5 | \n", "2009/01/02_00:00 | \n", "epeiy.com/wssl713fro.exe | \n", "68.180.151.74 | \n", "p2p.geo.vip.sp1.yahoo.com | \n", "backdoor.win32.keystart.m | \n", "epeiy@yahoo.com | \n", "36752 | \n", "1 | \n", "1 | \n", "
6 | \n", "2009/01/03_00:00 | \n", "har5launo.com/cgi-bin/index.cgi?dx | \n", "74.213.167.191 | \n", "74-213-167-191.ultrahosting.com | \n", "exploits/mebroot | \n", "prokofyev yaroslav weksya@gmail.ru | \n", "23136 | \n", "1 | \n", "1 | \n", "
5 rows × 9 columns
\n", "\n", " \"Ahhh, what an awful dream. Ones and zeroes everywhere... and I thought I saw a two [shudder].\" \n", " -- Bender\n", " \"It was just a dream, Bender. There's no such thing as two\". \n", " -- Fry\n", "" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 1., 1.],\n", " [ 1., 1.]])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Using numpy.corrcoef to compute the correlation coefficient matrix\n", "np.corrcoef(dataframe[\"inactive\"], dataframe[\"country\"])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | inactive | \n", "country | \n", "
---|---|---|
inactive | \n", "1 | \n", "1 | \n", "
country | \n", "1 | \n", "1 | \n", "
2 rows × 2 columns
\n", "