{ "metadata": { "name": "", "signature": "sha256:0ae8f2b10d9ac7e310131af2ebf507397abfa8c8949d986b8fd40f7fc174f57e" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tricks collected from \n", "- [workshop](http://nbviewer.ipython.org/github/nealcaren/workshop_2014/tree/master/notebooks/)\n", "- [DGA (Dynamic Generation Algorithm) detection](http://nbviewer.ipython.org/github/ClickSecurity/data_hacking/blob/master/dga_detection/DGA_Domain_Detection.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Libraries for text and scraping\n", "1. requests (web scraping)\n", "2. mechanize (web scraping)\n", "3. BeautifulSoup (html cleaning)\n", "4. database (database interface for lazy ppl)\n", "5. chardet (char encoding detector)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Discard non-ascii unicode in your text" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import chardet # or use !file in POSIX\n", "s = \"hello world \u01e5ood day. Let`\u00c7hange the world!\"\n", "print chardet.detect(s)\n", "\n", "s ## s as bytes" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "{'confidence': 0.7996636550693685, 'encoding': 'ISO-8859-2'}\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 1, "text": [ "'hello world \\xc7\\xa5ood day. Let`\\xc3\\x87hange the world!'" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "## decode the bytes using the suggested encoding\n", "## encode as ascii with error = 'ignore'\n", "clean_s = s.decode(\"ISO-8859-2\").encode(\"ascii\", \"ignore\")\n", "clean_s" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "'hello world ood day. Let`hange the world!'" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. meaning of '?' in regular expression\n", "- it turns default greedy search into non-greedy mode\n", "- intutively '?' means search until it finds the FIRST ..." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re\n", "s = 'google, Yahoo!'\n", "print re.findall(r'google, \n", "
\n", " | rank | \n", "uri | \n", "domain | \n", "type | \n", "
---|---|---|---|---|
0 | \n", "1 | \n", "facebook.com | \n", "legit | \n", "|
1 | \n", "2 | \n", "google.com | \n", "legit | \n", "|
2 | \n", "3 | \n", "youtube.com | \n", "youtube | \n", "legit | \n", "
3 | \n", "4 | \n", "yahoo.com | \n", "yahoo | \n", "legit | \n", "
4 | \n", "5 | \n", "baidu.com | \n", "baidu | \n", "legit | \n", "
5 rows \u00d7 4 columns
\n", "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ " rank uri domain type\n", "0 1 facebook.com facebook legit\n", "1 2 google.com google legit\n", "2 3 youtube.com youtube legit\n", "3 4 yahoo.com yahoo legit\n", "4 5 baidu.com baidu legit\n", "\n", "[5 rows x 4 columns]" ] } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "dga_df = pd.read_csv('data/dga_domains.txt', names = ['raw_domain'], header = None, encoding='utf-8')\n", "dga_df['domain'] = map(lambda uri: uri.lower().split(\".\")[0].strip(), dga_df.raw_domain)\n", "dga_df['type'] = 'dga'\n", "dga_df = dga_df.dropna().drop_duplicates()\n", "\n", "print dga_df.shape" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(2669, 3)\n" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "dga_df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", " | raw_domain | \n", "domain | \n", "type | \n", "
---|---|---|---|
0 | \n", "04055051be412eea5a61b7da8438be3d.info | \n", "04055051be412eea5a61b7da8438be3d | \n", "dga | \n", "
1 | \n", "1cb8a5f36f.info | \n", "1cb8a5f36f | \n", "dga | \n", "
2 | \n", "30acd347397c34fc273e996b22951002.org | \n", "30acd347397c34fc273e996b22951002 | \n", "dga | \n", "
3 | \n", "336c986a284e2b3bc0f69f949cb437cb.info | \n", "336c986a284e2b3bc0f69f949cb437cb | \n", "dga | \n", "
4 | \n", "336c986a284e2b3bc0f69f949cb437cb.org | \n", "336c986a284e2b3bc0f69f949cb437cb | \n", "dga | \n", "
5 rows \u00d7 3 columns
\n", "\n", " | domain | \n", "type | \n", "
---|---|---|
0 | \n", "legit | \n", "|
1 | \n", "legit | \n", "|
2 | \n", "youtube | \n", "legit | \n", "
3 | \n", "yahoo | \n", "legit | \n", "
4 | \n", "baidu | \n", "legit | \n", "
5 | \n", "wikipedia | \n", "legit | \n", "
6 | \n", "amazon | \n", "legit | \n", "
7 | \n", "live | \n", "legit | \n", "
8 | \n", "legit | \n", "|
9 | \n", "taobao | \n", "legit | \n", "
102495 | \n", "xcfwwghb | \n", "dga | \n", "
102496 | \n", "xcgqdfyrkgihlrmfmfib | \n", "dga | \n", "
102497 | \n", "xclqwzcfcx | \n", "dga | \n", "
102498 | \n", "xcpfxzuf | \n", "dga | \n", "
102499 | \n", "xcvxhxze | \n", "dga | \n", "
102500 | \n", "xdbrbsbm | \n", "dga | \n", "
102501 | \n", "xdfjryydcfwvkvui | \n", "dga | \n", "
102502 | \n", "xdjlvcgw | \n", "dga | \n", "
102503 | \n", "xdrmjeu | \n", "dga | \n", "
102504 | \n", "xflrjyyjswoatsoq | \n", "dga | \n", "
20 rows \u00d7 2 columns
\n", "