{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "\n",
      "<div style=\"float: right; margin: 0px 0px 0px 0px\"><img src=\"files/images/workbench.jpg\" width=\"300px\"></div>\n",
      "## PE File Similarity Graph using Workbench:\n",
      "\n",
      "Here we're using the term Similarity Graph to mean a graph where the nodes are entities (PE Files in this case) and the edges are relationships between the nodes based on similar attributes. See [Semantic Network](http://en.wikipedia.org/wiki/Semantic_network) for more information.\n",
      "\n",
      "Workbench can be setup to utilize several indexers:\n",
      "\n",
      "- Straight up Indexing with [ElasticSearch](http://www.elasticsearch.org) \n",
      "- Super awesome [Neo4j](http://www.neo4j.org) as both an indexer and graph database.\n",
      "\n",
      "Neo4j also incorporates Lucene based indexing so not only can you capture a rich set of relationships between your data entities but searches and queries are super quick.\n",
      "\n",
      "<div style=\"float: right; margin: 0px 0px 0px 30px\"><img src=\"files/images/pe_graph.png\" width=\"400px\"></div>\n",
      "## Lets start up the workbench server...\n",
      "Run the workbench server (from somewhere, for the demo we're just going to start a local one)\n",
      "<pre>\n",
      "$ workbench_server\n",
      "</pre>\n",
      "\n",
      "#### Okay so when the server starts up, it autoloads any worker plugins in the server/worker directory and dynamically monitors the directory, if a new python file shows up, it's validated as a properly formed plugin and if it passes is added to the list of workers."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Lets start to interact with workbench, please note there is NO specific client to workbench,\n",
      "# Just use the ZeroRPC Python, Node.js, or CLI interfaces.\n",
      "import zerorpc\n",
      "c = zerorpc.Client()\n",
      "c.connect(\"tcp://127.0.0.1:4242\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 2,
       "text": [
        "[None]"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Load in 100 PE Files\n",
      "def workbench_load(file_list):\n",
      "    md5_list = []\n",
      "    for filename in file_list:\n",
      "        with open(filename,'rb') as f:\n",
      "            md5_list.append(c.store_sample(f.read(), filename, 'exe'))\n",
      "    print 'Files loaded: %d' % len(md5_list)\n",
      "    return md5_list\n",
      "\n",
      "import os\n",
      "file_list = [os.path.join('../data/pe/bad', child) for child in os.listdir('../data/pe/bad')]\n",
      "md5s_bad = workbench_load(file_list)\n",
      "file_list = [os.path.join('../data/pe/good', child) for child in os.listdir('../data/pe/good')]\n",
      "md5s_good = workbench_load(file_list)\n",
      "md5_list = md5s_bad + md5s_good\n",
      "md5_list[:5]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Files loaded: 50\n",
        "Files loaded: 50"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "['033d91aae8ad29ed9fbb858179271232',\n",
        " '0cb9aa6fb9c4aa3afad7a303e21ac0f3',\n",
        " '0e882ec9b485979ea84c7843d41ba36f',\n",
        " '0e8b030fb6ae48ffd29e520fc16b5641',\n",
        " '0eb9e990c521b30428a379700ec5ab3e']"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<div style=\"float: left; margin: 0px 30px 0px 0px\"><img src=\"files/images/precise.jpg\" width=\"200px\"></div>\n",
      "### Notice the level on control on our batch operations\n",
      "#### Your database may have tons of files of different types. We literally control execution on the per sample level with md5 lists. Alternatively we can specify specific types or simply make a query to the database get exactly what we want and build our own md5 list.\n",
      "\n",
      "#### Also notice that we can specify ^exactly^ what data we want down to arbitrary depth.. here we want just the imported_symbols from the sparse features from the pe_features worker."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Compute pe_features on all files of type pe, just pull back the sparse features\n",
      "imports = c.batch_work_request('pe_features', {'md5_list': md5_list, 'subkeys':['md5','sparse_features.imported_symbols']})\n",
      "imports"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 4,
       "text": [
        "<generator object iterator at 0x106074cd0>"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<div style=\"float: left; margin: 0px 0px 0px 30px\"><img src=\"files/images/head_explode.jpg\" width=\"350px\"></div>\n",
      "## Holy s#@&! The server batch request returned a generator?\n",
      "#### Yes generators are awesome but getting one from a server request! Are u serious?!  Yes, dead serious.. like chopping off your head and kicking your body into a shallow grave and putting your head on a stick... serious.\n",
      "\n",
      "#### For more on client/server generators and client-contructed/server-executed generator pipelines see our super spiffy [Generator Pipelines](http://nbviewer.ipython.org/url/raw.github.com/SuperCowPowers/workbench/master/workbench/notebooks/Generator_Pipelines.ipynb) notebook.\n",
      "\n",
      "#### Now that we have the a server generator from workbench we can push it into a Pandas Dataframe without a copy, fast and memory efficient..."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<div style=\"float: right; padding: 0px 0px 0px 30px;\"><img src=\"files/images/transformers.png\" width=\"200px\"></div>\n",
      "### Data Transformation: One line of code to transform workbench output into a Pandas Dataframe!\n",
      "** Putting your data into a Pandas Dataframe opens up a new world and enables tons of functionality for data, temporal and statistical analysis. **\n",
      "<pre>df = pd.DataFrame(imports)</pre>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Sending generator output into a Pandas Dataframe constructor\n",
      "import pandas as pd\n",
      "df_imports = pd.DataFrame(imports)\n",
      "df_imports.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>imported_symbols</th>\n",
        "      <th>md5</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> [kernel32.dll:name=getenvironmentvariablew, ke...</td>\n",
        "      <td> 033d91aae8ad29ed9fbb858179271232</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> [mfc42.dll:ordinal=2514, mfc42.dll:ordinal=516...</td>\n",
        "      <td> 0cb9aa6fb9c4aa3afad7a303e21ac0f3</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> [msvbvm60.dll:ordinal=588 bound=1923206285, ms...</td>\n",
        "      <td> 0e882ec9b485979ea84c7843d41ba36f</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> [wsock32.dll:name=wsastartup, wsock32.dll:name...</td>\n",
        "      <td> 0e8b030fb6ae48ffd29e520fc16b5641</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> [user32.dll:name=getwindow bound=2110180096, u...</td>\n",
        "      <td> 0eb9e990c521b30428a379700ec5ab3e</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 2 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "                                    imported_symbols  \\\n",
        "0  [kernel32.dll:name=getenvironmentvariablew, ke...   \n",
        "1  [mfc42.dll:ordinal=2514, mfc42.dll:ordinal=516...   \n",
        "2  [msvbvm60.dll:ordinal=588 bound=1923206285, ms...   \n",
        "3  [wsock32.dll:name=wsastartup, wsock32.dll:name...   \n",
        "4  [user32.dll:name=getwindow bound=2110180096, u...   \n",
        "\n",
        "                                md5  \n",
        "0  033d91aae8ad29ed9fbb858179271232  \n",
        "1  0cb9aa6fb9c4aa3afad7a303e21ac0f3  \n",
        "2  0e882ec9b485979ea84c7843d41ba36f  \n",
        "3  0e8b030fb6ae48ffd29e520fc16b5641  \n",
        "4  0eb9e990c521b30428a379700ec5ab3e  \n",
        "\n",
        "[5 rows x 2 columns]"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<div style=\"float: left; margin: 0px 0px 0px 0px\"><img src=\"files/images/confused.jpg\" width=\"350px\"></div>\n",
      "\n",
      "## So I'm confused what just happened? \n",
      "\n",
      "- We sent 100 PE files to the workbench server\n",
      "- Workbench stores them in a database (MongoDB now, maybe Vertica later)\n",
      "- We executed a precise batch operation where we asked for a specific worker\n",
      "- We pulled back a tiny piece of the output that we specifically wanted\n",
      "- We got a client/server generator, we populated a Pandas dataframe (see Transformer)\n",
      "\n",
      "<div style=\"float: right; margin: -70px 0px 0px 30px\"><img src=\"files/images/mongo_data.png\" width=\"800px\"></div>\n",
      "<br>\n",
      "** Image below shows the Workbench database, each worker stores data in a separate collection. The data is transparent, organized and accessible **\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Okay so we have lots of PE File attributes that we might want to look at lets do a bunch\n",
      "# Note: We're invoking a couple of new workers: strings and pe_peid\n",
      "\n",
      "# Compute pe_features on all files of type pe, just pull back the sparse features\n",
      "df_warnings = pd.DataFrame(c.batch_work_request('pe_features', {'type_tag': 'exe', 'subkeys':['md5','sparse_features.pe_warning_strings']}))\n",
      "df_warnings.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>md5</th>\n",
        "      <th>pe_warning_strings</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 090a189f4eeb3c0b76e97acdb1a71c92</td>\n",
        "      <td>                                                []</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 093dee8d97fd9d35884ed52179b3d142</td>\n",
        "      <td> [Suspicious flags set for section 5. Both IMAG...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 0dd74786d22edff0ce5b8e1b1e398618</td>\n",
        "      <td>                                                []</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 10328f92e7ec8735ea7846bf2c8254c2</td>\n",
        "      <td>                                                []</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 12fd4ef8f2cbbf98e0a5ced88258ddf3</td>\n",
        "      <td>                                                []</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 2 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 6,
       "text": [
        "                                md5  \\\n",
        "0  090a189f4eeb3c0b76e97acdb1a71c92   \n",
        "1  093dee8d97fd9d35884ed52179b3d142   \n",
        "2  0dd74786d22edff0ce5b8e1b1e398618   \n",
        "3  10328f92e7ec8735ea7846bf2c8254c2   \n",
        "4  12fd4ef8f2cbbf98e0a5ced88258ddf3   \n",
        "\n",
        "                                  pe_warning_strings  \n",
        "0                                                 []  \n",
        "1  [Suspicious flags set for section 5. Both IMAG...  \n",
        "2                                                 []  \n",
        "3                                                 []  \n",
        "4                                                 []  \n",
        "\n",
        "[5 rows x 2 columns]"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Compute strings on all files of type pe, just pull back the string_list\n",
      "df_strings = pd.DataFrame(c.batch_work_request('strings', {'type_tag': 'exe', 'subkeys':['md5','string_list']}))\n",
      "df_strings.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>md5</th>\n",
        "      <th>string_list</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 090a189f4eeb3c0b76e97acdb1a71c92</td>\n",
        "      <td> [!This program cannot be run in DOS mode., Ric...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 093dee8d97fd9d35884ed52179b3d142</td>\n",
        "      <td> [!This program cannot be run in DOS mode., r\\R...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 0dd74786d22edff0ce5b8e1b1e398618</td>\n",
        "      <td> [!This program cannot be run in DOS mode., qHy...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 10328f92e7ec8735ea7846bf2c8254c2</td>\n",
        "      <td> [!This program cannot be run in DOS mode., .te...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 12fd4ef8f2cbbf98e0a5ced88258ddf3</td>\n",
        "      <td> [!This program cannot be run in DOS mode., qHy...</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 2 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "                                md5  \\\n",
        "0  090a189f4eeb3c0b76e97acdb1a71c92   \n",
        "1  093dee8d97fd9d35884ed52179b3d142   \n",
        "2  0dd74786d22edff0ce5b8e1b1e398618   \n",
        "3  10328f92e7ec8735ea7846bf2c8254c2   \n",
        "4  12fd4ef8f2cbbf98e0a5ced88258ddf3   \n",
        "\n",
        "                                         string_list  \n",
        "0  [!This program cannot be run in DOS mode., Ric...  \n",
        "1  [!This program cannot be run in DOS mode., r\\R...  \n",
        "2  [!This program cannot be run in DOS mode., qHy...  \n",
        "3  [!This program cannot be run in DOS mode., .te...  \n",
        "4  [!This program cannot be run in DOS mode., qHy...  \n",
        "\n",
        "[5 rows x 2 columns]"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Compute pe_peid on all files of type pe, just pull back the match_list\n",
      "df_peids = pd.DataFrame(c.batch_work_request('pe_peid', {'type_tag': 'exe', 'subkeys':['md5','match_list']}))\n",
      "df_peids.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>match_list</th>\n",
        "      <th>md5</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>                [Microsoft Visual C++ 8]</td>\n",
        "      <td> 090a189f4eeb3c0b76e97acdb1a71c92</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>                                      []</td>\n",
        "      <td> 093dee8d97fd9d35884ed52179b3d142</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>                [Microsoft Visual C++ 8]</td>\n",
        "      <td> 0dd74786d22edff0ce5b8e1b1e398618</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> [Microsoft Visual C# v7.0 / Basic .NET]</td>\n",
        "      <td> 10328f92e7ec8735ea7846bf2c8254c2</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td>                [Microsoft Visual C++ 8]</td>\n",
        "      <td> 12fd4ef8f2cbbf98e0a5ced88258ddf3</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 2 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 8,
       "text": [
        "                                match_list                               md5\n",
        "0                 [Microsoft Visual C++ 8]  090a189f4eeb3c0b76e97acdb1a71c92\n",
        "1                                       []  093dee8d97fd9d35884ed52179b3d142\n",
        "2                 [Microsoft Visual C++ 8]  0dd74786d22edff0ce5b8e1b1e398618\n",
        "3  [Microsoft Visual C# v7.0 / Basic .NET]  10328f92e7ec8735ea7846bf2c8254c2\n",
        "4                 [Microsoft Visual C++ 8]  12fd4ef8f2cbbf98e0a5ced88258ddf3\n",
        "\n",
        "[5 rows x 2 columns]"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<div style=\"float: right; margin: 0px 0px 0px 30px\"><img src=\"files/images/similarity.jpg\" width=\"350px\"></div>\n",
      "# Computing Similarity between PE Files\n",
      "There are a myriad of approaches for computing similarity between PE Files, we're arbitrarily choosing two:\n",
      "\n",
      "- SSDeep: computes context triggered piecewise hashes (CTPH) which can match inputs that have homologies.\n",
      "    - [SSDeep Sourceforge](http://ssdeep.sourceforge.net/)\n",
      "\n",
      "- Jaccard Index: a set based distance metric (overlap in element sets)  \n",
      "    - [Jaccard Index](http://en.wikipedia.org/wiki/Jaccard_index)\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# For the first approach workbench already has a worker that does SSDeep Sims\n",
      "ssdeep = pd.DataFrame(c.batch_work_request('pe_deep_sim', {'type_tag': 'exe'}))\n",
      "ssdeep.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>md5</th>\n",
        "      <th>sim_list</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 090a189f4eeb3c0b76e97acdb1a71c92</td>\n",
        "      <td>                                                []</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 093dee8d97fd9d35884ed52179b3d142</td>\n",
        "      <td>                                                []</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 0dd74786d22edff0ce5b8e1b1e398618</td>\n",
        "      <td> [{u'sim': 65, u'md5': u'e0b173f23d873286169995...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 10328f92e7ec8735ea7846bf2c8254c2</td>\n",
        "      <td>                                                []</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 12fd4ef8f2cbbf98e0a5ced88258ddf3</td>\n",
        "      <td>                                                []</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 2 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 9,
       "text": [
        "                                md5  \\\n",
        "0  090a189f4eeb3c0b76e97acdb1a71c92   \n",
        "1  093dee8d97fd9d35884ed52179b3d142   \n",
        "2  0dd74786d22edff0ce5b8e1b1e398618   \n",
        "3  10328f92e7ec8735ea7846bf2c8254c2   \n",
        "4  12fd4ef8f2cbbf98e0a5ced88258ddf3   \n",
        "\n",
        "                                            sim_list  \n",
        "0                                                 []  \n",
        "1                                                 []  \n",
        "2  [{u'sim': 65, u'md5': u'e0b173f23d873286169995...  \n",
        "3                                                 []  \n",
        "4                                                 []  \n",
        "\n",
        "[5 rows x 2 columns]"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# For the second approach we need to do a bit more work\n",
      "# Here we setup a convenience function that takes a sparse feature list\n",
      "# and computes pair wise similarities between each item in the list\n",
      "def jaccard_sims(feature_df, name, thres):\n",
      "    md5s = feature_df['md5'].tolist()\n",
      "    features = feature_df[name].tolist()\n",
      "    sim_info_list = []\n",
      "    for md5_source, features_source in zip(md5s, features):\n",
      "        for md5_target, features_target in zip(md5s, features):\n",
      "            if md5_source == md5_target: continue\n",
      "            sim = jaccard_sim(features_source, features_target)\n",
      "            if sim > thres:\n",
      "                sim_info_list.append({'source':md5_source, 'target':md5_target, 'sim':sim})\n",
      "    return sim_info_list\n",
      "\n",
      "def jaccard_sim(features1, features2):\n",
      "    ''' Compute similarity between two sets using Jaccard similarity '''\n",
      "    set1 = set(features1)\n",
      "    set2 = set(features2)\n",
      "    try:\n",
      "        return len(set1.intersection(set2))/float(max(len(set1),len(set2)))\n",
      "    except ZeroDivisionError:\n",
      "        return 0"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<div style=\"float: right; margin: 0px 0px 0px 30px\"><img src=\"files/images/pe_graph.png\" width=\"500px\"></div>\n",
      "# Building our similarity graph\n",
      "Here we're using the term Similarity Graph to mean a graph where the nodes are entities (PE Files in this case) and the edges are relationships between the nodes based on similar attributes. See [Semantic Network](http://en.wikipedia.org/wiki/Semantic_network) for more information.\n",
      "\n",
      "Here we're using the super awesome [Neo4j](http://www.neo4j.org) as both an indexer and graph database.\n",
      "\n",
      "Neo4j also incorporates Lucene based indexing so not only can you capture a rich set of relationships between your data entities but searches and queries are super quick.\n",
      "\n",
      "Note: All images were captured by simply going to http://localhost:7474/browser/ (Neo4j Browser) and making some queries."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# First just add all the nodes\n",
      "for md5 in md5s_bad:\n",
      "    c.add_node(md5, md5[:6], ['exe','bad'])\n",
      "for md5 in md5s_good:\n",
      "    c.add_node(md5, md5[:6], ['exe','good'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Store the ssdeep sims as relationships\n",
      "for i, row in ssdeep.iterrows():\n",
      "    for sim_info in row['sim_list']:\n",
      "        c.add_rel(row['md5'], sim_info['md5'], 'ssdeep')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## SSDeep similarity graph\n",
      "<div style=\"margin: 0px 0px 0px 80px\"><img src=\"files/images/ssdeep.png\" width=\"800px\"></div>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Compute the Jaccard Index between imported systems and store as relationships\n",
      "sims = jaccard_sims(df_imports, 'imported_symbols', .8)\n",
      "for sim_info in sims:\n",
      "    c.add_rel(sim_info['source'], sim_info['target'], 'imports')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Import Symbols (80% or greater) similarity graph\n",
      "<div style=\"margin: 0px 0px 0px 80px\"><img src=\"files/images/imports.png\" width=\"800px\"></div>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Compute the Jaccard Index between warnings and store as relationships\n",
      "sims = jaccard_sims(df_warnings, 'pe_warning_strings', .5)\n",
      "for sim_info in sims:\n",
      "    c.add_rel(sim_info['source'], sim_info['target'], 'warnings')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Compute the Jaccard Index between strings and store as relationships\n",
      "sims = jaccard_sims(df_strings, 'string_list', .7)\n",
      "for sim_info in sims:\n",
      "    c.add_rel(sim_info['source'], sim_info['target'], 'strings')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Removing PE IDs that show Microsoft Visual C\n",
      "criterion = df_peids['match_list'].map(lambda x:any('Microsoft Visual C' not in y for y in x))\n",
      "df_peids = df_peids[criterion]\n",
      "\n",
      "# Compute the Jaccard Index between peids and store as relationships\n",
      "sims = jaccard_sims(df_peids, 'match_list', .5)\n",
      "for sim_info in sims:\n",
      "    c.add_rel(sim_info['source'], sim_info['target'], 'peids')\n",
      "print df_peids['match_list']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "7                       [Installer VISE Custom]\n",
        "13                  [Safeguard 1.03 -> Simonzh]\n",
        "26                   [Borland Delphi 3.0 (???)]\n",
        "27                   [Borland Delphi 3.0 (???)]\n",
        "36                   [Borland Delphi 3.0 (???)]\n",
        "52         [Microsoft Visual Basic v5.0 - v6.0]\n",
        "58                             [Armadillo v4.x]\n",
        "65                    [UPX v1.25 (Delphi) Stub]\n",
        "66                   [Borland Delphi 3.0 (???)]\n",
        "68                   [Borland Delphi 3.0 (???)]\n",
        "71                [Microsoft Visual Basic v5.0]\n",
        "73                  [Safeguard 1.03 -> Simonzh]\n",
        "75                   [Borland Delphi 3.0 (???)]\n",
        "76             [UPX -> www.upx.sourceforge.net]\n",
        "79       [BobSoft Mini Delphi -> BoB / BobSoft]\n",
        "81    [UPX v0.71 - v0.72, tElock v0.7x - v0.84]\n",
        "82                         [Borland Delphi 4.0]\n",
        "84                [Pack Master v1.0, PEX v0.99]\n",
        "85                    [UPX v1.25 (Delphi) Stub]\n",
        "94                                 [Dev-C++ v5]\n",
        "97                              [ASPack v1.06b]\n",
        "99                      [Upack v0.399 -> Dwing]\n",
        "Name: match_list, dtype: object\n"
       ]
      }
     ],
     "prompt_number": 33
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## All relationship similarity graph\n",
      "<div style=\"margin: 0px 0px 0px 80px\"><img src=\"files/images/graph_all.png\" width=\"800px\"></div>\n",
      "\n",
      "### Things of interest\n",
      "#### Good-Bad pair: Both Safeguard 1.03 -> Simonzh packer (peid relationship)\n",
      "#### Good-Bad cluster: All PEs had the same pe_file warning: \n",
      "\n",
      "- Section N: Both IMAGE_SCN_MEM_WRITE and IMAGE_SCN_MEM_EXECUTE are set\n",
      "- Note: None of them got were flagged as packed by PEID\n",
      "\n",
      "#### Bigger Bad cluster: All had imports that looked a lot like this...\n",
      "\n",
      "- kernel32.dll:name=virtualalloc\n",
      "- gdi32.dll:name=getobjecta\n",
      "- kernel32.dll:name=virtualprotect\n",
      "- kernel32.dll:name=virtualfree\n",
      "- msvcrt.dll:name=rand\n",
      "- kernel32.dll:name=getprocaddress\n",
      "- kernel32.dll:name=loadlibrarya\n",
      "- user32.dll:name=getfocus\n",
      "- ole32.dll:name=olerun"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Now run some graph queries against Neo4j\n",
      "from py2neo import neo4j\n",
      "graph_db = neo4j.GraphDatabaseService()\n",
      "query = neo4j.CypherQuery(graph_db, \"match (n:bad)-[r]-(m:good) return n.md5, labels(n),type(r), m.md5, labels(m)\")\n",
      "for record in query.stream():\n",
      "    v = record.values\n",
      "    print '%s(%s) ---%s--> %s(%s)' %(v[0],v[1],v[2],v[3],v[4])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2d09ca902990545fec9ac190b0338b50([u'bad', u'exe']) ---peids--> 41636f77ad6d9a396ea34e4786b96f2b([u'exe', u'good'])\n",
        "2d09ca902990545fec9ac190b0338b50([u'bad', u'exe']) ---peids--> 41636f77ad6d9a396ea34e4786b96f2b([u'exe', u'good'])\n",
        "1cea13cf888cd8ce4f869029f1dbb601([u'bad', u'exe']) ---warnings--> 52744454c74fac9fcc8f5efb6418c9b4([u'exe', u'good'])\n",
        "d94da41e7e809f7366971b3b50f8ef68([u'bad', u'exe']) ---warnings--> 52744454c74fac9fcc8f5efb6418c9b4([u'exe', u'good'])\n",
        "1cea13cf888cd8ce4f869029f1dbb601([u'bad', u'exe']) ---warnings--> 52744454c74fac9fcc8f5efb6418c9b4([u'exe', u'good'])\n",
        "d94da41e7e809f7366971b3b50f8ef68([u'bad', u'exe']) ---warnings--> 52744454c74fac9fcc8f5efb6418c9b4([u'exe', u'good'])\n",
        "1cea13cf888cd8ce4f869029f1dbb601([u'bad', u'exe']) ---warnings--> 5caea70e05a942e9ee9e02d178a28b1a([u'exe', u'good'])\n",
        "d94da41e7e809f7366971b3b50f8ef68([u'bad', u'exe']) ---warnings--> 5caea70e05a942e9ee9e02d178a28b1a([u'exe', u'good'])\n",
        "1cea13cf888cd8ce4f869029f1dbb601([u'bad', u'exe']) ---warnings--> 5caea70e05a942e9ee9e02d178a28b1a([u'exe', u'good'])\n",
        "d94da41e7e809f7366971b3b50f8ef68([u'bad', u'exe']) ---warnings--> 5caea70e05a942e9ee9e02d178a28b1a([u'exe', u'good'])\n",
        "1cea13cf888cd8ce4f869029f1dbb601([u'bad', u'exe']) ---warnings--> 73b459178a48657d5e92c41ec1fdd716([u'exe', u'good'])\n",
        "d94da41e7e809f7366971b3b50f8ef68([u'bad', u'exe']) ---warnings--> 73b459178a48657d5e92c41ec1fdd716([u'exe', u'good'])\n",
        "1cea13cf888cd8ce4f869029f1dbb601([u'bad', u'exe']) ---warnings--> 73b459178a48657d5e92c41ec1fdd716([u'exe', u'good'])\n",
        "d94da41e7e809f7366971b3b50f8ef68([u'bad', u'exe']) ---warnings--> 73b459178a48657d5e92c41ec1fdd716([u'exe', u'good'])\n",
        "2d094b6c69020091b68d1bcf5d11fa4b([u'bad', u'exe']) ---peids--> 73b459178a48657d5e92c41ec1fdd716([u'exe', u'good'])\n",
        "2d09546831b17d2cc0583362b6d312ae([u'bad', u'exe']) ---peids--> 73b459178a48657d5e92c41ec1fdd716([u'exe', u'good'])\n",
        "2d09cc92bbe29d96bb3a91b350d1725f([u'bad', u'exe']) ---peids--> 73b459178a48657d5e92c41ec1fdd716([u'exe', u'good'])\n",
        "2d094b6c69020091b68d1bcf5d11fa4b([u'bad', u'exe']) ---peids--> 73b459178a48657d5e92c41ec1fdd716([u'exe', u'good'])\n",
        "2d09546831b17d2cc0583362b6d312ae([u'bad', u'exe']) ---peids--> 73b459178a48657d5e92c41ec1fdd716([u'exe', u'good'])\n",
        "2d09cc92bbe29d96bb3a91b350d1725f([u'bad', u'exe']) ---peids--> 73b459178a48657d5e92c41ec1fdd716([u'exe', u'good'])\n",
        "2d094b6c69020091b68d1bcf5d11fa4b([u'bad', u'exe']) ---peids--> 74855d03ee3999e56b785a33b956245d([u'exe', u'good'])\n",
        "2d09546831b17d2cc0583362b6d312ae([u'bad', u'exe']) ---peids--> 74855d03ee3999e56b785a33b956245d([u'exe', u'good'])\n",
        "2d09cc92bbe29d96bb3a91b350d1725f([u'bad', u'exe']) ---peids--> 74855d03ee3999e56b785a33b956245d([u'exe', u'good'])\n",
        "2d094b6c69020091b68d1bcf5d11fa4b([u'bad', u'exe']) ---peids--> 74855d03ee3999e56b785a33b956245d([u'exe', u'good'])\n",
        "2d09546831b17d2cc0583362b6d312ae([u'bad', u'exe']) ---peids--> 74855d03ee3999e56b785a33b956245d([u'exe', u'good'])\n",
        "2d09cc92bbe29d96bb3a91b350d1725f([u'bad', u'exe']) ---peids--> 74855d03ee3999e56b785a33b956245d([u'exe', u'good'])\n",
        "2d094b6c69020091b68d1bcf5d11fa4b([u'bad', u'exe']) ---peids--> a3661a61f7e7b7d37e6d037ed747e7ef([u'exe', u'good'])\n",
        "2d09546831b17d2cc0583362b6d312ae([u'bad', u'exe']) ---peids--> a3661a61f7e7b7d37e6d037ed747e7ef([u'exe', u'good'])\n",
        "2d09cc92bbe29d96bb3a91b350d1725f([u'bad', u'exe']) ---peids--> a3661a61f7e7b7d37e6d037ed747e7ef([u'exe', u'good'])\n",
        "2d094b6c69020091b68d1bcf5d11fa4b([u'bad', u'exe']) ---peids--> a3661a61f7e7b7d37e6d037ed747e7ef([u'exe', u'good'])\n",
        "2d09546831b17d2cc0583362b6d312ae([u'bad', u'exe']) ---peids--> a3661a61f7e7b7d37e6d037ed747e7ef([u'exe', u'good'])\n",
        "2d09cc92bbe29d96bb3a91b350d1725f([u'bad', u'exe']) ---peids--> a3661a61f7e7b7d37e6d037ed747e7ef([u'exe', u'good'])\n",
        "1cea13cf888cd8ce4f869029f1dbb601([u'bad', u'exe']) ---warnings--> e87f31116298d4d4839e50fce87b9f6f([u'exe', u'good'])\n",
        "d94da41e7e809f7366971b3b50f8ef68([u'bad', u'exe']) ---warnings--> e87f31116298d4d4839e50fce87b9f6f([u'exe', u'good'])\n",
        "1cea13cf888cd8ce4f869029f1dbb601([u'bad', u'exe']) ---warnings--> e87f31116298d4d4839e50fce87b9f6f([u'exe', u'good'])\n",
        "d94da41e7e809f7366971b3b50f8ef68([u'bad', u'exe']) ---warnings--> e87f31116298d4d4839e50fce87b9f6f([u'exe', u'good'])\n"
       ]
      }
     ],
     "prompt_number": 104
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "neo_q = 'match (s{md5:\"74855d03ee3999e56b785a33b956245d\"})-[r]-(t) return type(r),t.md5,labels(t)'\n",
      "query = neo4j.CypherQuery(graph_db, neo_q)\n",
      "for record in query.stream():\n",
      "    v = record.values\n",
      "    print v"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(u'warnings', u'a3661a61f7e7b7d37e6d037ed747e7ef', [u'exe', u'good'])\n",
        "(u'warnings', u'c6b6a394c597dfca84a2e98a9c0dc58f', [u'exe', u'good'])\n",
        "(u'warnings', u'a3661a61f7e7b7d37e6d037ed747e7ef', [u'exe', u'good'])\n",
        "(u'warnings', u'c6b6a394c597dfca84a2e98a9c0dc58f', [u'exe', u'good'])\n",
        "(u'peids', u'2d094b6c69020091b68d1bcf5d11fa4b', [u'bad', u'exe'])\n",
        "(u'peids', u'2d09546831b17d2cc0583362b6d312ae', [u'bad', u'exe'])\n",
        "(u'peids', u'2d09cc92bbe29d96bb3a91b350d1725f', [u'bad', u'exe'])\n",
        "(u'peids', u'73b459178a48657d5e92c41ec1fdd716', [u'exe', u'good'])\n",
        "(u'peids', u'a3661a61f7e7b7d37e6d037ed747e7ef', [u'exe', u'good'])\n",
        "(u'peids', u'2d094b6c69020091b68d1bcf5d11fa4b', [u'bad', u'exe'])\n",
        "(u'peids', u'2d09546831b17d2cc0583362b6d312ae', [u'bad', u'exe'])\n",
        "(u'peids', u'2d09cc92bbe29d96bb3a91b350d1725f', [u'bad', u'exe'])\n",
        "(u'peids', u'73b459178a48657d5e92c41ec1fdd716', [u'exe', u'good'])\n",
        "(u'peids', u'a3661a61f7e7b7d37e6d037ed747e7ef', [u'exe', u'good'])\n"
       ]
      }
     ],
     "prompt_number": 105
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "neo_q = 'match (s{md5:\"73b459178a48657d5e92c41ec1fdd716\"}),(t:bad), p=allShortestPaths((s)-[*..2]-(t)) return p'\n",
      "query = neo4j.CypherQuery(graph_db, neo_q)\n",
      "for record in query.stream():\n",
      "    v = record.values\n",
      "    print v[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(76)-[:\"peids\"]->(18)-[:\"warnings\"]->(0)\n",
        "(76)-[:\"peids\"]->(18)-[:\"warnings\"]->(0)\n",
        "(76)-[:\"peids\"]->(18)-[:\"warnings\"]->(0)\n",
        "(76)-[:\"peids\"]->(18)-[:\"warnings\"]->(0)\n",
        "(76)-[:\"warnings\"]->(9)\n",
        "(76)-[:\"warnings\"]->(9)\n",
        "(76)-[:\"peids\"]->(16)\n",
        "(76)-[:\"peids\"]->(16)\n",
        "(76)-[:\"peids\"]->(18)\n",
        "(76)-[:\"peids\"]->(18)\n",
        "(76)-[:\"peids\"]->(25)\n",
        "(76)-[:\"peids\"]->(25)\n",
        "(76)-[:\"peids\"]->(18)-[:\"warnings\"]->(27)\n",
        "(76)-[:\"peids\"]->(18)-[:\"warnings\"]->(27)"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "(76)-[:\"peids\"]->(18)-[:\"warnings\"]->(27)\n",
        "(76)-[:\"peids\"]->(18)-[:\"warnings\"]->(27)\n",
        "(76)-[:\"peids\"]->(18)-[:\"warnings\"]->(43)\n",
        "(76)-[:\"peids\"]->(18)-[:\"warnings\"]->(43)\n",
        "(76)-[:\"peids\"]->(18)-[:\"warnings\"]->(43)\n",
        "(76)-[:\"peids\"]->(18)-[:\"warnings\"]->(43)\n",
        "(76)-[:\"warnings\"]->(46)\n",
        "(76)-[:\"warnings\"]->(46)\n"
       ]
      }
     ],
     "prompt_number": 106
    }
   ],
   "metadata": {}
  }
 ]
}