{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "## PE File Similarity Graph using Workbench:\n", "\n", "Here we're using the term Similarity Graph to mean a graph where the nodes are entities (PE Files in this case) and the edges are relationships between the nodes based on similar attributes. See [Semantic Network](http://en.wikipedia.org/wiki/Semantic_network) for more information.\n", "\n", "Workbench can be setup to utilize several indexers:\n", "\n", "- Straight up Indexing with [ElasticSearch](http://www.elasticsearch.org) \n", "- Super awesome [Neo4j](http://www.neo4j.org) as both an indexer and graph database.\n", "\n", "Neo4j also incorporates Lucene based indexing so not only can you capture a rich set of relationships between your data entities but searches and queries are super quick.\n", "\n", "\n", "## Lets start up the workbench server...\n", "Run the workbench server (from somewhere, for the demo we're just going to start a local one)\n", "\n", "$ workbench_server\n", "\n", "\n", "#### Okay so when the server starts up, it autoloads any worker plugins in the server/worker directory and dynamically monitors the directory, if a new python file shows up, it's validated as a properly formed plugin and if it passes is added to the list of workers." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Lets start to interact with workbench, please note there is NO specific client to workbench,\n", "# Just use the ZeroRPC Python, Node.js, or CLI interfaces.\n", "import zerorpc\n", "c = zerorpc.Client()\n", "c.connect(\"tcp://127.0.0.1:4242\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "[None]" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "# Load in 100 PE Files\n", "def workbench_load(file_list):\n", " md5_list = []\n", " for filename in file_list:\n", " with open(filename,'rb') as f:\n", " md5_list.append(c.store_sample(f.read(), filename, 'exe'))\n", " print 'Files loaded: %d' % len(md5_list)\n", " return md5_list\n", "\n", "import os\n", "file_list = [os.path.join('../data/pe/bad', child) for child in os.listdir('../data/pe/bad')]\n", "md5s_bad = workbench_load(file_list)\n", "file_list = [os.path.join('../data/pe/good', child) for child in os.listdir('../data/pe/good')]\n", "md5s_good = workbench_load(file_list)\n", "md5_list = md5s_bad + md5s_good\n", "md5_list[:5]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Files loaded: 50\n", "Files loaded: 50" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "['033d91aae8ad29ed9fbb858179271232',\n", " '0cb9aa6fb9c4aa3afad7a303e21ac0f3',\n", " '0e882ec9b485979ea84c7843d41ba36f',\n", " '0e8b030fb6ae48ffd29e520fc16b5641',\n", " '0eb9e990c521b30428a379700ec5ab3e']" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Notice the level on control on our batch operations\n", "#### Your database may have tons of files of different types. We literally control execution on the per sample level with md5 lists. Alternatively we can specify specific types or simply make a query to the database get exactly what we want and build our own md5 list.\n", "\n", "#### Also notice that we can specify ^exactly^ what data we want down to arbitrary depth.. here we want just the imported_symbols from the sparse features from the pe_features worker." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Compute pe_features on all files of type pe, just pull back the sparse features\n", "imports = c.batch_work_request('pe_features', {'md5_list': md5_list, 'subkeys':['md5','sparse_features.imported_symbols']})\n", "imports" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "
df = pd.DataFrame(imports)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Sending generator output into a Pandas Dataframe constructor\n", "import pandas as pd\n", "df_imports = pd.DataFrame(imports)\n", "df_imports.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | imported_symbols | \n", "md5 | \n", "
---|---|---|
0 | \n", "[kernel32.dll:name=getenvironmentvariablew, ke... | \n", "033d91aae8ad29ed9fbb858179271232 | \n", "
1 | \n", "[mfc42.dll:ordinal=2514, mfc42.dll:ordinal=516... | \n", "0cb9aa6fb9c4aa3afad7a303e21ac0f3 | \n", "
2 | \n", "[msvbvm60.dll:ordinal=588 bound=1923206285, ms... | \n", "0e882ec9b485979ea84c7843d41ba36f | \n", "
3 | \n", "[wsock32.dll:name=wsastartup, wsock32.dll:name... | \n", "0e8b030fb6ae48ffd29e520fc16b5641 | \n", "
4 | \n", "[user32.dll:name=getwindow bound=2110180096, u... | \n", "0eb9e990c521b30428a379700ec5ab3e | \n", "
5 rows \u00d7 2 columns
\n", "\n", " | md5 | \n", "pe_warning_strings | \n", "
---|---|---|
0 | \n", "090a189f4eeb3c0b76e97acdb1a71c92 | \n", "[] | \n", "
1 | \n", "093dee8d97fd9d35884ed52179b3d142 | \n", "[Suspicious flags set for section 5. Both IMAG... | \n", "
2 | \n", "0dd74786d22edff0ce5b8e1b1e398618 | \n", "[] | \n", "
3 | \n", "10328f92e7ec8735ea7846bf2c8254c2 | \n", "[] | \n", "
4 | \n", "12fd4ef8f2cbbf98e0a5ced88258ddf3 | \n", "[] | \n", "
5 rows \u00d7 2 columns
\n", "\n", " | md5 | \n", "string_list | \n", "
---|---|---|
0 | \n", "090a189f4eeb3c0b76e97acdb1a71c92 | \n", "[!This program cannot be run in DOS mode., Ric... | \n", "
1 | \n", "093dee8d97fd9d35884ed52179b3d142 | \n", "[!This program cannot be run in DOS mode., r\\R... | \n", "
2 | \n", "0dd74786d22edff0ce5b8e1b1e398618 | \n", "[!This program cannot be run in DOS mode., qHy... | \n", "
3 | \n", "10328f92e7ec8735ea7846bf2c8254c2 | \n", "[!This program cannot be run in DOS mode., .te... | \n", "
4 | \n", "12fd4ef8f2cbbf98e0a5ced88258ddf3 | \n", "[!This program cannot be run in DOS mode., qHy... | \n", "
5 rows \u00d7 2 columns
\n", "\n", " | match_list | \n", "md5 | \n", "
---|---|---|
0 | \n", "[Microsoft Visual C++ 8] | \n", "090a189f4eeb3c0b76e97acdb1a71c92 | \n", "
1 | \n", "[] | \n", "093dee8d97fd9d35884ed52179b3d142 | \n", "
2 | \n", "[Microsoft Visual C++ 8] | \n", "0dd74786d22edff0ce5b8e1b1e398618 | \n", "
3 | \n", "[Microsoft Visual C# v7.0 / Basic .NET] | \n", "10328f92e7ec8735ea7846bf2c8254c2 | \n", "
4 | \n", "[Microsoft Visual C++ 8] | \n", "12fd4ef8f2cbbf98e0a5ced88258ddf3 | \n", "
5 rows \u00d7 2 columns
\n", "\n", " | md5 | \n", "sim_list | \n", "
---|---|---|
0 | \n", "090a189f4eeb3c0b76e97acdb1a71c92 | \n", "[] | \n", "
1 | \n", "093dee8d97fd9d35884ed52179b3d142 | \n", "[] | \n", "
2 | \n", "0dd74786d22edff0ce5b8e1b1e398618 | \n", "[{u'sim': 65, u'md5': u'e0b173f23d873286169995... | \n", "
3 | \n", "10328f92e7ec8735ea7846bf2c8254c2 | \n", "[] | \n", "
4 | \n", "12fd4ef8f2cbbf98e0a5ced88258ddf3 | \n", "[] | \n", "
5 rows \u00d7 2 columns
\n", "