{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Finding Correlations in a CSV of Malware Events via Hypergraph Views\n", "\n", "To find patterns and outliers in CSVs and event data, Graphistry provides the hypergraph transform. \n", "\n", "As an example, this notebook examines different malware files reported to a security vendor. It reveals phenomena such as:\n", "\n", "* The malware files cluster into several families\n", "* The nodes central to a cluster reveal attributes specific to a strain of malware\n", "* The nodes bordering a cluster reveal attributes that show up in a strain, but are unique to each instance in that strain\n", "* Several families have attributes connecting them, suggesting they had the same authors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load CSV" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "import pandas as pd\n", "import graphistry as g\n", "#graphistry.register(key='...')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('# samples', 999)\n" ] }, { "data": { "text/plain": [ "{'Campaign': 'TRANSFORMICE',\n", " 'Date': '2015-11-19 14:04:23',\n", " 'Domain': 'spynet1.ddns.net',\n", " 'InstallDir': 'TEMP',\n", " 'InstallFlag': 'True',\n", " 'InstallName': 'svchost.exe',\n", " 'NetworkSeparator': \"|'|'|\",\n", " 'Origin': 'vt',\n", " 'Port': '1177',\n", " 'RegistryValue': 'ba4c12bee3027d94da5c81db2d196bfd',\n", " 'Version': '0.6.4',\n", " 'compile_date': '2015-11-18 21:25:59',\n", " 'imphash': 'f34d5f2d4577ed6d9ceec516c1f5a744',\n", " 'magic': 'PE32 executable for MS Windows (GUI) Intel 80386 32-bit Mono/.Net assembly',\n", " 'md5': '007a8403b3281fd4d48c69f4c96da0b8',\n", " 'rat_name': 'njRat',\n", " 'section_.RELOC': '7905c1aa858eb5484ad08a2e10b7e50e',\n", " 'section_.RSRC': '5b346ed223699f15252c1fdad182859f',\n", " 'section_.TEXT': 'f414cace41511d02fb8e278cf36fd2a3',\n", " 'sha1': 'd215edec90c5487800d961cc1ac2808e221818fa',\n", " 'sha256': '2beb53ca652d9d4f73516ce45365ae824370d2408d6b0d5a809cf3cd177ba694'}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('barncat.1k.csv', encoding = \"utf8\")\n", "print(\"# samples\", len(df))\n", "eval(df[:10]['value'].tolist()[0])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
uuidevent_idcategorytypevalueto_idsdate
056e1af55-22f4-4b76-881a-50feac1f3af3417External analysiscomment{\"InstallFlag\": \"True\", \"RegistryValue\": \"ba4c...020160310
\n", "
" ], "text/plain": [ " uuid event_id category type \\\n", "0 56e1af55-22f4-4b76-881a-50feac1f3af3 417 External analysis comment \n", "\n", " value to_ids date \n", "0 {\"InstallFlag\": \"True\", \"RegistryValue\": \"ba4c... 0 20160310 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#avoid double counting\n", "df3 = df[df['value'].str.contains(\"{\")]\n", "df3[:1]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ActivateKeyloggerActiveXKeyActiveXStartupBackupDNSServerBypassUACCampaignChangeCreationDateClearAccessControlClearZoneIdentifierConnectDelay...section_.TEXTsection_.TLSsection_BSSsection_CODEsection_DATAsha1sha256to_idstypeuuid
0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaN0.0comment56e1af55-22f4-4b76-881a-50feac1f3af3
\n", "

1 rows × 116 columns

\n", "
" ], "text/plain": [ " ActivateKeylogger ActiveXKey ActiveXStartup BackupDNSServer BypassUAC \\\n", "0 NaN NaN NaN NaN NaN \n", "\n", " Campaign ChangeCreationDate ClearAccessControl ClearZoneIdentifier \\\n", "0 NaN NaN NaN NaN \n", "\n", " ConnectDelay ... section_.TEXT \\\n", "0 NaN ... NaN \n", "\n", " section_.TLS section_BSS section_CODE section_DATA sha1 sha256 to_ids \\\n", "0 NaN NaN NaN NaN NaN NaN 0.0 \n", "\n", " type uuid \n", "0 comment 56e1af55-22f4-4b76-881a-50feac1f3af3 \n", "\n", "[1 rows x 116 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Unpack 'value' json\n", "import json\n", "df4 = pd.concat([df3.drop('value', axis=1), df3.value.apply(json.loads).apply(pd.Series)])\n", "len(df4)\n", "df4[:1]" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Default Hypergraph Transform\n", "\n", "The hypergraph transform creates:\n", "* A node for every row, \n", "* A node for every unique value in a columns (so multiple if found across columns)\n", "* An edge connecting a row to its values\n", "\n", "When multiple rows share similar values, they will cluster together. When a row has unique values, those will form a ring around only that node." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('# links', 200)\n", "('# event entities', 50)\n", "('# attrib entities', 102)\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g.hypergraph(df4[:50])['graph'].plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configured Hypergraph Transform\n", "We clean up the visualization in a few ways:\n", "\n", "1. Categorize hash codes as in the same family. This simplifies coloring by the generated 'category' field. If columns share the same value, such as two columns using md5 values, this would also cause them to only create 1 node per hash, instead of per-column instance.\n", "\n", "2. Not show a lot of attributes as nodes, such as numbers and dates\n", "\n", "Running `help(graphistry.hypergraph)` reveals more options." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('# links', 2350)\n", "('# event entities', 204)\n", "('# attrib entities', 1156)\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g.hypergraph(\n", " df4,\n", " opts={\n", " 'CATEGORIES': {\n", " 'hash': ['sha1', 'sha256', 'md5'],\n", " 'section': [x for x in df4.columns if 'section_' in x]\n", " },\n", " 'SKIP': ['event_id', 'InstallFlag', 'type', 'val', 'Date', 'date', 'Port', 'FTPPort', 'Origin', 'category', 'comment', 'to_ids']\n", " })['graph'].plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 0 }