{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# Zeek Network Data to Scikit-Learn\n", "In this notebook we're going to be using the zat Python module and explore the functionality that enables us to easily go from Zeek data to Pandas to Scikit-Learn. Once we get our data in a form that is usable by Scikit-Learn we have a wide array of data analysis and machine learning algorithms at our disposal.\n", "\n", "\n", "\n", "### Software\n", "- zat: https://github.com/SuperCowPowers/zat\n", "- Pandas: https://github.com/pandas-dev/pandas\n", "- Scikit-Learn: http://scikit-learn.org/stable/index.html\n", "\n", "### Techniques\n", "\n", "\n", "- One Hot Encoding: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html\n", "- t-SNE: https://distill.pub/2016/misread-tsne/\n", "- Kmeans: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html\n", "\n", "### Thanks\n", "- The DataFrameToMatrix() class is inspired by a great talk from Tom Augspurger at PyData Chicago 2016: https://youtu.be/KLPtEBokqQ0\n", "\n", "### Code Availability\n", "All this code in this notebook is from the examples/bro_to_scikit.py file in the zat repository (https://github.com/SuperCowPowers/zat). If you have any questions/problems please don't hesitate to open up an Issue in GitHub or even better submit a PR. :) \n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "zat: 0.3.6\n", "Pandas: 0.25.1\n", "Numpy: 1.16.4\n", "Scikit Learn Version: 0.21.2\n" ] } ], "source": [ "# Third Party Imports\n", "import pandas as pd\n", "import sklearn\n", "from sklearn.manifold import TSNE\n", "from sklearn.decomposition import PCA\n", "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n", "from sklearn.cluster import KMeans\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "# Local imports\n", "import zat\n", "from zat.log_to_dataframe import LogToDataFrame\n", "from zat.dataframe_to_matrix import DataFrameToMatrix\n", "\n", "# Good to print out versions of stuff\n", "print('zat: {:s}'.format(zat.__version__))\n", "print('Pandas: {:s}'.format(pd.__version__))\n", "print('Numpy: {:s}'.format(np.__version__))\n", "print('Scikit Learn Version:', sklearn.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quickly go from Zeek log to Pandas DataFrame" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " | uid | \n", "id.orig_h | \n", "id.orig_p | \n", "id.resp_h | \n", "id.resp_p | \n", "proto | \n", "trans_id | \n", "query | \n", "qclass | \n", "qclass_name | \n", "... | \n", "rcode | \n", "rcode_name | \n", "AA | \n", "TC | \n", "RD | \n", "RA | \n", "Z | \n", "answers | \n", "TTLs | \n", "rejected | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ts | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2013-09-15 23:44:27.631940126 | \n", "CZGShC2znK1sV7jdI7 | \n", "192.168.33.10 | \n", "1030 | \n", "4.2.2.3 | \n", "53 | \n", "udp | \n", "44949 | \n", "guyspy.com | \n", "1 | \n", "C_INTERNET | \n", "... | \n", "0 | \n", "NOERROR | \n", "F | \n", "F | \n", "T | \n", "T | \n", "0 | \n", "54.245.228.191 | \n", "36.000000 | \n", "F | \n", "
2013-09-15 23:44:27.696868896 | \n", "CZGShC2znK1sV7jdI7 | \n", "192.168.33.10 | \n", "1030 | \n", "4.2.2.3 | \n", "53 | \n", "udp | \n", "50071 | \n", "www.guyspy.com | \n", "1 | \n", "C_INTERNET | \n", "... | \n", "0 | \n", "NOERROR | \n", "F | \n", "F | \n", "T | \n", "T | \n", "0 | \n", "guyspy.com,54.245.228.191 | \n", "1000.000000,36.000000 | \n", "F | \n", "
2013-09-15 23:44:28.060639143 | \n", "CZGShC2znK1sV7jdI7 | \n", "192.168.33.10 | \n", "1030 | \n", "4.2.2.3 | \n", "53 | \n", "udp | \n", "39062 | \n", "devrubn8mli40.cloudfront.net | \n", "1 | \n", "C_INTERNET | \n", "... | \n", "0 | \n", "NOERROR | \n", "F | \n", "F | \n", "T | \n", "T | \n", "0 | \n", "54.230.86.87,54.230.86.18,54.230.87.160,54.230... | \n", "60.000000,60.000000,60.000000,60.000000,60.000... | \n", "F | \n", "
2013-09-15 23:44:28.141794920 | \n", "CZGShC2znK1sV7jdI7 | \n", "192.168.33.10 | \n", "1030 | \n", "4.2.2.3 | \n", "53 | \n", "udp | \n", "7312 | \n", "d31qbv1cthcecs.cloudfront.net | \n", "1 | \n", "C_INTERNET | \n", "... | \n", "0 | \n", "NOERROR | \n", "F | \n", "F | \n", "T | \n", "T | \n", "0 | \n", "54.230.86.87,54.230.86.18,54.230.84.20,54.230.... | \n", "60.000000,60.000000,60.000000,60.000000,60.000... | \n", "F | \n", "
2013-09-15 23:44:28.422703981 | \n", "CZGShC2znK1sV7jdI7 | \n", "192.168.33.10 | \n", "1030 | \n", "4.2.2.3 | \n", "53 | \n", "udp | \n", "41872 | \n", "crl.entrust.net | \n", "1 | \n", "C_INTERNET | \n", "... | \n", "0 | \n", "NOERROR | \n", "F | \n", "F | \n", "T | \n", "T | \n", "0 | \n", "cdn.entrust.net.c.footprint.net,192.221.123.25... | \n", "4993.000000,129.000000,129.000000,129.000000 | \n", "F | \n", "
5 rows × 22 columns
\n", "\n", " | AA | \n", "RA | \n", "RD | \n", "TC | \n", "Z | \n", "rejected | \n", "proto | \n", "qclass_name | \n", "qtype_name | \n", "rcode_name | \n", "query_length | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
ts | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2013-09-15 23:44:27.631940126 | \n", "F | \n", "T | \n", "T | \n", "F | \n", "0 | \n", "F | \n", "udp | \n", "C_INTERNET | \n", "A | \n", "NOERROR | \n", "10.0 | \n", "
2013-09-15 23:44:27.696868896 | \n", "F | \n", "T | \n", "T | \n", "F | \n", "0 | \n", "F | \n", "udp | \n", "C_INTERNET | \n", "A | \n", "NOERROR | \n", "14.0 | \n", "
2013-09-15 23:44:28.060639143 | \n", "F | \n", "T | \n", "T | \n", "F | \n", "0 | \n", "F | \n", "udp | \n", "C_INTERNET | \n", "A | \n", "NOERROR | \n", "28.0 | \n", "
2013-09-15 23:44:28.141794920 | \n", "F | \n", "T | \n", "T | \n", "F | \n", "0 | \n", "F | \n", "udp | \n", "C_INTERNET | \n", "A | \n", "NOERROR | \n", "29.0 | \n", "
2013-09-15 23:44:28.422703981 | \n", "F | \n", "T | \n", "T | \n", "F | \n", "0 | \n", "F | \n", "udp | \n", "C_INTERNET | \n", "A | \n", "NOERROR | \n", "15.0 | \n", "
\n", " | query | \n", "proto | \n", "x | \n", "y | \n", "cluster | \n", "
---|---|---|---|---|---|
ts | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
2013-09-15 23:44:27.631940126 | \n", "guyspy.com | \n", "udp | \n", "39.936161 | \n", "-18.896549 | \n", "0 | \n", "
2013-09-15 23:44:27.696868896 | \n", "www.guyspy.com | \n", "udp | \n", "24.945557 | \n", "3.408248 | \n", "0 | \n", "
2013-09-15 23:44:28.060639143 | \n", "devrubn8mli40.cloudfront.net | \n", "udp | \n", "-28.303942 | \n", "2.064817 | \n", "0 | \n", "
2013-09-15 23:44:28.141794920 | \n", "d31qbv1cthcecs.cloudfront.net | \n", "udp | \n", "-22.633984 | \n", "6.704806 | \n", "0 | \n", "
2013-09-15 23:44:28.422703981 | \n", "crl.entrust.net | \n", "udp | \n", "18.326878 | \n", "-2.018499 | \n", "0 | \n", "