{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# Clustering: Picking the 'K' hyperparameter\n", "The unsupervised machine learning technique of clustering data into similar groups can be useful and fairly efficient in most cases. The big trick is often how you pick the number of clusters to make (the K hyperparameter). \n", "The number of clusters may vary dramatically depending on the characteristics of the data, the different types of variables (numeric or categorical), how the data is normalized/encoded and the distance metric used.\n", "\n", "\n", "\n", "**For this notebook we're going to focus specifically on the following:**\n", "- Optimizing the number of clusters (K hyperparameter) using Silhouette Scoring\n", "- Utilizing an algorithm (DBSCAN) that automatically determines the number of clusters\n", "\n", "\n", "### Software\n", "- Zeek Analysis Tools (ZAT): https://github.com/SuperCowPowers/zat\n", "- Pandas: https://github.com/pandas-dev/pandas\n", "- Scikit-Learn: http://scikit-learn.org/stable/index.html\n", "\n", "\n", "\n", "### Techniques\n", "- One Hot Encoding: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html\n", "- t-SNE: https://distill.pub/2016/misread-tsne/\n", "- Kmeans: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html\n", "- Silhouette Score: https://en.wikipedia.org/wiki/Silhouette_(clustering)\n", "- DBSCAN: https://en.wikipedia.org/wiki/DBSCAN" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ZAT: 0.3.6\n", "Pandas: 0.25.1\n", "Scikit Learn Version: 0.21.2\n" ] } ], "source": [ "# Third Party Imports\n", "import pandas as pd\n", "import numpy as np\n", "import sklearn\n", "from sklearn.manifold import TSNE\n", "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n", "from sklearn.cluster import KMeans, DBSCAN\n", "\n", "# Local imports\n", "import zat\n", "from zat.log_to_dataframe import LogToDataFrame\n", "from zat.dataframe_to_matrix import DataFrameToMatrix\n", "\n", "# Good to print out versions of stuff\n", "print('ZAT: {:s}'.format(zat.__version__))\n", "print('Pandas: {:s}'.format(pd.__version__))\n", "print('Scikit Learn Version:', sklearn.__version__)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " | uid | \n", "id.orig_h | \n", "id.orig_p | \n", "id.resp_h | \n", "id.resp_p | \n", "trans_depth | \n", "method | \n", "host | \n", "uri | \n", "referrer | \n", "... | \n", "info_msg | \n", "filename | \n", "tags | \n", "username | \n", "password | \n", "proxied | \n", "orig_fuids | \n", "orig_mime_types | \n", "resp_fuids | \n", "resp_mime_types | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ts | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2013-09-15 23:44:27.668081999 | \n", "CyIaMO7IheOh38Zsi | \n", "192.168.33.10 | \n", "1031 | \n", "54.245.228.191 | \n", "80 | \n", "1 | \n", "GET | \n", "guyspy.com | \n", "/ | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "(empty) | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "Fnjq3r4R0VGmHVWiN5 | \n", "text/html | \n", "
2013-09-15 23:44:27.731701851 | \n", "CoyZrY2g74UvMMgp4a | \n", "192.168.33.10 | \n", "1032 | \n", "54.245.228.191 | \n", "80 | \n", "1 | \n", "GET | \n", "www.guyspy.com | \n", "/ | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "(empty) | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "FCQ5aX37YzsjAKpcv8 | \n", "text/html | \n", "
2013-09-15 23:44:28.092921972 | \n", "CoyZrY2g74UvMMgp4a | \n", "192.168.33.10 | \n", "1032 | \n", "54.245.228.191 | \n", "80 | \n", "2 | \n", "GET | \n", "www.guyspy.com | \n", "/wp-content/plugins/slider-pro/css/advanced-sl... | \n", "http://www.guyspy.com/ | \n", "... | \n", "NaN | \n", "NaN | \n", "(empty) | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "FD9Xu815Hwui3sniSf | \n", "text/html | \n", "
2013-09-15 23:44:28.150300980 | \n", "CiCKTz4e0fkYYazBS3 | \n", "192.168.33.10 | \n", "1040 | \n", "54.245.228.191 | \n", "80 | \n", "1 | \n", "GET | \n", "www.guyspy.com | \n", "/wp-content/plugins/contact-form-7/includes/cs... | \n", "http://www.guyspy.com/ | \n", "... | \n", "NaN | \n", "NaN | \n", "(empty) | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "FMZXWm1yCdsCAU3K9d | \n", "text/plain | \n", "
2013-09-15 23:44:28.150601864 | \n", "C1YBkC1uuO9bzndRvh | \n", "192.168.33.10 | \n", "1041 | \n", "54.245.228.191 | \n", "80 | \n", "1 | \n", "GET | \n", "www.guyspy.com | \n", "/wp-content/plugins/slider-pro/css/slider/adva... | \n", "http://www.guyspy.com/ | \n", "... | \n", "NaN | \n", "NaN | \n", "(empty) | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "FA4NM039Rf9Y8Sn2Rh | \n", "text/plain | \n", "
5 rows × 26 columns
\n", "