{ "cells": [ { "cell_type": "markdown", "id": "c99eea00-b6af-40cb-8084-090dd7ef43d7", "metadata": {}, "source": [ "# Anomaly Detection - Network Activity" ] }, { "cell_type": "markdown", "id": "2e54f95d-7518-4f57-8ba3-b2cdd8718c28", "metadata": {}, "source": [ "After cybersecurity attackers get initial access to a victim's network, their next techniques may involve executing scripts from one network machine to another. These follow-up activities could generate network activity that is *anomalous*. It might be anomalous in two different ways: maybe compared to what *normally happens* on the network, or maybe relative to what everything else on the network is *currently doing*. Those two kinds of anomalies have an important distinction -- the first one assumes some ground-truth knowledge of what \"normal\" *looks* like -- this is \"novelty\" detection. The second one doesn't need a ground truth -- it's just \"outlier\" detection.\n", "\n", "My favorite example of network anomaly detection gone wrong is the university network which detected an enormous spike in traffic on a high port number between the hours of midnight and about 3am every night. Were they under attack? Follow-up investigation determined that the involved IP addresses were all for on-campus freshmen housing, and that the port was a common Minecraft server config. Oops, false alarm! The solution was to create a *new* \"normalcy\" model specific to on-campus housing in the wee nighttime hours, which wouldn't be alarmed by late-night minecraft gaming. I'm probably not remembering that story correctly, but the point is that \"normalcy\" model can quickly spiral to requiring a huge number of hyper-specific models or parameters, to reduce false positives.\n", "\n", "This notebook demonstrates a simple way to do outlier detection on netflow records, using one of the CTU-13 datasets. It clusters on standard-scaled TotBytes,\n", "using DBSCAN with an eps of 2.5, means any points labeled as cluster `-1` don't group with any other points, assuming gaussian (normal) distributions..\n", "\n", "I'm using [deargle/my-datascience-notebook](https://hub.docker.com/r/deargle/my-datascience-notebook)" ] }, { "cell_type": "code", "execution_count": 1, "id": "e6993eb8-09b8-4101-84d1-929d7b870312", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pandas version: 1.4.0\n", "sklearn version: 1.0.2\n" ] } ], "source": [ "import pandas as pd\n", "import sklearn\n", "from sklearn.cluster import DBSCAN\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "print(f'pandas version: {pd.__version__}')\n", "print(f'sklearn version: {sklearn.__version__}')" ] }, { "cell_type": "code", "execution_count": 2, "id": "c485e5ec-f53a-412d-9ac8-c7372a10aea1", "metadata": {}, "outputs": [], "source": [ "# first, download https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-Botnet-42/detailed-bidirectional-flow-labels/capture20110810.binetflow into the current directory\n", "from pathlib import Path\n", "\n", "path_to_file = 'capture20110810.binetflow'\n", "path = Path(path_to_file)\n", "\n", "if not path.is_file():\n", " !wget https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-Botnet-42/detailed-bidirectional-flow-labels/capture20110810.binetflow\n", "\n", "df = pd.read_csv(path_to_file)" ] }, { "cell_type": "code", "execution_count": 3, "id": "9b0845ac-77e2-4b6f-be51-714f7ae85423", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StartTimeDurProtoSrcAddrSportDirDstAddrDportStatesTosdTosTotPktsTotBytesSrcBytesLabel
02011/08/10 09:46:53.0472773550.182373udp212.50.71.17939678<->147.32.84.22913363CON0.00.012875413flow=Background-UDP-Established
12011/08/10 09:46:53.0488430.000883udp84.13.246.13228431<->147.32.84.22913363CON0.00.0213575flow=Background-UDP-Established
22011/08/10 09:46:53.0498950.000326tcp217.163.21.3580<?>147.32.86.1942063FA_A0.00.0212060flow=Background
32011/08/10 09:46:53.0537710.056966tcp83.3.77.7432882<?>147.32.85.521857FA_FA0.00.03180120flow=Background
42011/08/10 09:46:53.0539373427.768066udp74.89.223.20421278<->147.32.84.22913363CON0.00.04228561596flow=Background-UDP-Established
\n", "
" ], "text/plain": [ " StartTime Dur Proto SrcAddr Sport Dir \\\n", "0 2011/08/10 09:46:53.047277 3550.182373 udp 212.50.71.179 39678 <-> \n", "1 2011/08/10 09:46:53.048843 0.000883 udp 84.13.246.132 28431 <-> \n", "2 2011/08/10 09:46:53.049895 0.000326 tcp 217.163.21.35 80 \n", "3 2011/08/10 09:46:53.053771 0.056966 tcp 83.3.77.74 32882 \n", "4 2011/08/10 09:46:53.053937 3427.768066 udp 74.89.223.204 21278 <-> \n", "\n", " DstAddr Dport State sTos dTos TotPkts TotBytes SrcBytes \\\n", "0 147.32.84.229 13363 CON 0.0 0.0 12 875 413 \n", "1 147.32.84.229 13363 CON 0.0 0.0 2 135 75 \n", "2 147.32.86.194 2063 FA_A 0.0 0.0 2 120 60 \n", "3 147.32.85.5 21857 FA_FA 0.0 0.0 3 180 120 \n", "4 147.32.84.229 13363 CON 0.0 0.0 42 2856 1596 \n", "\n", " Label \n", "0 flow=Background-UDP-Established \n", "1 flow=Background-UDP-Established \n", "2 flow=Background \n", "3 flow=Background \n", "4 flow=Background-UDP-Established " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 4, "id": "b7fffc73-b2b5-457c-af2a-51c8f604dcdf", "metadata": {}, "outputs": [], "source": [ "# Set the StartTime as the row index\n", "df = df.set_index(pd.to_datetime(df['StartTime']))" ] }, { "cell_type": "code", "execution_count": 5, "id": "683dc161-eabf-4155-a1db-6a883a848ed9", "metadata": {}, "outputs": [], "source": [ "# Streaming network analytics would bin incoming netflows by some time window. Below, I set the window\n", "# to 1 minute, but this is arbitrary to make iterative testing faster than, say, a 10-minute window\n", "grouped = df.groupby(pd.Grouper(freq='1min'))" ] }, { "cell_type": "code", "execution_count": 6, "id": "2a429f32-2b7d-4d21-80ee-5ea238b4a9fe", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Window: 2011-08-10 09:46:00\n" ] } ], "source": [ "# For demonstration purposes, I extract just the first window\n", "grouped = list(grouped)\n", "first_window = grouped[0]\n", "print(f'Window: {first_window[0]}')\n", "data = first_window[1]" ] }, { "cell_type": "code", "execution_count": 7, "id": "d2827652-c3d8-43a3-b773-76edf4634dda", "metadata": {}, "outputs": [], "source": [ "# I'll cluster just based on netflow TotBytes, but this is also arbitrary. Other numerical features could be included.\n", "X = data[['TotBytes']]" ] }, { "cell_type": "code", "execution_count": 8, "id": "75443e56-3b58-44ae-8135-877470ba7097", "metadata": {}, "outputs": [], "source": [ "# Apply a standard deviation transformation to my data.\n", "X = StandardScaler().fit_transform(X)" ] }, { "cell_type": "code", "execution_count": 9, "id": "a28e9306-5eaa-41e2-a78e-d5830bc43492", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "n netflows: 1654\n", "n clusters: 1\n", "n outliers: 6\n" ] } ], "source": [ "# Since we standard-scaled, we can set `DBSCAN`'s `eps` parameter\n", "# to be 2.5, which roughly corresponds to a common threshold for gaussian (normal) distribution for what is\n", "# considered an \"outlier\"\n", "\n", "clf = DBSCAN(eps=2.5)\n", "y_preds = clf.fit_predict(X)\n", "print(f'n netflows: {len(X)}')\n", "print(f'n clusters: {len(list(set([y for y in clf.labels_ if y != -1])))}')\n", "print(f'n outliers: {len([x for x in y_preds if x == -1])}') " ] }, { "cell_type": "code", "execution_count": 11, "id": "1af42383-78d1-4f8d-8ec2-1ab7f2ab6299", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StartTimeDurProtoSrcAddrSportDirDstAddrDportStatesTosdTosTotPktsTotBytesSrcBytesLabel
StartTime
2011-08-10 09:46:53.0782972011/08/10 09:46:53.0782973599.972412tcp147.32.80.1380<?>147.32.84.16251769PA_A0.00.0721576163854460214264flow=From-Background-CVUT-Proxy
2011-08-10 09:46:53.1064312011/08/10 09:46:53.106431507.347626tcp147.32.80.1380<?>147.32.85.11210885FPA_FA0.00.0162760137136528132816366flow=From-Background-CVUT-Proxy
2011-08-10 09:46:53.3469512011/08/10 09:46:53.3469513598.887695tcp195.250.146.99554<?>147.32.86.9916786PA_PA0.00.0515766096444060363789flow=Background
2011-08-10 09:46:53.7096662011/08/10 09:46:53.7096663599.047607tcp195.250.146.6554<?>147.32.84.5949375PA_PA0.00.0550686116475360334389flow=Background-Established-cmpgw-CVUT
2011-08-10 09:46:58.6802892011/08/10 09:46:58.6802893591.222656tcp147.32.85.10349317<?>88.159.8.1022PA_PA0.00.01055557096330868636560flow=Background
2011-08-10 09:46:59.3675872011/08/10 09:46:59.3675873350.645508tcp147.32.87.5524?>147.32.85.249350PA_0.0NaN214827248405120248405120flow=Background
\n", "
" ], "text/plain": [ " StartTime Dur Proto \\\n", "StartTime \n", "2011-08-10 09:46:53.078297 2011/08/10 09:46:53.078297 3599.972412 tcp \n", "2011-08-10 09:46:53.106431 2011/08/10 09:46:53.106431 507.347626 tcp \n", "2011-08-10 09:46:53.346951 2011/08/10 09:46:53.346951 3598.887695 tcp \n", "2011-08-10 09:46:53.709666 2011/08/10 09:46:53.709666 3599.047607 tcp \n", "2011-08-10 09:46:58.680289 2011/08/10 09:46:58.680289 3591.222656 tcp \n", "2011-08-10 09:46:59.367587 2011/08/10 09:46:59.367587 3350.645508 tcp \n", "\n", " SrcAddr Sport Dir DstAddr \\\n", "StartTime \n", "2011-08-10 09:46:53.078297 147.32.80.13 80 147.32.84.162 \n", "2011-08-10 09:46:53.106431 147.32.80.13 80 147.32.85.112 \n", "2011-08-10 09:46:53.346951 195.250.146.99 554 147.32.86.99 \n", "2011-08-10 09:46:53.709666 195.250.146.6 554 147.32.84.59 \n", "2011-08-10 09:46:58.680289 147.32.85.103 49317 88.159.8.10 \n", "2011-08-10 09:46:59.367587 147.32.87.5 524 ?> 147.32.85.2 \n", "\n", " Dport State sTos dTos TotPkts TotBytes \\\n", "StartTime \n", "2011-08-10 09:46:53.078297 51769 PA_A 0.0 0.0 72157 61638544 \n", "2011-08-10 09:46:53.106431 10885 FPA_FA 0.0 0.0 162760 137136528 \n", "2011-08-10 09:46:53.346951 16786 PA_PA 0.0 0.0 51576 60964440 \n", "2011-08-10 09:46:53.709666 49375 PA_PA 0.0 0.0 55068 61164753 \n", "2011-08-10 09:46:58.680289 22 PA_PA 0.0 0.0 105555 70963308 \n", "2011-08-10 09:46:59.367587 49350 PA_ 0.0 NaN 214827 248405120 \n", "\n", " SrcBytes Label \n", "StartTime \n", "2011-08-10 09:46:53.078297 60214264 flow=From-Background-CVUT-Proxy \n", "2011-08-10 09:46:53.106431 132816366 flow=From-Background-CVUT-Proxy \n", "2011-08-10 09:46:53.346951 60363789 flow=Background \n", "2011-08-10 09:46:53.709666 60334389 flow=Background-Established-cmpgw-CVUT \n", "2011-08-10 09:46:58.680289 68636560 flow=Background \n", "2011-08-10 09:46:59.367587 248405120 flow=Background " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show the anomalous netflows:\n", "data[y_preds == -1]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }