{
"cells": [
{
"cell_type": "markdown",
"id": "c99eea00-b6af-40cb-8084-090dd7ef43d7",
"metadata": {},
"source": [
"# Anomaly Detection - Network Activity"
]
},
{
"cell_type": "markdown",
"id": "2e54f95d-7518-4f57-8ba3-b2cdd8718c28",
"metadata": {},
"source": [
"After cybersecurity attackers get initial access to a victim's network, their next techniques may involve executing scripts from one network machine to another. These follow-up activities could generate network activity that is *anomalous*. It might be anomalous in two different ways: maybe compared to what *normally happens* on the network, or maybe relative to what everything else on the network is *currently doing*. Those two kinds of anomalies have an important distinction -- the first one assumes some ground-truth knowledge of what \"normal\" *looks* like -- this is \"novelty\" detection. The second one doesn't need a ground truth -- it's just \"outlier\" detection.\n",
"\n",
"My favorite example of network anomaly detection gone wrong is the university network which detected an enormous spike in traffic on a high port number between the hours of midnight and about 3am every night. Were they under attack? Follow-up investigation determined that the involved IP addresses were all for on-campus freshmen housing, and that the port was a common Minecraft server config. Oops, false alarm! The solution was to create a *new* \"normalcy\" model specific to on-campus housing in the wee nighttime hours, which wouldn't be alarmed by late-night minecraft gaming. I'm probably not remembering that story correctly, but the point is that \"normalcy\" model can quickly spiral to requiring a huge number of hyper-specific models or parameters, to reduce false positives.\n",
"\n",
"This notebook demonstrates a simple way to do outlier detection on netflow records, using one of the CTU-13 datasets. It clusters on standard-scaled TotBytes,\n",
"using DBSCAN with an eps of 2.5, means any points labeled as cluster `-1` don't group with any other points, assuming gaussian (normal) distributions..\n",
"\n",
"I'm using [deargle/my-datascience-notebook](https://hub.docker.com/r/deargle/my-datascience-notebook)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "e6993eb8-09b8-4101-84d1-929d7b870312",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"pandas version: 1.4.0\n",
"sklearn version: 1.0.2\n"
]
}
],
"source": [
"import pandas as pd\n",
"import sklearn\n",
"from sklearn.cluster import DBSCAN\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"print(f'pandas version: {pd.__version__}')\n",
"print(f'sklearn version: {sklearn.__version__}')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c485e5ec-f53a-412d-9ac8-c7372a10aea1",
"metadata": {},
"outputs": [],
"source": [
"# first, download https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-Botnet-42/detailed-bidirectional-flow-labels/capture20110810.binetflow into the current directory\n",
"from pathlib import Path\n",
"\n",
"path_to_file = 'capture20110810.binetflow'\n",
"path = Path(path_to_file)\n",
"\n",
"if not path.is_file():\n",
" !wget https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-Botnet-42/detailed-bidirectional-flow-labels/capture20110810.binetflow\n",
"\n",
"df = pd.read_csv(path_to_file)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9b0845ac-77e2-4b6f-be51-714f7ae85423",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" StartTime | \n",
" Dur | \n",
" Proto | \n",
" SrcAddr | \n",
" Sport | \n",
" Dir | \n",
" DstAddr | \n",
" Dport | \n",
" State | \n",
" sTos | \n",
" dTos | \n",
" TotPkts | \n",
" TotBytes | \n",
" SrcBytes | \n",
" Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2011/08/10 09:46:53.047277 | \n",
" 3550.182373 | \n",
" udp | \n",
" 212.50.71.179 | \n",
" 39678 | \n",
" <-> | \n",
" 147.32.84.229 | \n",
" 13363 | \n",
" CON | \n",
" 0.0 | \n",
" 0.0 | \n",
" 12 | \n",
" 875 | \n",
" 413 | \n",
" flow=Background-UDP-Established | \n",
"
\n",
" \n",
" 1 | \n",
" 2011/08/10 09:46:53.048843 | \n",
" 0.000883 | \n",
" udp | \n",
" 84.13.246.132 | \n",
" 28431 | \n",
" <-> | \n",
" 147.32.84.229 | \n",
" 13363 | \n",
" CON | \n",
" 0.0 | \n",
" 0.0 | \n",
" 2 | \n",
" 135 | \n",
" 75 | \n",
" flow=Background-UDP-Established | \n",
"
\n",
" \n",
" 2 | \n",
" 2011/08/10 09:46:53.049895 | \n",
" 0.000326 | \n",
" tcp | \n",
" 217.163.21.35 | \n",
" 80 | \n",
" <?> | \n",
" 147.32.86.194 | \n",
" 2063 | \n",
" FA_A | \n",
" 0.0 | \n",
" 0.0 | \n",
" 2 | \n",
" 120 | \n",
" 60 | \n",
" flow=Background | \n",
"
\n",
" \n",
" 3 | \n",
" 2011/08/10 09:46:53.053771 | \n",
" 0.056966 | \n",
" tcp | \n",
" 83.3.77.74 | \n",
" 32882 | \n",
" <?> | \n",
" 147.32.85.5 | \n",
" 21857 | \n",
" FA_FA | \n",
" 0.0 | \n",
" 0.0 | \n",
" 3 | \n",
" 180 | \n",
" 120 | \n",
" flow=Background | \n",
"
\n",
" \n",
" 4 | \n",
" 2011/08/10 09:46:53.053937 | \n",
" 3427.768066 | \n",
" udp | \n",
" 74.89.223.204 | \n",
" 21278 | \n",
" <-> | \n",
" 147.32.84.229 | \n",
" 13363 | \n",
" CON | \n",
" 0.0 | \n",
" 0.0 | \n",
" 42 | \n",
" 2856 | \n",
" 1596 | \n",
" flow=Background-UDP-Established | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" StartTime Dur Proto SrcAddr Sport Dir \\\n",
"0 2011/08/10 09:46:53.047277 3550.182373 udp 212.50.71.179 39678 <-> \n",
"1 2011/08/10 09:46:53.048843 0.000883 udp 84.13.246.132 28431 <-> \n",
"2 2011/08/10 09:46:53.049895 0.000326 tcp 217.163.21.35 80 > \n",
"3 2011/08/10 09:46:53.053771 0.056966 tcp 83.3.77.74 32882 > \n",
"4 2011/08/10 09:46:53.053937 3427.768066 udp 74.89.223.204 21278 <-> \n",
"\n",
" DstAddr Dport State sTos dTos TotPkts TotBytes SrcBytes \\\n",
"0 147.32.84.229 13363 CON 0.0 0.0 12 875 413 \n",
"1 147.32.84.229 13363 CON 0.0 0.0 2 135 75 \n",
"2 147.32.86.194 2063 FA_A 0.0 0.0 2 120 60 \n",
"3 147.32.85.5 21857 FA_FA 0.0 0.0 3 180 120 \n",
"4 147.32.84.229 13363 CON 0.0 0.0 42 2856 1596 \n",
"\n",
" Label \n",
"0 flow=Background-UDP-Established \n",
"1 flow=Background-UDP-Established \n",
"2 flow=Background \n",
"3 flow=Background \n",
"4 flow=Background-UDP-Established "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b7fffc73-b2b5-457c-af2a-51c8f604dcdf",
"metadata": {},
"outputs": [],
"source": [
"# Set the StartTime as the row index\n",
"df = df.set_index(pd.to_datetime(df['StartTime']))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "683dc161-eabf-4155-a1db-6a883a848ed9",
"metadata": {},
"outputs": [],
"source": [
"# Streaming network analytics would bin incoming netflows by some time window. Below, I set the window\n",
"# to 1 minute, but this is arbitrary to make iterative testing faster than, say, a 10-minute window\n",
"grouped = df.groupby(pd.Grouper(freq='1min'))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "2a429f32-2b7d-4d21-80ee-5ea238b4a9fe",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Window: 2011-08-10 09:46:00\n"
]
}
],
"source": [
"# For demonstration purposes, I extract just the first window\n",
"grouped = list(grouped)\n",
"first_window = grouped[0]\n",
"print(f'Window: {first_window[0]}')\n",
"data = first_window[1]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d2827652-c3d8-43a3-b773-76edf4634dda",
"metadata": {},
"outputs": [],
"source": [
"# I'll cluster just based on netflow TotBytes, but this is also arbitrary. Other numerical features could be included.\n",
"X = data[['TotBytes']]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "75443e56-3b58-44ae-8135-877470ba7097",
"metadata": {},
"outputs": [],
"source": [
"# Apply a standard deviation transformation to my data.\n",
"X = StandardScaler().fit_transform(X)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "a28e9306-5eaa-41e2-a78e-d5830bc43492",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"n netflows: 1654\n",
"n clusters: 1\n",
"n outliers: 6\n"
]
}
],
"source": [
"# Since we standard-scaled, we can set `DBSCAN`'s `eps` parameter\n",
"# to be 2.5, which roughly corresponds to a common threshold for gaussian (normal) distribution for what is\n",
"# considered an \"outlier\"\n",
"\n",
"clf = DBSCAN(eps=2.5)\n",
"y_preds = clf.fit_predict(X)\n",
"print(f'n netflows: {len(X)}')\n",
"print(f'n clusters: {len(list(set([y for y in clf.labels_ if y != -1])))}')\n",
"print(f'n outliers: {len([x for x in y_preds if x == -1])}') "
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "1af42383-78d1-4f8d-8ec2-1ab7f2ab6299",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" StartTime | \n",
" Dur | \n",
" Proto | \n",
" SrcAddr | \n",
" Sport | \n",
" Dir | \n",
" DstAddr | \n",
" Dport | \n",
" State | \n",
" sTos | \n",
" dTos | \n",
" TotPkts | \n",
" TotBytes | \n",
" SrcBytes | \n",
" Label | \n",
"
\n",
" \n",
" StartTime | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2011-08-10 09:46:53.078297 | \n",
" 2011/08/10 09:46:53.078297 | \n",
" 3599.972412 | \n",
" tcp | \n",
" 147.32.80.13 | \n",
" 80 | \n",
" <?> | \n",
" 147.32.84.162 | \n",
" 51769 | \n",
" PA_A | \n",
" 0.0 | \n",
" 0.0 | \n",
" 72157 | \n",
" 61638544 | \n",
" 60214264 | \n",
" flow=From-Background-CVUT-Proxy | \n",
"
\n",
" \n",
" 2011-08-10 09:46:53.106431 | \n",
" 2011/08/10 09:46:53.106431 | \n",
" 507.347626 | \n",
" tcp | \n",
" 147.32.80.13 | \n",
" 80 | \n",
" <?> | \n",
" 147.32.85.112 | \n",
" 10885 | \n",
" FPA_FA | \n",
" 0.0 | \n",
" 0.0 | \n",
" 162760 | \n",
" 137136528 | \n",
" 132816366 | \n",
" flow=From-Background-CVUT-Proxy | \n",
"
\n",
" \n",
" 2011-08-10 09:46:53.346951 | \n",
" 2011/08/10 09:46:53.346951 | \n",
" 3598.887695 | \n",
" tcp | \n",
" 195.250.146.99 | \n",
" 554 | \n",
" <?> | \n",
" 147.32.86.99 | \n",
" 16786 | \n",
" PA_PA | \n",
" 0.0 | \n",
" 0.0 | \n",
" 51576 | \n",
" 60964440 | \n",
" 60363789 | \n",
" flow=Background | \n",
"
\n",
" \n",
" 2011-08-10 09:46:53.709666 | \n",
" 2011/08/10 09:46:53.709666 | \n",
" 3599.047607 | \n",
" tcp | \n",
" 195.250.146.6 | \n",
" 554 | \n",
" <?> | \n",
" 147.32.84.59 | \n",
" 49375 | \n",
" PA_PA | \n",
" 0.0 | \n",
" 0.0 | \n",
" 55068 | \n",
" 61164753 | \n",
" 60334389 | \n",
" flow=Background-Established-cmpgw-CVUT | \n",
"
\n",
" \n",
" 2011-08-10 09:46:58.680289 | \n",
" 2011/08/10 09:46:58.680289 | \n",
" 3591.222656 | \n",
" tcp | \n",
" 147.32.85.103 | \n",
" 49317 | \n",
" <?> | \n",
" 88.159.8.10 | \n",
" 22 | \n",
" PA_PA | \n",
" 0.0 | \n",
" 0.0 | \n",
" 105555 | \n",
" 70963308 | \n",
" 68636560 | \n",
" flow=Background | \n",
"
\n",
" \n",
" 2011-08-10 09:46:59.367587 | \n",
" 2011/08/10 09:46:59.367587 | \n",
" 3350.645508 | \n",
" tcp | \n",
" 147.32.87.5 | \n",
" 524 | \n",
" ?> | \n",
" 147.32.85.2 | \n",
" 49350 | \n",
" PA_ | \n",
" 0.0 | \n",
" NaN | \n",
" 214827 | \n",
" 248405120 | \n",
" 248405120 | \n",
" flow=Background | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" StartTime Dur Proto \\\n",
"StartTime \n",
"2011-08-10 09:46:53.078297 2011/08/10 09:46:53.078297 3599.972412 tcp \n",
"2011-08-10 09:46:53.106431 2011/08/10 09:46:53.106431 507.347626 tcp \n",
"2011-08-10 09:46:53.346951 2011/08/10 09:46:53.346951 3598.887695 tcp \n",
"2011-08-10 09:46:53.709666 2011/08/10 09:46:53.709666 3599.047607 tcp \n",
"2011-08-10 09:46:58.680289 2011/08/10 09:46:58.680289 3591.222656 tcp \n",
"2011-08-10 09:46:59.367587 2011/08/10 09:46:59.367587 3350.645508 tcp \n",
"\n",
" SrcAddr Sport Dir DstAddr \\\n",
"StartTime \n",
"2011-08-10 09:46:53.078297 147.32.80.13 80 > 147.32.84.162 \n",
"2011-08-10 09:46:53.106431 147.32.80.13 80 > 147.32.85.112 \n",
"2011-08-10 09:46:53.346951 195.250.146.99 554 > 147.32.86.99 \n",
"2011-08-10 09:46:53.709666 195.250.146.6 554 > 147.32.84.59 \n",
"2011-08-10 09:46:58.680289 147.32.85.103 49317 > 88.159.8.10 \n",
"2011-08-10 09:46:59.367587 147.32.87.5 524 ?> 147.32.85.2 \n",
"\n",
" Dport State sTos dTos TotPkts TotBytes \\\n",
"StartTime \n",
"2011-08-10 09:46:53.078297 51769 PA_A 0.0 0.0 72157 61638544 \n",
"2011-08-10 09:46:53.106431 10885 FPA_FA 0.0 0.0 162760 137136528 \n",
"2011-08-10 09:46:53.346951 16786 PA_PA 0.0 0.0 51576 60964440 \n",
"2011-08-10 09:46:53.709666 49375 PA_PA 0.0 0.0 55068 61164753 \n",
"2011-08-10 09:46:58.680289 22 PA_PA 0.0 0.0 105555 70963308 \n",
"2011-08-10 09:46:59.367587 49350 PA_ 0.0 NaN 214827 248405120 \n",
"\n",
" SrcBytes Label \n",
"StartTime \n",
"2011-08-10 09:46:53.078297 60214264 flow=From-Background-CVUT-Proxy \n",
"2011-08-10 09:46:53.106431 132816366 flow=From-Background-CVUT-Proxy \n",
"2011-08-10 09:46:53.346951 60363789 flow=Background \n",
"2011-08-10 09:46:53.709666 60334389 flow=Background-Established-cmpgw-CVUT \n",
"2011-08-10 09:46:58.680289 68636560 flow=Background \n",
"2011-08-10 09:46:59.367587 248405120 flow=Background "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Show the anomalous netflows:\n",
"data[y_preds == -1]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}