{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Banner](../media/banner2.png)\n", "\n", "---\n", "# Workshop 2.3: Advanced Pandas\n", "\n", "* **Contributors**:\n", " * Ashwin Patil (@ashwin)\n", " * Luis Francisco Monge Martinez (@LuckyLuke)\n", " * Jose Rodriguez (@Cyb3rPandah)\n", " * Ian Hellen (@ianhellen)\n", "

\n", "* **Agenda**:\n", " * [Joins and merges](#joins)\n", " * [Using styles](#styles)\n", " * [Reshaping/preprocessing data](#reshaping)\n", " * [Pivot tables](#pivots)\n", " * [Time manipulation](#time)\n", " * [Other useful operations](#otherops)\n", "

\n", "* **Notebook**: https://aka.ms/Jupyterthon-ws-2-3\n", "* **License**: [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/)\n", "\n", "* **Q&A** - OTR Discord **#Jupyterthon #WORKSHOP DAY 2 - ADVANCED PANDAS**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "# Joins and merges [Ashwin]\n", "\n", "**Pandas UserGuide :** [Merge, join, concatenate and compare](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ### Load some data and normalize it into:\n", " - Processes\n", " - ParentProcesses\n", " - Users" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:56:44.269255Z", "start_time": "2021-11-29T16:56:43.972401Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "original 117\n", "procs 117\n", "parents 3\n", "users 2\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NewProcessIdNewProcessNameCommandLineParentProcessIdTimeCreatedUtcSubjectUserSid
00x1580C:\\Diagnostics\\UserTmp\\ftp.exe.\\ftp -s:C:\\RECYCLER\\xxppyy.exe0xbc82019-01-15 05:15:15.677S-1-5-21-996632719-2361334927-4038480536-500
10x16fcC:\\Diagnostics\\UserTmp\\reg.exe.\\reg not /domain:everything that /sid:shines...0xbc82019-01-15 05:15:16.167S-1-5-21-996632719-2361334927-4038480536-500
20x1700C:\\Diagnostics\\UserTmp\\cmd.execmd /c \"systeminfo && systeminfo\"0xbc82019-01-15 05:15:16.277S-1-5-21-996632719-2361334927-4038480536-500
\n", "
" ], "text/plain": [ " NewProcessId NewProcessName \\\n", "0 0x1580 C:\\Diagnostics\\UserTmp\\ftp.exe \n", "1 0x16fc C:\\Diagnostics\\UserTmp\\reg.exe \n", "2 0x1700 C:\\Diagnostics\\UserTmp\\cmd.exe \n", "\n", " CommandLine ParentProcessId \\\n", "0 .\\ftp -s:C:\\RECYCLER\\xxppyy.exe 0xbc8 \n", "1 .\\reg not /domain:everything that /sid:shines... 0xbc8 \n", "2 cmd /c \"systeminfo && systeminfo\" 0xbc8 \n", "\n", " TimeCreatedUtc SubjectUserSid \n", "0 2019-01-15 05:15:15.677 S-1-5-21-996632719-2361334927-4038480536-500 \n", "1 2019-01-15 05:15:16.167 S-1-5-21-996632719-2361334927-4038480536-500 \n", "2 2019-01-15 05:15:16.277 S-1-5-21-996632719-2361334927-4038480536-500 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ProcessIdParentProcessName
00xbc8C:\\Windows\\System32\\cmd.exe
1150x440C:\\Windows\\System32\\svchost.exe
1160x1580C:\\Diagnostics\\UserTmp\\powershell.exe
\n", "
" ], "text/plain": [ " ProcessId ParentProcessName\n", "0 0xbc8 C:\\Windows\\System32\\cmd.exe\n", "115 0x440 C:\\Windows\\System32\\svchost.exe\n", "116 0x1580 C:\\Diagnostics\\UserTmp\\powershell.exe" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SubjectUserSidSubjectUserNameSubjectDomainName
0S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin1
115S-1-5-18MSTICAlertsWin1$WORKGROUP
\n", "
" ], "text/plain": [ " SubjectUserSid SubjectUserName \\\n", "0 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin \n", "115 S-1-5-18 MSTICAlertsWin1$ \n", "\n", " SubjectDomainName \n", "0 MSTICAlertsWin1 \n", "115 WORKGROUP " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "\n", "procs_df = pd.read_csv(\n", " \"../data/process_tree.csv\",\n", " parse_dates=[\"TimeCreatedUtc\", \"TimeGenerated\"],\n", " index_col=0\n", ")\n", "parents = procs_df[[\"ProcessId\", \"ParentProcessName\"]].drop_duplicates()\n", "procs = (\n", " procs_df[[\"NewProcessId\", \"NewProcessName\", \"CommandLine\", \"ProcessId\", \"TimeCreatedUtc\", \"SubjectUserSid\"]]\n", " .drop_duplicates()\n", " .rename(columns={\"ProcessId\": \"ParentProcessId\"})\n", ")\n", "users = procs_df[['SubjectUserSid', 'SubjectUserName', 'SubjectDomainName']].drop_duplicates()\n", "\n", "print(\"original\", len(procs_df))\n", "print(\"procs\", len(procs))\n", "print(\"parents\", len(parents))\n", "print(\"users\", len(users))\n", "display(procs.head(3))\n", "display(parents)\n", "display(users)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joining on Index using pd.concat\n", "\n", "

\n", "pd.concat([df1, df2...])\n", "

\n", "\n", "We saw using pd.concat to append rows in part 1" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:56:45.380847Z", "start_time": "2021-11-29T16:56:45.338334Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SubjectLogonId_dec
016428071
116428071
216428071
316428071
416428071
\n", "
" ], "text/plain": [ " SubjectLogonId_dec\n", "0 16428071\n", "1 16428071\n", "2 16428071\n", "3 16428071\n", "4 16428071" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Do some processing on the original DF\n", "dec_logon_id = (\n", " pd.DataFrame(procs_df.SubjectLogonId.apply(lambda x: int(x, base=16)))\n", " .rename(columns={\"SubjectLogonId\": \"SubjectLogonId_dec\"})\n", ")\n", "\n", "dec_logon_id.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### pd.concat with `axis=\"columns\"` or `axis=1` joins column-wise (horizontally)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:56:46.566520Z", "start_time": "2021-11-29T16:56:46.516902Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SubjectUserSidSubjectUserNameSubjectDomainNameSubjectLogonIdNewProcessIdNewProcessNameProcessIdParentProcessNameSubjectLogonId_dec
0S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin10xfaac270x1580C:\\Diagnostics\\UserTmp\\ftp.exe0xbc8C:\\Windows\\System32\\cmd.exe16428071
1S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin10xfaac270x16fcC:\\Diagnostics\\UserTmp\\reg.exe0xbc8C:\\Windows\\System32\\cmd.exe16428071
2S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin10xfaac270x1700C:\\Diagnostics\\UserTmp\\cmd.exe0xbc8C:\\Windows\\System32\\cmd.exe16428071
3S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin10xfaac270x1728C:\\Diagnostics\\UserTmp\\rundll32.exe0xbc8C:\\Windows\\System32\\cmd.exe16428071
4S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin10xfaac270x175cC:\\Diagnostics\\UserTmp\\rundll32.exe0xbc8C:\\Windows\\System32\\cmd.exe16428071
\n", "
" ], "text/plain": [ " SubjectUserSid SubjectUserName \\\n", "0 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin \n", "1 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin \n", "2 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin \n", "3 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin \n", "4 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin \n", "\n", " SubjectDomainName SubjectLogonId NewProcessId \\\n", "0 MSTICAlertsWin1 0xfaac27 0x1580 \n", "1 MSTICAlertsWin1 0xfaac27 0x16fc \n", "2 MSTICAlertsWin1 0xfaac27 0x1700 \n", "3 MSTICAlertsWin1 0xfaac27 0x1728 \n", "4 MSTICAlertsWin1 0xfaac27 0x175c \n", "\n", " NewProcessName ProcessId ParentProcessName \\\n", "0 C:\\Diagnostics\\UserTmp\\ftp.exe 0xbc8 C:\\Windows\\System32\\cmd.exe \n", "1 C:\\Diagnostics\\UserTmp\\reg.exe 0xbc8 C:\\Windows\\System32\\cmd.exe \n", "2 C:\\Diagnostics\\UserTmp\\cmd.exe 0xbc8 C:\\Windows\\System32\\cmd.exe \n", "3 C:\\Diagnostics\\UserTmp\\rundll32.exe 0xbc8 C:\\Windows\\System32\\cmd.exe \n", "4 C:\\Diagnostics\\UserTmp\\rundll32.exe 0xbc8 C:\\Windows\\System32\\cmd.exe \n", "\n", " SubjectLogonId_dec \n", "0 16428071 \n", "1 16428071 \n", "2 16428071 \n", "3 16428071 \n", "4 16428071 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " pd.concat([procs_df, dec_logon_id], axis=\"columns\")\n", " .head()\n", " .filter(regex=\".*Process.*|Sub.*\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key-based Joins\n", "\n", "

\n", "df1.merge(df2, ...)
\n", "df1.join(df2, ...)\n", "

\n", "\n", "\n", "Source tables" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:56:47.814523Z", "start_time": "2021-11-29T16:56:47.770687Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NewProcessIdNewProcessNameCommandLineParentProcessIdTimeCreatedUtcSubjectUserSid
00x1580C:\\Diagnostics\\UserTmp\\ftp.exe.\\ftp -s:C:\\RECYCLER\\xxppyy.exe0xbc82019-01-15 05:15:15.677S-1-5-21-996632719-2361334927-4038480536-500
10x16fcC:\\Diagnostics\\UserTmp\\reg.exe.\\reg not /domain:everything that /sid:shines...0xbc82019-01-15 05:15:16.167S-1-5-21-996632719-2361334927-4038480536-500
20x1700C:\\Diagnostics\\UserTmp\\cmd.execmd /c \"systeminfo && systeminfo\"0xbc82019-01-15 05:15:16.277S-1-5-21-996632719-2361334927-4038480536-500
30x1728C:\\Diagnostics\\UserTmp\\rundll32.exe.\\rundll32 /C 12345.exe0xbc82019-01-15 05:15:16.340S-1-5-21-996632719-2361334927-4038480536-500
40x175cC:\\Diagnostics\\UserTmp\\rundll32.exe.\\rundll32 /C c:\\users\\MSTICAdmin\\12345.exe0xbc82019-01-15 05:15:16.400S-1-5-21-996632719-2361334927-4038480536-500
\n", "
" ], "text/plain": [ " NewProcessId NewProcessName \\\n", "0 0x1580 C:\\Diagnostics\\UserTmp\\ftp.exe \n", "1 0x16fc C:\\Diagnostics\\UserTmp\\reg.exe \n", "2 0x1700 C:\\Diagnostics\\UserTmp\\cmd.exe \n", "3 0x1728 C:\\Diagnostics\\UserTmp\\rundll32.exe \n", "4 0x175c C:\\Diagnostics\\UserTmp\\rundll32.exe \n", "\n", " CommandLine ParentProcessId \\\n", "0 .\\ftp -s:C:\\RECYCLER\\xxppyy.exe 0xbc8 \n", "1 .\\reg not /domain:everything that /sid:shines... 0xbc8 \n", "2 cmd /c \"systeminfo && systeminfo\" 0xbc8 \n", "3 .\\rundll32 /C 12345.exe 0xbc8 \n", "4 .\\rundll32 /C c:\\users\\MSTICAdmin\\12345.exe 0xbc8 \n", "\n", " TimeCreatedUtc SubjectUserSid \n", "0 2019-01-15 05:15:15.677 S-1-5-21-996632719-2361334927-4038480536-500 \n", "1 2019-01-15 05:15:16.167 S-1-5-21-996632719-2361334927-4038480536-500 \n", "2 2019-01-15 05:15:16.277 S-1-5-21-996632719-2361334927-4038480536-500 \n", "3 2019-01-15 05:15:16.340 S-1-5-21-996632719-2361334927-4038480536-500 \n", "4 2019-01-15 05:15:16.400 S-1-5-21-996632719-2361334927-4038480536-500 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SubjectUserSidSubjectUserNameSubjectDomainName
0S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin1
115S-1-5-18MSTICAlertsWin1$WORKGROUP
\n", "
" ], "text/plain": [ " SubjectUserSid SubjectUserName \\\n", "0 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin \n", "115 S-1-5-18 MSTICAlertsWin1$ \n", "\n", " SubjectDomainName \n", "0 MSTICAlertsWin1 \n", "115 WORKGROUP " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(procs.head())\n", "display(users)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simple merge on common key" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:56:48.928329Z", "start_time": "2021-11-29T16:56:48.886277Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NewProcessIdNewProcessNameCommandLineParentProcessIdTimeCreatedUtcSubjectUserSidSubjectUserNameSubjectDomainName
00x1580C:\\Diagnostics\\UserTmp\\ftp.exe.\\ftp -s:C:\\RECYCLER\\xxppyy.exe0xbc82019-01-15 05:15:15.677S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin1
10x16fcC:\\Diagnostics\\UserTmp\\reg.exe.\\reg not /domain:everything that /sid:shines...0xbc82019-01-15 05:15:16.167S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin1
20x1700C:\\Diagnostics\\UserTmp\\cmd.execmd /c \"systeminfo && systeminfo\"0xbc82019-01-15 05:15:16.277S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin1
30x1728C:\\Diagnostics\\UserTmp\\rundll32.exe.\\rundll32 /C 12345.exe0xbc82019-01-15 05:15:16.340S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin1
40x175cC:\\Diagnostics\\UserTmp\\rundll32.exe.\\rundll32 /C c:\\users\\MSTICAdmin\\12345.exe0xbc82019-01-15 05:15:16.400S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin1
...........................
1120x1434C:\\Diagnostics\\UserTmp\\rundll32.exe.\\rundll32.exe /C c:\\windows\\fonts\\conhost.exe0xbc82019-01-15 05:15:14.613S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin1
1130x123cC:\\Diagnostics\\UserTmp\\regsvr32.exe.\\regsvr32 /u /s c:\\windows\\fonts\\csrss.exe0xbc82019-01-15 05:15:14.693S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin1
1140x240C:\\Windows\\System32\\tasklist.exetasklist0xbc82019-01-15 05:15:14.770S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin1
1150x15a0C:\\Windows\\System32\\win32calc.exe\"C:\\Windows\\System32\\win32calc.exe\"0x15802019-01-15 05:15:13.053S-1-5-21-996632719-2361334927-4038480536-500MSTICAdminMSTICAlertsWin1
1160xbc8C:\\Windows\\System32\\cmd.execmd.exe /c c:\\Diagnostics\\WindowsSimulateDetec...0x4402019-01-15 05:15:03.047S-1-5-18MSTICAlertsWin1$WORKGROUP
\n", "

117 rows × 8 columns

\n", "
" ], "text/plain": [ " NewProcessId NewProcessName \\\n", "0 0x1580 C:\\Diagnostics\\UserTmp\\ftp.exe \n", "1 0x16fc C:\\Diagnostics\\UserTmp\\reg.exe \n", "2 0x1700 C:\\Diagnostics\\UserTmp\\cmd.exe \n", "3 0x1728 C:\\Diagnostics\\UserTmp\\rundll32.exe \n", "4 0x175c C:\\Diagnostics\\UserTmp\\rundll32.exe \n", ".. ... ... \n", "112 0x1434 C:\\Diagnostics\\UserTmp\\rundll32.exe \n", "113 0x123c C:\\Diagnostics\\UserTmp\\regsvr32.exe \n", "114 0x240 C:\\Windows\\System32\\tasklist.exe \n", "115 0x15a0 C:\\Windows\\System32\\win32calc.exe \n", "116 0xbc8 C:\\Windows\\System32\\cmd.exe \n", "\n", " CommandLine ParentProcessId \\\n", "0 .\\ftp -s:C:\\RECYCLER\\xxppyy.exe 0xbc8 \n", "1 .\\reg not /domain:everything that /sid:shines... 0xbc8 \n", "2 cmd /c \"systeminfo && systeminfo\" 0xbc8 \n", "3 .\\rundll32 /C 12345.exe 0xbc8 \n", "4 .\\rundll32 /C c:\\users\\MSTICAdmin\\12345.exe 0xbc8 \n", ".. ... ... \n", "112 .\\rundll32.exe /C c:\\windows\\fonts\\conhost.exe 0xbc8 \n", "113 .\\regsvr32 /u /s c:\\windows\\fonts\\csrss.exe 0xbc8 \n", "114 tasklist 0xbc8 \n", "115 \"C:\\Windows\\System32\\win32calc.exe\" 0x1580 \n", "116 cmd.exe /c c:\\Diagnostics\\WindowsSimulateDetec... 0x440 \n", "\n", " TimeCreatedUtc SubjectUserSid \\\n", "0 2019-01-15 05:15:15.677 S-1-5-21-996632719-2361334927-4038480536-500 \n", "1 2019-01-15 05:15:16.167 S-1-5-21-996632719-2361334927-4038480536-500 \n", "2 2019-01-15 05:15:16.277 S-1-5-21-996632719-2361334927-4038480536-500 \n", "3 2019-01-15 05:15:16.340 S-1-5-21-996632719-2361334927-4038480536-500 \n", "4 2019-01-15 05:15:16.400 S-1-5-21-996632719-2361334927-4038480536-500 \n", ".. ... ... \n", "112 2019-01-15 05:15:14.613 S-1-5-21-996632719-2361334927-4038480536-500 \n", "113 2019-01-15 05:15:14.693 S-1-5-21-996632719-2361334927-4038480536-500 \n", "114 2019-01-15 05:15:14.770 S-1-5-21-996632719-2361334927-4038480536-500 \n", "115 2019-01-15 05:15:13.053 S-1-5-21-996632719-2361334927-4038480536-500 \n", "116 2019-01-15 05:15:03.047 S-1-5-18 \n", "\n", " SubjectUserName SubjectDomainName \n", "0 MSTICAdmin MSTICAlertsWin1 \n", "1 MSTICAdmin MSTICAlertsWin1 \n", "2 MSTICAdmin MSTICAlertsWin1 \n", "3 MSTICAdmin MSTICAlertsWin1 \n", "4 MSTICAdmin MSTICAlertsWin1 \n", ".. ... ... \n", "112 MSTICAdmin MSTICAlertsWin1 \n", "113 MSTICAdmin MSTICAlertsWin1 \n", "114 MSTICAdmin MSTICAlertsWin1 \n", "115 MSTICAdmin MSTICAlertsWin1 \n", "116 MSTICAlertsWin1$ WORKGROUP \n", "\n", "[117 rows x 8 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "procs.merge(users, on=\"SubjectUserSid\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Left joins (also right and outer)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:56:50.490622Z", "start_time": "2021-11-29T16:56:50.460080Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NewProcessIdNewProcessNameCommandLineParentProcessIdTimeCreatedUtcSubjectUserSidSubjectUserNameSubjectDomainName
00xbc8C:\\Windows\\System32\\cmd.execmd.exe /c c:\\Diagnostics\\WindowsSimulateDetec...0x4402019-01-15 05:15:03.047S-1-5-18MSTICAlertsWin1$WORKGROUP
\n", "
" ], "text/plain": [ " NewProcessId NewProcessName \\\n", "0 0xbc8 C:\\Windows\\System32\\cmd.exe \n", "\n", " CommandLine ParentProcessId \\\n", "0 cmd.exe /c c:\\Diagnostics\\WindowsSimulateDetec... 0x440 \n", "\n", " TimeCreatedUtc SubjectUserSid SubjectUserName SubjectDomainName \n", "0 2019-01-15 05:15:03.047 S-1-5-18 MSTICAlertsWin1$ WORKGROUP " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "procs.merge(users[1:], on=\"SubjectUserSid\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:56:51.087997Z", "start_time": "2021-11-29T16:56:51.043723Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NewProcessIdNewProcessNameCommandLineParentProcessIdTimeCreatedUtcSubjectUserSidSubjectUserNameSubjectDomainName
00x1580C:\\Diagnostics\\UserTmp\\ftp.exe.\\ftp -s:C:\\RECYCLER\\xxppyy.exe0xbc82019-01-15 05:15:15.677S-1-5-21-996632719-2361334927-4038480536-500NaNNaN
10x16fcC:\\Diagnostics\\UserTmp\\reg.exe.\\reg not /domain:everything that /sid:shines...0xbc82019-01-15 05:15:16.167S-1-5-21-996632719-2361334927-4038480536-500NaNNaN
20x1700C:\\Diagnostics\\UserTmp\\cmd.execmd /c \"systeminfo && systeminfo\"0xbc82019-01-15 05:15:16.277S-1-5-21-996632719-2361334927-4038480536-500NaNNaN
30x1728C:\\Diagnostics\\UserTmp\\rundll32.exe.\\rundll32 /C 12345.exe0xbc82019-01-15 05:15:16.340S-1-5-21-996632719-2361334927-4038480536-500NaNNaN
40x175cC:\\Diagnostics\\UserTmp\\rundll32.exe.\\rundll32 /C c:\\users\\MSTICAdmin\\12345.exe0xbc82019-01-15 05:15:16.400S-1-5-21-996632719-2361334927-4038480536-500NaNNaN
...........................
1120x1434C:\\Diagnostics\\UserTmp\\rundll32.exe.\\rundll32.exe /C c:\\windows\\fonts\\conhost.exe0xbc82019-01-15 05:15:14.613S-1-5-21-996632719-2361334927-4038480536-500NaNNaN
1130x123cC:\\Diagnostics\\UserTmp\\regsvr32.exe.\\regsvr32 /u /s c:\\windows\\fonts\\csrss.exe0xbc82019-01-15 05:15:14.693S-1-5-21-996632719-2361334927-4038480536-500NaNNaN
1140x240C:\\Windows\\System32\\tasklist.exetasklist0xbc82019-01-15 05:15:14.770S-1-5-21-996632719-2361334927-4038480536-500NaNNaN
1150xbc8C:\\Windows\\System32\\cmd.execmd.exe /c c:\\Diagnostics\\WindowsSimulateDetec...0x4402019-01-15 05:15:03.047S-1-5-18MSTICAlertsWin1$WORKGROUP
1160x15a0C:\\Windows\\System32\\win32calc.exe\"C:\\Windows\\System32\\win32calc.exe\"0x15802019-01-15 05:15:13.053S-1-5-21-996632719-2361334927-4038480536-500NaNNaN
\n", "

117 rows × 8 columns

\n", "
" ], "text/plain": [ " NewProcessId NewProcessName \\\n", "0 0x1580 C:\\Diagnostics\\UserTmp\\ftp.exe \n", "1 0x16fc C:\\Diagnostics\\UserTmp\\reg.exe \n", "2 0x1700 C:\\Diagnostics\\UserTmp\\cmd.exe \n", "3 0x1728 C:\\Diagnostics\\UserTmp\\rundll32.exe \n", "4 0x175c C:\\Diagnostics\\UserTmp\\rundll32.exe \n", ".. ... ... \n", "112 0x1434 C:\\Diagnostics\\UserTmp\\rundll32.exe \n", "113 0x123c C:\\Diagnostics\\UserTmp\\regsvr32.exe \n", "114 0x240 C:\\Windows\\System32\\tasklist.exe \n", "115 0xbc8 C:\\Windows\\System32\\cmd.exe \n", "116 0x15a0 C:\\Windows\\System32\\win32calc.exe \n", "\n", " CommandLine ParentProcessId \\\n", "0 .\\ftp -s:C:\\RECYCLER\\xxppyy.exe 0xbc8 \n", "1 .\\reg not /domain:everything that /sid:shines... 0xbc8 \n", "2 cmd /c \"systeminfo && systeminfo\" 0xbc8 \n", "3 .\\rundll32 /C 12345.exe 0xbc8 \n", "4 .\\rundll32 /C c:\\users\\MSTICAdmin\\12345.exe 0xbc8 \n", ".. ... ... \n", "112 .\\rundll32.exe /C c:\\windows\\fonts\\conhost.exe 0xbc8 \n", "113 .\\regsvr32 /u /s c:\\windows\\fonts\\csrss.exe 0xbc8 \n", "114 tasklist 0xbc8 \n", "115 cmd.exe /c c:\\Diagnostics\\WindowsSimulateDetec... 0x440 \n", "116 \"C:\\Windows\\System32\\win32calc.exe\" 0x1580 \n", "\n", " TimeCreatedUtc SubjectUserSid \\\n", "0 2019-01-15 05:15:15.677 S-1-5-21-996632719-2361334927-4038480536-500 \n", "1 2019-01-15 05:15:16.167 S-1-5-21-996632719-2361334927-4038480536-500 \n", "2 2019-01-15 05:15:16.277 S-1-5-21-996632719-2361334927-4038480536-500 \n", "3 2019-01-15 05:15:16.340 S-1-5-21-996632719-2361334927-4038480536-500 \n", "4 2019-01-15 05:15:16.400 S-1-5-21-996632719-2361334927-4038480536-500 \n", ".. ... ... \n", "112 2019-01-15 05:15:14.613 S-1-5-21-996632719-2361334927-4038480536-500 \n", "113 2019-01-15 05:15:14.693 S-1-5-21-996632719-2361334927-4038480536-500 \n", "114 2019-01-15 05:15:14.770 S-1-5-21-996632719-2361334927-4038480536-500 \n", "115 2019-01-15 05:15:03.047 S-1-5-18 \n", "116 2019-01-15 05:15:13.053 S-1-5-21-996632719-2361334927-4038480536-500 \n", "\n", " SubjectUserName SubjectDomainName \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", ".. ... ... \n", "112 NaN NaN \n", "113 NaN NaN \n", "114 NaN NaN \n", "115 MSTICAlertsWin1$ WORKGROUP \n", "116 NaN NaN \n", "\n", "[117 rows x 8 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "procs.merge(users[1:], on=\"SubjectUserSid\", how=\"left\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Joins where no common key" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:03.202101Z", "start_time": "2021-11-29T16:57:03.179540Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NewProcessIdNewProcessNameParentProcessIdProcessIdParentProcessName
00x1580C:\\Diagnostics\\UserTmp\\ftp.exe0xbc80xbc8C:\\Windows\\System32\\cmd.exe
10x16fcC:\\Diagnostics\\UserTmp\\reg.exe0xbc80xbc8C:\\Windows\\System32\\cmd.exe
20x1700C:\\Diagnostics\\UserTmp\\cmd.exe0xbc80xbc8C:\\Windows\\System32\\cmd.exe
30x1728C:\\Diagnostics\\UserTmp\\rundll32.exe0xbc80xbc8C:\\Windows\\System32\\cmd.exe
40x175cC:\\Diagnostics\\UserTmp\\rundll32.exe0xbc80xbc8C:\\Windows\\System32\\cmd.exe
\n", "
" ], "text/plain": [ " NewProcessId NewProcessName ParentProcessId ProcessId \\\n", "0 0x1580 C:\\Diagnostics\\UserTmp\\ftp.exe 0xbc8 0xbc8 \n", "1 0x16fc C:\\Diagnostics\\UserTmp\\reg.exe 0xbc8 0xbc8 \n", "2 0x1700 C:\\Diagnostics\\UserTmp\\cmd.exe 0xbc8 0xbc8 \n", "3 0x1728 C:\\Diagnostics\\UserTmp\\rundll32.exe 0xbc8 0xbc8 \n", "4 0x175c C:\\Diagnostics\\UserTmp\\rundll32.exe 0xbc8 0xbc8 \n", "\n", " ParentProcessName \n", "0 C:\\Windows\\System32\\cmd.exe \n", "1 C:\\Windows\\System32\\cmd.exe \n", "2 C:\\Windows\\System32\\cmd.exe \n", "3 C:\\Windows\\System32\\cmd.exe \n", "4 C:\\Windows\\System32\\cmd.exe " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " procs.merge(parents, left_on=\"ParentProcessId\", right_on=\"ProcessId\")\n", " .head()\n", " .filter(regex=\".*Process.*\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "# Using Styles [Ian]\n", "\n", "- Max/min values \n", "- Value coloring \n", "- Inline bars \n", "\n", "

\n", "df.style(...)\n", "

" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:05.125828Z", "start_time": "2021-11-29T16:57:05.063273Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FlowsSumFlowsVarFlowsStdDevL7Prots
RemoteRegion
814.057.2670277.5674986
canadacentral5103.029.8112235.4599651
centralus236.04.6758972.1623821
eastus602.01.6461541.2830253
eastus21502.04.8309142.1979341
northeurope82.00.4924380.7017391
southcentralus817.08.8821862.9803001
westcentralus59.00.0172410.1313061
westus38.00.7826090.8846521
westus27.00.3000000.5477231
\n", "
" ], "text/plain": [ " FlowsSum FlowsVar FlowsStdDev L7Prots\n", "RemoteRegion \n", " 814.0 57.267027 7.567498 6\n", "canadacentral 5103.0 29.811223 5.459965 1\n", "centralus 236.0 4.675897 2.162382 1\n", "eastus 602.0 1.646154 1.283025 3\n", "eastus2 1502.0 4.830914 2.197934 1\n", "northeurope 82.0 0.492438 0.701739 1\n", "southcentralus 817.0 8.882186 2.980300 1\n", "westcentralus 59.0 0.017241 0.131306 1\n", "westus 38.0 0.782609 0.884652 1\n", "westus2 7.0 0.300000 0.547723 1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df = pd.read_pickle(\"../data/az_net_comms_df.pkl\")\n", "\n", "# Generate a summary\n", "summary_df = (\n", " net_df[[\"RemoteRegion\", \"TotalAllowedFlows\", \"L7Protocol\"]]\n", " .groupby(\"RemoteRegion\")\n", " .agg(\n", " FlowsSum = pd.NamedAgg(\"TotalAllowedFlows\", \"sum\"),\n", " FlowsVar = pd.NamedAgg(\"TotalAllowedFlows\", \"var\"),\n", " FlowsStdDev = pd.NamedAgg(\"TotalAllowedFlows\", \"std\"),\n", " L7Prots = pd.NamedAgg(\"L7Protocol\", \"nunique\"),\n", " )\n", ")\n", "summary_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### highlight_max/highlight_mix\n", "\n", "

\n", "df.style.highlight_max(...)\n", "

" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:06.657172Z", "start_time": "2021-11-29T16:57:05.656786Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FlowsSum FlowsVar FlowsStdDev L7Prots
RemoteRegion
814.00000057.2670277.5674986
canadacentral5103.00000029.8112235.4599651
centralus236.0000004.6758972.1623821
eastus602.0000001.6461541.2830253
eastus21502.0000004.8309142.1979341
northeurope82.0000000.4924380.7017391
southcentralus817.0000008.8821862.9803001
westcentralus59.0000000.0172410.1313061
westus38.0000000.7826090.8846521
westus27.0000000.3000000.5477231
" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_style = summary_df.style.highlight_max(color=\"blue\").highlight_min(color=\"green\")\n", "df_style" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Color gradients\n", "\n", "

\n", "df.style.background_gradient(...)\n", "

" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:07.227310Z", "start_time": "2021-11-29T16:57:06.827914Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FlowsSum FlowsVar FlowsStdDev L7Prots
RemoteRegion
814.057.37.66.0
canadacentral5103.029.85.51.0
centralus236.04.72.21.0
eastus602.01.61.33.0
eastus21502.04.82.21.0
northeurope82.00.50.71.0
southcentralus817.08.93.01.0
westcentralus59.00.00.11.0
westus38.00.80.91.0
westus27.00.30.51.0
" ], "text/plain": [ "" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import seaborn as sns\n", "cm = sns.light_palette(\"blue\", as_cmap=True)\n", "\n", "summary_df.style.background_gradient(cmap=cm).format(\"{:.1f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inline bars\n", "\n", "

\n", "df.style.bar(...)\n", "

\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:07.430565Z", "start_time": "2021-11-29T16:57:07.410341Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FlowsSum FlowsVar FlowsStdDev L7Prots
RemoteRegion
814.0057.277.576.00
canadacentral5103.0029.815.461.00
centralus236.004.682.161.00
eastus602.001.651.283.00
eastus21502.004.832.201.00
northeurope82.000.490.701.00
southcentralus817.008.882.981.00
westcentralus59.000.020.131.00
westus38.000.780.881.00
westus27.000.300.551.00
" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary_df.style.bar(color=\"blue\").format(\"{:.2f}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 59, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:07.871292Z", "start_time": "2021-11-29T16:57:07.617135Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FlowsSum FlowsVar FlowsStdDev L7Prots
RemoteRegion
814.0057.277.576.00
canadacentral5103.0029.815.461.00
centralus236.004.682.161.00
eastus602.001.651.283.00
eastus21502.004.832.201.00
northeurope82.000.490.701.00
southcentralus817.008.882.981.00
westcentralus59.000.020.131.00
westus38.000.780.881.00
westus27.000.300.551.00
" ], "text/plain": [ "" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary_df.style.set_properties(**{\n", " 'background-color': 'black',\n", " 'color': 'lawngreen',\n", " 'font-family': 'consolas',\n", "}).format(\"{:.2f}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "# Reshaping/preprocessing data? [Ian] \n", "\n", "- Dealing with nulls/NAs \n", "- Type conversion \n", "- Renaming columns\n", "- Pandas operations: melt, explode, transpose, indexing/stack/unstack \n", "- Dealing with complex Python objects - explode \n", "- Tidy data - melt \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dealing with nulls/NAs\n", "\n", "[Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data)\n", "\n", "pandas primarily uses `NaN` to represent missing data which is of `floattype`. \n", "IEEE 754 floating point representation of Not a Number (NaN).\n", "\n", "Tip: Often you will see TypeError exceptions about not being able to \n", "perform an expected operation on a float (when you were expecting the type\n", "to be a string or other object type). This is very likely due NaNs in your data.\n", "\n", "Also NaT - is the equivalent of NaN for DateTime data.\n", " \n", "Sometimes python also raises `None` for missing data. `NoneType` object. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:10.316650Z", "start_time": "2021-11-29T16:57:10.308913Z" } }, "outputs": [ { "data": { "text/plain": [ "1360" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "net_df = pd.read_pickle(\"../data/az_net_comms_df.pkl\")\n", "len(net_df)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:13.652979Z", "start_time": "2021-11-29T16:57:13.624651Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Null elements in DataFrame: 24 \n", "Rows with null elements: 8\n" ] } ], "source": [ "print(f\"Null elements in DataFrame: {net_df.isnull().values.sum()} \\n\\\n", "Rows with null elements: {net_df.shape[0] - net_df.dropna().shape[0]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Which columns have NAs?\n", "\n", "

\n", "df.isna()
\n", "series.isna()\n", "

" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:14.126373Z", "start_time": "2021-11-29T16:57:14.098944Z" } }, "outputs": [ { "data": { "text/plain": [ "TimeGenerated False\n", "FlowStartTime False\n", "FlowEndTime False\n", "FlowIntervalEndTime False\n", "FlowType False\n", "ResourceGroup True\n", "VMName True\n", "VMIPAddress False\n", "PublicIPs True\n", "SrcIP False\n", "DestIP False\n", "L4Protocol False\n", "L7Protocol False\n", "DestPort False\n", "FlowDirection False\n", "AllowedOutFlows False\n", "AllowedInFlows False\n", "DeniedInFlows False\n", "DeniedOutFlows False\n", "RemoteRegion False\n", "VMRegion False\n", "AllExtIPs False\n", "TotalAllowedFlows False\n", "dtype: bool" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df.isna().any()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:14.719395Z", "start_time": "2021-11-29T16:57:14.710600Z" } }, "outputs": [ { "data": { "text/plain": [ "asihuntomsworkspacerg 1352\n", "Name: ResourceGroup, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df.ResourceGroup.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filtering to see which columns have NaNs\n", "\n", "You can use `.isna()` on the whole DataFrame or a single column." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:15.475287Z", "start_time": "2021-11-29T16:57:15.431680Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeGeneratedFlowStartTimeFlowEndTimeFlowIntervalEndTimeFlowTypeResourceGroupVMNameVMIPAddressPublicIPsSrcIP...DestPortFlowDirectionAllowedOutFlowsAllowedInFlowsDeniedInFlowsDeniedOutFlowsRemoteRegionVMRegionAllExtIPsTotalAllowedFlows
3262019-02-13 01:23:47.6342019-02-13 00:55:102019-02-13 00:57:332019-02-13 01:00:00IntraVNetNoneNoneNone10.0.3.4...445.0O3.00.00.00.0eastus10.0.3.53.0
3272019-02-13 01:23:47.6342019-02-13 00:55:102019-02-13 00:57:332019-02-13 01:00:00IntraVNetNoneNoneNone10.0.3.4...445.0I0.03.00.00.0eastus10.0.3.43.0
3362019-02-12 22:23:05.3992019-02-12 21:55:432019-02-12 21:55:432019-02-12 22:00:00IntraVNetNoneNoneNone10.0.3.5...22.0O1.00.00.00.0eastus10.0.3.41.0
3452019-02-12 22:23:05.3842019-02-12 21:54:042019-02-12 21:55:362019-02-12 22:00:00IntraVNetNoneNoneNone10.0.3.5...22.0O6.00.00.00.0eastus104.211.30.16.0
3572019-02-12 23:23:59.5152019-02-12 22:22:352019-02-12 22:55:372019-02-12 23:00:00IntraVNetNoneNoneNone10.0.3.5...22.0O12.00.00.00.0eastus104.211.30.112.0
4132019-02-12 18:23:51.8532019-02-12 17:26:192019-02-12 17:44:092019-02-12 18:00:00IntraVNetNoneNoneNone10.0.3.4...445.0O6.00.00.00.0eastus10.0.3.56.0
4142019-02-12 18:23:51.8532019-02-12 17:26:192019-02-12 17:44:092019-02-12 18:00:00IntraVNetNoneNoneNone10.0.3.4...445.0I0.06.00.00.0eastus10.0.3.46.0
4662019-02-12 22:23:17.2362019-02-12 21:55:432019-02-12 21:55:432019-02-12 22:00:00IntraVNetNoneNoneNone10.0.3.5...22.0I0.01.00.00.0eastus10.0.3.51.0
\n", "

8 rows × 23 columns

\n", "
" ], "text/plain": [ " TimeGenerated FlowStartTime FlowEndTime \\\n", "326 2019-02-13 01:23:47.634 2019-02-13 00:55:10 2019-02-13 00:57:33 \n", "327 2019-02-13 01:23:47.634 2019-02-13 00:55:10 2019-02-13 00:57:33 \n", "336 2019-02-12 22:23:05.399 2019-02-12 21:55:43 2019-02-12 21:55:43 \n", "345 2019-02-12 22:23:05.384 2019-02-12 21:54:04 2019-02-12 21:55:36 \n", "357 2019-02-12 23:23:59.515 2019-02-12 22:22:35 2019-02-12 22:55:37 \n", "413 2019-02-12 18:23:51.853 2019-02-12 17:26:19 2019-02-12 17:44:09 \n", "414 2019-02-12 18:23:51.853 2019-02-12 17:26:19 2019-02-12 17:44:09 \n", "466 2019-02-12 22:23:17.236 2019-02-12 21:55:43 2019-02-12 21:55:43 \n", "\n", " FlowIntervalEndTime FlowType ResourceGroup VMName VMIPAddress PublicIPs \\\n", "326 2019-02-13 01:00:00 IntraVNet None None None \n", "327 2019-02-13 01:00:00 IntraVNet None None None \n", "336 2019-02-12 22:00:00 IntraVNet None None None \n", "345 2019-02-12 22:00:00 IntraVNet None None None \n", "357 2019-02-12 23:00:00 IntraVNet None None None \n", "413 2019-02-12 18:00:00 IntraVNet None None None \n", "414 2019-02-12 18:00:00 IntraVNet None None None \n", "466 2019-02-12 22:00:00 IntraVNet None None None \n", "\n", " SrcIP ... DestPort FlowDirection AllowedOutFlows AllowedInFlows \\\n", "326 10.0.3.4 ... 445.0 O 3.0 0.0 \n", "327 10.0.3.4 ... 445.0 I 0.0 3.0 \n", "336 10.0.3.5 ... 22.0 O 1.0 0.0 \n", "345 10.0.3.5 ... 22.0 O 6.0 0.0 \n", "357 10.0.3.5 ... 22.0 O 12.0 0.0 \n", "413 10.0.3.4 ... 445.0 O 6.0 0.0 \n", "414 10.0.3.4 ... 445.0 I 0.0 6.0 \n", "466 10.0.3.5 ... 22.0 I 0.0 1.0 \n", "\n", " DeniedInFlows DeniedOutFlows RemoteRegion VMRegion AllExtIPs \\\n", "326 0.0 0.0 eastus 10.0.3.5 \n", "327 0.0 0.0 eastus 10.0.3.4 \n", "336 0.0 0.0 eastus 10.0.3.4 \n", "345 0.0 0.0 eastus 104.211.30.1 \n", "357 0.0 0.0 eastus 104.211.30.1 \n", "413 0.0 0.0 eastus 10.0.3.5 \n", "414 0.0 0.0 eastus 10.0.3.4 \n", "466 0.0 0.0 eastus 10.0.3.5 \n", "\n", " TotalAllowedFlows \n", "326 3.0 \n", "327 3.0 \n", "336 1.0 \n", "345 6.0 \n", "357 12.0 \n", "413 6.0 \n", "414 6.0 \n", "466 1.0 \n", "\n", "[8 rows x 23 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df[net_df[\"PublicIPs\"].isna()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Removing NaNs with `.dropna`\n", "\n", "

\n", "df.dropna() # removes all rows with ANY NaNs
\n", "df.dropna(axis=1) # removes all columns with ANY NaNs\n", "

\n", "

\n", "df.dropna(how=\"all\") # removes all rows that are ALL NaNs
\n", "df.dropna(axis=1, how=\"all\") # removes all cols that are ALL NaNs
\n", "

\n", "\n", "`dropna()` also supports `inplace=True`. Don't do it!!!\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:16.014992Z", "start_time": "2021-11-29T16:57:15.988549Z" } }, "outputs": [ { "data": { "text/plain": [ "1352" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(net_df.dropna())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Replacing NaNs with values\n", "\n", "

\n", "df.fillna(replacement) # replace NaNs with 'replacement'

\n", "df[column] = df[column].fillna(replacement) # replace NaNs in a single column\n", "

\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:17.067541Z", "start_time": "2021-11-29T16:57:17.022938Z" } }, "outputs": [], "source": [ "net_df2 = net_df.fillna(value=\"N/A\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:17.505737Z", "start_time": "2021-11-29T16:57:17.489845Z" } }, "outputs": [ { "data": { "text/plain": [ "asihuntomsworkspacerg 1352\n", "N/A 8\n", "Name: ResourceGroup, dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df2.ResourceGroup.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Type Conversion" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:21.279553Z", "start_time": "2021-11-29T16:57:21.272725Z" } }, "outputs": [ { "data": { "text/plain": [ "TimeGenerated datetime64[ns]\n", "FlowStartTime datetime64[ns]\n", "FlowEndTime datetime64[ns]\n", "FlowIntervalEndTime datetime64[ns]\n", "FlowType object\n", "ResourceGroup object\n", "VMName object\n", "VMIPAddress object\n", "PublicIPs object\n", "SrcIP object\n", "DestIP object\n", "L4Protocol object\n", "L7Protocol object\n", "DestPort float64\n", "FlowDirection object\n", "AllowedOutFlows float64\n", "AllowedInFlows float64\n", "DeniedInFlows float64\n", "DeniedOutFlows float64\n", "RemoteRegion object\n", "VMRegion object\n", "AllExtIPs object\n", "TotalAllowedFlows float64\n", "dtype: object" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "series.column.astype(target) # convert type
\n", "

\n", "\n", "Target can be a numpy type, a pandas dtype or a friendly string:\n", "- \"object\"\n", "- \"datetime\"\n", "- \"number\"\n", "\n", "

\n", "df.column.astype(target|{col1: type1, col2, type2...}) # convert multiple cols
\n", "

\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:22.249568Z", "start_time": "2021-11-29T16:57:22.234649Z" } }, "outputs": [], "source": [ "net_df.TotalAllowedFlows = net_df.TotalAllowedFlows.astype('str')" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:22.909576Z", "start_time": "2021-11-29T16:57:22.899133Z" } }, "outputs": [ { "data": { "text/plain": [ "dtype('O')" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df.TotalAllowedFlows.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert using explicit pandas function\n", "\n", "Gives you more control over specific conversions (esp for DateTime)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:46.097851Z", "start_time": "2021-11-29T16:57:46.091318Z" } }, "outputs": [], "source": [ "net_df.TotalAllowedFlows = pd.to_numeric(net_df.TotalAllowedFlows)\n", "#pd.to_datetime\n", "#pd.to_timedelta" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:46.890283Z", "start_time": "2021-11-29T16:57:46.877279Z" } }, "outputs": [ { "data": { "text/plain": [ "dtype('float64')" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df.TotalAllowedFlows.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Renaming columns" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:49.831462Z", "start_time": "2021-11-29T16:57:49.820173Z" } }, "outputs": [ { "data": { "text/plain": [ "Index(['TimeGenerated', 'FlowStartTime', 'FlowEndTime', 'FlowIntervalEndTime',\n", " 'FlowType', 'ResourceGroup', 'VMName', 'VMIPAddress', 'PublicIPs',\n", " 'SrcIP', 'DestIP', 'L4Protocol', 'L7Protocol', 'DestPort',\n", " 'FlowDirection', 'AllowedOutFlows', 'AllowedInFlows', 'DeniedInFlows',\n", " 'DeniedOutFlows', 'RemoteRegion', 'VMRegion', 'AllExtIPs',\n", " 'TotalAllowedFlows'],\n", " dtype='object')" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "df.rename(columns={col1: col1_new, col2: ....}) # rename
\n", "

\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:50.653581Z", "start_time": "2021-11-29T16:57:50.640127Z" } }, "outputs": [ { "data": { "text/plain": [ "Index(['TimeGenerated', 'FlowStartDateTime', 'FlowEndDateTime',\n", " 'FlowIntervalEndTime', 'FlowType', 'ResourceGroup', 'VMName',\n", " 'VMIPAddress', 'PublicIPs', 'SrcIP', 'DestIP', 'L4Protocol',\n", " 'L7Protocol', 'DestPort', 'FlowDirection', 'AllowedOutFlows',\n", " 'AllowedInFlows', 'DeniedInFlows', 'DeniedOutFlows', 'RemoteRegion',\n", " 'VMRegion', 'AllExtIPs', 'TotalAllowedFlows'],\n", " dtype='object')" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df.rename(columns={\"FlowStartTime\": \"FlowStartDateTime\", \"FlowEndTime\": \"FlowEndDateTime\"}).columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "

\n", "df.rename(func, axis='columns')\n", "

" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:52.191594Z", "start_time": "2021-11-29T16:57:52.172578Z" } }, "outputs": [ { "data": { "text/plain": [ "Index(['timegenerated', 'flowstarttime', 'flowendtime', 'flowintervalendtime',\n", " 'flowtype', 'resourcegroup', 'vmname', 'vmipaddress', 'publicips',\n", " 'srcip', 'destip', 'l4protocol', 'l7protocol', 'destport',\n", " 'flowdirection', 'allowedoutflows', 'allowedinflows', 'deniedinflows',\n", " 'deniedoutflows', 'remoteregion', 'vmregion', 'allextips',\n", " 'totalallowedflows'],\n", " dtype='object')" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df.rename(str.lower, axis='columns').columns" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:57:56.345464Z", "start_time": "2021-11-29T16:57:56.330796Z" } }, "outputs": [ { "data": { "text/plain": [ "Index(['TimeGenerated', 'FlowStartTime', 'FlowEndTime', 'FlowIntervalEndTime',\n", " 'FlowType', 'ResourceGroup', 'VMName', 'VMIPAddress', 'PublicIPs',\n", " 'SrcIP', 'DestIP', 'L4Protocol', 'L7Protocol', 'DestPort',\n", " 'FlowDirection', 'AllowedOutFlows', 'AllowedInFlows', 'DeniedInFlows',\n", " 'DeniedOutFlows', 'RemoteRegion', 'VMRegion', 'AllExtIPs',\n", " 'TotalAllowedFlows'],\n", " dtype='object')" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Statically rename using assignment" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:00.745520Z", "start_time": "2021-11-29T16:59:00.731493Z" } }, "outputs": [], "source": [ "net_df.columns = [\n", " \"timegenerated\",\n", " \"flowstarttime\",\n", " \"flowendtime\",\n", " \"flowintervalendtime\",\n", " \"flowtype\",\n", " \"resourcegroup\",\n", " \"vmname\",\n", " \"vmipaddress\",\n", " \"publicips\",\n", " \"srcip\",\n", " \"destip\",\n", " \"l4protocol\",\n", " \"l7protocol\",\n", " \"destport\",\n", " \"flowdirection\",\n", " \"allowedoutflows\",\n", " \"allowedinflows\",\n", " \"deniedinflows\",\n", " \"deniedoutflows\",\n", " \"remoteregion\",\n", " \"vmregion\",\n", " \"allextips\",\n", " \"totalallowedflows\",\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas operations: melt, explode, transpose, indexing/stack/unstack [Ashwin]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dealing with complex Python objects - explode" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:04.263586Z", "start_time": "2021-11-29T16:59:04.235426Z" } }, "outputs": [ { "data": { "text/plain": [ "0 [13.67.143.117]\n", "1 [40.77.232.95]\n", "2 [13.65.107.32, 40.124.45.19]\n", "3 [13.65.107.32, 40.124.45.19]\n", "4 [20.38.98.100]\n", "5 [13.67.143.117]\n", "6 [13.71.172.128, 13.71.172.130]\n", "7 [13.71.172.128, 13.71.172.130]\n", "8 [65.55.44.109, 40.77.228.69, 65.55.44.108]\n", "9 [65.55.44.109, 40.77.228.69, 65.55.44.108]\n", "Name: PublicIPs, dtype: object" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df = pd.read_pickle(\"../data/az_net_comms_df.pkl\")\n", "net_df.PublicIPs.head(10)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:05.918978Z", "start_time": "2021-11-29T16:59:05.905221Z" } }, "outputs": [ { "data": { "text/plain": [ "1352" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df.PublicIPs.count()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:07.499080Z", "start_time": "2021-11-29T16:59:07.475692Z" } }, "outputs": [], "source": [ "net_df_ext = net_df.explode(\"PublicIPs\")" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:08.179437Z", "start_time": "2021-11-29T16:59:08.166135Z" } }, "outputs": [ { "data": { "text/plain": [ "0 13.67.143.117\n", "1 40.77.232.95\n", "2 13.65.107.32\n", "2 40.124.45.19\n", "3 13.65.107.32\n", "3 40.124.45.19\n", "4 20.38.98.100\n", "5 13.67.143.117\n", "6 13.71.172.128\n", "6 13.71.172.130\n", "Name: PublicIPs, dtype: object" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df_ext.PublicIPs.head(10)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:09.054072Z", "start_time": "2021-11-29T16:59:09.041999Z" } }, "outputs": [ { "data": { "text/plain": [ "123" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(net_df_ext.PublicIPs.unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tidy data - melt\n", "Pandas.melt() unpivots a DataFrame from wide format to long format. \n", "melt() function is useful to message a DataFrame into a format where one or more columns are identifier variables, while all other columns, considered measured variables, are unpivoted to the row axis, leaving just two non-identifier columns, variable and value. " ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:10.842088Z", "start_time": "2021-11-29T16:59:10.831120Z" } }, "outputs": [], "source": [ "net_df_min = net_df[[\"FlowType\", \"AllExtIPs\", \"TotalAllowedFlows\"]]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:11.476008Z", "start_time": "2021-11-29T16:59:11.456710Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AllExtIPsvariablevalue
013.67.143.117FlowTypeAzurePublic
140.77.232.95FlowTypeAzurePublic
213.65.107.32FlowTypeAzurePublic
340.124.45.19FlowTypeAzurePublic
420.38.98.100FlowTypeAzurePublic
............
271513.71.172.130TotalAllowedFlows23.0
271640.77.232.95TotalAllowedFlows1.0
271752.168.138.145TotalAllowedFlows4.0
271823.215.98.90TotalAllowedFlows2.0
271972.21.81.240TotalAllowedFlows2.0
\n", "

2720 rows × 3 columns

\n", "
" ], "text/plain": [ " AllExtIPs variable value\n", "0 13.67.143.117 FlowType AzurePublic\n", "1 40.77.232.95 FlowType AzurePublic\n", "2 13.65.107.32 FlowType AzurePublic\n", "3 40.124.45.19 FlowType AzurePublic\n", "4 20.38.98.100 FlowType AzurePublic\n", "... ... ... ...\n", "2715 13.71.172.130 TotalAllowedFlows 23.0\n", "2716 40.77.232.95 TotalAllowedFlows 1.0\n", "2717 52.168.138.145 TotalAllowedFlows 4.0\n", "2718 23.215.98.90 TotalAllowedFlows 2.0\n", "2719 72.21.81.240 TotalAllowedFlows 2.0\n", "\n", "[2720 rows x 3 columns]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.melt(net_df_min, \n", " id_vars=['AllExtIPs'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Transpose" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:13.573096Z", "start_time": "2021-11-29T16:59:13.555565Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
FlowTypeAzurePublicAzurePublicAzurePublicAzurePublicAzurePublic
AllExtIPs13.67.143.11740.77.232.9513.65.107.3240.124.45.1920.38.98.100
TotalAllowedFlows1.01.04.04.01.0
\n", "
" ], "text/plain": [ " 0 1 2 3 \\\n", "FlowType AzurePublic AzurePublic AzurePublic AzurePublic \n", "AllExtIPs 13.67.143.117 40.77.232.95 13.65.107.32 40.124.45.19 \n", "TotalAllowedFlows 1.0 1.0 4.0 4.0 \n", "\n", " 4 \n", "FlowType AzurePublic \n", "AllExtIPs 20.38.98.100 \n", "TotalAllowedFlows 1.0 " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df_min.head().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexing, Stack and Unstack" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:15.460085Z", "start_time": "2021-11-29T16:59:15.446382Z" } }, "outputs": [], "source": [ "net_df = pd.read_pickle(\"../data/az_net_comms_df.pkl\")" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:16.331231Z", "start_time": "2021-11-29T16:59:16.303347Z" } }, "outputs": [], "source": [ "net_df_agg = net_df.groupby(\"AllExtIPs\").agg({\"TotalAllowedFlows\":['mean', 'min', 'max'],\n", " \"AllowedOutFlows\":['mean', 'min', 'max'],\n", " \"AllowedInFlows\":['mean', 'min', 'max']})" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:16.964883Z", "start_time": "2021-11-29T16:59:16.927450Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TotalAllowedFlowsAllowedOutFlowsAllowedInFlows
meanminmaxmeanminmaxmeanminmax
AllExtIPs
10.0.3.43.3333331.06.00.3333330.01.03.0000000.06.0
10.0.3.53.3333331.06.03.0000000.06.00.3333330.01.0
104.211.30.19.0000006.012.09.0000006.012.00.0000000.00.0
104.40.17.1531.7500001.02.01.7500001.02.00.0000000.00.0
104.43.212.122.1666671.04.02.1666671.04.00.0000000.00.0
\n", "
" ], "text/plain": [ " TotalAllowedFlows AllowedOutFlows \\\n", " mean min max mean min max \n", "AllExtIPs \n", "10.0.3.4 3.333333 1.0 6.0 0.333333 0.0 1.0 \n", "10.0.3.5 3.333333 1.0 6.0 3.000000 0.0 6.0 \n", "104.211.30.1 9.000000 6.0 12.0 9.000000 6.0 12.0 \n", "104.40.17.153 1.750000 1.0 2.0 1.750000 1.0 2.0 \n", "104.43.212.12 2.166667 1.0 4.0 2.166667 1.0 4.0 \n", "\n", " AllowedInFlows \n", " mean min max \n", "AllExtIPs \n", "10.0.3.4 3.000000 0.0 6.0 \n", "10.0.3.5 0.333333 0.0 1.0 \n", "104.211.30.1 0.000000 0.0 0.0 \n", "104.40.17.153 0.000000 0.0 0.0 \n", "104.43.212.12 0.000000 0.0 0.0 " ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df_agg.head()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:18.906756Z", "start_time": "2021-11-29T16:59:18.886349Z" } }, "outputs": [ { "data": { "text/plain": [ "AllExtIPs\n", "10.0.3.4 3.333333\n", "10.0.3.5 3.333333\n", "104.211.30.1 9.000000\n", "104.40.17.153 1.750000\n", "104.43.212.12 2.166667\n", " ... \n", "90.130.70.73 1.000000\n", "99.84.104.63 7.000000\n", "99.84.106.178 10.000000\n", "99.84.106.27 10.000000\n", "99.84.106.92 10.000000\n", "Name: mean, Length: 125, dtype: float64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df_agg[\"TotalAllowedFlows\"][\"mean\"]" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:19.702322Z", "start_time": "2021-11-29T16:59:19.658756Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TotalAllowedFlowsAllowedOutFlowsAllowedInFlows
meanmeanmean
AllExtIPs
10.0.3.43.3333330.3333333.000000
10.0.3.53.3333333.0000000.333333
104.211.30.19.0000009.0000000.000000
104.40.17.1531.7500001.7500000.000000
104.43.212.122.1666672.1666670.000000
............
90.130.70.731.0000001.0000000.000000
99.84.104.637.0000007.0000000.000000
99.84.106.17810.00000010.0000000.000000
99.84.106.2710.00000010.0000000.000000
99.84.106.9210.00000010.0000000.000000
\n", "

125 rows × 3 columns

\n", "
" ], "text/plain": [ " TotalAllowedFlows AllowedOutFlows AllowedInFlows\n", " mean mean mean\n", "AllExtIPs \n", "10.0.3.4 3.333333 0.333333 3.000000\n", "10.0.3.5 3.333333 3.000000 0.333333\n", "104.211.30.1 9.000000 9.000000 0.000000\n", "104.40.17.153 1.750000 1.750000 0.000000\n", "104.43.212.12 2.166667 2.166667 0.000000\n", "... ... ... ...\n", "90.130.70.73 1.000000 1.000000 0.000000\n", "99.84.104.63 7.000000 7.000000 0.000000\n", "99.84.106.178 10.000000 10.000000 0.000000\n", "99.84.106.27 10.000000 10.000000 0.000000\n", "99.84.106.92 10.000000 10.000000 0.000000\n", "\n", "[125 rows x 3 columns]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "idx = pd.IndexSlice\n", "net_df_agg.loc[:,idx[:,'mean']]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:21.223362Z", "start_time": "2021-11-29T16:59:21.200773Z" } }, "outputs": [], "source": [ "net_df_agg_stacked = net_df_agg.stack()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:21.803040Z", "start_time": "2021-11-29T16:59:21.779226Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TotalAllowedFlowsAllowedOutFlowsAllowedInFlows
AllExtIPs
10.0.3.4mean3.3333330.3333333.000000
min1.0000000.0000000.000000
max6.0000001.0000006.000000
10.0.3.5mean3.3333333.0000000.333333
min1.0000000.0000000.000000
\n", "
" ], "text/plain": [ " TotalAllowedFlows AllowedOutFlows AllowedInFlows\n", "AllExtIPs \n", "10.0.3.4 mean 3.333333 0.333333 3.000000\n", " min 1.000000 0.000000 0.000000\n", " max 6.000000 1.000000 6.000000\n", "10.0.3.5 mean 3.333333 3.000000 0.333333\n", " min 1.000000 0.000000 0.000000" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df_agg_stacked.head()" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:23.170816Z", "start_time": "2021-11-29T16:59:23.158347Z" } }, "outputs": [ { "data": { "text/plain": [ "3.3333333333333335" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df_agg_stacked.loc[(\"10.0.3.4\",\"mean\"),\"TotalAllowedFlows\"]" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:24.037883Z", "start_time": "2021-11-29T16:59:23.986355Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TotalAllowedFlowsAllowedOutFlowsAllowedInFlows
meanminmaxmeanminmaxmeanminmax
AllExtIPs
10.0.3.43.3333331.06.00.3333330.01.03.0000000.06.0
10.0.3.53.3333331.06.03.0000000.06.00.3333330.01.0
104.211.30.19.0000006.012.09.0000006.012.00.0000000.00.0
104.40.17.1531.7500001.02.01.7500001.02.00.0000000.00.0
104.43.212.122.1666671.04.02.1666671.04.00.0000000.00.0
\n", "
" ], "text/plain": [ " TotalAllowedFlows AllowedOutFlows \\\n", " mean min max mean min max \n", "AllExtIPs \n", "10.0.3.4 3.333333 1.0 6.0 0.333333 0.0 1.0 \n", "10.0.3.5 3.333333 1.0 6.0 3.000000 0.0 6.0 \n", "104.211.30.1 9.000000 6.0 12.0 9.000000 6.0 12.0 \n", "104.40.17.153 1.750000 1.0 2.0 1.750000 1.0 2.0 \n", "104.43.212.12 2.166667 1.0 4.0 2.166667 1.0 4.0 \n", "\n", " AllowedInFlows \n", " mean min max \n", "AllExtIPs \n", "10.0.3.4 3.000000 0.0 6.0 \n", "10.0.3.5 0.333333 0.0 1.0 \n", "104.211.30.1 0.000000 0.0 0.0 \n", "104.40.17.153 0.000000 0.0 0.0 \n", "104.43.212.12 0.000000 0.0 0.0 " ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df_agg_stacked.unstack().head()" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "---\n", "# Pivoting/pivot tables [Ashwin]\n", "\n", "### [Reshaping and Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:26.497275Z", "start_time": "2021-11-29T16:59:26.429533Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeGeneratedFlowStartTimeFlowEndTimeFlowIntervalEndTimeFlowTypeResourceGroupVMNameVMIPAddressPublicIPsSrcIP...DestPortFlowDirectionAllowedOutFlowsAllowedInFlowsDeniedInFlowsDeniedOutFlowsRemoteRegionVMRegionAllExtIPsTotalAllowedFlows
02019-02-14 13:23:59.5122019-02-14 12:21:582019-02-14 12:21:582019-02-14 13:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[13.67.143.117]...443.0O1.00.00.00.0centraluseastus13.67.143.1171.0
12019-02-14 13:23:59.5122019-02-14 12:29:022019-02-14 12:29:022019-02-14 13:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[40.77.232.95]...443.0O1.00.00.00.0westcentraluseastus40.77.232.951.0
22019-02-14 03:26:06.7652019-02-14 02:08:462019-02-14 02:48:452019-02-14 03:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[13.65.107.32, 40.124.45.19]...443.0O4.00.00.00.0southcentraluseastus13.65.107.324.0
32019-02-14 03:26:06.7652019-02-14 02:08:462019-02-14 02:48:452019-02-14 03:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[13.65.107.32, 40.124.45.19]...443.0O4.00.00.00.0southcentraluseastus40.124.45.194.0
42019-02-14 03:26:06.8282019-02-14 02:30:562019-02-14 02:30:562019-02-14 03:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[20.38.98.100]...443.0O1.00.00.00.0eastuseastus20.38.98.1001.0
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " TimeGenerated FlowStartTime FlowEndTime \\\n", "0 2019-02-14 13:23:59.512 2019-02-14 12:21:58 2019-02-14 12:21:58 \n", "1 2019-02-14 13:23:59.512 2019-02-14 12:29:02 2019-02-14 12:29:02 \n", "2 2019-02-14 03:26:06.765 2019-02-14 02:08:46 2019-02-14 02:48:45 \n", "3 2019-02-14 03:26:06.765 2019-02-14 02:08:46 2019-02-14 02:48:45 \n", "4 2019-02-14 03:26:06.828 2019-02-14 02:30:56 2019-02-14 02:30:56 \n", "\n", " FlowIntervalEndTime FlowType ResourceGroup VMName \\\n", "0 2019-02-14 13:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "1 2019-02-14 13:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "2 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "3 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "4 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "\n", " VMIPAddress PublicIPs SrcIP ... DestPort FlowDirection \\\n", "0 10.0.3.5 [13.67.143.117] ... 443.0 O \n", "1 10.0.3.5 [40.77.232.95] ... 443.0 O \n", "2 10.0.3.5 [13.65.107.32, 40.124.45.19] ... 443.0 O \n", "3 10.0.3.5 [13.65.107.32, 40.124.45.19] ... 443.0 O \n", "4 10.0.3.5 [20.38.98.100] ... 443.0 O \n", "\n", " AllowedOutFlows AllowedInFlows DeniedInFlows DeniedOutFlows \\\n", "0 1.0 0.0 0.0 0.0 \n", "1 1.0 0.0 0.0 0.0 \n", "2 4.0 0.0 0.0 0.0 \n", "3 4.0 0.0 0.0 0.0 \n", "4 1.0 0.0 0.0 0.0 \n", "\n", " RemoteRegion VMRegion AllExtIPs TotalAllowedFlows \n", "0 centralus eastus 13.67.143.117 1.0 \n", "1 westcentralus eastus 40.77.232.95 1.0 \n", "2 southcentralus eastus 13.65.107.32 4.0 \n", "3 southcentralus eastus 40.124.45.19 4.0 \n", "4 eastus eastus 20.38.98.100 1.0 \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df = pd.read_pickle(\"../data/az_net_comms_df.pkl\")\n", "net_df.head()" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:27.624201Z", "start_time": "2021-11-29T16:59:27.611499Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FlowIntervalEndTimeFlowTypeAllExtIPs
02019-02-07 13:00:00AzurePublic3
12019-02-07 14:00:00AzurePublic8
22019-02-07 14:00:00ExternalPublic1
32019-02-07 15:00:00AzurePublic5
42019-02-07 15:00:00ExternalPublic3
\n", "
" ], "text/plain": [ " FlowIntervalEndTime FlowType AllExtIPs\n", "0 2019-02-07 13:00:00 AzurePublic 3\n", "1 2019-02-07 14:00:00 AzurePublic 8\n", "2 2019-02-07 14:00:00 ExternalPublic 1\n", "3 2019-02-07 15:00:00 AzurePublic 5\n", "4 2019-02-07 15:00:00 ExternalPublic 3" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Prepare groupby dataset to perform pivot. Does expect column with unique values\n", "net_df_grouped = net_df.groupby(['FlowIntervalEndTime','FlowType'])['AllExtIPs'].count().reset_index()\n", "net_df_grouped.head()" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:28.924283Z", "start_time": "2021-11-29T16:59:28.879893Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FlowTypeAzurePublicExternalPublicIntraVNet
FlowIntervalEndTime
2019-02-07 13:00:003.0NaNNaN
2019-02-07 14:00:008.01.0NaN
2019-02-07 15:00:005.03.0NaN
2019-02-07 16:00:007.01.0NaN
2019-02-07 17:00:008.0NaNNaN
............
2019-02-14 09:00:008.01.0NaN
2019-02-14 10:00:008.0NaNNaN
2019-02-14 11:00:007.02.0NaN
2019-02-14 12:00:009.0NaNNaN
2019-02-14 13:00:002.0NaNNaN
\n", "

141 rows × 3 columns

\n", "
" ], "text/plain": [ "FlowType AzurePublic ExternalPublic IntraVNet\n", "FlowIntervalEndTime \n", "2019-02-07 13:00:00 3.0 NaN NaN\n", "2019-02-07 14:00:00 8.0 1.0 NaN\n", "2019-02-07 15:00:00 5.0 3.0 NaN\n", "2019-02-07 16:00:00 7.0 1.0 NaN\n", "2019-02-07 17:00:00 8.0 NaN NaN\n", "... ... ... ...\n", "2019-02-14 09:00:00 8.0 1.0 NaN\n", "2019-02-14 10:00:00 8.0 NaN NaN\n", "2019-02-14 11:00:00 7.0 2.0 NaN\n", "2019-02-14 12:00:00 9.0 NaN NaN\n", "2019-02-14 13:00:00 2.0 NaN NaN\n", "\n", "[141 rows x 3 columns]" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df_grouped.pivot(index=\"FlowIntervalEndTime\", columns=\"FlowType\", values=\"AllExtIPs\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Time manipulation [Ashwin]\n", "\n", "- Timezone conversions\n", "- Resample - Grouping by time " ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:31.451152Z", "start_time": "2021-11-29T16:59:31.388909Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeGeneratedFlowStartTimeFlowEndTimeFlowIntervalEndTimeFlowTypeResourceGroupVMNameVMIPAddressPublicIPsSrcIP...DestPortFlowDirectionAllowedOutFlowsAllowedInFlowsDeniedInFlowsDeniedOutFlowsRemoteRegionVMRegionAllExtIPsTotalAllowedFlows
02019-02-14 13:23:59.5122019-02-14 12:21:582019-02-14 12:21:582019-02-14 13:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[13.67.143.117]...443.0O1.00.00.00.0centraluseastus13.67.143.1171.0
12019-02-14 13:23:59.5122019-02-14 12:29:022019-02-14 12:29:022019-02-14 13:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[40.77.232.95]...443.0O1.00.00.00.0westcentraluseastus40.77.232.951.0
22019-02-14 03:26:06.7652019-02-14 02:08:462019-02-14 02:48:452019-02-14 03:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[13.65.107.32, 40.124.45.19]...443.0O4.00.00.00.0southcentraluseastus13.65.107.324.0
32019-02-14 03:26:06.7652019-02-14 02:08:462019-02-14 02:48:452019-02-14 03:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[13.65.107.32, 40.124.45.19]...443.0O4.00.00.00.0southcentraluseastus40.124.45.194.0
42019-02-14 03:26:06.8282019-02-14 02:30:562019-02-14 02:30:562019-02-14 03:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[20.38.98.100]...443.0O1.00.00.00.0eastuseastus20.38.98.1001.0
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " TimeGenerated FlowStartTime FlowEndTime \\\n", "0 2019-02-14 13:23:59.512 2019-02-14 12:21:58 2019-02-14 12:21:58 \n", "1 2019-02-14 13:23:59.512 2019-02-14 12:29:02 2019-02-14 12:29:02 \n", "2 2019-02-14 03:26:06.765 2019-02-14 02:08:46 2019-02-14 02:48:45 \n", "3 2019-02-14 03:26:06.765 2019-02-14 02:08:46 2019-02-14 02:48:45 \n", "4 2019-02-14 03:26:06.828 2019-02-14 02:30:56 2019-02-14 02:30:56 \n", "\n", " FlowIntervalEndTime FlowType ResourceGroup VMName \\\n", "0 2019-02-14 13:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "1 2019-02-14 13:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "2 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "3 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "4 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "\n", " VMIPAddress PublicIPs SrcIP ... DestPort FlowDirection \\\n", "0 10.0.3.5 [13.67.143.117] ... 443.0 O \n", "1 10.0.3.5 [40.77.232.95] ... 443.0 O \n", "2 10.0.3.5 [13.65.107.32, 40.124.45.19] ... 443.0 O \n", "3 10.0.3.5 [13.65.107.32, 40.124.45.19] ... 443.0 O \n", "4 10.0.3.5 [20.38.98.100] ... 443.0 O \n", "\n", " AllowedOutFlows AllowedInFlows DeniedInFlows DeniedOutFlows \\\n", "0 1.0 0.0 0.0 0.0 \n", "1 1.0 0.0 0.0 0.0 \n", "2 4.0 0.0 0.0 0.0 \n", "3 4.0 0.0 0.0 0.0 \n", "4 1.0 0.0 0.0 0.0 \n", "\n", " RemoteRegion VMRegion AllExtIPs TotalAllowedFlows \n", "0 centralus eastus 13.67.143.117 1.0 \n", "1 westcentralus eastus 40.77.232.95 1.0 \n", "2 southcentralus eastus 13.65.107.32 4.0 \n", "3 southcentralus eastus 40.124.45.19 4.0 \n", "4 eastus eastus 20.38.98.100 1.0 \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df = pd.read_pickle(\"../data/az_net_comms_df.pkl\")\n", "net_df.head()" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:36:48.911801Z", "start_time": "2021-11-29T16:36:48.902616Z" } }, "source": [ "### Timezone considerations " ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:36.761041Z", "start_time": "2021-11-29T16:59:36.745145Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 2019-02-14 13:23:59.512000+00:00\n", "1 2019-02-14 13:23:59.512000+00:00\n", "2 2019-02-14 03:26:06.765000+00:00\n", "3 2019-02-14 03:26:06.765000+00:00\n", "4 2019-02-14 03:26:06.828000+00:00\n", " ... \n", "1355 2019-02-09 03:32:41.967000+00:00\n", "1356 2019-02-09 03:32:51.124000+00:00\n", "1357 2019-02-09 03:32:51.264000+00:00\n", "1358 2019-02-09 03:32:45.608000+00:00\n", "1359 2019-02-09 03:32:45.608000+00:00\n", "Name: TimeGenerated, Length: 1360, dtype: datetime64[ns, UTC]\n" ] } ], "source": [ "dti = pd.to_datetime(net_df['TimeGenerated'])\n", "dti_utc = dti.dt.tz_localize(\"UTC\")\n", "print(dti_utc)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T16:59:58.263860Z", "start_time": "2021-11-29T16:59:58.228637Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 2019-02-14 13:23:59.512000-08:00\n", "1 2019-02-14 13:23:59.512000-08:00\n", "2 2019-02-14 03:26:06.765000-08:00\n", "3 2019-02-14 03:26:06.765000-08:00\n", "4 2019-02-14 03:26:06.828000-08:00\n", " ... \n", "1355 2019-02-09 03:32:41.967000-08:00\n", "1356 2019-02-09 03:32:51.124000-08:00\n", "1357 2019-02-09 03:32:51.264000-08:00\n", "1358 2019-02-09 03:32:45.608000-08:00\n", "1359 2019-02-09 03:32:45.608000-08:00\n", "Name: TimeGenerated, Length: 1360, dtype: datetime64[ns, US/Pacific]\n" ] } ], "source": [ "dti_pst = dti.dt.tz_localize(\"US/Pacific\")\n", "print(dti_pst)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grouping by time\n", "\n", "[Resampling](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling) - `resample()` time-based groupby" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T17:00:00.414768Z", "start_time": "2021-11-29T17:00:00.391034Z" } }, "outputs": [ { "data": { "text/plain": [ "TimeGenerated\n", "2019-02-07 13:00:00 3\n", "2019-02-07 14:00:00 9\n", "2019-02-07 15:00:00 8\n", "2019-02-07 16:00:00 8\n", "2019-02-07 17:00:00 8\n", " ..\n", "2019-02-14 09:00:00 9\n", "2019-02-14 10:00:00 8\n", "2019-02-14 11:00:00 9\n", "2019-02-14 12:00:00 9\n", "2019-02-14 13:00:00 2\n", "Freq: H, Name: FlowType, Length: 169, dtype: int64" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df.set_index('TimeGenerated').resample('H')['FlowType'].count()" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "---\n", "# Other Useful operations [Ian] \n", "\n", "- Chaining multiple operations with \".\" \n", "- Including external functions with pipe \n", "- Apply, assign, others ???? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chaining multiple operations with \".\"" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T17:00:30.281446Z", "start_time": "2021-11-29T17:00:30.225861Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeGeneratedFlowStartTimeFlowEndTimeFlowIntervalEndTimeFlowTypeResourceGroupVMNameVMIPAddressPublicIPsSrcIP...DestPortFlowDirectionAllowedOutFlowsAllowedInFlowsDeniedInFlowsDeniedOutFlowsRemoteRegionVMRegionAllExtIPsTotalAllowedFlows
02019-02-14 13:23:59.5122019-02-14 12:21:582019-02-14 12:21:582019-02-14 13:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[13.67.143.117]...443.0O1.00.00.00.0centraluseastus13.67.143.1171.0
12019-02-14 13:23:59.5122019-02-14 12:29:022019-02-14 12:29:022019-02-14 13:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[40.77.232.95]...443.0O1.00.00.00.0westcentraluseastus40.77.232.951.0
22019-02-14 03:26:06.7652019-02-14 02:08:462019-02-14 02:48:452019-02-14 03:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[13.65.107.32, 40.124.45.19]...443.0O4.00.00.00.0southcentraluseastus13.65.107.324.0
32019-02-14 03:26:06.7652019-02-14 02:08:462019-02-14 02:48:452019-02-14 03:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[13.65.107.32, 40.124.45.19]...443.0O4.00.00.00.0southcentraluseastus40.124.45.194.0
42019-02-14 03:26:06.8282019-02-14 02:30:562019-02-14 02:30:562019-02-14 03:00:00AzurePublicasihuntomsworkspacergmsticalertswin110.0.3.5[20.38.98.100]...443.0O1.00.00.00.0eastuseastus20.38.98.1001.0
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " TimeGenerated FlowStartTime FlowEndTime \\\n", "0 2019-02-14 13:23:59.512 2019-02-14 12:21:58 2019-02-14 12:21:58 \n", "1 2019-02-14 13:23:59.512 2019-02-14 12:29:02 2019-02-14 12:29:02 \n", "2 2019-02-14 03:26:06.765 2019-02-14 02:08:46 2019-02-14 02:48:45 \n", "3 2019-02-14 03:26:06.765 2019-02-14 02:08:46 2019-02-14 02:48:45 \n", "4 2019-02-14 03:26:06.828 2019-02-14 02:30:56 2019-02-14 02:30:56 \n", "\n", " FlowIntervalEndTime FlowType ResourceGroup VMName \\\n", "0 2019-02-14 13:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "1 2019-02-14 13:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "2 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "3 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "4 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 \n", "\n", " VMIPAddress PublicIPs SrcIP ... DestPort FlowDirection \\\n", "0 10.0.3.5 [13.67.143.117] ... 443.0 O \n", "1 10.0.3.5 [40.77.232.95] ... 443.0 O \n", "2 10.0.3.5 [13.65.107.32, 40.124.45.19] ... 443.0 O \n", "3 10.0.3.5 [13.65.107.32, 40.124.45.19] ... 443.0 O \n", "4 10.0.3.5 [20.38.98.100] ... 443.0 O \n", "\n", " AllowedOutFlows AllowedInFlows DeniedInFlows DeniedOutFlows \\\n", "0 1.0 0.0 0.0 0.0 \n", "1 1.0 0.0 0.0 0.0 \n", "2 4.0 0.0 0.0 0.0 \n", "3 4.0 0.0 0.0 0.0 \n", "4 1.0 0.0 0.0 0.0 \n", "\n", " RemoteRegion VMRegion AllExtIPs TotalAllowedFlows \n", "0 centralus eastus 13.67.143.117 1.0 \n", "1 westcentralus eastus 40.77.232.95 1.0 \n", "2 southcentralus eastus 13.65.107.32 4.0 \n", "3 southcentralus eastus 40.124.45.19 4.0 \n", "4 eastus eastus 20.38.98.100 1.0 \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df = pd.read_pickle(\"../data/az_net_comms_df.pkl\")\n", "net_df.head()" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T17:00:05.252169Z", "start_time": "2021-11-29T17:00:05.112991Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeGenerated
AllExtIPs
10.0.3.43
10.0.3.53
104.211.30.12
104.40.17.1534
104.43.212.1212
......
90.130.70.732
99.84.104.631
99.84.106.1781
99.84.106.271
99.84.106.921
\n", "

125 rows × 1 columns

\n", "
" ], "text/plain": [ " TimeGenerated\n", "AllExtIPs \n", "10.0.3.4 3\n", "10.0.3.5 3\n", "104.211.30.1 2\n", "104.40.17.153 4\n", "104.43.212.12 12\n", "... ...\n", "90.130.70.73 2\n", "99.84.104.63 1\n", "99.84.106.178 1\n", "99.84.106.27 1\n", "99.84.106.92 1\n", "\n", "[125 rows x 1 columns]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df[[\"TimeGenerated\",\"AllExtIPs\"]].groupby(\"AllExtIPs\").agg(\"count\")" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "ExecuteTime": { "end_time": "2021-11-29T17:00:05.654200Z", "start_time": "2021-11-29T17:00:05.624771Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RemoteRegion
AllExtIPs
65.55.44.109139
13.71.172.130136
52.168.138.145117
40.124.45.19115
13.71.172.128114
......
23.45.181.1781
23.45.181.1761
23.45.181.1601
23.45.180.341
99.84.106.921
\n", "

125 rows × 1 columns

\n", "
" ], "text/plain": [ " RemoteRegion\n", "AllExtIPs \n", "65.55.44.109 139\n", "13.71.172.130 136\n", "52.168.138.145 117\n", "40.124.45.19 115\n", "13.71.172.128 114\n", "... ...\n", "23.45.181.178 1\n", "23.45.181.176 1\n", "23.45.181.160 1\n", "23.45.180.34 1\n", "99.84.106.92 1\n", "\n", "[125 rows x 1 columns]" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df[[\"RemoteRegion\",\"AllExtIPs\"]].groupby(\"AllExtIPs\").agg(\"count\").sort_values(by=\"RemoteRegion\", ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use parentheses to let you stack the functions vertically" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RemoteRegion
AllExtIPs
65.55.44.109139
13.71.172.130136
52.168.138.145117
40.124.45.19115
13.71.172.128114
......
23.45.181.1781
23.45.181.1761
23.45.181.1601
23.45.180.341
99.84.106.921
\n", "

125 rows × 1 columns

\n", "
" ], "text/plain": [ " RemoteRegion\n", "AllExtIPs \n", "65.55.44.109 139\n", "13.71.172.130 136\n", "52.168.138.145 117\n", "40.124.45.19 115\n", "13.71.172.128 114\n", "... ...\n", "23.45.181.178 1\n", "23.45.181.176 1\n", "23.45.181.160 1\n", "23.45.180.34 1\n", "99.84.106.92 1\n", "\n", "[125 rows x 1 columns]" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " net_df[[\"RemoteRegion\",\"AllExtIPs\"]]\n", " .groupby(\"AllExtIPs\")\n", " .agg(\"count\")\n", " .sort_values(by=\"RemoteRegion\", ascending=False)\n", ")" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RemoteRegion
AllExtIPs
65.55.44.109139
13.71.172.130136
52.168.138.145117
40.124.45.19115
13.71.172.128114
\n", "
" ], "text/plain": [ " RemoteRegion\n", "AllExtIPs \n", "65.55.44.109 139\n", "13.71.172.130 136\n", "52.168.138.145 117\n", "40.124.45.19 115\n", "13.71.172.128 114" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " net_df[[\"RemoteRegion\", \"AllExtIPs\"]]\n", " .groupby(\"AllExtIPs\")\n", " .agg(\"count\")\n", " .sort_values(\n", " by=\"RemoteRegion\", ascending=False\n", " )\n", " .head(5)\n", ")" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['65.55.44.109',\n", " '13.71.172.130',\n", " '52.168.138.145',\n", " '40.124.45.19',\n", " '13.71.172.128']" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " net_df[[\"RemoteRegion\",\"AllExtIPs\"]]\n", " .groupby(\"AllExtIPs\")\n", " .agg(\"count\")\n", " .sort_values(by=\"RemoteRegion\", ascending=False)\n", " .head(5)\n", " .index\n", " .to_list()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## External functions with `.pipe`\n", "\n", "

\n", "df.pipe(function)
\n", "

\n", "\n", "You can call functions to do processing on your data.\n", "The function must take a DataFrame as the first parameter and return a DataFrame" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "# Define a couple of (not very useful) functions\n", "\n", "def drop_duplicates(df, column_name):\n", " return df.drop_duplicates(subset=column_name)\n", "\n", "\n", "def fill_missing_values(df):\n", " df_result = df.copy()\n", " for col in df_result.columns:\n", " df_result[col].fillna(\"N/A\", inplace=True)\n", " return df_result\n", "\n" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeGeneratedResourceGroup
3262019-02-13 01:23:47.634None
3272019-02-13 01:23:47.634None
3362019-02-12 22:23:05.399None
3452019-02-12 22:23:05.384None
3572019-02-12 23:23:59.515None
4132019-02-12 18:23:51.853None
4142019-02-12 18:23:51.853None
4662019-02-12 22:23:17.236None
\n", "
" ], "text/plain": [ " TimeGenerated ResourceGroup\n", "326 2019-02-13 01:23:47.634 None\n", "327 2019-02-13 01:23:47.634 None\n", "336 2019-02-12 22:23:05.399 None\n", "345 2019-02-12 22:23:05.384 None\n", "357 2019-02-12 23:23:59.515 None\n", "413 2019-02-12 18:23:51.853 None\n", "414 2019-02-12 18:23:51.853 None\n", "466 2019-02-12 22:23:17.236 None" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "rows with NaNs: 24\n" ] } ], "source": [ "display(net_df[[\"TimeGenerated\", \"ResourceGroup\"]][net_df[\"ResourceGroup\"].isna()])\n", "print(\"rows with NaNs:\", net_df.isnull().values.sum())" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df.pipe(fill_missing_values).isnull().values.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the drop_duplicates function" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1360" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(net_df)" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(125, 23)" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df.pipe(drop_duplicates, \"AllExtIPs\").shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using both functions" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1360" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "net_df = pd.read_pickle(\"../data/az_net_comms_df.pkl\")\n", "len(net_df)" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [], "source": [ "net_df_cleaned = (\n", " net_df\n", " .pipe(drop_duplicates, \"AllExtIPs\")\n", " .pipe(fill_missing_values)\n", ")" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeGeneratedResourceGroup
3262019-02-13 01:23:47.634None
3272019-02-13 01:23:47.634None
3362019-02-12 22:23:05.399None
3452019-02-12 22:23:05.384None
3572019-02-12 23:23:59.515None
4132019-02-12 18:23:51.853None
4142019-02-12 18:23:51.853None
4662019-02-12 22:23:17.236None
\n", "
" ], "text/plain": [ " TimeGenerated ResourceGroup\n", "326 2019-02-13 01:23:47.634 None\n", "327 2019-02-13 01:23:47.634 None\n", "336 2019-02-12 22:23:05.399 None\n", "345 2019-02-12 22:23:05.384 None\n", "357 2019-02-12 23:23:59.515 None\n", "413 2019-02-12 18:23:51.853 None\n", "414 2019-02-12 18:23:51.853 None\n", "466 2019-02-12 22:23:17.236 None" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Ian\\Anaconda3\\envs\\condadev\\lib\\site-packages\\ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeGeneratedResourceGroup
3262019-02-13 01:23:47.634N/A
3272019-02-13 01:23:47.634N/A
3452019-02-12 22:23:05.384N/A
\n", "
" ], "text/plain": [ " TimeGenerated ResourceGroup\n", "326 2019-02-13 01:23:47.634 N/A\n", "327 2019-02-13 01:23:47.634 N/A\n", "345 2019-02-12 22:23:05.384 N/A" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "display(net_df[[\"TimeGenerated\", \"ResourceGroup\"]][net_df[\"ResourceGroup\"].isna()])\n", "display(net_df_cleaned[[\"TimeGenerated\", \"ResourceGroup\"]][net_df[\"ResourceGroup\"].isna()])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## External functions with `.apply`\n", "\n", "apply is very inefficient but necessary sometimes.\n", "\n", "### `.apply` and pandas series\n", "\n", "

\n", "series.apply(function)
\n", "df.column_name.apply(function)
\n", "

" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [10.0.3.5]\n", "1 [10.0.3.5]\n", "2 [10.0.3.5]\n", "3 [10.0.3.5]\n", "4 [10.0.3.5]\n", "Name: VMIPAddress, dtype: object" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0 Eastus\n", "1 Eastus\n", "2 Eastus\n", "3 Eastus\n", "4 Eastus\n", "Name: VMRegion, dtype: object" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(\n", " net_df.VMIPAddress.apply(str.split, \".\").head()\n", ")\n", "display(\n", " net_df.VMRegion.apply(str.capitalize).head()\n", ")" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 10_0_3_5\n", "1 10_0_3_5\n", "2 10_0_3_5\n", "3 10_0_3_5\n", "4 10_0_3_5\n", "Name: VMIPAddress, dtype: object" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Using a lambda (inline) function\n", "display(\n", " net_df.VMIPAddress.apply(\n", " lambda col: \"_\".join(col.split(\".\"))\n", " )\n", " .head()\n", ")" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 (global, 13.67.143.117)\n", "1 (global, 40.77.232.95)\n", "2 (global, 13.65.107.32)\n", "3 (global, 40.124.45.19)\n", "4 (global, 20.38.98.100)\n", "5 (global, 13.67.143.117)\n", "6 (global, 13.71.172.128)\n", "7 (global, 13.71.172.130)\n", "8 (global, 65.55.44.109)\n", "9 (global, 40.77.228.69)\n", "Name: AllExtIPs, dtype: object" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import ipaddress\n", "\n", "def to_ip(ip_str):\n", " if ip_str:\n", " ip = ipaddress.ip_address(ip_str)\n", " if ip.is_global:\n", " return \"global\", ip\n", " else:\n", " return \"other\", ip\n", " return \"Unknown\"\n", "\n", "display(\n", " net_df.AllExtIPs.apply(to_ip)\n", " .head(10)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `.apply` and DataFrames\n", "\n", "

\n", "df.apply(function, axis=1) # apply by row
\n", "df.apply(function [, axis=0]) # apply by column
\n", "

" ] }, { "cell_type": "code", "execution_count": 159, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 CENTRALUS\n", "1 WESTCENTRALUS\n", "2 SOUTHCENTRALUS\n", "3 SOUTHCENTRALUS\n", "4 EASTUS\n", "dtype: object" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0 Centralus: 3997470178254466550\n", "1 Westcentralus: 2950529182713360191\n", "2 Southcentralus: 2388453837175337402\n", "3 Southcentralus: 2388453837175337402\n", "4 Eastus: -5448835124403651518\n", "dtype: object" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(\n", " net_df\n", " .apply(lambda row: row.RemoteRegion.upper(), axis=1)\n", " .head(5)\n", ")\n", "\n", "display(\n", " net_df\n", " .apply(lambda row: row.RemoteRegion.capitalize() + \": \" + str(hash(row.RemoteRegion)), axis=1)\n", " .head()\n", ")" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VMIPAddressPublicIPsSrcIPDestIPAllExtIPs
0IP private: 10.0.3.5[13.67.143.117]IP global: 13.67.143.117
1IP private: 10.0.3.5[40.77.232.95]IP global: 40.77.232.95
2IP private: 10.0.3.5[13.65.107.32, 40.124.45.19]IP global: 13.65.107.32
3IP private: 10.0.3.5[13.65.107.32, 40.124.45.19]IP global: 40.124.45.19
4IP private: 10.0.3.5[20.38.98.100]IP global: 20.38.98.100
..................
1355IP private: 10.0.3.5[13.71.172.128, 13.71.172.130]IP global: 13.71.172.130
1356IP private: 10.0.3.5[40.77.232.95]IP global: 40.77.232.95
1357IP private: 10.0.3.5[52.168.138.145]IP global: 52.168.138.145
1358IP private: 10.0.3.5[23.215.98.90, 72.21.81.240]IP global: 23.215.98.90
1359IP private: 10.0.3.5[23.215.98.90, 72.21.81.240]IP global: 72.21.81.240
\n", "

1360 rows × 5 columns

\n", "
" ], "text/plain": [ " VMIPAddress PublicIPs SrcIP DestIP \\\n", "0 IP private: 10.0.3.5 [13.67.143.117] \n", "1 IP private: 10.0.3.5 [40.77.232.95] \n", "2 IP private: 10.0.3.5 [13.65.107.32, 40.124.45.19] \n", "3 IP private: 10.0.3.5 [13.65.107.32, 40.124.45.19] \n", "4 IP private: 10.0.3.5 [20.38.98.100] \n", "... ... ... ... ... \n", "1355 IP private: 10.0.3.5 [13.71.172.128, 13.71.172.130] \n", "1356 IP private: 10.0.3.5 [40.77.232.95] \n", "1357 IP private: 10.0.3.5 [52.168.138.145] \n", "1358 IP private: 10.0.3.5 [23.215.98.90, 72.21.81.240] \n", "1359 IP private: 10.0.3.5 [23.215.98.90, 72.21.81.240] \n", "\n", " AllExtIPs \n", "0 IP global: 13.67.143.117 \n", "1 IP global: 40.77.232.95 \n", "2 IP global: 13.65.107.32 \n", "3 IP global: 40.124.45.19 \n", "4 IP global: 20.38.98.100 \n", "... ... \n", "1355 IP global: 13.71.172.130 \n", "1356 IP global: 40.77.232.95 \n", "1357 IP global: 52.168.138.145 \n", "1358 IP global: 23.215.98.90 \n", "1359 IP global: 72.21.81.240 \n", "\n", "[1360 rows x 5 columns]" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def df_to_ip(row):\n", " for name in row.index:\n", " value = row[name]\n", " try:\n", " ip = ipaddress.ip_address(value)\n", " if ip.is_global:\n", " row[name] = f\"IP global: {ip}\"\n", " elif ip.is_private:\n", " row[name] = f\"IP private: {ip}\"\n", " else:\n", " row[name] = f\"IP other: {ip}\"\n", " except:\n", " pass\n", " return row\n", "\n", "net_df.apply(df_to_ip, axis=1).filter(regex=\".*IP.*\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# End of Session\n", "# Break: 5 Minutes\n", "\n", "![](../media/dog-leash-break.jpg)" ] } ], "metadata": { "hide_input": false, "interpreter": { "hash": "f1a957ba120dd441fca5345565e323085d222a237914b966f18f2f288f750033" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "465.391px" }, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }