{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Obfuscation Library\n", "\n", "Sharing data, creating documents and doing public demonstrations often require that data containing\n", "PII or other sensitive material be obfuscated.\n", "\n", "MSTICPy contains a simple library to obfuscate data using hashing and random mapping of values.\n", "You can use these functions on a single data items or entire DataFrames.\n", "\n", "## Contents\n", "- [Import the module](#Import-the-module)\n", "- [Individual Obfuscation Functions](#Individual-Obfuscation-Functions)\n", "- [Obfuscating DataFrames](#Obfuscating-DataFrames)\n", "- [Creating custom column mappings](#Creating-custom-mappings)\n", "- [Using hash_item with delimiters](#Using-hash_item-with-delimiters-to-preserve-the-structure/look-of-the-hashed-input)\n", "- [Checking Your Obfuscation](#Checking-Your-Obfuscation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import the module" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from msticpy.common.utility import md\n", "from msticpy.data import data_obfus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read in some data for the examples" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "\n", "netflow_df = pd.read_csv(\"data/az_net_flows.csv\")\n", "# list is imported as string from csv - convert back to list with eval\n", "def str_to_list(val):\n", " if isinstance(val, str):\n", " return eval(val)\n", "netflow_df[\"PublicIPs\"] = netflow_df[\"PublicIPs\"].apply(str_to_list)\n", "\n", "# Define subset of output columns\n", "out_cols = [\n", " 'TenantId', 'TimeGenerated', 'FlowStartTime',\n", " 'ResourceGroup', 'VMName', 'VMIPAddress', 'PublicIPs',\n", " 'SrcIP', 'DestIP', 'L4Protocol', 'AllExtIPs'\n", "]\n", "netflow_df = netflow_df[out_cols]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Individual Obfuscation Functions\n", "\n", "Here we're importing individual functions but you can access them with the single\n", "import statement above as:\n", "```\n", "data_obfus.hash_string(...)\n", "```\n", "etc.\n", "\n", "> **Note** In the next cell we're using a function to output documentation and examples.
\n", "> You can ignore this. The usage of each function is show in the output of
\n", "> the subsequent cells." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from msticpy.data.data_obfus import (\n", " hash_dict,\n", " hash_ip,\n", " hash_item,\n", " hash_list,\n", " hash_sid,\n", " hash_string,\n", " replace_guid\n", ")\n", "\n", "# Function to automate/format the examples below. You can ignore this\n", "def show_func(func, examples):\n", " func_name = func.__name__\n", " if func.__name__.startswith(\"_\"):\n", " func_name = func_name[1:]\n", " md(func_name, \"bold\")\n", " print(func.__doc__)\n", " md(\"Examples\", \"bold\")\n", " for example in examples:\n", " if isinstance(example, tuple):\n", " arg, delim = example\n", " print(\n", " f\"{func_name}('{arg}', delim='{delim}') =>\", func(*example)\n", " )\n", " else:\n", " print(\n", " f\"{func_name}('{example}') =>\", func(example)\n", " )\n", " md(\"


\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

hash_string

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

hash_string does a simple hash of the input. If the input is a numeric string it will output a numeric

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

hash_string

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", " Hash a simple string.\n", "\n", " Parameters\n", " ----------\n", " input_str : str\n", " The input string\n", "\n", " Returns\n", " -------\n", " str\n", " The obfuscated output string\n", "\n", " \n" ] }, { "data": { "text/html": [ "

Examples

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "hash_string('sensitive data') => jdiqcnrqmlidkd\n", "hash_string('42424') => 98478\n" ] }, { "data": { "text/html": [ "




" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "md(\"hash_string\", \"large, bold\")\n", "md(\"hash_string does a simple hash of the input. If the input is a numeric string it will output a numeric\")\n", "show_func(hash_string, [\"sensitive data\", \"42424\"])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

hash_item

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

hash_item allows specification of delimiters. Useful for preserving the look of domains, emails, etc.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

hash_item

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", " Hash a simple string.\n", "\n", " Parameters\n", " ----------\n", " input_item : str\n", " The input string\n", " delim: str, optional\n", " A string of delimiters to use to split the input string\n", " prior to hashing.\n", "\n", " Returns\n", " -------\n", " str\n", " The obfuscated output string\n", "\n", " \n" ] }, { "data": { "text/html": [ "

Examples

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "hash_item('sensitive data', delim=' ') => kdneqoiia laoe\n", "hash_item('most-sensitive-data/here', delim=' /-') => kmea-kdneqoiia-laoe/fcec\n" ] }, { "data": { "text/html": [ "




" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "md(\"hash_item\", \"large, bold\")\n", "md(\"hash_item allows specification of delimiters. Useful for preserving the look of domains, emails, etc.\")\n", "show_func(hash_item, [(\"sensitive data\", \" \"), (\"most-sensitive-data/here\", \" /-\")])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

hash_ip

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

hash_ip will output random mappings of input IP V4 and V6 addresses.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

Within a Python session the mapping will remain constant.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

hash_ip

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", " Hash IP address or list of IP addresses.\n", "\n", " Parameters\n", " ----------\n", " input_item : Union[List[str], str]\n", " List of IP addresses or single IP address.\n", "\n", " Returns\n", " -------\n", " Union[List[str], str]\n", " List of hashed addresses or single address.\n", " (depending on input)\n", "\n", " \n" ] }, { "data": { "text/html": [ "

Examples

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "hash_ip('192.168.3.1') => 192.168.84.105\n", "hash_ip('2001:0db8:85a3:0000:0000:8a2e:0370:7334') => 85d6:7819:9cce:9af1:9af1:24ad:d338:7d03\n", "hash_ip('['192.168.3.1', '192.168.5.2', '192.168.10.2']') => ['192.168.84.105', '192.168.172.202', '192.168.232.202']\n" ] }, { "data": { "text/html": [ "




" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "md(\"hash_ip\", \"large, bold\")\n", "md(\"hash_ip will output random mappings of input IP V4 and V6 addresses.\")\n", "md(\"Within a Python session the mapping will remain constant.\")\n", "show_func(hash_ip, [\n", " \"192.168.3.1\", \n", " \"2001:0db8:85a3:0000:0000:8a2e:0370:7334\",\n", " [\"192.168.3.1\", \"192.168.5.2\", \"192.168.10.2\"],\n", "])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

hash_sid

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

hash_sid will randomize the domain-specific parts of a SID. It preserves built-in SIDs and well known RIDs (e.g. Admins -500)

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

hash_sid

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", " Hash a SID preserving well-known SIDs and the RID.\n", "\n", " Parameters\n", " ----------\n", " sid : str\n", " SID string\n", "\n", " Returns\n", " -------\n", " str\n", " Hashed SID\n", "\n", " \n" ] }, { "data": { "text/html": [ "

Examples

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "hash_sid('S-1-5-21-1180699209-877415012-3182924384-1004') => S-1-5-21-3321821741-636458740-4143214142-1004\n", "hash_sid('S-1-5-18') => S-1-5-18\n" ] }, { "data": { "text/html": [ "




" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "md(\"hash_sid\", \"large, bold\")\n", "md(\"hash_sid will randomize the domain-specific parts of a SID. It preserves built-in SIDs and well known RIDs (e.g. Admins -500)\")\n", "show_func(hash_sid, [\"S-1-5-21-1180699209-877415012-3182924384-1004\", \"S-1-5-18\"])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

hash_list

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

hash_list will randomize a list of items preserving the list structure.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

hash_list

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", " Hash list of strings.\n", "\n", " Parameters\n", " ----------\n", " item_list : List[str]\n", " Input list\n", "\n", " Returns\n", " -------\n", " List[str]\n", " Hashed list\n", "\n", " \n" ] }, { "data": { "text/html": [ "

Examples

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "hash_list('['S-1-5-21-1180699209-877415012-3182924384-1004', 'S-1-5-18']') => ['elkbjiboklpknokdeflikamojqjflqmicqiorqfbqboqe', 'nrllmpbd']\n" ] }, { "data": { "text/html": [ "




" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "md(\"hash_list\", \"large, bold\")\n", "md(\"hash_list will randomize a list of items preserving the list structure.\")\n", "show_func(hash_list, [[\"S-1-5-21-1180699209-877415012-3182924384-1004\", \"S-1-5-18\"]])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

hash_dict

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

hash_dict will randomize a dict of items preserving the structure and the dict keys.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

hash_dict

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", " Hash dictionary values.\n", "\n", " Parameters\n", " ----------\n", " item_dict : Dict[str, Union[Dict[str, Any], List[Any], str]]\n", " Input item can be a Dict of strings, lists or other\n", " dictionaries.\n", "\n", " Returns\n", " -------\n", " Dict[str, Any]\n", " Dictionary with hashed values.\n", "\n", " \n" ] }, { "data": { "text/html": [ "

Examples

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "hash_dict('{'SID1': 'S-1-5-21-1180699209-877415012-3182924384-1004', 'SID2': 'S-1-5-18'}') => {'SID1': 'elkbjiboklpknokdeflikamojqjflqmicqiorqfbqboqe', 'SID2': 'nrllmpbd'}\n" ] }, { "data": { "text/html": [ "




" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "md(\"hash_dict\", \"large, bold\")\n", "md(\"hash_dict will randomize a dict of items preserving the structure and the dict keys.\")\n", "show_func(hash_dict, [{\"SID1\": \"S-1-5-21-1180699209-877415012-3182924384-1004\", \"SID2\": \"S-1-5-18\"}])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

replace_guid

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

replace_guid will output a random UUID mapped to the input.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

An input GUID will be mapped to the same newly-generated output UUID

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

You can see that UUID #4 is the same as #1 and mapped to the same output UUID.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

replace_guid

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", " Replace GUID/UUID with mapped random UUID.\n", "\n", " Parameters\n", " ----------\n", " guid : str\n", " Input UUID.\n", "\n", " Returns\n", " -------\n", " str\n", " Mapped UUID\n", "\n", " \n" ] }, { "data": { "text/html": [ "

Examples

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "replace_guid('cf1b0b29-08ae-4528-839a-5f66eca2cce9') => 01ae8633-22e5-480f-b884-fc48588c25d9\n", "replace_guid('ed63d29e-6288-4d66-b10d-8847096fc586') => 52cd2814-b5e4-48bd-80f2-51b503e50467\n", "replace_guid('ac561203-99b2-4067-a525-60d45ea0d7ff') => ef059dc7-2d6e-4506-8619-05b346a6bc6b\n", "replace_guid('cf1b0b29-08ae-4528-839a-5f66eca2cce9') => 01ae8633-22e5-480f-b884-fc48588c25d9\n" ] }, { "data": { "text/html": [ "




" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "md(\"replace_guid\", \"large, bold\")\n", "md(\"replace_guid will output a random UUID mapped to the input.\")\n", "md(\"An input GUID will be mapped to the same newly-generated output UUID\")\n", "md(\"You can see that UUID #4 is the same as #1 and mapped to the same output UUID.\")\n", "show_func(replace_guid, [\n", " \"cf1b0b29-08ae-4528-839a-5f66eca2cce9\",\n", " \"ed63d29e-6288-4d66-b10d-8847096fc586\",\n", " \"ac561203-99b2-4067-a525-60d45ea0d7ff\",\n", " \"cf1b0b29-08ae-4528-839a-5f66eca2cce9\",\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Obfuscating DataFrames\n", "\n", "We can use the msticpy pandas extension to obfuscate an entire DataFrame.\n", "\n", "The obfuscation library contains a mapping for a number of common field names.\n", "You can view this list by displaying the attribute:\n", "```\n", "data_obfus.OBFUS_COL_MAP\n", "```\n", "\n", "In the first example, the TenantId, ResourceGroup, VMName have been obfuscated." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TenantIdTimeGeneratedFlowStartTimeResourceGroupVMNameVMIPAddressPublicIPsSrcIPDestIPL4ProtocolAllExtIPs
052b1ab41-869e-4138-9e40-2a4457f09bf02019-02-12 14:22:40.6972019-02-12 13:00:07.000asihuntomsworkspacergmsticalertswin110.0.3.5[65.55.44.109]NaNNaNT65.55.44.109
152b1ab41-869e-4138-9e40-2a4457f09bf02019-02-12 14:22:40.6812019-02-12 13:00:48.000asihuntomsworkspacergmsticalertswin110.0.3.5[13.71.172.130, 13.71.172.128]NaNNaNT13.71.172.128
252b1ab41-869e-4138-9e40-2a4457f09bf02019-02-12 14:22:40.6812019-02-12 13:00:48.000asihuntomsworkspacergmsticalertswin110.0.3.5[13.71.172.130, 13.71.172.128]NaNNaNT13.71.172.130
\n", "
" ], "text/plain": [ " TenantId TimeGenerated \\\n", "0 52b1ab41-869e-4138-9e40-2a4457f09bf0 2019-02-12 14:22:40.697 \n", "1 52b1ab41-869e-4138-9e40-2a4457f09bf0 2019-02-12 14:22:40.681 \n", "2 52b1ab41-869e-4138-9e40-2a4457f09bf0 2019-02-12 14:22:40.681 \n", "\n", " FlowStartTime ResourceGroup VMName \\\n", "0 2019-02-12 13:00:07.000 asihuntomsworkspacerg msticalertswin1 \n", "1 2019-02-12 13:00:48.000 asihuntomsworkspacerg msticalertswin1 \n", "2 2019-02-12 13:00:48.000 asihuntomsworkspacerg msticalertswin1 \n", "\n", " VMIPAddress PublicIPs SrcIP DestIP L4Protocol \\\n", "0 10.0.3.5 [65.55.44.109] NaN NaN T \n", "1 10.0.3.5 [13.71.172.130, 13.71.172.128] NaN NaN T \n", "2 10.0.3.5 [13.71.172.130, 13.71.172.128] NaN NaN T \n", "\n", " AllExtIPs \n", "0 65.55.44.109 \n", "1 13.71.172.128 \n", "2 13.71.172.130 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TenantIdTimeGeneratedFlowStartTimeResourceGroupVMNameVMIPAddressPublicIPsSrcIPDestIPL4ProtocolAllExtIPs
0f9ef3428-3ccb-4ecd-8466-dbedc70442932019-02-12 14:22:40.6972019-02-12 13:00:07.000ibmkajbmepnmiaeilfofafmlmbnlpdcbnbnn10.0.3.5[65.55.44.109]NaNNaNT65.55.44.109
1f9ef3428-3ccb-4ecd-8466-dbedc70442932019-02-12 14:22:40.6812019-02-12 13:00:48.000ibmkajbmepnmiaeilfofafmlmbnlpdcbnbnn10.0.3.5[13.71.172.130, 13.71.172.128]NaNNaNT13.71.172.128
2f9ef3428-3ccb-4ecd-8466-dbedc70442932019-02-12 14:22:40.6812019-02-12 13:00:48.000ibmkajbmepnmiaeilfofafmlmbnlpdcbnbnn10.0.3.5[13.71.172.130, 13.71.172.128]NaNNaNT13.71.172.130
\n", "
" ], "text/plain": [ " TenantId TimeGenerated \\\n", "0 f9ef3428-3ccb-4ecd-8466-dbedc7044293 2019-02-12 14:22:40.697 \n", "1 f9ef3428-3ccb-4ecd-8466-dbedc7044293 2019-02-12 14:22:40.681 \n", "2 f9ef3428-3ccb-4ecd-8466-dbedc7044293 2019-02-12 14:22:40.681 \n", "\n", " FlowStartTime ResourceGroup VMName \\\n", "0 2019-02-12 13:00:07.000 ibmkajbmepnmiaeilfofa fmlmbnlpdcbnbnn \n", "1 2019-02-12 13:00:48.000 ibmkajbmepnmiaeilfofa fmlmbnlpdcbnbnn \n", "2 2019-02-12 13:00:48.000 ibmkajbmepnmiaeilfofa fmlmbnlpdcbnbnn \n", "\n", " VMIPAddress PublicIPs SrcIP DestIP L4Protocol \\\n", "0 10.0.3.5 [65.55.44.109] NaN NaN T \n", "1 10.0.3.5 [13.71.172.130, 13.71.172.128] NaN NaN T \n", "2 10.0.3.5 [13.71.172.130, 13.71.172.128] NaN NaN T \n", "\n", " AllExtIPs \n", "0 65.55.44.109 \n", "1 13.71.172.128 \n", "2 13.71.172.130 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(netflow_df.head(3))\n", "netflow_df.head(3).mp_mask.mask()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding custom column mappings\n", "\n", "Note in the previous example that the VMIPAddress, PublicIPs and AllExtIPs columns were unchanged.\n", "\n", "We can add these columns to a custom mapping dictionary and re-run the obfuscation.\n", "See the later section on [Creating Custom Mappings](#Creating-custom-mappings)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TenantIdTimeGeneratedFlowStartTimeResourceGroupVMNameVMIPAddressPublicIPsSrcIPDestIPL4ProtocolAllExtIPs
0f9ef3428-3ccb-4ecd-8466-dbedc70442932019-02-12 14:22:40.6972019-02-12 13:00:07.000ibmkajbmepnmiaeilfofafmlmbnlpdcbnbnn10.0.3.5[65.55.44.109]NaNNaNT65.55.44.109
1f9ef3428-3ccb-4ecd-8466-dbedc70442932019-02-12 14:22:40.6812019-02-12 13:00:48.000ibmkajbmepnmiaeilfofafmlmbnlpdcbnbnn10.0.3.5[13.71.172.130, 13.71.172.128]NaNNaNT13.71.172.128
2f9ef3428-3ccb-4ecd-8466-dbedc70442932019-02-12 14:22:40.6812019-02-12 13:00:48.000ibmkajbmepnmiaeilfofafmlmbnlpdcbnbnn10.0.3.5[13.71.172.130, 13.71.172.128]NaNNaNT13.71.172.130
\n", "
" ], "text/plain": [ " TenantId TimeGenerated \\\n", "0 f9ef3428-3ccb-4ecd-8466-dbedc7044293 2019-02-12 14:22:40.697 \n", "1 f9ef3428-3ccb-4ecd-8466-dbedc7044293 2019-02-12 14:22:40.681 \n", "2 f9ef3428-3ccb-4ecd-8466-dbedc7044293 2019-02-12 14:22:40.681 \n", "\n", " FlowStartTime ResourceGroup VMName \\\n", "0 2019-02-12 13:00:07.000 ibmkajbmepnmiaeilfofa fmlmbnlpdcbnbnn \n", "1 2019-02-12 13:00:48.000 ibmkajbmepnmiaeilfofa fmlmbnlpdcbnbnn \n", "2 2019-02-12 13:00:48.000 ibmkajbmepnmiaeilfofa fmlmbnlpdcbnbnn \n", "\n", " VMIPAddress PublicIPs SrcIP DestIP L4Protocol \\\n", "0 10.0.3.5 [65.55.44.109] NaN NaN T \n", "1 10.0.3.5 [13.71.172.130, 13.71.172.128] NaN NaN T \n", "2 10.0.3.5 [13.71.172.130, 13.71.172.128] NaN NaN T \n", "\n", " AllExtIPs \n", "0 65.55.44.109 \n", "1 13.71.172.128 \n", "2 13.71.172.130 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col_map = {\n", " \"VMName\": \".\",\n", " \"VMIPAddress\": \"ip\", \n", " \"PublicIPs\": \"ip\",\n", " \"AllExtIPs\": \"ip\"\n", "}\n", "\n", "netflow_df.head(3).mp_mask.mask()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ofuscate_df function\n", "\n", "You can also call the standard function `obfuscate_df` to perform the same operation\n", "on the dataframe passed as the `data` parameter." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TenantIdTimeGeneratedFlowStartTimeResourceGroupVMNameVMIPAddressPublicIPsSrcIPDestIPL4ProtocolAllExtIPs
0f9ef3428-3ccb-4ecd-8466-dbedc70442932019-02-12 14:22:40.6972019-02-12 13:00:07.000ibmkajbmepnmiaeilfofafmlmbnlpdcbnbnn10.112.51.93[100.11.187.82]NaNNaNT100.11.187.82
1f9ef3428-3ccb-4ecd-8466-dbedc70442932019-02-12 14:22:40.6812019-02-12 13:00:48.000ibmkajbmepnmiaeilfofafmlmbnlpdcbnbnn10.112.51.93[144.169.193.140, 144.169.193.144]NaNNaNT144.169.193.144
2f9ef3428-3ccb-4ecd-8466-dbedc70442932019-02-12 14:22:40.6812019-02-12 13:00:48.000ibmkajbmepnmiaeilfofafmlmbnlpdcbnbnn10.112.51.93[144.169.193.140, 144.169.193.144]NaNNaNT144.169.193.140
\n", "
" ], "text/plain": [ " TenantId TimeGenerated \\\n", "0 f9ef3428-3ccb-4ecd-8466-dbedc7044293 2019-02-12 14:22:40.697 \n", "1 f9ef3428-3ccb-4ecd-8466-dbedc7044293 2019-02-12 14:22:40.681 \n", "2 f9ef3428-3ccb-4ecd-8466-dbedc7044293 2019-02-12 14:22:40.681 \n", "\n", " FlowStartTime ResourceGroup VMName \\\n", "0 2019-02-12 13:00:07.000 ibmkajbmepnmiaeilfofa fmlmbnlpdcbnbnn \n", "1 2019-02-12 13:00:48.000 ibmkajbmepnmiaeilfofa fmlmbnlpdcbnbnn \n", "2 2019-02-12 13:00:48.000 ibmkajbmepnmiaeilfofa fmlmbnlpdcbnbnn \n", "\n", " VMIPAddress PublicIPs SrcIP DestIP L4Protocol \\\n", "0 10.112.51.93 [100.11.187.82] NaN NaN T \n", "1 10.112.51.93 [144.169.193.140, 144.169.193.144] NaN NaN T \n", "2 10.112.51.93 [144.169.193.140, 144.169.193.144] NaN NaN T \n", "\n", " AllExtIPs \n", "0 100.11.187.82 \n", "1 144.169.193.144 \n", "2 144.169.193.140 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_obfus.obfuscate_df(data=netflow_df.head(3), column_map=col_map)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating custom mappings\n", "\n", "A custom mapping dictionary has entries in the following form:\n", "```\n", " \"ColumnName\": \"operation\"\n", "```\n", "\n", "The `operation` defines the type of obfuscation method used for that column. Both the column\n", "and the operation code must be quoted.\n", "\n", "|operation code | obfuscation function |\n", "|---------------|----------------------|\n", "| \"uuid\" | replace_guid |\n", "| \"ip\" | hash_ip |\n", "| \"str\" | hash_string |\n", "| \"dict\" | hash_dict |\n", "| \"list\" | hash_list |\n", "| \"sid\" | hash_sid |\n", "| \"null\" | \"null\"\\* |\n", "| None | hash_str\\* |\n", "| delims_str | hash_item\\* |\n", "\n", "\\*The last three items require some explanation:\n", "- null - the `null` operation code means set the value to empty - i.e. delete the value\n", " in the output frame.\n", "- None (i.e. the dictionary value is `None`) default to hash_string.\n", "- delims_str - any string other than those named above is assumed to be a string of delimiters.\n", " See next section for a discussion of use of delimiters.\n", "\n", "---\n", "\n", "> **NOTE** If you want to *only* use custom mappings and ignore the builtin
\n", "> mapping table, specify `use_default=False` as a parameter to either
\n", "> `mp_obf.obfuscate()` or `obfuscate_df`\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using `hash_item` with delimiters to preserve the structure/look of the hashed input\n", "\n", "Using hash_item with a delimiters string lets you create output that somewhat resembles the input\n", "type. The delimiters string is specified as a simple string of delimiter characters, e.g. `\"@\\,-\"`\n", "\n", "The input string is broken into substrings using each of the delimiters in the delims_str. The substrings\n", "are individually hashed and the resulting substrings joined together using the original delimiters.\n", "The string is split in the order of the characters in the delims string.\n", "\n", "This allows you to create hashed values that bear some resemblance to the original structure of the string.\n", "This might be useful for email address, qualified domain names and other structure text.\n", "\n", "For example :\n", " ian@mydomain.com\n", " \n", "Using the simple `hash_string` function the output bears no resemblance to an email address" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'prqocjmdpbodrafn'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hash_string(\"ian@mydomain.com\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `hash_item` and specifying the expected delimiters we get something like an email address in the output." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'bnm@blbbrfbk.pjb'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hash_item(\"ian@mydomain.com\", \"@.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You use `hash_item` in your Custom Mapping dictionary by specifying a delimiters string as the `operation`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking Your Obfuscation\n", "\n", "You should check that you have correctly masked all of the columns needed. \n", "There is a function `check_obfuscation` to do this.\n", "\n", "Use `silent=False` to print out the results.\n", "If you use `silent=True` (the default it will return 2 lists of `unchanged` and\n", "`obfuscated` columns)\n", "\n", "```\n", "data_obfus.check_obfuscation(\n", " data: pandas.core.frame.DataFrame,\n", " orig_data: pandas.core.frame.DataFrame,\n", " index: int = 0,\n", " silent=True,\n", ") -> Union[Tuple[List[str], List[str]], NoneType]\n", "\n", "Check the obfuscation results for a row.\n", "Parameters\n", "----------\n", "data : pd.DataFrame\n", " Obfuscated DataFrame\n", "orig_data : pd.DataFrame\n", " Original DataFrame\n", "index : int, optional\n", " The row to check, by default 0\n", "silent: bool\n", " If False the function returns no output and\n", " returns lists of changed and unchanged columns.\n", " By default, True\n", "\n", "Returns\n", "-------\n", "Optional[Tuple[List[str], List[str]]] :\n", " If silent is True returns a tuple of unchanged, changed\n", " items. If False, returns None.\n", "```\n", "\n", "> **Note** by default this will check only the first row of the data.\n", "> You can check other rows using the index parameter.\n", "\n", "> **Warning** The two DataFrames should have a matching index and ordering because\n", "> the check works by comparing the values in each column, judging that\n", "> column values that do not match have been obfuscated.\n", "\n", "**We first test the partially-obfuscated DataFrame from earlier.**" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "===== Start Check ====\n", "Unchanged columns:\n", "------------------\n", "AllExtIPs: 65.55.44.109\n", "FlowStartTime: 2019-02-12 13:00:07.000\n", "L4Protocol: T\n", "PublicIPs: ['65.55.44.109']\n", "TimeGenerated: 2019-02-12 14:22:40.697\n", "VMIPAddress: 10.0.3.5\n", "\n", "Obfuscated columns:\n", "--------------------\n", "DestIP: nan ----> nan\n", "ResourceGroup: asihuntomsworkspacerg ----> ibmkajbmepnmiaeilfofa\n", "SrcIP: nan ----> nan\n", "TenantId: 52b1ab41-869e-4138-9e40-2a4457f09bf0 ----> f9ef3428-3ccb-4ecd-8466-dbedc7044293\n", "VMName: msticalertswin1 ----> fmlmbnlpdcbnbnn\n", "====== End Check =====\n" ] } ], "source": [ "partly_obfus_df = netflow_df.head(3).mp_mask.mask()\n", "fully_obfus_df = netflow_df.head(3).mp_mask.mask(column_map=col_map)\n", "\n", "data_obfus.check_obfuscation(partly_obfus_df, netflow_df.head(3), silent=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Checking the fully-obfuscated data set**" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "===== Start Check ====\n", "Unchanged columns:\n", "------------------\n", "FlowStartTime: 2019-02-12 13:00:07.000\n", "L4Protocol: T\n", "TimeGenerated: 2019-02-12 14:22:40.697\n", "\n", "Obfuscated columns:\n", "--------------------\n", "AllExtIPs: 65.55.44.109 ----> 100.11.187.82\n", "DestIP: nan ----> nan\n", "PublicIPs: ['65.55.44.109'] ----> ['100.11.187.82']\n", "ResourceGroup: asihuntomsworkspacerg ----> ibmkajbmepnmiaeilfofa\n", "SrcIP: nan ----> nan\n", "TenantId: 52b1ab41-869e-4138-9e40-2a4457f09bf0 ----> f9ef3428-3ccb-4ecd-8466-dbedc7044293\n", "VMIPAddress: 10.0.3.5 ----> 10.112.51.93\n", "VMName: msticalertswin1 ----> fmlmbnlpdcbnbnn\n", "====== End Check =====\n" ] } ], "source": [ "data_obfus.check_obfuscation(fully_obfus_df, netflow_df.head(3), silent=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Appendix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import tabulate\n", "# print(tabulate.tabulate(netflow_df.head(3), tablefmt=\"rst\", showindex=False, headers=\"keys\"))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 4 }