# Data Obfuscation Library

Sharing data, creating documents and doing public demonstrations often require that data containing
PII or other sensitive material be obfuscated.

MSTICPy contains a simple library to obfuscate data using hashing and random mapping of values.
You can use these functions on a single data items or entire DataFrames.

## Contents
- [Import the module](#Import-the-module)
- [Individual Obfuscation Functions](#Individual-Obfuscation-Functions)
- [Obfuscating DataFrames](#Obfuscating-DataFrames)
- [Creating custom column mappings](#Creating-custom-mappings)
- [Using hash_item with delimiters](#Using-hash_item-with-delimiters-to-preserve-the-structure/look-of-the-hashed-input)
- [Checking Your Obfuscation](#Checking-Your-Obfuscation)

## Import the module

In [1]:
import pandas as pd
from msticpy.common.utility import md
from msticpy.data import data_obfus

### Read in some data for the examples

In [2]:

netflow_df = pd.read_csv("data/az_net_flows.csv")
# list is imported as string from csv - convert back to list with eval
def str_to_list(val):
    if isinstance(val, str):
        return eval(val)
netflow_df["PublicIPs"] = netflow_df["PublicIPs"].apply(str_to_list)

# Define subset of output columns
out_cols = [
    'TenantId', 'TimeGenerated', 'FlowStartTime',
    'ResourceGroup', 'VMName', 'VMIPAddress', 'PublicIPs',
    'SrcIP', 'DestIP', 'L4Protocol', 'AllExtIPs'
]
netflow_df = netflow_df[out_cols]

## Individual Obfuscation Functions

Here we're importing individual functions but you can access them with the single
import statement above as:
```
data_obfus.hash_string(...)
```
etc.

> **Note** In the next cell we're using a function to output documentation and examples.<br>
> You can ignore this. The usage of each function is show in the output of<br>
> the subsequent cells.

In [3]:
from msticpy.data.data_obfus import (
    hash_dict,
    hash_ip,
    hash_item,
    hash_list,
    hash_sid,
    hash_string,
    replace_guid
)

# Function to automate/format the examples below. You can ignore this
def show_func(func, examples):
    func_name = func.__name__
    if func.__name__.startswith("_"):
        func_name = func_name[1:]
    md(func_name, "bold")
    print(func.__doc__)
    md("Examples", "bold")
    for example in examples:
        if isinstance(example, tuple):
            arg, delim = example
            print(
                f"{func_name}('{arg}', delim='{delim}') =>", func(*example)
            )
        else:
            print(
                f"{func_name}('{example}') =>", func(example)
            )
    md("<br><hr><br>")

In [4]:
md("hash_string", "large, bold")
md("hash_string does a simple hash of the input. If the input is a numeric string it will output a numeric")
show_func(hash_string, ["sensitive data", "42424"])


    Hash a simple string.

    Parameters
    ----------
    input_str : str
        The input string

    Returns
    -------
    str
        The obfuscated output string

    


hash_string('sensitive data') => jdiqcnrqmlidkd
hash_string('42424') => 98478


In [5]:
md("hash_item", "large, bold")
md("hash_item allows specification of delimiters. Useful for preserving the look of domains, emails, etc.")
show_func(hash_item, [("sensitive data", " "), ("most-sensitive-data/here", " /-")])


    Hash a simple string.

    Parameters
    ----------
    input_item : str
        The input string
    delim: str, optional
        A string of delimiters to use to split the input string
        prior to hashing.

    Returns
    -------
    str
        The obfuscated output string

    


hash_item('sensitive data', delim=' ') => kdneqoiia laoe
hash_item('most-sensitive-data/here', delim=' /-') => kmea-kdneqoiia-laoe/fcec


In [6]:
md("hash_ip", "large, bold")
md("hash_ip will output random mappings of input IP V4 and V6 addresses.")
md("Within a Python session the mapping will remain constant.")
show_func(hash_ip, [
    "192.168.3.1", 
    "2001:0db8:85a3:0000:0000:8a2e:0370:7334",
    ["192.168.3.1", "192.168.5.2", "192.168.10.2"],
])


    Hash IP address or list of IP addresses.

    Parameters
    ----------
    input_item : Union[List[str], str]
        List of IP addresses or single IP address.

    Returns
    -------
    Union[List[str], str]
        List of hashed addresses or single address.
        (depending on input)

    


hash_ip('192.168.3.1') => 192.168.84.105
hash_ip('2001:0db8:85a3:0000:0000:8a2e:0370:7334') => 85d6:7819:9cce:9af1:9af1:24ad:d338:7d03
hash_ip('['192.168.3.1', '192.168.5.2', '192.168.10.2']') => ['192.168.84.105', '192.168.172.202', '192.168.232.202']


In [7]:
md("hash_sid", "large, bold")
md("hash_sid will randomize the domain-specific parts of a SID. It preserves built-in SIDs and well known RIDs (e.g. Admins -500)")
show_func(hash_sid, ["S-1-5-21-1180699209-877415012-3182924384-1004", "S-1-5-18"])


    Hash a SID preserving well-known SIDs and the RID.

    Parameters
    ----------
    sid : str
        SID string

    Returns
    -------
    str
        Hashed SID

    


hash_sid('S-1-5-21-1180699209-877415012-3182924384-1004') => S-1-5-21-3321821741-636458740-4143214142-1004
hash_sid('S-1-5-18') => S-1-5-18


In [8]:
md("hash_list", "large, bold")
md("hash_list will randomize a list of items preserving the list structure.")
show_func(hash_list, [["S-1-5-21-1180699209-877415012-3182924384-1004", "S-1-5-18"]])


    Hash list of strings.

    Parameters
    ----------
    item_list : List[str]
        Input list

    Returns
    -------
    List[str]
        Hashed list

    


hash_list('['S-1-5-21-1180699209-877415012-3182924384-1004', 'S-1-5-18']') => ['elkbjiboklpknokdeflikamojqjflqmicqiorqfbqboqe', 'nrllmpbd']


In [9]:
md("hash_dict", "large, bold")
md("hash_dict will randomize a dict of items preserving the structure and the dict keys.")
show_func(hash_dict, [{"SID1": "S-1-5-21-1180699209-877415012-3182924384-1004", "SID2": "S-1-5-18"}])


    Hash dictionary values.

    Parameters
    ----------
    item_dict : Dict[str, Union[Dict[str, Any], List[Any], str]]
        Input item can be a Dict of strings, lists or other
        dictionaries.

    Returns
    -------
    Dict[str, Any]
        Dictionary with hashed values.

    


hash_dict('{'SID1': 'S-1-5-21-1180699209-877415012-3182924384-1004', 'SID2': 'S-1-5-18'}') => {'SID1': 'elkbjiboklpknokdeflikamojqjflqmicqiorqfbqboqe', 'SID2': 'nrllmpbd'}


In [10]:
md("replace_guid", "large, bold")
md("replace_guid will output a random UUID mapped to the input.")
md("An input GUID will be mapped to the same newly-generated output UUID")
md("You can see that UUID #4 is the same as #1 and mapped to the same output UUID.")
show_func(replace_guid, [
    "cf1b0b29-08ae-4528-839a-5f66eca2cce9",
    "ed63d29e-6288-4d66-b10d-8847096fc586",
    "ac561203-99b2-4067-a525-60d45ea0d7ff",
    "cf1b0b29-08ae-4528-839a-5f66eca2cce9",
])


        Replace GUID/UUID with mapped random UUID.

        Parameters
        ----------
        guid : str
            Input UUID.

        Returns
        -------
        str
            Mapped UUID

        


replace_guid('cf1b0b29-08ae-4528-839a-5f66eca2cce9') => 01ae8633-22e5-480f-b884-fc48588c25d9
replace_guid('ed63d29e-6288-4d66-b10d-8847096fc586') => 52cd2814-b5e4-48bd-80f2-51b503e50467
replace_guid('ac561203-99b2-4067-a525-60d45ea0d7ff') => ef059dc7-2d6e-4506-8619-05b346a6bc6b
replace_guid('cf1b0b29-08ae-4528-839a-5f66eca2cce9') => 01ae8633-22e5-480f-b884-fc48588c25d9


## Obfuscating DataFrames

We can use the msticpy pandas extension to obfuscate an entire DataFrame.

The obfuscation library contains a mapping for a number of common field names.
You can view this list by displaying the attribute:
```
data_obfus.OBFUS_COL_MAP
```

In the first example, the TenantId, ResourceGroup, VMName have been obfuscated.

In [12]:
display(netflow_df.head(3))
netflow_df.head(3).mp_mask.mask()

Unnamed: 0,TenantId,TimeGenerated,FlowStartTime,ResourceGroup,VMName,VMIPAddress,PublicIPs,SrcIP,DestIP,L4Protocol,AllExtIPs
0,52b1ab41-869e-4138-9e40-2a4457f09bf0,2019-02-12 14:22:40.697,2019-02-12 13:00:07.000,asihuntomsworkspacerg,msticalertswin1,10.0.3.5,[65.55.44.109],,,T,65.55.44.109
1,52b1ab41-869e-4138-9e40-2a4457f09bf0,2019-02-12 14:22:40.681,2019-02-12 13:00:48.000,asihuntomsworkspacerg,msticalertswin1,10.0.3.5,"[13.71.172.130, 13.71.172.128]",,,T,13.71.172.128
2,52b1ab41-869e-4138-9e40-2a4457f09bf0,2019-02-12 14:22:40.681,2019-02-12 13:00:48.000,asihuntomsworkspacerg,msticalertswin1,10.0.3.5,"[13.71.172.130, 13.71.172.128]",,,T,13.71.172.130


Unnamed: 0,TenantId,TimeGenerated,FlowStartTime,ResourceGroup,VMName,VMIPAddress,PublicIPs,SrcIP,DestIP,L4Protocol,AllExtIPs
0,f9ef3428-3ccb-4ecd-8466-dbedc7044293,2019-02-12 14:22:40.697,2019-02-12 13:00:07.000,ibmkajbmepnmiaeilfofa,fmlmbnlpdcbnbnn,10.0.3.5,[65.55.44.109],,,T,65.55.44.109
1,f9ef3428-3ccb-4ecd-8466-dbedc7044293,2019-02-12 14:22:40.681,2019-02-12 13:00:48.000,ibmkajbmepnmiaeilfofa,fmlmbnlpdcbnbnn,10.0.3.5,"[13.71.172.130, 13.71.172.128]",,,T,13.71.172.128
2,f9ef3428-3ccb-4ecd-8466-dbedc7044293,2019-02-12 14:22:40.681,2019-02-12 13:00:48.000,ibmkajbmepnmiaeilfofa,fmlmbnlpdcbnbnn,10.0.3.5,"[13.71.172.130, 13.71.172.128]",,,T,13.71.172.130


### Adding custom column mappings

Note in the previous example that the VMIPAddress, PublicIPs and AllExtIPs columns were unchanged.

We can add these columns to a custom mapping dictionary and re-run the obfuscation.
See the later section on [Creating Custom Mappings](#Creating-custom-mappings).

In [14]:
col_map = {
    "VMName": ".",
    "VMIPAddress": "ip", 
    "PublicIPs": "ip",
    "AllExtIPs": "ip"
}

netflow_df.head(3).mp_mask.mask()

Unnamed: 0,TenantId,TimeGenerated,FlowStartTime,ResourceGroup,VMName,VMIPAddress,PublicIPs,SrcIP,DestIP,L4Protocol,AllExtIPs
0,f9ef3428-3ccb-4ecd-8466-dbedc7044293,2019-02-12 14:22:40.697,2019-02-12 13:00:07.000,ibmkajbmepnmiaeilfofa,fmlmbnlpdcbnbnn,10.0.3.5,[65.55.44.109],,,T,65.55.44.109
1,f9ef3428-3ccb-4ecd-8466-dbedc7044293,2019-02-12 14:22:40.681,2019-02-12 13:00:48.000,ibmkajbmepnmiaeilfofa,fmlmbnlpdcbnbnn,10.0.3.5,"[13.71.172.130, 13.71.172.128]",,,T,13.71.172.128
2,f9ef3428-3ccb-4ecd-8466-dbedc7044293,2019-02-12 14:22:40.681,2019-02-12 13:00:48.000,ibmkajbmepnmiaeilfofa,fmlmbnlpdcbnbnn,10.0.3.5,"[13.71.172.130, 13.71.172.128]",,,T,13.71.172.130


### ofuscate_df function

You can also call the standard function `obfuscate_df` to perform the same operation
on the dataframe passed as the `data` parameter.

In [15]:
data_obfus.obfuscate_df(data=netflow_df.head(3), column_map=col_map)

Unnamed: 0,TenantId,TimeGenerated,FlowStartTime,ResourceGroup,VMName,VMIPAddress,PublicIPs,SrcIP,DestIP,L4Protocol,AllExtIPs
0,f9ef3428-3ccb-4ecd-8466-dbedc7044293,2019-02-12 14:22:40.697,2019-02-12 13:00:07.000,ibmkajbmepnmiaeilfofa,fmlmbnlpdcbnbnn,10.112.51.93,[100.11.187.82],,,T,100.11.187.82
1,f9ef3428-3ccb-4ecd-8466-dbedc7044293,2019-02-12 14:22:40.681,2019-02-12 13:00:48.000,ibmkajbmepnmiaeilfofa,fmlmbnlpdcbnbnn,10.112.51.93,"[144.169.193.140, 144.169.193.144]",,,T,144.169.193.144
2,f9ef3428-3ccb-4ecd-8466-dbedc7044293,2019-02-12 14:22:40.681,2019-02-12 13:00:48.000,ibmkajbmepnmiaeilfofa,fmlmbnlpdcbnbnn,10.112.51.93,"[144.169.193.140, 144.169.193.144]",,,T,144.169.193.140


## Creating custom mappings

A custom mapping dictionary has entries in the following form:
```
    "ColumnName": "operation"
```

The `operation` defines the type of obfuscation method used for that column. Both the column
and the operation code must be quoted.

|operation code | obfuscation function |
|---------------|----------------------|
| "uuid"        | replace_guid         |
| "ip"          | hash_ip              |
| "str"         | hash_string          |
| "dict"        | hash_dict            |
| "list"        | hash_list            |
| "sid"         | hash_sid             |
| "null"        | "null"\*             |
| None          | hash_str\*           |
| delims_str    | hash_item\*          |

\*The last three items require some explanation:
- null - the `null` operation code means set the value to empty - i.e. delete the value
  in the output frame.
- None (i.e. the dictionary value is `None`) default to hash_string.
- delims_str - any string other than those named above is assumed to be a string of delimiters.
  See next section for a discussion of use of delimiters.

---

> **NOTE** If you want to *only* use custom mappings and ignore the builtin<br>
> mapping table, specify `use_default=False` as a parameter to either<br>
> `mp_obf.obfuscate()` or `obfuscate_df`
---

## Using `hash_item` with delimiters to preserve the structure/look of the hashed input

Using hash_item with a delimiters string lets you create output that somewhat resembles the input
type. The delimiters string is specified as a simple string of delimiter characters, e.g. `"@\,-"`

The input string is broken into substrings using each of the delimiters in the delims_str. The substrings
are individually hashed and the resulting substrings joined together using the original delimiters.
The string is split in the order of the characters in the delims string.

This allows you to create hashed values that bear some resemblance to the original structure of the string.
This might be useful for email address, qualified domain names and other structure text.

For example :
    ian@mydomain.com
    
Using the simple `hash_string` function the output bears no resemblance to an email address

In [16]:
hash_string("ian@mydomain.com")

'prqocjmdpbodrafn'

Using `hash_item` and specifying the expected delimiters we get something like an email address in the output.

In [17]:
hash_item("ian@mydomain.com", "@.")

'bnm@blbbrfbk.pjb'

You use `hash_item` in your Custom Mapping dictionary by specifying a delimiters string as the `operation`.

## Checking Your Obfuscation

You should check that you have correctly masked all of the columns needed. 
There is a function `check_obfuscation` to do this.

Use `silent=False` to print out the results.
If you use `silent=True` (the default it will return 2 lists of `unchanged` and
`obfuscated` columns)

```
data_obfus.check_obfuscation(
    data: pandas.core.frame.DataFrame,
    orig_data: pandas.core.frame.DataFrame,
    index: int = 0,
    silent=True,
) -> Union[Tuple[List[str], List[str]], NoneType]

Check the obfuscation results for a row.
Parameters
----------
data : pd.DataFrame
    Obfuscated DataFrame
orig_data : pd.DataFrame
    Original DataFrame
index : int, optional
    The row to check, by default 0
silent: bool
    If False the function returns no output and
    returns lists of changed and unchanged columns.
    By default, True

Returns
-------
Optional[Tuple[List[str], List[str]]] :
    If silent is True returns a tuple of unchanged, changed
    items. If False, returns None.
```

> **Note** by default this will check only the first row of the data.
> You can check other rows using the index parameter.

> **Warning** The two DataFrames should have a matching index and ordering because
> the check works by comparing the values in each column, judging that
> column values that do not match have been obfuscated.

**We first test the partially-obfuscated DataFrame from earlier.**

In [19]:
partly_obfus_df = netflow_df.head(3).mp_mask.mask()
fully_obfus_df = netflow_df.head(3).mp_mask.mask(column_map=col_map)

data_obfus.check_obfuscation(partly_obfus_df, netflow_df.head(3), silent=False)

===== Start Check ====
Unchanged columns:
------------------
AllExtIPs: 65.55.44.109
FlowStartTime: 2019-02-12 13:00:07.000
L4Protocol: T
PublicIPs: ['65.55.44.109']
TimeGenerated: 2019-02-12 14:22:40.697
VMIPAddress: 10.0.3.5

Obfuscated columns:
--------------------
DestIP:   nan ----> nan
ResourceGroup:   asihuntomsworkspacerg ----> ibmkajbmepnmiaeilfofa
SrcIP:   nan ----> nan
TenantId:   52b1ab41-869e-4138-9e40-2a4457f09bf0 ----> f9ef3428-3ccb-4ecd-8466-dbedc7044293
VMName:   msticalertswin1 ----> fmlmbnlpdcbnbnn


**Checking the fully-obfuscated data set**

In [20]:
data_obfus.check_obfuscation(fully_obfus_df, netflow_df.head(3), silent=False)

===== Start Check ====
Unchanged columns:
------------------
FlowStartTime: 2019-02-12 13:00:07.000
L4Protocol: T
TimeGenerated: 2019-02-12 14:22:40.697

Obfuscated columns:
--------------------
AllExtIPs:   65.55.44.109 ----> 100.11.187.82
DestIP:   nan ----> nan
PublicIPs:   ['65.55.44.109'] ----> ['100.11.187.82']
ResourceGroup:   asihuntomsworkspacerg ----> ibmkajbmepnmiaeilfofa
SrcIP:   nan ----> nan
TenantId:   52b1ab41-869e-4138-9e40-2a4457f09bf0 ----> f9ef3428-3ccb-4ecd-8466-dbedc7044293
VMIPAddress:   10.0.3.5 ----> 10.112.51.93
VMName:   msticalertswin1 ----> fmlmbnlpdcbnbnn


---
## Appendix

In [None]:
# import tabulate
# print(tabulate.tabulate(netflow_df.head(3), tablefmt="rst", showindex=False, headers="keys"))