# Title: msticpy - IoC Extraction
## Description:
This class allows you to extract IoC patterns from a string or a DataFrame.
Several patterns are built in to the class and you can override these or supply new ones.


<a id='toc'></a>
## Table of Contents
- [Looking for IoC in a String](#cmdlineiocs)
- [Search DataFrame for IoCs](#dataframeiocs)
- [IoCExtractor API](#iocextractapi)
  - [Predefined Regex Patterns](#regexpatterns)
  - [Adding your own pattern(s)](#addingpatterns)
  - [extract() method](#extractmethod)
  - [Merge the results with the input DataFrame](#mergeresults)

In [2]:
# Imports
import sys
MIN_REQ_PYTHON = (3,6)
if sys.version_info < MIN_REQ_PYTHON:
    print('Check the Kernel->Change Kernel menu and ensure that Python 3.6')
    print('or later is selected as the active kernel.')
    sys.exit("Python %s.%s or later is required.\n" % MIN_REQ_PYTHON)


import numpy as np
from IPython import get_ipython
from IPython.display import display, HTML
import ipywidgets as widgets

import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
sns.set()
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_colwidth', 100)

import os
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)
   
import msticpy.sectools as sectools
import msticpy.nbtools as asi
import msticpy.nbtools.kql as qry
import msticpy.nbtools.nbdisplay as nbdisp


In [56]:
# Load test data
process_tree = pd.read_csv('data/process_tree.csv')
process_tree[['CommandLine']].head()

Unnamed: 0,CommandLine
0,.\ftp -s:C:\RECYCLER\xxppyy.exe
1,.\reg not /domain:everything that /sid:shines is /krbtgt:golden !
2,"cmd /c ""systeminfo && systeminfo"""
3,.\rundll32 /C 12345.exe
4,.\rundll32 /C c:\users\MSTICAdmin\12345.exe


<a id='cmdlineiocs'></a>[Contents](#toc)
## Looking for IoC in a String
Here we:
- Get a commandline from our data set.
- Pass it to the IoC Extractor
- View the results

In [8]:
# get a commandline from our data set
cmdline = process_tree['CommandLine'].loc[78]
cmdline

'netsh  start capture=yes IPv4.Address=1.2.3.4 tracefile=C:\\\\Users\\\\user\\\\AppData\\\\Local\\\\Temp\\\\bzzzzzz.txt'

In [13]:
# Instantiate an IoCExtract object
from msticpy.sectools import IoCExtract
ioc_extractor = IoCExtract()

# any IoCs in the string?
iocs_found = ioc_extractor.extract(cmdline)
    
if iocs_found:
    print('\nPotential IoCs found in alert process:')
    display(iocs_found)



Potential IoCs found in alert process:


defaultdict(set,
            {'ipv4': {'1.2.3.4'},
             'windows_path': {'C:\\\\Users\\\\user\\\\AppData\\\\Local\\\\Temp\\\\bzzzzzz.txt'}})

<a id='dataframeiocs'></a>[Contents](#toc)
## If we have a DataFrame, look for IoCs in the whole data set
You can replace the ```data=``` parameter to ioc_extractor.extract() to pass other data frames.
Use the ```columns``` parameter to specify which column or columns that you want to search.

In [10]:
ioc_extractor = IoCExtract()
ioc_df = ioc_extractor.extract(data=process_tree, columns=['CommandLine'], os_family='Windows')
if len(ioc_df):
    display(HTML("<h3>IoC patterns found in process tree.</h3>"))
    display(ioc_df)

Unnamed: 0,IoCType,Observable,SourceIndex
0,windows_path,C:\RECYCLER\xxppyy.exe,0
1,windows_path,.\ftp,0
2,windows_path,.\reg,1
3,windows_path,.\rundll32,3
4,windows_path,c:\users\MSTICAdmin\12345.exe,4
5,windows_path,.\rundll32,4
6,windows_path,.\rundll32,5
7,windows_path,c:\users\MSTICAdmin\1234.exe,6
8,windows_path,.\rundll32,6
9,windows_path,.\rundll32,7


<a id='iocextractapi'></a>[Contents](#toc)
## IoCExtractor API


In [16]:
# IoCExtract docstring
IoCExtract?

<a id='regexpatterns'></a>[Contents](#toc)
### Predefined Regex Patterns

In [29]:
extractor = IoCExtract()

for ioc_type, pattern in extractor.ioc_types.items():
    display(HTML(f'<b>{ioc_type}</b>'))
    display(HTML(f'<div style="margin-left:20px"><pre>{pattern.comp_regex}<pre></div>)'))

<a id='addingpatterns'></a>[Contents](#toc)
### Adding your own pattern(s)
Docstring:
```
Add an IoC type and regular expression to use to the built-in set.

Note: adding an ioc_type that exists in the internal set will overwrite that item
Regular expressions are compiled with re.I | re.X | re.M (Ignore case, Verbose
and MultiLine)
    :param: ioc_type{str} - a unique name for the IoC type
    :param: ioc_regex{str} - a regular expression used to search for the type
```

In [33]:
import re
rcomp = re.compile(r'(?P<pipe>\\\\\.\\pipe\\[^\s\\]+)')

In [39]:
extractor.add_ioc_type(ioc_type='win_named_pipe', ioc_regex=r'(?P<pipe>\\\\\.\\pipe\\[^\s\\]+)')

# Check that it added ok
print(extractor.ioc_types['win_named_pipe'])

# Use it in our data set
ioc_extractor.extract(data=process_tree, columns=['CommandLine'], os_family='Windows').query('IoCType == \'win_named_pipe\'')

IoCPattern(ioc_type='win_named_pipe', comp_regex=re.compile('(?P<pipe>\\\\\\\\\\.\\\\pipe\\\\[^\\s\\\\]+)', re.IGNORECASE|re.MULTILINE|re.VERBOSE), priority=0)


Unnamed: 0,IoCType,Observable,SourceIndex
116,win_named_pipe,"\\.\pipe\blahtest""",107


<a id='extractmethod'></a>[Contents](#toc)
### extract() method
```
Docstring:
Extract IoCs from either a string or pandas DataFrame.

    :param data: input DataFrame from which to read source strings
    :param columns: The list of columns to use as source strings,
        if the data parameter is used.
    :param src: source string in which to look for IoC patterns
    :param os_family: 'Linux' or 'Windows'

Returns:
    dict of found observables (if input is a string) or
    DataFrame of observables

Extract takes either a string or a pandas DataFrame as input.
When using the string option as an input extract will
return a dictionary of results.
When using a DataFrame the results will be returned as a new
DataFrame with the following columns:
- IoCType: the mnemonic used to distinguish different IoC Types
- Observable: the actual value of the observable
- SourceIndex: the index of the row in the input DataFrame from
which the source for the IoC observable was extracted.
```

**Note** the os_family parameter is optional. If you are not interested in searching for Linux paths omit this or set to 'Windows'. Almost anything is a legal character in a Linux path name so this is a very loose regex (the built-in one is more restrictive than the possible path names, otherwise this will match too much to be useful).

In [42]:
# You can specify multiple columns
ioc_extractor.extract(data=process_tree.head(20), columns=['NewProcessName', 'CommandLine']).head(10)

Unnamed: 0,IoCType,Observable,SourceIndex
0,windows_path,C:\Diagnostics\UserTmp\ftp.exe,0
1,windows_path,C:\RECYCLER\xxppyy.exe,0
2,windows_path,.\ftp,0
3,windows_path,C:\Diagnostics\UserTmp\reg.exe,1
4,windows_path,.\reg,1
5,windows_path,C:\Diagnostics\UserTmp\cmd.exe,2
6,windows_path,C:\Diagnostics\UserTmp\rundll32.exe,3
7,windows_path,.\rundll32,3
8,windows_path,C:\Diagnostics\UserTmp\rundll32.exe,4
9,windows_path,c:\users\MSTICAdmin\12345.exe,4


<a id='mergeresults'></a>[Contents](#toc)
### SourceIndex column allows you to merge the results with the input DataFrame
Where an input row has multiple IoC matches the output of this merge will result in duplicate rows from the input (one per IoC match). The previous index is preserved in the second column (and in the SourceIndex column).

Note: you will need to set the type of the SourceIndex column. In the example below case we are matching with the default numeric index so we force the type to be numeric. In cases where you are using an index of a different dtype you will need to convert the SourceIndex (dtype=object) to match the type of your index column.

In [55]:
input_df = data=process_tree.head(20)
output_df = ioc_extractor.extract(data=input_df, columns=['NewProcessName', 'CommandLine'])
# set the type of the SourceIndex column. In this case we are matching with the default numeric index.
output_df['SourceIndex'] = pd.to_numeric(output_df['SourceIndex'])
merged_df = pd.merge(left=input_df, right=output_df, how='outer', left_index=True, right_on='SourceIndex')
merged_df.head()

Unnamed: 0.1,Unnamed: 0,TenantId,Account,EventID,TimeGenerated,Computer,SubjectUserSid,SubjectUserName,SubjectDomainName,SubjectLogonId,NewProcessId,NewProcessName,TokenElevationType,ProcessId,CommandLine,ParentProcessName,TargetLogonId,SourceComputerId,TimeCreatedUtc,NodeRole,Level,ProcessId1,NewProcessId1,IoCType,Observable,SourceIndex
0,0,802d39e1-9d70-404d-832c-2de5e2478eda,MSTICAlertsWin1\MSTICAdmin,4688,2019-01-15 05:15:15.677,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,0x1580,C:\Diagnostics\UserTmp\ftp.exe,%%1936,0xbc8,.\ftp -s:C:\RECYCLER\xxppyy.exe,C:\Windows\System32\cmd.exe,0x0,46fe7078-61bb-4bed-9430-7ac01d91c273,2019-01-15 05:15:15.677,source,0,,,windows_path,C:\Diagnostics\UserTmp\ftp.exe,0
1,0,802d39e1-9d70-404d-832c-2de5e2478eda,MSTICAlertsWin1\MSTICAdmin,4688,2019-01-15 05:15:15.677,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,0x1580,C:\Diagnostics\UserTmp\ftp.exe,%%1936,0xbc8,.\ftp -s:C:\RECYCLER\xxppyy.exe,C:\Windows\System32\cmd.exe,0x0,46fe7078-61bb-4bed-9430-7ac01d91c273,2019-01-15 05:15:15.677,source,0,,,windows_path,C:\RECYCLER\xxppyy.exe,0
2,0,802d39e1-9d70-404d-832c-2de5e2478eda,MSTICAlertsWin1\MSTICAdmin,4688,2019-01-15 05:15:15.677,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,0x1580,C:\Diagnostics\UserTmp\ftp.exe,%%1936,0xbc8,.\ftp -s:C:\RECYCLER\xxppyy.exe,C:\Windows\System32\cmd.exe,0x0,46fe7078-61bb-4bed-9430-7ac01d91c273,2019-01-15 05:15:15.677,source,0,,,windows_path,.\ftp,0
3,1,802d39e1-9d70-404d-832c-2de5e2478eda,MSTICAlertsWin1\MSTICAdmin,4688,2019-01-15 05:15:16.167,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,0x16fc,C:\Diagnostics\UserTmp\reg.exe,%%1936,0xbc8,.\reg not /domain:everything that /sid:shines is /krbtgt:golden !,C:\Windows\System32\cmd.exe,0x0,46fe7078-61bb-4bed-9430-7ac01d91c273,2019-01-15 05:15:16.167,sibling,1,,,windows_path,C:\Diagnostics\UserTmp\reg.exe,1
4,1,802d39e1-9d70-404d-832c-2de5e2478eda,MSTICAlertsWin1\MSTICAdmin,4688,2019-01-15 05:15:16.167,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,0x16fc,C:\Diagnostics\UserTmp\reg.exe,%%1936,0xbc8,.\reg not /domain:everything that /sid:shines is /krbtgt:golden !,C:\Windows\System32\cmd.exe,0x0,46fe7078-61bb-4bed-9430-7ac01d91c273,2019-01-15 05:15:16.167,sibling,1,,,windows_path,.\reg,1
