{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Guided Hunting - Base64-Encoded Linux Commands"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " <details>\n",
    "     <summary>&nbsp;<u>Details...</u></summary>\n",
    "\n",
    " **Notebook Version:** 1.0<br>\n",
    " **Python Version:** Python 3.6 (including Python 3.6 - AzureML)<br>\n",
    " **Required Packages**: kqlmagic, msticpy, pandas, numpy, matplotlib, networkx, seaborn, datetime, ipywidgets, ipython, dnspython, folium, maxminddb_geolite2, BeautifulSoup<br>\n",
    " **Platforms Supported**:\n",
    " - Azure Notebooks Free Compute\n",
    " - Azure Notebooks DSVM\n",
    " - OS Independent\n",
    "\n",
    " **Data Sources Required**:\n",
    " - Log Analytics/Azure Sentinel - Syslog, Security Alerts, Auditd, Azure Network Analytics.\n",
    " - VirusTotal, AlienVault OTX, and IBM XForce require account and API key, which are free to create on their respective websites. If you'd prefer to use only one or prefer one over the others, there will be further instruction in the following sections.\n",
    " </details>\n",
    "\n",
    "This notebook is a collection of tools for detecting malicious behavior when commands are Base64-encoded. It allows you to specify a workspace and time frame and will score and rank Base64 commands within those bounds.\n",
    "\n",
    "It utilizes multiple data sources, primarily focusing on Azure Sentinel Syslog data augmented by telemetry from the MSTIC research branch of the AUOMS audit collection tool. Make sure to install this agent and connect your virtual machines with Azure Sentinel before using this notebook. For more on this, please see this [blog post](https://techcommunity.microsoft.com/t5/azure-sentinel/hunting-threats-on-linux-with-azure-sentinel/ba-p/1344431#).\n",
    "\n",
    "This notebook also uses data from [GTFOBins](https://gtfobins.github.io/), a list of Unix binaries that can be exploited by attackers. These bash commands are labeled with preliminary functions that can help an investigator better understand what a command does.\n",
    "\n",
    "Finally, we use TI intel from [AlienVaultOTX](https://otx.alienvault.com/), [VirusTotal](https://www.virustotal.com/gui/home/upload), and [IBM XForce](https://www.ibm.com/security/services/ibm-x-force-incident-response-and-intelligence) to highlight and emphasize certain Base64 commands. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of Contents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"toc\">\n",
    "    <ul class=\"toc-item\">\n",
    "        <li>\n",
    "            <span>\n",
    "                <a href=\"#Notebook-Setup\" data-toc-modified-id=\"Notebook-Setup\">\n",
    "                    <span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Notebook Setup\n",
    "                </a>\n",
    "            </span>\n",
    "            <ul class=\"toc-item\">\n",
    "                <li>\n",
    "                    <span>\n",
    "                        <a href=\"#Connect-to-Log-Analytics\" data-toc-modified-id=\"Connect-to-Log-Analytics\">\n",
    "                        <span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Connect to Log Analytics and get Workspace ID\n",
    "                        </a>\n",
    "                    </span>\n",
    "                </li>\n",
    "            </ul>\n",
    "        </li>\n",
    "        <li>\n",
    "            <span>\n",
    "                <a href=\"#Set-Time-Parameters\" data-toc-modified-id=\"Set-Time-Parameters\">\n",
    "                    <span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Set Time Parameters\n",
    "                </a>\n",
    "            </span>\n",
    "        </li>\n",
    "        <li>\n",
    "            <span>\n",
    "                <a href=\"#Get-Base64-Commands\" data-toc-modified-id=\"Get-Base64-Commands\">\n",
    "                    <span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Get Base64 Commands\n",
    "                </a>\n",
    "            </span>\n",
    "            <ul class=\"toc-item\">\n",
    "                <li>\n",
    "                    <span>\n",
    "                        <a href=\"#Categorize-Decoded-Commands\" data-toc-modified-id=\"Categorize-Decoded-Commands\">\n",
    "                            <span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>Categorize Decoded Commands\n",
    "                        </a>\n",
    "                    </span>\n",
    "                </li>\n",
    "                <li>\n",
    "                    <span>\n",
    "                        <a href=\"#GTFO-Bins-Classification\" data-toc-modified-id=\"GTFO-Bins-Classification\">\n",
    "                            <span class=\"toc-item-num\">3.2&nbsp;&nbsp;</span>GTFO Bins Classification\n",
    "                        </a>\n",
    "                    </span>\n",
    "                </li>\n",
    "            </ul>\n",
    "        </li>\n",
    "        <li>\n",
    "            <span>\n",
    "                <a href=\"#Generate-Scores-and-Rankings\" data-toc-modified-id=\"Generate-Scores-and-Rankings\">\n",
    "                    <span class=\"toc-item-num\">4&nbsp;&nbsp;</span>Generate Scores and Rankings\n",
    "                </a>\n",
    "            </span>\n",
    "            <ul class=\"toc-item\">\n",
    "                <li>\n",
    "                    <span>\n",
    "                        <a href=\"#Frequency-Analysis\" data-toc-modified-id=\"Frequency-Analysis\">\n",
    "                            <span class=\"toc-item-num\">4.1&nbsp;&nbsp;</span>Frequency Analysis\n",
    "                        </a>\n",
    "                    </span>\n",
    "                </li>\n",
    "                <li>\n",
    "                    <span>\n",
    "                        <a href=\"#Extract-IoCs\" data-toc-modified-id=\"Extract-IoCs\">\n",
    "                            <span class=\"toc-item-num\">4.2&nbsp;&nbsp;</span>Extract IoCs\n",
    "                        </a>\n",
    "                    </span>\n",
    "                </li>\n",
    "                <li>\n",
    "                    <span>\n",
    "                        <a href=\"#Threat-Intelligence-Lookup\" data-toc-modified-id=\"Threat-Intelligence-Lookup\">\n",
    "                            <span class=\"toc-item-num\">4.3&nbsp;&nbsp;</span>Threat Intelligence Lookup\n",
    "                        </a>\n",
    "                    </span>\n",
    "                </li>\n",
    "                <li>\n",
    "                    <span>\n",
    "                        <a href=\"#Related-Alerts\" data-toc-modified-id=\"Related-Alerts\">\n",
    "                            <span class=\"toc-item-num\">4.4&nbsp;&nbsp;</span>Related Alerts\n",
    "                        </a>\n",
    "                    </span>\n",
    "                </li>\n",
    "            </ul>\n",
    "        </li>\n",
    "        <li>\n",
    "            <span>\n",
    "                <a href=\"#Final-Scores-and-Rankings\" data-toc-modified-id=\"Final-Scores-and-Rankings\">\n",
    "                    <span class=\"toc-item-num\">5&nbsp;&nbsp;</span>Final Scores and Rankings\n",
    "                </a>\n",
    "            </span>\n",
    "            <ul class=\"toc-item\">\n",
    "                <li>\n",
    "                    <span>\n",
    "                        <a href=\"#Behavior-Timeline\" data-toc-modified-id=\"Behavior-Timeline\">\n",
    "                            <span class=\"toc-item-num\">5.1&nbsp;&nbsp;</span>Behavior Timeline\n",
    "                        </a>\n",
    "                    </span>\n",
    "                </li>\n",
    "            </ul>\n",
    "        </li>\n",
    "    </ul>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next cell:\n",
    "- Checks for the correct Python version\n",
    "- Checks versions and optionally installs required packages\n",
    "- Imports the required packages into the notebook\n",
    "- Sets a number of configuration options.\n",
    "\n",
    "This should complete without errors. If you encounter errors or warnings look at the following two notebooks:\n",
    "- [TroubleShootingNotebooks](https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/TroubleShootingNotebooks.ipynb)\n",
    "- [ConfiguringNotebookEnvironment](https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/ConfiguringNotebookEnvironment.ipynb)\n",
    "\n",
    "If you are running in the Azure Sentinel Notebooks environment (Azure Notebooks or Azure ML) you can run live versions of these notebooks:\n",
    "- [Run TroubleShootingNotebooks](./TroubleShootingNotebooks.ipynb)\n",
    "- [Run ConfiguringNotebookEnvironment](./ConfiguringNotebookEnvironment.ipynb)\n",
    "\n",
    "You may also need to do some additional configuration to successfully use functions such as Threat Intelligence service lookup and Geo IP lookup. \n",
    "There are more details about this in the `ConfiguringNotebookEnvironment` notebook and in these documents:\n",
    "- [msticpy configuration](https://msticpy.readthedocs.io/en/latest/getting_started/msticpyconfig.html); This file is found in the same folder this notebook is in: [Azure Sentinel Notebooks](https://github.com/Azure/Azure-Sentinel-Notebooks).\n",
    "- [Threat intelligence provider configuration](https://msticpy.readthedocs.io/en/latest/data_acquisition/TIProviders.html#configuration-file)\n",
    "\n",
    "If you are unfamiliar with Jupyter notebooks, or want a more in-depth setup reference, check out these resources:\n",
    "- [Getting Started with Azure Sentinel Notebooks Guide](https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/A%20Getting%20Started%20Guide%20For%20Azure%20Sentinel%20Notebooks.ipynb)\n",
    "- [Security Investigation with Azure Sentinel and Jupyter Notebooks - Part 1](https://techcommunity.microsoft.com/t5/azure-sentinel/security-investigation-with-azure-sentinel-and-jupyter-notebooks/ba-p/432921)\n",
    "- [Why Use Jupyter for Security Investigations](https://msticpy.readthedocs.io/en/latest/getting_started/JupyterAndSecurity.html)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617824890131
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from pathlib import Path\n",
    "from IPython.display import display, HTML\n",
    "\n",
    "REQ_PYTHON_VER = \"3.6\"\n",
    "REQ_MSTICPY_VER = \"1.0.0\"\n",
    "\n",
    "display(HTML(\"<h3>Starting Notebook setup...</h3>\"))\n",
    "if Path(\"./utils/nb_check.py\").is_file():\n",
    "    from utils.nb_check import check_versions\n",
    "    check_versions(REQ_PYTHON_VER, REQ_MSTICPY_VER)\n",
    "            \n",
    "from msticpy.nbtools import nbinit\n",
    "nbinit.init_notebook(namespace=globals());\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Connect to Log Analytics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the cells below to connect to your Log Analytics workspace. If you haven't already, please fill in the relevant information in `msticpyconfig.yaml`. This file is found in the [Azure Sentinel Notebooks folder](https://github.com/Azure/Azure-Sentinel-Notebooks) this notebook is in. There is more information on how to do this in the Notebook Setup section above. You may need to restart the kernel after doing so and rerun any cells you've already run to update to the new information.\n",
    "\n",
    "If you are unfamiliar with connecting to Log Analytics or want a more in-depth walkthrough, check out the [Getting Started with Azure Sentinel Notebook](https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/A%20Getting%20Started%20Guide%20For%20Azure%20Sentinel%20Notebooks.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617824899543
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# See if we have an Azure Sentinel Workspace defined in our config file.\n",
    "# If not, let the user specify Workspace and Tenant IDs\n",
    "\n",
    "ws_config = WorkspaceConfig()\n",
    "if not ws_config.config_loaded:\n",
    "    ws_config.prompt_for_ws()\n",
    "    \n",
    "qry_prov = QueryProvider(data_environment=\"AzureSentinel\")\n",
    "print(\"done\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617824912258
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Authenticate to Azure Sentinel workspace\n",
    "qry_prov.connect(ws_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set Time Parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the cell below, then use the sliding bar that pops up to adjust the time frame in which you want the query to find Base64 commands."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Decide the time frame of your query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617824918576
    }
   },
   "outputs": [],
   "source": [
    "query_times = nbwidgets.QueryTime(units='day',\n",
    "                                  max_before=20, max_after=1, before=3)\n",
    "query_times.display()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get Base64 Commands"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following cell queries all Base64 commands in your Log Analytics workspace during the given time frame and queries data from AUOMS_EXECVE logs, which are discussed in this [blog post](https://techcommunity.microsoft.com/t5/azure-sentinel/hunting-threats-on-linux-with-azure-sentinel/ba-p/1344431#), which was mentioned earlier. This is the data the rest of the commands will run on. The query is written in KQL. If you would like to add additional information to the query results, you may do it here. Note that following cells rely on this output so the original columns must still be projected.\n",
    "\n",
    "If you prefer to use a different log (not AUOMS_EXECVE), you may write your own query and will potentially have to edit certain values throughout the rest of the notebook to get the correct values and data frames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617824925882
    }
   },
   "outputs": [],
   "source": [
    "pd.options.display.html.use_mathjax = False\n",
    "\n",
    "query = \"Syslog\" + f\"\"\" | where TimeGenerated between (datetime({query_times.start}) .. datetime({query_times.end})) \"\"\" + r\"\"\" \n",
    "| parse SyslogMessage with \"type=\" EventType \" audit(\" * \"): \" EventData\n",
    "| project TimeGenerated, EventType, Computer, EventData \n",
    "| where EventType == \"AUOMS_EXECVE\"\n",
    "| parse EventData with * \"cmdline=\" Cmdline \" containerid=\" containerid\n",
    "| where Cmdline has \"base64\" and Cmdline has \"echo\"\n",
    "| where Cmdline matches regex \"^(.*)([A-Za-z0-9])(.*)$\"\n",
    "| parse kind=regex Cmdline with * \"echo\\\\s*(-n\\\\s)?\\\\\\\\?[\\\"']?\" cmdextract \"\\\\\\\\?[\\\"']?[\\\\s\\\"'$]\"\n",
    "| extend cmdextract= trim_end(@\"(\\\\?)(\\'?)(\\s?)(\\|)(\\s?)(.*)(base64)(.*)\",cmdextract)\n",
    "| extend DecodedCommand=base64_decode_tostring(cmdextract)\n",
    "| project TimeGenerated, Computer, Cmdline, DecodedCommand\n",
    "\"\"\"\n",
    "\n",
    "print(\"Collecting base64 queries...\")\n",
    "b64_df= qry_prov.exec_query(query)\n",
    "b64_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basic Command Categorization\n",
    "\n",
    "We will be categorizing commands in two ways: this cell categorizes commands by looking for commonly used commands we are aware of. The next section will use an open source compilation.\n",
    "\n",
    "This cell categorizes each decoded Base64 command by functionality based on what bash commands are present in the decoded version. For example, commands with \"wget\" or \"curl\" in them are categorized as \"Network connections/Downloading.\" Other categories include \"File Manipulation\", \"Host Enumeration\", and \"File/Process deletion/killing.\"\n",
    "\n",
    "This categorization is by no means exhaustive. Feel free to add commands and categories to our basic one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617824930857
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Network connections/downloading (wget, curl, urllib.urlopen) \n",
    "# File manipulation (chmod, chattr, touch, cp, mv, ln, sed, awk, echo)\n",
    "# Host enumeration (uname, grep … /proc/cpuinfo) \n",
    "# File/process deletion/killing (rm, pkill) \n",
    "# Archive/compression programs (tar, zip, gzip, bzip2, lzma, xz)\n",
    "\n",
    "def categorize(cmds):\n",
    "    ret = []\n",
    "    for cmd in cmds:\n",
    "        categories = []\n",
    "        if (\"wget\" in cmd) or (\"curl\" in cmd) or (\"urllib.urlopen\" in cmd):\n",
    "            categories.append(\"network connections/downloading\")\n",
    "        if (\"chmod\" in cmd) or (\"chattr\" in cmd) or (\"touch\" in cmd) or (\"cp\" in cmd) or (\"mv\" in cmd) or (\"ln\" in cmd) or (\"sed\" in cmd) or (\"awk\" in cmd) or (\"echo\" in cmd):\n",
    "            categories.append(\"file manipulation\")\n",
    "        if (\"uname\" in cmd) or (\"grep\" in cmd) or (\"/proc/cpuinfo\" in cmd):\n",
    "            categories.append(\"host enumeration\")\n",
    "        if (\"rm\" in cmd) or (\"pkill\" in cmd):\n",
    "            categories.append(\"file/process deletion/killing\")\n",
    "        if (\"tar\" in cmd) or (\"zip\" in cmd) or (\"gzip\" in cmd) or (\"bzip2\" in cmd) or (\"lzma\" in cmd) or (\"xz\" in cmd):\n",
    "            categories.append(\"archive/compression programs\")\n",
    "        ret.append(categories)\n",
    "    return ret\n",
    "\n",
    "print(\"Categorizing commands...\")\n",
    "b64_df['Categories'] = categorize(b64_df['DecodedCommand'])\n",
    "b64_df['Categories'] = b64_df['Categories'].apply(str)  # For drop_duplicates to work\n",
    "b64_df[[\"Computer\", \"DecodedCommand\", \"Categories\"]].drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### GTFO Bins Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This cell categorizes the commands based on [GTFOBins](https://gtfobins.github.io/). GTFOBins is a vetted collection of bash commands frequently exploited by attackers as well as a reference as to how those commands may be used. We are using it to find potentially exploited commands in the dataset and tag those with their corresponding functionalities.\n",
    "\n",
    "Run the cell below to read about what each category means according to the GTFOBins website."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617824937794
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from requests import get\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "\n",
    "# Get HTML content from GTFOBins Website\n",
    "fn_url = 'https://gtfobins.github.io/functions/'\n",
    "fn_response = get(fn_url)\n",
    "\n",
    "fn_soup = BeautifulSoup(fn_response.text, 'html.parser')\n",
    "function_names = fn_soup.find_all('dt', class_ = 'function-name')\n",
    "function_descriptions = fn_soup.find_all('dd')\n",
    "\n",
    "display(HTML(\"<h1>GTFOBins Functions</h1>\"))\n",
    "\n",
    "# Print function headings and descriptions\n",
    "for fn in range(len(function_descriptions)):\n",
    "    display(HTML(f\"<b>{function_names[fn].text}</b>:&nbsp;{function_descriptions[fn].text}<br>\")) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following cell tags commands with GTFOBins bins and functions and displays the dataframe again for viewing. You may click on the links in the 'GTFO Bins' column for easy access to the GTFOBins website and more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617824990809
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Get GTFOBins bins from the website and create a list\n",
    "print(\"Getting GTFO Bins...\")\n",
    "url = 'https://gtfobins.github.io/'\n",
    "response = get(url)\n",
    "gtfo_soup = BeautifulSoup(response.text, 'html.parser')\n",
    "gtfo_cmds = gtfo_soup.find_all('a', class_ = 'bin-name')\n",
    "gtfobinsList = [cmd.text for cmd in gtfo_cmds]\n",
    "\n",
    "# Get the GTFO functions corresponding to each bin\n",
    "print(\"Getting GTFO Functions...\")\n",
    "binsFunctions = []\n",
    "for b in gtfobinsList:\n",
    "    bin_url = 'https://gtfobins.github.io/gtfobins/' + b\n",
    "    bin_response = get(bin_url)\n",
    "    bin_soup = BeautifulSoup(bin_response.text, 'html.parser')\n",
    "    bin_fnnames = bin_soup.find_all('h2', class_ = 'function-name')\n",
    "    names = [n.text for n in bin_fnnames]\n",
    "    binsFunctions.append(names)\n",
    "\n",
    "# Create a dictionary where the keys are bins and the values are its functions\n",
    "binsDict = dict(zip(gtfobinsList, binsFunctions))\n",
    "\n",
    "# Return lists of bins and functions corresponding to each command\n",
    "def getGtfoBins(cmds):\n",
    "    retBins = []\n",
    "    retFns = []\n",
    "    for cmd in cmds:\n",
    "        bins_matched = []\n",
    "        fns_matched = set()\n",
    "        for b in binsDict.keys():\n",
    "            if b in cmd:\n",
    "                bins_matched.append('<a href=\"https://gtfobins.github.io/gtfobins/' + b + '\">' + b + '</a>')\n",
    "                fns_matched.update(binsDict[b])\n",
    "        retBins.append(bins_matched)\n",
    "        retFns.append(fns_matched)  \n",
    "    return retBins, retFns\n",
    "\n",
    "print(\"Tagging GTFOBins...\")\n",
    "GTFOResult = getGtfoBins(b64_df['DecodedCommand'])\n",
    "\n",
    "print(\"Formatting result...\")\n",
    "b64_df['GTFO Bins'] = GTFOResult[0]\n",
    "b64_df['GTFO Functions'] = GTFOResult[1]\n",
    "\n",
    "b64_df['GTFO Bins'] = b64_df['GTFO Bins'].apply(str) # For drop_duplicates\n",
    "b64_df['GTFO Functions'] = b64_df['GTFO Functions'].apply(str)\n",
    "\n",
    "HTML(b64_df[[ 'GTFO Bins', 'GTFO Functions', 'Computer', 'DecodedCommand', 'Categories']].drop_duplicates().to_html(escape=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Scores and Rankings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following sections generate scores for each unique Base64 command based on criteria such as frequency of the command, severity of TI lookup results, and related commands run. Each score is added to the dataframe at the end, so you can view and rank each individually or by the aggregate score.\n",
    "\n",
    "Scores are somewhat artificially created and are meant to help investigators understand and highlight commands that are more likely to be malicious. They do not represent any mathematical value and are not calculated in comparison to any particular number other than each other, where higher scores are more likely to be malicious commands."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Frequency Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cell below creates a frequency score for each unique Base64 command by calculating (1 / # times command occured in the workspace). It then adds an additional score calculated by (1 / # times command occured in its host computer). Both of these scores are divided by 2 for normalization purposes.\n",
    "\n",
    "This results in rarer commands getting higher scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617824991098
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Calculate Frequency Scores for the given data frame column \n",
    "def calcFreqScore(df):\n",
    "    return 1 / df\n",
    "\n",
    "def num_unique(col):\n",
    "    return len(col.unique())\n",
    "\n",
    "# Aggregate b64_df column \n",
    "b64_analytics = b64_df[[\"TimeGenerated\", \"Computer\", \"DecodedCommand\", \"Categories\", \"GTFO Bins\", \"GTFO Functions\"]].groupby(\"DecodedCommand\").agg(\n",
    "    CommandCount=pd.NamedAgg(column=\"DecodedCommand\", aggfunc=\"count\"),\n",
    "    TotalHosts=pd.NamedAgg(column=\"Computer\", aggfunc=num_unique),\n",
    "    Hostnames=pd.NamedAgg(column=\"Computer\", aggfunc=\"unique\"),\n",
    "    Categories= pd.NamedAgg(column=\"Categories\", aggfunc=\"first\"),\n",
    "    GTFOBins=pd.NamedAgg(column=\"GTFO Bins\", aggfunc=\"first\"),\n",
    "    GTFOFunctions=pd.NamedAgg(column=\"GTFO Functions\", aggfunc=\"first\"),\n",
    "    FirstSeen=pd.NamedAgg(column=\"TimeGenerated\", aggfunc=\"min\"),\n",
    "    LastSeen=pd.NamedAgg(column=\"TimeGenerated\", aggfunc=\"max\"),\n",
    ").reset_index()\n",
    "\n",
    "b64_analytics[\"FreqScore\"] = calcFreqScore(b64_analytics[\"CommandCount\"]) / 2\n",
    "b64_analytics[\"FreqScore\"] = b64_analytics[\"FreqScore\"] + ((calcFreqScore(b64_analytics[\"TotalHosts\"])) / 2)\n",
    "b64_analytics[\"TotalScore\"] = b64_analytics[\"FreqScore\"]\n",
    "\n",
    "# Display\n",
    "display_cols = [\n",
    "    'TotalScore','FreqScore', 'DecodedCommand', 'CommandCount', 'TotalHosts', \n",
    "    'Hostnames', 'Categories', 'GTFOBins', 'GTFOFunctions', 'FirstSeen', 'LastSeen'\n",
    "]\n",
    "summary_df = (\n",
    "    b64_analytics[display_cols].sort_values(\"TotalScore\", ascending=False).reset_index().drop(['index'], axis=1)\n",
    ")\n",
    "HTML(summary_df.to_html(escape=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract IoCs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cell below extracts any IoCs from the decoded Base64 commands and adds them to the dataframe. It uses the MSTICpy IoC extraction features, which extract the following patterns:\n",
    "- ipv4\n",
    "- ipv6\n",
    "- dns\n",
    "- url\n",
    "- windows_path\n",
    "- linux_path\n",
    "- md5_hash\n",
    "- sha1_hash\n",
    "- sha256_hash\n",
    "\n",
    "If you want to look for an IoC pattern that is not included, here feel free to modify the MSTICpy class. See [this link](https://msticpy.readthedocs.io/en/latest/data_analysis/IoCExtract.html?) for more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617825049100
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "ioc_extractor = IoCExtract()\n",
    "\n",
    "ioc_df = ioc_extractor.extract(data=b64_analytics, columns=['DecodedCommand'])\n",
    "\n",
    "if len(ioc_df):\n",
    "    display(HTML(\"<h3>IoC patterns found in process tree.</h3>\"))\n",
    "    display(ioc_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Threat Intelligence Lookup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load and run TILookup on IoCs found. Make sure you configure `msticpyconfig.yaml` with the appropriate TI sources. Check out the document below if you need help with this) process.\n",
    "- [TI Configuration File](https://msticpy.readthedocs.io/en/latest/data_acquisition/TIProviders.html#configuration-file)\n",
    "\n",
    "We highly encourage you to add TI sources, but if you don't have any (i.e. API keys from AlienVault OTX, IBM XForce, or VirusTotal) and don't want to make accounts, you can skip this section and go to directly to [Related Alerts Scoring](#Related-Alerts) below. Your rankings will be based exclusively on frequency scores and related alerts scoring in this case. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Confirm TI Sources\n",
    "\n",
    "The below code will print out your current TI Lookup configurations. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617825052658
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "ti_lookup = TILookup()\n",
    "\n",
    "ti_lookup.reload_providers()\n",
    "if not ti_lookup.provider_status:\n",
    "    md_warn(\"You have no TI providers configured, please see the documentation link above.\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Choose which providers you would like to use during the TI lookup. You will need these to be configured on `msticpyconfig.yaml`. Additional directions given above in the Notebook Setup section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "providers = [t.split(' - ', 1)[0] for t in ti_lookup.provider_status]\n",
    "providers_ss = nbwidgets.SelectSubset(\n",
    "    providers,\n",
    "    default_selected=['OTX', 'VirusTotal', 'XForce']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Choose IoCs to look up"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can choose IoCs you're interested in to look up or look up all of them for scoring. Scores will be based exclusively on the Severity column. The following cells will also print a TI dataframe with added information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "items = ioc_df[\"Observable\"].values\n",
    "ioc_ss = nbwidgets.SelectSubset(\n",
    "    items,\n",
    "    default_selected=items.all()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run this cell to look up the selected IoCs above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "iocs_to_check = (ioc_df[ioc_df[\"Observable\"].isin(ioc_ss.selected_items)]\n",
    "                 [[\"IoCType\", \"Observable\"]].drop_duplicates())\n",
    "\n",
    "print(\"Looking up IoCs...\")\n",
    "ti_results = ti_lookup.lookup_iocs(data=iocs_to_check, obs_col=\"Observable\", providers=providers_ss.selected_items)\n",
    "\n",
    "ti_results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Calculate TI Severity Scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following cell uses the most severe of the severity scores provided by the providers to add to each score. The more severe the IoC found, the higher the score the command will receive. Each unique IoC found will add to the score of that command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617825071437
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Define severity scores\n",
    "sev_scores = {\"information\": 0, \"low\": 1, \"medium\": 1.5, \"high\": 3, \"unknown\":1}\n",
    "\n",
    "# Calculate severity scores and add iocs to the data frame\n",
    "def calc_severity(cmds):\n",
    "    ret_iocs = []\n",
    "    ret_scores =[]\n",
    "    for c in cmds:\n",
    "        c_iocs = set()\n",
    "        c_sev_score = []\n",
    "        for ioc in ioc_df['Observable']:\n",
    "            if ioc in c:\n",
    "                c_iocs.add(ioc)\n",
    "        for uniq_ioc in c_iocs:\n",
    "            sev_df = ti_results[ti_results['Ioc'].values == uniq_ioc]\n",
    "            \n",
    "            # Add severities for selected providers\n",
    "            az_sev = \"\"\n",
    "            otx_sev = \"\"\n",
    "            vt_sev = \"\"\n",
    "            xf_sev = \"\"\n",
    "            if 'AzSTI' in providers_ss.selected_items:\n",
    "                az_sev = str(sev_df[sev_df['Provider'] == 'AzSTI']['Severity'])\n",
    "            if 'OTX' in providers_ss.selected_items:\n",
    "                otx_sev = str(sev_df[sev_df['Provider'] == 'OTX']['Severity'])\n",
    "            if 'VirusTotal' in providers_ss.selected_items:\n",
    "                vt_sev = str(sev_df[sev_df['Provider'] == 'VirusTotal']['Severity'])\n",
    "            if 'Xforce' in providers_ss.selected_items:\n",
    "                xf_sev = str(sev_df[sev_df['Provider'] == 'XForce']['Severity'])\n",
    "                \n",
    "            # Add scores \n",
    "            if 'high' in otx_sev or 'high' in vt_sev or 'high' in xf_sev:\n",
    "                c_sev_score.append(sev_scores['high'])\n",
    "            elif 'medium' in otx_sev or 'medium' in vt_sev or 'medium' in xf_sev:\n",
    "                c_sev_score.append(sev_scores['medium'])\n",
    "            elif 'low' in otx_sev or 'low' in vt_sev or 'low' in xf_sev:\n",
    "                c_sev_score.append(sev_scores['low'])\n",
    "            elif 'info' in otx_sev or 'info' in vt_sev or 'info' in xf_sev:\n",
    "                c_sev_score.append(sev_scores['information'])\n",
    "            else:\n",
    "                c_sev_score.append(sev_scores['unknown'])\n",
    "        ret_iocs.append(c_iocs)\n",
    "        ret_scores.append(sum(c_sev_score))\n",
    "    \n",
    "    return ret_iocs, ret_scores\n",
    "        \n",
    "ti_info = calc_severity(b64_analytics['DecodedCommand'])\n",
    "b64_analytics['IoCsFound'] = ti_info[0]\n",
    "b64_analytics['SevScore'] = ti_info[1]\n",
    "b64_analytics['TotalScore'] += b64_analytics['SevScore']\n",
    "\n",
    "display_cols = [\n",
    "    'TotalScore','SevScore','FreqScore', 'DecodedCommand', 'CommandCount', 'TotalHosts', \n",
    "    'Hostnames', 'Categories', 'GTFOBins', 'GTFOFunctions', 'FirstSeen', 'LastSeen'\n",
    "]\n",
    "summary_df = (\n",
    "    b64_analytics[display_cols].sort_values(\"TotalScore\", ascending=False).reset_index().drop(['index'], axis=1)\n",
    ")\n",
    "HTML(summary_df.to_html(escape=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Related Alerts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This section searches for any related Sentinel alerts on the hosts we've found Base64 commands on in the given time frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617825078094
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "ra_query_times = nbwidgets.QueryTime(\n",
    "    units=\"day\",\n",
    "    origin_time=query_times.origin_time,\n",
    "    max_before=28,\n",
    "    max_after=5,\n",
    "    before=5,\n",
    "    auto_display=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Points are added to the score depending on the severity of the alerts that occurred at this time. For example, high severity alerts around the Base64 commands will result in a higher score for those commands. Each unique alert's score is only added once. Alert information as well as timeline visualizations will also be printed out to provide context and enable further investigation. Be sure to scroll for information on all the hosts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617825099112
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Define alert scores\n",
    "alert_scores = {\"Informational\": 0, \"Low\": 1, \"Medium\": 2, \"High\": 3}\n",
    "\n",
    "# Create list of hosts to search for related alerts on\n",
    "host_df = b64_df.groupby('Computer')\n",
    "hosts = [h for h in host_df.groups] \n",
    "\n",
    "def print_related_alerts(alertDict, entityType, entityName, df):\n",
    "    if len(alertDict) > 0:\n",
    "        display(\n",
    "            Markdown(\n",
    "                f\"### Found {len(alertDict)} different alert types related to this {entityType} (`{entityName}`)\"\n",
    "            )\n",
    "        )\n",
    "        for (k, v) in alertDict.items():\n",
    "            print(f\"- {k}, # Alerts: {v}\")\n",
    "        display(df)\n",
    "    else:\n",
    "        print(f\"No alerts for {entityType} entity `{entityName}`\")\n",
    "\n",
    "host_alert_scores = []\n",
    "for host in hosts: \n",
    "    alerts_found = []\n",
    "    \n",
    "    related_alerts = qry_prov.SecurityAlert.list_related_alerts(\n",
    "            ra_query_times, host_name=host\n",
    "        )\n",
    "    \n",
    "    if isinstance(related_alerts, pd.DataFrame) and not related_alerts.empty:\n",
    "        host_alert_items = (\n",
    "            related_alerts[[\"AlertName\", \"TimeGenerated\"]]\n",
    "            .groupby(\"AlertName\")\n",
    "            .TimeGenerated.agg(\"count\")\n",
    "            .to_dict()\n",
    "        )\n",
    "        # Print related alerts in shorthand format\n",
    "        print_related_alerts(host_alert_items, \"host\", host, related_alerts)\n",
    "        if len(host_alert_items) > 1:\n",
    "            nbdisplay.display_timeline(\n",
    "                data=related_alerts, title=\"Alerts\", source_columns=[\"AlertName\"], height=200\n",
    "            )\n",
    "\n",
    "        # Add to Alert Scoring based on the severity of the found alerts\n",
    "        # Only adds each unique alert, not repeats\n",
    "        uniq_alerts_found = set(related_alerts['AlertName'].values)\n",
    "        for a in uniq_alerts_found:\n",
    "            sev = related_alerts[related_alerts['AlertName'] == a]['Severity'].values[0]\n",
    "            alerts_found.append(alert_scores[sev])\n",
    "        host_alert_scores.append(sum(alerts_found))\n",
    "        \n",
    "    else:\n",
    "        display(Markdown(\"No related alerts found.\"))\n",
    "\n",
    "# Add appropriate scores \n",
    "final_alert_scores = []\n",
    "for i in b64_analytics['Hostnames']:\n",
    "    alert_val = 0\n",
    "    for h in range(len(i)):\n",
    "        alert_val += host_alert_scores[h]\n",
    "    final_alert_scores.append(alert_val)\n",
    "\n",
    "b64_analytics['AlertScore'] = final_alert_scores\n",
    "b64_analytics['TotalScore'] += b64_analytics['AlertScore']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "View the score again by running the cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617825102642
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# If no TI Scores, add 0 as TI Score for each row\n",
    "has_ti_scores = True\n",
    "if 'SevScore' not in b64_analytics.keys():\n",
    "    b64_analytics['SevScore'] = 0\n",
    "    has_ti_scores = False\n",
    "    \n",
    "display_cols = [\n",
    "    'TotalScore','AlertScore', 'SevScore', 'FreqScore', 'DecodedCommand', 'CommandCount', 'TotalHosts', \n",
    "    'Hostnames', 'Categories', 'GTFOBins', 'GTFOFunctions', 'FirstSeen', 'LastSeen'\n",
    "]\n",
    "\n",
    "summary_df = (\n",
    "    b64_analytics[display_cols].sort_values(\"TotalScore\", ascending=False).reset_index().drop(['index'], axis=1)\n",
    ")\n",
    "HTML(summary_df.to_html(escape=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Final Scores and Rankings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the cell below to choose the columns you would like to view. You must select TotalScore for rankings to work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1617825107478
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "column_names = b64_analytics.columns.values.tolist() \n",
    "columns_included = nbwidgets.SelectSubset(\n",
    "    column_names,\n",
    "    default_selected=['TotalScore', 'FreqScore', 'AlertScore', 'SevScore', 'DecodedCommand', 'Categories']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run this cell to display the columns you chose above. Score columns will be colored a certain amount of red to help you visualize what percent of the total score is made up of each type of score and how these compare with other command scores.\n",
    "\n",
    "You can also choose to only view data with numerical columns over a given cutoff by selecting a column and choosing a cutoff point. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import ipywidgets as widgets\n",
    "from ipywidgets import interact, interact_manual\n",
    "\n",
    "# Get score colums and numerical columns to display\n",
    "score_cols = ['TotalScore', 'FreqScore', 'SevScore', 'AlertScore']\n",
    "numerical_cols = ['TotalScore', 'FreqScore', 'SevScore', 'CommandCount', 'TotalHosts']\n",
    "display_cols = [col for col in columns_included.selected_items if col in numerical_cols]\n",
    "subset_cols = [col for col in columns_included.selected_items if col in score_cols]\n",
    "\n",
    "# Get all display columns in order\n",
    "ordered_cols = ['TotalScore', 'FreqScore', 'SevScore', 'AlertScore', 'DecodedCommand', 'Categories', 'CommandCount', 'FirstSeen', 'LastSeen', 'TotalHosts']\n",
    "final_cols = ordered_cols.copy()\n",
    "for col in ordered_cols:\n",
    "    if col not in columns_included.selected_items:\n",
    "        final_cols.remove(col)\n",
    "        \n",
    "@interact(Column=(display_cols))\n",
    "def show_articles_more_than(Column= 'TotalScore', \n",
    "                            Cutoff=(0, max(b64_analytics['TotalScore']), 0.1)):\n",
    "    return b64_analytics[final_cols].sort_values('TotalScore', ascending=False).loc[b64_analytics[Column] > Cutoff].style.bar(subset=subset_cols, color=\"#d65f5f\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use the following bar chart to view the compositions of the scores in a visual manner. The horizontal axis represents the index of the command in the data frame, so you can reference the data frame above for context around any interesting data you see."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(12,10))\n",
    "plt.xlabel('Index')\n",
    "plt.ylabel('TotalScore')\n",
    "display(b64_analytics[['FreqScore','SevScore', 'AlertScore']].plot(ax=ax, kind='bar', stacked=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Behavior Timeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This timeline visualizes when commands occurred to identify potential windows of activity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "nbdisplay.display_timeline(data=b64_df, source_columns=['DecodedCommand', 'Categories'])"
   ]
  }
 ],
 "metadata": {
  "kernel_info": {
   "name": "python38-azureml"
  },
  "kernelspec": {
   "display_name": "Python 3.8 - AzureML",
   "language": "python",
   "name": "python38-azureml"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.1"
  },
  "nteract": {
   "version": "nteract-front-end@1.0.0"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}