{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Sessionize, Model and Visualise Office Exchange Data\n",
        "\n",
        "<b> Notebook Version:</b> 1.0 <br/>\n",
        "<b> Python Version:</b> Python 3.6 (including Python 3.6 - AzureML) <br>\n",
        "<b> Required Packages:</b> msticpy, pandas, kqlmagic<br>\n",
        "\n",
        "<b>Data Sources Required:</b>\n",
        "* Log Analytics - OfficeActivity\n",
        "\n",
        "<b>Configuration Required:</b>\n",
        "\n",
        "This Notebook presumes you have your Microsoft Sentinel Workspace settings configured in a config file. If you do not have this in place, please [read the docs](https://msticpy.readthedocs.io/en/latest/getting_started/msticpyconfig.html) and [use this notebook](https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/ConfiguringNotebookEnvironment.ipynb) to test.\n",
        "\n",
        "\n",
        "\n",
        "## Description:\n",
        "Various types of security logs can be broken up into sessions/sequences where each session can be thought of as an ordered sequence of events. It can be useful to model these sessions in order to understand what the usual activity is like so that we can highlight anomalous sequences of events.\n",
        "\n",
        "In this hunting notebook, we treat the Office Exchange PowerShell cmdlets (\"Set-Mailbox\", \"Set-MailboxFolderPermission\" etc) as \"events\" and then group the events into \"sessions\" on a per-user basis. We demonstrate the sessionizing, modelling and visualisation on the Office Exchange Admin logs, however the methods used in this notebook can be applied to other log types as well.\n",
        "\n",
        "A new subpackage called anomalous_sequence has been released to [msticpy](https://github.com/microsoft/msticpy/tree/master/msticpy/analysis/anomalous_sequence) recently. This library allows the user to sessionize, model and visualize their data via some high level functions. For more details on how to use this subpackage, please [read the docs](https://msticpy.readthedocs.io/en/latest/data_analysis/AnomalousSequence.html) and/or refer to this more [documentation heavy notebook](https://github.com/microsoft/msticpy/blob/master/docs/notebooks/AnomalousSequence.ipynb). The documentation for this subpackage also includes some suggested guidance on how this library can be applied to some other log types.\n",
        "\n",
        "\n",
        "<b>High level sections of the notebook:</b>\n",
        "* Sessionize your Office Exchange logs data using built-in KQL operators\n",
        "* Use the anomalous_sequence subpackage of msticpy to model the sessions\n",
        "* Use the anomalous_sequence subpackage of msticpy to visualize the scored sessions\n",
        "\n",
        " "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Table of Contents\n",
        "* [Notebook Initialization](#init_notebook)\n",
        "    * [Imports](#imports)\n",
        "    * [Authenticate Log Analytics](#la_auth)\n",
        "* [Create Sessions from your Office Exchange Data](#create_sessions)\n",
        "    * [What is a Session?](#create_sessions)\n",
        "    * [Sessionize using Kusto's Native Functionality](#use_la)\n",
        "    * [Convert sessions into an allowed format for the modelling](#clean_sessions)\n",
        "* [Model the Sessions](#explain_model)\n",
        "    * [High Level function for modelling](#model_function)\n",
        "* [Visualise the Modelled Sessions](#visualize_function)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Notebook initialization <a id='init_notebook'></a>\n",
        "\n",
        "The next cell:\n",
        "\n",
        "* Checks for the correct Python version\n",
        "* Checks versions and optionally installs required packages\n",
        "* Imports the required packages into the notebook\n",
        "* Sets a number of configuration options\n",
        "\n",
        "This should complete without errors. If you encounter errors or warnings,  please look at the following two notebooks:\n",
        "\n",
        "* [TroubleShootingNotebooks](https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/TroubleShootingNotebooks.ipynb)\n",
        "* [ConfiguringNotebookEnvironment](https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/ConfiguringNotebookEnvironment.ipynb)\n",
        "\n",
        "<a id='imports'></a>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1617756662780
        },
        "tags": []
      },
      "outputs": [],
      "source": [
        "from pathlib import Path\n",
        "from IPython.display import display, HTML\n",
        "\n",
        "REQ_PYTHON_VER = \"3.6\"\n",
        "REQ_MSTICPY_VER = \"1.0.0\"\n",
        "\n",
        "\n",
        "display(HTML(\"<h3>Starting Notebook setup...</h3>\"))\n",
        "\n",
        "# If the installation fails try to manually install using\n",
        "# %pip install --upgrade msticpy\n",
        "\n",
        "extra_imports = [    \n",
        "    \"msticpy.analysis.anomalous_sequence.utils.data_structures, Cmd\",\n",
        "    \"msticpy.analysis.anomalous_sequence, anomalous\",\n",
        "    \"msticpy.analysis.anomalous_sequence.model, Model\",\n",
        "    \"typing, List\",\n",
        "    \"typing, Dict\",\n",
        "]\n",
        "\n",
        "from msticpy.nbtools import nbinit\n",
        "nbinit.init_notebook(\n",
        "    namespace=globals(),\n",
        "    extra_imports=extra_imports,\n",
        ");"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Using LogAnalytics Query Provider <a id='la_auth'></a>\n",
        "\n",
        "msticpy has a QueryProvider class which you can use to connect to your Log Analytics data environment."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1617756706602
        },
        "tags": []
      },
      "outputs": [],
      "source": [
        "# Collect Microsoft Sentinel Workspace Details from our config file and use them to connect\n",
        "try:\n",
        "    # Update to WorkspaceConfig(workspace=\"WORKSPACE_NAME\") to use a Workspace other than your default one.\n",
        "    # Run WorkspaceConfig().list_workspaces() to see a list of configured workspaces\n",
        "    ws_config = WorkspaceConfig()\n",
        "    md(\"Workspace details collected from config file\")\n",
        "    qry_prov = QueryProvider(data_environment='AzureSentinel')\n",
        "    qry_prov.connect(connection_str=ws_config.code_connect_str)\n",
        "except RuntimeError:\n",
        "    md(\"\"\"You do not have any Workspaces configured in your config files.\n",
        "       Please run the https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/ConfiguringNotebookEnvironment.ipynb\n",
        "       to setup these files before proceeding\"\"\" ,'bold')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Create Sessions from your Office Exchange logs\n",
        "\n",
        "\n",
        "## What is a Session? <a id='create_sessions'></a>\n",
        "\n",
        "<b>In this context, a session is an ordered sequence of events/commands. The anomalous_sequence subpackage can handle 3 different formats for each of the sessions:</b>\n",
        "\n",
        "1. sequence of just events/commands.\\\n",
        "\\[\"Set-User\", \"Set-Mailbox\"\\] <br/><br/>\n",
        "2. sequence of events/commands with accompanying parameters.\\\n",
        "\\[Cmd(name=\"Set-User\", params=\\{\"Identity', \"Force\"\\}), Cmd(name=\"Set-Mailbox\", params=\\{\"Identity\", \"AuditEnabled\"\\})\\] <br/><br/>     \n",
        "3. sequence of events/commands with accompanying parameters and their corresponding values.\\\n",
        "\\[Cmd(name=\"Set-User\", params=\\{\"Identity\": \"blahblah\", \"Force\": 'true'\\}), Cmd(name=\"Set-Mailbox\", params=\\{\"Identity\": \"blahblah\", \"AuditEnabled\": \"false\"\\})\\]\n",
        "\n",
        "The Cmd datatype can be accessed from <i>msticpy.analysis.anomalous_sequence.utils.data_structures</i>\n",
        "\n",
        "\n",
        "## How will we sessionize the data?\n",
        "\n",
        "We discuss two possible approaches:\n",
        "\n",
        "1. Use the sessionize module from msticpy's anomalous_subsequence subpackage\n",
        "2. Sessionize directly inside your KQL query to retrieve data from Log Analytics\n",
        "\n",
        "In this notebook, we use the second approach (KQL) to sessionize the Office Exchange logs. In order to do the sessionizing using KQL, we make use of the [row_window_session](https://docs.microsoft.com/azure/data-explorer/kusto/query/row-window-session-function) function.\n",
        "\n",
        "\n",
        "However, if you are interested in using msticpy's sessionizing capabilities, then please [read the docs](https://msticpy.readthedocs.io/en/latest/data_analysis/AnomalousSequence.html) and/or refer to this more [documentation heavy notebook](https://github.com/microsoft/msticpy/blob/master/docs/notebooks/AnomalousSequence.ipynb).\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Use Kusto to Sessionize your Logs Data <a id='use_la'></a>\n",
        "\n",
        "The cell below contains a kusto query which queries the OfficeActivity table. In this example, we wish for the sessions to be on a per UserId - ClientIP basis. In addition, we require that each session be no longer than 20 minutes in total, with each command no more than 2 minutes apart from each other. (These requirements are somewhat arbitrary and can be adjusted for different data-sets/use-cases etc).\n",
        "\n",
        "\n",
        "<b>Here are some high level steps to the query:</b>\n",
        "\n",
        "- Add a time filter which goes back far enough so you have enough data to train the model.\n",
        "- Filter to the desired type of logs.\n",
        "- Exclude some known automated users (optional)\n",
        "- Sort the rows by UserId, ClientIp, TimeGenerated in ascending order\n",
        "- Use the native KQL function row_window_session to create an additonal \"begin\" column to aid creating the sessions\n",
        "- Summarize the commands (and optionally parameters) by UserId, ClientIp, begin\n",
        "- Optionally exclude sessions which have only 1 command\n",
        "\n",
        "Note that in KQL, comments are made using //"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1617756713245
        }
      },
      "outputs": [],
      "source": [
        "# write kql query\n",
        "query = \"\"\"\n",
        "let time_back = 60d;\n",
        "OfficeActivity\n",
        "| where TimeGenerated >= ago(time_back)\n",
        "//\n",
        "// filter to the event type of interest\n",
        "| where RecordType == 'ExchangeAdmin'\n",
        "//\n",
        "// exclude some known automated users (optional)\n",
        "| where UserId !startswith \"NT AUTHORITY\"\n",
        "| where UserId !contains \"prod.outlook.com\"  \n",
        "//\n",
        "// create new dynamic variable with the command as the key, and the parameters as the values\n",
        "| extend params = todynamic(strcat('{\"', Operation, '\" : ', tostring(Parameters), '}')) \n",
        "| project TimeGenerated, UserId, ClientIP, Operation, params\n",
        "//\n",
        "// sort by the user related columns and the timestamp column in ascending order\n",
        "| sort by UserId asc, ClientIP asc, TimeGenerated asc\n",
        "//\n",
        "// calculate the start time of each session into the \"begin\" variable\n",
        "// With each session max 20 mins in length with each event at most 2 mins apart.\n",
        "// A new session is created each time one of the user related columns change.\n",
        "| extend begin = row_window_session(TimeGenerated, 20m, 2m, UserId != prev(UserId) or ClientIP != prev(ClientIP))\n",
        "//\n",
        "// summarize the operations and the params by the user related variables and the \"begin\" variable\n",
        "| summarize cmds=makelist(Operation), end=max(TimeGenerated), nCmds=count(), nDistinctCmds=dcount(Operation),\n",
        "params=makelist(params) by UserId, ClientIP, begin\n",
        "//\n",
        "//optionally specify an order to the final columns\n",
        "| project UserId, ClientIP, nCmds, nDistinctCmds, begin, end, duration=end-begin, cmds, params\n",
        "//\n",
        "// optionally filter out sessions which contain only one event\n",
        "//| where nCmds > 1\n",
        "\"\"\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1617756717744
        }
      },
      "outputs": [],
      "source": [
        "# execute the query\n",
        "sessions_df = qry_prov.exec_query(query=query)\n",
        "# I comment out this cell and run it again once it has run to prevent the notebook from slowing down"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1617756719969
        },
        "tags": []
      },
      "outputs": [],
      "source": [
        "try:\n",
        "    print(sessions_df.shape)\n",
        "except AttributeError as e:\n",
        "    sessions_df = _kql_raw_result_.to_dataframe()\n",
        "print(sessions_df.shape)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1617822935527
        }
      },
      "outputs": [],
      "source": [
        "sessions_df.drop(columns=[\"params\", \"param_value_session\", \"param_session\"]).head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Convert Sessions to Correct Format for the Model <a id='clean_sessions'></a>\n",
        "\n",
        "Recall the allowed session types [here](#create_sessions)\n",
        "\n",
        "\n",
        "<b>So let's see what needs to be done to the sessions_df.</b>\n",
        "\n",
        "The \"cmds\" column is already in a suitable format of type (1). This is because it is a list of strings. However, if you are interested in including the parameters (and possibly the values) in the modelling stage, then we need to make use of the Cmd datatype. \n",
        "\n",
        "In particular, we need to define a custom cleaning function which will transform the \"params\" column slightly to become a list of the Cmd datatype. This cleaning function is specific to the format of the exchange demo data. Therefore, you may need to tweak it slightly before you can use it on other data sets.   \n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1617756730838
        }
      },
      "outputs": [],
      "source": [
        "# define a helper function for converting the sessions with params (and values) into a suitable format\n",
        "\n",
        "def process_exchange_session(session_with_params: [List[Dict[str, List[Dict[str, str]]]]], include_vals: bool) -> List[Cmd]:\n",
        "    \"\"\"\n",
        "    Converts an exchange session with params to an allowed format.\n",
        "    \n",
        "    param session_with_params: example format:\n",
        "        [\n",
        "            {'Set-Mailbox': [{'Name': 'MessageCopyForSentAsEnabled', 'Value': 'True'}, \n",
        "            {'Name': 'Identity', 'Value': 'blahblah@blah.com'}]}\n",
        "        ]\n",
        "    param include_vals: if True, then it will be transformed to a format which includes the values, \n",
        "        else the output will just contain the parameters\n",
        "    \n",
        "    return: list of the Cmd data type which includes either just the parameters, or also the corresponding values\n",
        "    \"\"\"\n",
        "    new_ses = []\n",
        "    for cmd in session_with_params:\n",
        "        c = list(cmd.keys())[0]\n",
        "        par = list(cmd.values())[0]\n",
        "        new_pars = set()\n",
        "        if include_vals:\n",
        "            new_pars = dict()\n",
        "        for p in par:\n",
        "            if include_vals:\n",
        "                new_pars[p['Name']] = p['Value']\n",
        "            else:\n",
        "                new_pars.add(p['Name'])\n",
        "        new_ses.append(Cmd(name=c, params=new_pars))\n",
        "    return new_ses    \n",
        "                                                   "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1617756733856
        }
      },
      "outputs": [],
      "source": [
        "# let's create suitable sessions for params, and suitable sessions for params + values by applying the custom function\n",
        "sessions = sessions_df.cmds.values.tolist()\n",
        "param_sessions = []\n",
        "param_value_sessions = []\n",
        "\n",
        "for ses in sessions_df.params.values.tolist():\n",
        "    new_ses_set = process_exchange_session(session_with_params=ses, include_vals=False)\n",
        "    new_ses_dict = process_exchange_session(session_with_params=ses, include_vals=True)\n",
        "    param_sessions.append(new_ses_set)\n",
        "    param_value_sessions.append(new_ses_dict)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1617756736237
        },
        "tags": []
      },
      "outputs": [],
      "source": [
        "# let's see the differences between the three types of sessions\n",
        "ind = 0\n",
        "\n",
        "print(sessions[ind][:3])\n",
        "\n",
        "print(param_sessions[ind][:3])\n",
        "\n",
        "print(param_value_sessions[ind][:3])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1617756738773
        }
      },
      "outputs": [],
      "source": [
        "# let's add these reformatted sessions as columns to a dataframe\n",
        "data = sessions_df\n",
        "data['session'] = sessions\n",
        "data['param_session'] = param_sessions\n",
        "data['param_value_session'] = param_value_sessions\n",
        "\n",
        "data.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Model the sessions <a id='explain_model'></a>\n",
        "\n",
        "We will give a brief description of how the modelling works under the hood for each of the three session types.\n",
        "\n",
        "* <b>Commands only</b>\n",
        "    - We treat the sessions as an ordered sequence of commands. \n",
        "    - We apply the Markov Assumption where we assume each command depends only on the command immediately before it.\n",
        "    - This means the likelihood of each session can be computed by multiplying a sequence of transition probabilities together.\n",
        "    - We use a sliding window (e.g. of length 3) throughout each session and then use the likelihood of the rarest window as the score for the session.<br/><br/>\n",
        "* <b>Commands with Parameters</b>\n",
        "    - All of the above (\"commands only\" case) except for one difference.\n",
        "    - This time, we include the parameters in the modelling.\n",
        "    - We make the assumption that the presence of each parameter is independent conditional on the command.\n",
        "    - We therefore model the presence of the parameters as independent Bernoulli random variables (conditional on the command)\n",
        "    - So to compute the likelihood of a session, each transition probability (of the commands) will be accompanied by a product of probabilties (for the parameters). \n",
        "    - A subtlety to note, is that we take the geometric mean of the product of parameter probabilities. This is so we don't penalise commands which happen to have more parameters set than on average.\n",
        "    - We use the same sliding window approach used with the \"commands only\" case. <br/><br/>\n",
        "* <b>Commands with Parameters and their Values</b>\n",
        "    - All of the above (\"commands with parameters\" case) except for one difference.\n",
        "    - This time, we include the values in the modelling.\n",
        "    - Some rough heuristics are used to determine which parameters have values which are categorical (e.g. \"true\" and \"false\" or \"high\", \"medium\" and \"low\") vs values which are arbitrary strings (such as email addresses). There is the option to override the \"modellable_params\" directly in the Model class.\n",
        "    - So to compute the likelihood of a session, each transition probability (of the commands) will be accompanied by a product of probabilties (for the parameters and categorical values). \n",
        "    - We use the same sliding window approach used with the \"commands only\" case.\n",
        "    \n",
        "    \n",
        "#### Important note: \n",
        "If you set the window length to be k, then only sessions which have at least k-1 commands will have a valid (not np.nan) score. The reason for the -1 is because we append an end token to each session by default, so a session of length k-1 gets treated as length k during the scoring.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# There are 3 high level functions available in this library\n",
        "\n",
        "1. score_sessions\n",
        "2. visualize_scored_sessions\n",
        "3. score_and_visualize_sessions\n",
        "\n",
        "We will demonstrate the usage of the first two functions, but the \"score_and_visualize_sessions\" function can be used in a similar way.\n",
        "\n",
        "If you want to see more detail about any of the arguments to the functions, you can simply run: <b>help(name_of_function)</b>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## We will first demonstrate the high level function for modelling the sessions. <a id='model_function'></a>\n",
        "\n",
        "We will do this for the \"Commands with Parameters and their Values\" session type.\n",
        "\n",
        "But because we created columns for all three session types, you can set the \"session_column\" parameter in the \"score_sessions\" function below to any of the following:\n",
        "\n",
        "1. session\n",
        "2. param_session\n",
        "3. param_value_session"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1617756745486
        }
      },
      "outputs": [],
      "source": [
        "# This function will return a dataframe with two additonal columns appended:\n",
        "# \"rarest_window3_likelihood\" and \"rarest_window3\"\n",
        "\n",
        "modelled_df = anomalous.score_sessions(\n",
        "    data=data,\n",
        "    session_column='param_value_session',\n",
        "    window_length=3\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1617756748482
        }
      },
      "outputs": [],
      "source": [
        "# Let's view the resulting dataframe in ascending order of the computed likelihood metric\n",
        "\n",
        "modelled_df.sort_values('rarest_window3_likelihood').head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1617756752041
        }
      },
      "outputs": [],
      "source": [
        "# we can view individual sessions in more detail\n",
        "\n",
        "modelled_df.sort_values('rarest_window3_likelihood').rarest_window3.iloc[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Now we demonstrate the visualization component of the library <a id='visualize_function'></a>\n",
        "\n",
        "We do this using the \"visualise_scored_sessions\" function. This function returns an interactive timeline plot which allows you to zoom into different sections etc.\n",
        "\n",
        "* The time of the session will be on the x-axis.\n",
        "* The computed likelihood metric will be on the y-axis.\n",
        "* lower likelihoods correspond to rarer sessions.\n",
        "\n",
        "<b>Important note</b>:\n",
        "\n",
        "During the scoring/modelling stage, if you set the window length to be k, then only sessions which have at least k-1 commands will appear in the interactive timeline plot. This is because sessions with fewer than k-1 commands will have a score of np.nan. The reason for the -1 is because we append an end token to each session by default, so a session of length k-1 gets treated as length k during the scoring."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1617756755036
        }
      },
      "outputs": [],
      "source": [
        "# visualise the scored sessions in an interactive timeline plot. \n",
        "\n",
        "anomalous.visualise_scored_sessions(\n",
        "    data_with_scores=modelled_df[modelled_df[\"rarest_window3_likelihood\"].notnull()],\n",
        "    time_column='begin',  # this will appear in the x-axis\n",
        "    score_column='rarest_window3_likelihood',  # this will appear on the y-axis\n",
        "    window_column='rarest_window3',  # this will represent the session in the tool-tips\n",
        "    source_columns=['UserId', 'ClientIP']  # specify any additonal columns to appear in the tool-tips\n",
        ")"
      ]
    }
  ],
  "metadata": {
    "kernel_info": {
      "name": "python38-azureml"
    },
    "kernelspec": {
      "display_name": "Python 3.8 - AzureML",
      "language": "python",
      "name": "python38-azureml"
    },
    "language_info": {
      "name": "python",
      "version": ""
    },
    "microsoft": {
      "host": {
        "AzureML": {
          "notebookHasBeenCompleted": true
        }
      }
    },
    "nteract": {
      "version": "nteract-front-end@1.0.0"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}