# MSTICPy - Matrix Plot

This notebook demonstrates the use of the MSTICPy matrix visualization built using the [Bokeh library](https://bokeh.pydata.org).

You must have msticpy installed:
```
%pip install --upgrade msticpy
```

The matrix plot is designed to show interactions between two items stored
in a pandas DataFrame in a x-y grid.

To take an example, if you have a DataFrame with source and destination IP addresses
(for example, a firewall log), you can plot the source IPs on the y axis and
destination IPs on the x axis. Where there is an event (row) that links a given
source and destination the matrix plot will plot a circle.

By default the circle is proportional to the number of events containing a given
source/destination (x and y).

The matrix plot also has the following variations:
- You can use a named column from the input data (e.g. bytes transmitted) to control
  the size of the plotted circle.
- You can invert the circle plot size, so that rarer interactions are shown
  with a large intersection point.
- You can plot just the presence of one or more interactions - this plots
  a fixed-size point and is useful if you only want to see the presence/
  absence of an interaction but don't care about the number of interactions.
- You can use a count of distinct values to control the size (e.g. you might
  specify "protocol" as the value column and want to see how many distinct
  protocols the source/destination interacted over).
- You can plot the log of any of the above counts/size - this is useful if
  the variance in the size is orders of magnitude.


In [1]:
# Imports

from msticpy.common.utility import check_py_version
MIN_REQ_PYTHON = (3,6)
check_py_version(MIN_REQ_PYTHON)

import pandas as pd

from msticpy import init_notebook
init_notebook(globals())


True

# Creating some sample data


In [2]:
all_df = pd.read_csv(
        "data/az_net_flows.csv",
        index_col=0,
        parse_dates=[
            "TimeGenerated",
            "FlowStartTime",
            "FlowEndTime",
            "FlowIntervalEndTime",
        ],
    )

# Create some sample data to work with
net_df = (
    all_df[["AllExtIPs", "L7Protocol", "TotalAllowedFlows"]]
    .rename(columns={"AllExtIPs": "SourceIP"})
    .sample(100)
)


def get_dest_ip(row):
    dest_ip = None
    while dest_ip is None or row.SourceIP == dest_ip:
        dest_ip = net_df[~net_df["SourceIP"].str.startswith("10.")].sample(1)["SourceIP"].values[0]
    return dest_ip

net_df["DestinationIP"] = net_df.apply(get_dest_ip, axis=1)
net_df.head(3)

Unnamed: 0,SourceIP,L7Protocol,TotalAllowedFlows,DestinationIP
690,20.38.98.100,https,1.0,65.55.44.109
544,13.67.143.117,https,1.0,13.71.172.130
957,65.55.163.76,https,5.0,13.65.107.32


## The basic matrix/interaction plot

The basic plot displays a circle at each interaction between the X and
Y axes items. The size of the circle is proportional to the number 
of records/rows in which the X and Y parameter interact.

Here we are using MSTICPy pandas accessor to plot the graph directly
from the DataFrame

`data.mp_plot.matrix()`

In [3]:
net_df.mp_plot.matrix(x="SourceIP", y="DestinationIP", title="IP Interaction")

## Using the Bokeh interactive tools

The Bokeh graph is interactive and has the following features:
- Tooltip display for each event marker as you hover over it
- Toolbar with the following tools (most are toggles enabling or disabling the tool):
  - Panning 
  - Select zoom
  - Mouse wheel zoom
  - Reset to default view
  - Save image to PNG
  - Hover tool

## Sorting the X and Y values

You can use `sort` to sort both axes or `sort_x` and `sort_y` to individually sort the values.

The sort parameters take values "asc" (ascending), "desc" (descending), `True` (ascending).
`None` and `False` produce no sorting.

In [25]:
net_df.mp_plot.matrix(
    x="SourceIP",
    y="DestinationIP",
    title="IP Interaction",
    sort_y="asc",
    sort_x=False,
)

## You can also import and use the `plot_matrix` function directly

Supply the input DataFrame as the first parameter (or as named
parameter `data`)

```python
from msticpy.vis.matrix_plot import plot_matrix

plot_matrix(data=net_df, x="SourceIP", y="DestinationIP", title="IP Interaction")
```

## Plotting interactions based on column value

Instead of a simple count of rows linking an X-Y pair of entities,
you can use a numeric column in the input DataFrame to control
the size of the plotted circle.

In this example, we're using the "TotalAllowedFlows" column.

In [13]:
all_df.mp_plot.matrix(
    x="L7Protocol",
    y="AllExtIPs",
    value_col="TotalAllowedFlows",
    title="External IP protocol flows",
    sort="asc",
)

## Log scaling the size column

Note because of a few large values in the data many points are difficult to see in the previous plot.
We can change this by plotting the log of the scalar values.

In [14]:
all_df.mp_plot.matrix(
    x="L7Protocol",
    y="AllExtIPs",
    value_col="TotalAllowedFlows",
    title="External IP protocol flows (log of size)",
    log_size=True,
    sort="asc",
)

## Size based on number of distinct values

Use the `dist_count` parameter with the `value_col` parameter
to display size based on number of distinct values in the value_col column.

The plot below plots the circle size in proportion to the number
of distinct Layer 7 protocols used between the endpoints.

In [21]:
net_df.mp_plot.matrix(
    x="SourceIP",
    y="DestinationIP",
    value_col="TotalAllowedFlows",
    dist_count=True,
    title="External IP protocol flows (distinct protocols)",
    sort="asc",
    max_label_font_size=9,
)

## Inverting to show rare interactions as larger

Where you want to highlight unusual interactions, we can plot the
inverse of the `value_col` value or count of interactions using the `invert=True` parameter.

This results in a plot with larger circles for rarer interactions.

In [16]:
net_df.mp_plot.matrix(
    x="SourceIP",
    y="DestinationIP",
    value_col="TotalAllowedFlows",
    title="External IP flows (rare flows == larger)",
    invert=True,
    sort="asc",
)

## Showing interactions only

Where you do not care about any value associated with the interaction
and only want to see if there has been an interaction, you can use
the `intersect` parameter

In [19]:
net_df.mp_plot.matrix(
    x="SourceIP",
    y="DestinationIP",
    title="External IP flows (intersection)",
    intersect=True,
    sort="asc",
)