{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 3: Statistical Analysis\n", "\n", "## Feature Pre-processing\n", "\n", "In the last section, we talked about how you can combine and transform data for the purpose of analysis. How do we achieve this? We need to have a clear understanding of what our input data is, and what we hope to achieve for our output, such that this can help inform our investigation. Therefore, we need to think about the input, the process, and the output. \n", "\n", "**How do we construct features from our data?** Essentially features can be thought of as numerical values, and quite often, we may just be counting data. If I wanted to derive a feature to measure RAM usage, or email usage, I would need to specify some time interval to observe (e.g., per hour, or per day). Much like when reporting speed, we would refer to miles per hour, rather than recording the absolute speed of the vehicle at every observation. In this way, we generalise to the time interval that makes most sense in practice (we could examine miles per minute, but it’s more natural to state 70 mph rather than 1.67 miles per minute). If I’m studying email usage, rather than simply just the number of emails sent, I may start to form some classifications of my features – for example, number of emails sent to unique recipients, number of emails sent to a specific individual, number of new recipients per day, number of words in each email, number of unique words in each email, and the list goes on. Hopefully you can start to see that there are many possible ways that we could derive numerical count features about email (and many other data observations). Crucially to remember is that we are interested in observations over time, so that we can compare time periods to understand where there may be an increase or decrease in the observed measure. As well as temporal features, we may be interested in spatial features – such as pixel locations in an image or GPS points on a map, or furthermore, we may be interested in sptiotemporal features – such as a pixel location in a video stream or a moving vehicle position.\n", "\n", "**Finding and cleaning data?** There are a number of excellent resources for gathering example datasets, such as [Kaggle](https://www.kaggle.com/), the [VAST Challenge](http://www.vacommunity.org/About+the+VAST+Challenge), the [UC Irvine Machine Learning repository](https://archive.ics.uci.edu/ml/index.php), and numerous other examples hosted online and in various data repositories. Whilst these are useful for learning about machine learning and visualisation, much of the hard work has already been done for us. **Web scraping** is often used to gather large amounts of data from online sources, for example, news story analysis, or examining CVE records. In such cases, there will be significant cleaning of data required, such as filtering out noise in the collected data, or correcting timestamps so that they are reported consistently. We will look at methods for cleaning data in our example practicals.\n", "\n", "## Types of Anomalies\n", "\n", "[Anomaly detection](https://www.datasciencecentral.com/profiles/blogs/anomaly-detection-for-the-oxford-data-science-for-iot-course) is widely discussed in terms of cyber security, and how “artificial intelligence can spot anomalies in your data”. However, it’s crucial to understand what we mean by anomalies, and what type of anomalies may exist. This will help us to understand what the anomalies actually mean to assess whether they pose some form of security concern. Here we will focus primarily on 3 types of outlier: point, contextual, and subsequence. In most applications here, we are thinking about [time-series anomalies](https://blog.statsbot.co/time-series-anomaly-detection-algorithms-1cef5519aef2): essentially how something changes over time. \n", "\n", "- **Point anomaly:** This is where a single point in a time series is anomalous compared to the rest of the data. This is the most typical kind of anomaly that we may think of, yet if we are graphing the correct data, it is also the most straightforward to identify.\n", "- **Contextual anomaly:** This is where a data instance in a time series is considered anomalous because of the context of the data. If we were measuring the temperature of different locations, and one location in the northern hemisphere reported low temperatures in the Summer, this may be an anomaly. Note that the temperature data alone is not sufficient to recognise this – we would need to have prior knowledge of temperature data for countries in the northern hemisphere during the Summer months, gathered historically, to be able to inform on the context here. Another example would be the presence of malware running on an infected machine, and the impact on CPU usage and process count. To recognise the anomaly here, we would need to know what the “typical” CPU usage and process count are in a state where the machine is deemed to be acting normally. In this manner, the anomaly is identified with respect to the historical data, where this observation may be much higher or much lower than the previous records. This historical data may be informed from some database of known anomalous (and non-anomalous) cases, or it may be informed directly from the data itself, where a repeating pattern is expected (e.g., seasonal).\n", "- **Subsequence anomaly:** This is where a sequence of individual events are deemed to be anomalous with regards to the rest of the data, although the individual data points themselves are not deemed as anomalous. This could be seen as similar to contextual anomalies, however the key difference is that subsequence anomalies may not be out-of-distribution, whereas contextual anomalies would be. For example, a recurring pattern that then suddenly flattens for a period, and then begins again, would be recognised as an anomaly. Yet, each individual data point is well within the prior distribution of the data. It is only anomalous because the sequence, or pattern, is anomalous. Consider another example related to insider threat detection. An employee may conduct the following steps in an activity: (1) log into payment application, (2) retrieve payment details, (3) record item to be purchased, (4) enter payment details, and (5) send email to line manager. Each individual step may not be anomalous on their own (i.e., all legitimate actions). Likewise, they may be authorised to make purchases at any time of day, as needed. However if something changed in this sequence - e.g., suppose they stopped emailing line manager following payment, stopped recording item to be purchased; or added a new step – such as open notepad, and write purchase details to file – then there would be a cause for concern. As mentioned, here each individual activity is legitimate, however it is about observing a anomalous sequence of events, rather than an individual anomaly. Note how subsequence anomalies can be used for both numerical and discrete data – such as labels. Another example for text analytics could be, “The quick brown fox jumps over the lazy camel”. Many people will be familiar with this phrase, and will therefore recognise that the word camel is an anomaly – not because camel would necessarily be an incorrect statement, but because the well known quote would say ‘dog’.\n", "\n", "![Alt text](./images/image8.png)\n", "\n", "## Descriptive Statistics\n", "\n", "Statistics are at the very heart of understanding the properties of data. There are some [core concepts](https://elearningindustry.com/stats-101-need-know-statistics) that you should therefore understand. Firstly, when we talk about a set of data, we may refer to this as a distribution – it is a set of measured observations that are indicative of real-world. The Mean is the average value of the distribution – for example, if I was assessing the number of network packets received per minute, then the mean would be the average number of packets received per minute. This could be used to estimate a baseline for the activity (i.e., the expected behaviour). The Median would then be the middle value of the distribution, if I arranged all values from lowest to highest. The Mode is the most common value that has occurred in the distribution. Each of these gives us some indication of where the centre of our data lies – however each has its own weaknesses. If there are outliers in the data, the mean will be skewed by these – so a single point anomaly can change the mean completely. With the median, essentially only the first half of the data is counted (i.e., if I have n values I count up to the n/2 value) which means that the higher values are completely ignored. Therefore, it is good practice to assess all measures in case they help inform different stories about the data. Standard deviation is also an important measurement to understand. This informs about the spread of the data – whether it is narrow around the centre point, or spread out across the range of values (the range being essentially the difference between the largest and smallest values). In many applications, it is useful to consider the normal distribution (sometimes referred to as a bell curve, or a Gaussian distribution). The normal distribution can be expressed by the mean to define the centre point, and the standard deviation to define the spread. This becomes particularly useful when we want to consider whether a new observation is deemed to be inside or outside of the distribution, since approximately 95% of the data observations should be within 2 standard deviations of the mean. \n", "\n", "![Alt text](./images/image9.png)\n", "\n", "## Comparing Data\n", "\n", "Suppose we have four datasets that we wish to compare, to observe any deviations or anomalies that may occur. How may we approach this task? Let's assume that each dataset has two parameters (X and Y). We therefore have X1 and Y1 as dataset 1, X2 and Y2 as dataset 2, X3 and Y3 as dataset 3, and X4 and Y4 as dataset 4. We can use the code below to load in our sample dataset." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X1Y1X2Y2X3Y3X4Y4
0108.04109.14107.4686.58
186.9588.1486.7785.76
2137.58138.741312.7487.71
398.8198.7797.1188.84
4118.33119.26117.8188.47
5149.96148.10148.8487.04
667.2466.1366.0885.25
744.2643.1045.391912.50
81210.84129.13128.1585.56
974.8277.2676.4287.91
1055.6854.7455.7386.89
\n", "
" ], "text/plain": [ " X1 Y1 X2 Y2 X3 Y3 X4 Y4\n", "0 10 8.04 10 9.14 10 7.46 8 6.58\n", "1 8 6.95 8 8.14 8 6.77 8 5.76\n", "2 13 7.58 13 8.74 13 12.74 8 7.71\n", "3 9 8.81 9 8.77 9 7.11 8 8.84\n", "4 11 8.33 11 9.26 11 7.81 8 8.47\n", "5 14 9.96 14 8.10 14 8.84 8 7.04\n", "6 6 7.24 6 6.13 6 6.08 8 5.25\n", "7 4 4.26 4 3.10 4 5.39 19 12.50\n", "8 12 10.84 12 9.13 12 8.15 8 5.56\n", "9 7 4.82 7 7.26 7 6.42 8 7.91\n", "10 5 5.68 5 4.74 5 5.73 8 6.89" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "data = pd.read_csv('./data/anscombe.csv')\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First of all, we can calculate statistics for each of the four datasets to see how they may vary. We will try a set of common statistics in the next few cells, starting with the mean of both X and Y parameters, the variance of X and Y parameters, the correlation between X and Y parameters, and finally the line of best fit, or the regression line, of our X and Y parameters." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean of X data:\n", "9.0\n", "9.0\n", "9.0\n", "9.0\n" ] } ], "source": [ "print (\"Mean of X data:\")\n", "for i in ['X1', 'X2', 'X3', 'X4']:\n", " print (data[i].mean())" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Variance of X data:\n", "11.0\n", "11.0\n", "11.0\n", "11.0\n" ] } ], "source": [ "print (\"Variance of X data:\")\n", "for i in ['X1', 'X2', 'X3', 'X4']:\n", " print (data[i].var())" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean of Y data:\n", "7.500909090909093\n", "7.50090909090909\n", "7.5\n", "7.500909090909091\n" ] } ], "source": [ "print (\"Mean of Y data:\")\n", "for i in ['Y1', 'Y2', 'Y3', 'Y4']:\n", " print (data[i].mean())" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Variance of Y data:\n", "4.127269090909091\n", "4.127629090909091\n", "4.12262\n", "4.123249090909091\n" ] } ], "source": [ "print (\"Variance of Y data:\")\n", "for i in ['Y1', 'Y2', 'Y3', 'Y4']:\n", " print (data[i].var())" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Correlation between X and Y:\n", " X1 Y1\n", "X1 1.000000 0.816421\n", "Y1 0.816421 1.000000\n", " X2 Y2\n", "X2 1.000000 0.816237\n", "Y2 0.816237 1.000000\n", " X3 Y3\n", "X3 1.000000 0.816287\n", "Y3 0.816287 1.000000\n", " X4 Y4\n", "X4 1.000000 0.816521\n", "Y4 0.816521 1.000000\n" ] } ], "source": [ "print (\"Correlation between X and Y:\")\n", "for i in [['X1','Y1'], ['X2','Y2'], ['X3','Y3'], ['X4','Y4']]:\n", " print (data[i].corr())" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0.50009091]]\n", "[[0.5]]\n", "[[0.49972727]]\n", "[[0.49990909]]\n" ] } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "for i in [['X1','Y1'], ['X2','Y2'], ['X3','Y3'], ['X4','Y4']]:\n", " lm = LinearRegression() \n", " lm.fit(data[i[0]].values.reshape(-1, 1), data[i[1]].values.reshape(-1, 1))\n", " print(lm.coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having performed our initial analysis, what can we observe about our four datasets. What is particular intriguing in this example, is that all four datasets have exactly the same statistical characteristics! They all show the same mean for both X and Y, the same variance, correlation, and regression line. So, presumably are these datasets essentially all the same then?\n", "\n", "This is a perfect case for data visualisation, and is actually a well-known problem known as Anscombe’s Quartet. When we visualise the data using four scatter plots, we can quickly determine that the four datasets are wildly different. However, on the surface, the descriptive statistics gave the same information. As data becomes increasingly large, we do need to use statistical measures and these are important, but it is also important that we do not rely on them solely. [Anscombe](https://www.sjsu.edu/faculty/gerstman/StatPrimer/anscombe1973.pdf) (1973) said:\n", "\n", "> *“make both calculations and graphs. Both sorts of output should be stuidied; each will contribute to understanding”*" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "for i in [['X1','Y1'], ['X2','Y2'], ['X3','Y3'], ['X4','Y4']]:\n", " plt.scatter(data[i[0]], data[i[1]])\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Datasaurus Dozen\n", "\n", "A modern take on the Anscombe quartet is the [Datasaurus Dozen](https://dl.acm.org/doi/10.1145/3025453.3025912). Given a distribution of points - in this example, a shape that looks remarkably like a dinosaur, the proposed system is able to use machine learning techniques (simulated annealing) to identify other configurations of the data points such that the underlying statistical properties match those of the original dataset. As the name suggests, the original depiction of a dinosaur can be mapped to 11 other data representations whilst also preserving the underlying stastistical properties.\n", "\n", "![Alt text](./images/image11.gif)\n", "\n", "## Correlation does not imply Causation\n", "\n", "As a final point for discussion in this section, it is important to recognise a golden rule when working with statistics, and that is [correlation does not imply causation](https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation). Ice cream sales may increase when the weather is sunny, and likewise shark attacks may increase when the weather is sunny. However, shark attacks are not caused by ice cream sales (nor are ice cream sales caused by shark attacks). In this example, the hidden variable that both attributes rely on is sunny weather – although there are actually many other factors and neither case is caused by a single variable. \n", "\n", "![Alt text](./images/image12.png)\n", "\n", "When we are exploring data science for cyber security, we want to make well informed decisions from the data. It is important to recognise that attributes observed in the SOC may or may not necessarily be caused by other correlated attributes in your workforce. Further research explores [causal modelling in cyber security](https://www.astesj.com/publications/ASTESJ_050349.pdf) to determine how effective this can be.\n", "\n", "\n", "![Alt text](./images/image13.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further reading\n", "\n", "- [T. Mahmood and U. Afzal, \"Security Analytics: Big Data Analytics for cybersecurity: A review of trends, techniques and tools,\" 2013 2nd National Conference on Information Assurance (NCIA), 2013, pp. 129-134, doi: 10.1109/NCIA.2013.6725337.](https://ieeexplore.ieee.org/document/6725337)\n", "- [Weihs, C., Ickstadt, K. Data Science: the impact of statistics. Int J Data Sci Anal 6, 189–194 (2018). https://doi.org/10.1007/s41060-018-0102-5](https://link.springer.com/article/10.1007/s41060-018-0102-5)\n", "- [Calude, C.S., Longo, G. The Deluge of Spurious Correlations in Big Data. Found Sci 22, 595–612 (2017). https://doi.org/10.1007/s10699-016-9489-4](https://link.springer.com/article/10.1007/s10699-016-9489-4)\n", "- [Briggs, W.M. Common Statistical Fallacies. Journal of American Physicians and Surgeons, Volume 19, Number 2 (2014).](https://www.jpands.org/vol19no2/briggs.pdf)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }