{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 3: Statistical Analysis\n", "\n", "## Feature Pre-processing\n", "\n", "In the last section, we talked about how you can combine and transform data for the purpose of analysis. How do we achieve this? We need to have a clear understanding of what our input data is, and what we hope to achieve for our output, such that this can help inform our investigation. Therefore, we need to think about the input, the process, and the output. \n", "\n", "**How do we construct features from our data?** Essentially features can be thought of as numerical values, and quite often, we may just be counting data. If I wanted to derive a feature to measure RAM usage, or email usage, I would need to specify some time interval to observe (e.g., per hour, or per day). Much like when reporting speed, we would refer to miles per hour, rather than recording the absolute speed of the vehicle at every observation. In this way, we generalise to the time interval that makes most sense in practice (we could examine miles per minute, but it’s more natural to state 70 mph rather than 1.67 miles per minute). If I’m studying email usage, rather than simply just the number of emails sent, I may start to form some classifications of my features – for example, number of emails sent to unique recipients, number of emails sent to a specific individual, number of new recipients per day, number of words in each email, number of unique words in each email, and the list goes on. Hopefully you can start to see that there are many possible ways that we could derive numerical count features about email (and many other data observations). Crucially to remember is that we are interested in observations over time, so that we can compare time periods to understand where there may be an increase or decrease in the observed measure. As well as temporal features, we may be interested in spatial features – such as pixel locations in an image or GPS points on a map, or furthermore, we may be interested in sptiotemporal features – such as a pixel location in a video stream or a moving vehicle position.\n", "\n", "**Finding and cleaning data?** There are a number of excellent resources for gathering example datasets, such as [Kaggle](https://www.kaggle.com/), the [VAST Challenge](http://www.vacommunity.org/About+the+VAST+Challenge), the [UC Irvine Machine Learning repository](https://archive.ics.uci.edu/ml/index.php), and numerous other examples hosted online and in various data repositories. Whilst these are useful for learning about machine learning and visualisation, much of the hard work has already been done for us. **Web scraping** is often used to gather large amounts of data from online sources, for example, news story analysis, or examining CVE records. In such cases, there will be significant cleaning of data required, such as filtering out noise in the collected data, or correcting timestamps so that they are reported consistently. We will look at methods for cleaning data in our example practicals.\n", "\n", "## Types of Anomalies\n", "\n", "[Anomaly detection](https://www.datasciencecentral.com/profiles/blogs/anomaly-detection-for-the-oxford-data-science-for-iot-course) is widely discussed in terms of cyber security, and how “artificial intelligence can spot anomalies in your data”. However, it’s crucial to understand what we mean by anomalies, and what type of anomalies may exist. This will help us to understand what the anomalies actually mean to assess whether they pose some form of security concern. Here we will focus primarily on 3 types of outlier: point, contextual, and subsequence. In most applications here, we are thinking about [time-series anomalies](https://blog.statsbot.co/time-series-anomaly-detection-algorithms-1cef5519aef2): essentially how something changes over time. \n", "\n", "- **Point anomaly:** This is where a single point in a time series is anomalous compared to the rest of the data. This is the most typical kind of anomaly that we may think of, yet if we are graphing the correct data, it is also the most straightforward to identify.\n", "- **Contextual anomaly:** This is where a data instance in a time series is considered anomalous because of the context of the data. If we were measuring the temperature of different locations, and one location in the northern hemisphere reported low temperatures in the Summer, this may be an anomaly. Note that the temperature data alone is not sufficient to recognise this – we would need to have prior knowledge of temperature data for countries in the northern hemisphere during the Summer months, gathered historically, to be able to inform on the context here. Another example would be the presence of malware running on an infected machine, and the impact on CPU usage and process count. To recognise the anomaly here, we would need to know what the “typical” CPU usage and process count are in a state where the machine is deemed to be acting normally. In this manner, the anomaly is identified with respect to the historical data, where this observation may be much higher or much lower than the previous records. This historical data may be informed from some database of known anomalous (and non-anomalous) cases, or it may be informed directly from the data itself, where a repeating pattern is expected (e.g., seasonal).\n", "- **Subsequence anomaly:** This is where a sequence of individual events are deemed to be anomalous with regards to the rest of the data, although the individual data points themselves are not deemed as anomalous. This could be seen as similar to contextual anomalies, however the key difference is that subsequence anomalies may not be out-of-distribution, whereas contextual anomalies would be. For example, a recurring pattern that then suddenly flattens for a period, and then begins again, would be recognised as an anomaly. Yet, each individual data point is well within the prior distribution of the data. It is only anomalous because the sequence, or pattern, is anomalous. Consider another example related to insider threat detection. An employee may conduct the following steps in an activity: (1) log into payment application, (2) retrieve payment details, (3) record item to be purchased, (4) enter payment details, and (5) send email to line manager. Each individual step may not be anomalous on their own (i.e., all legitimate actions). Likewise, they may be authorised to make purchases at any time of day, as needed. However if something changed in this sequence - e.g., suppose they stopped emailing line manager following payment, stopped recording item to be purchased; or added a new step – such as open notepad, and write purchase details to file – then there would be a cause for concern. As mentioned, here each individual activity is legitimate, however it is about observing a anomalous sequence of events, rather than an individual anomaly. Note how subsequence anomalies can be used for both numerical and discrete data – such as labels. Another example for text analytics could be, “The quick brown fox jumps over the lazy camel”. Many people will be familiar with this phrase, and will therefore recognise that the word camel is an anomaly – not because camel would necessarily be an incorrect statement, but because the well known quote would say ‘dog’.\n", "\n", "\n", "\n", "## Descriptive Statistics\n", "\n", "Statistics are at the very heart of understanding the properties of data. There are some [core concepts](https://elearningindustry.com/stats-101-need-know-statistics) that you should therefore understand. Firstly, when we talk about a set of data, we may refer to this as a distribution – it is a set of measured observations that are indicative of real-world. The Mean is the average value of the distribution – for example, if I was assessing the number of network packets received per minute, then the mean would be the average number of packets received per minute. This could be used to estimate a baseline for the activity (i.e., the expected behaviour). The Median would then be the middle value of the distribution, if I arranged all values from lowest to highest. The Mode is the most common value that has occurred in the distribution. Each of these gives us some indication of where the centre of our data lies – however each has its own weaknesses. If there are outliers in the data, the mean will be skewed by these – so a single point anomaly can change the mean completely. With the median, essentially only the first half of the data is counted (i.e., if I have n values I count up to the n/2 value) which means that the higher values are completely ignored. Therefore, it is good practice to assess all measures in case they help inform different stories about the data. Standard deviation is also an important measurement to understand. This informs about the spread of the data – whether it is narrow around the centre point, or spread out across the range of values (the range being essentially the difference between the largest and smallest values). In many applications, it is useful to consider the normal distribution (sometimes referred to as a bell curve, or a Gaussian distribution). The normal distribution can be expressed by the mean to define the centre point, and the standard deviation to define the spread. This becomes particularly useful when we want to consider whether a new observation is deemed to be inside or outside of the distribution, since approximately 95% of the data observations should be within 2 standard deviations of the mean. \n", "\n", "\n", "\n", "## Comparing Data\n", "\n", "Suppose we have four datasets that we wish to compare, to observe any deviations or anomalies that may occur. How may we approach this task? Let's assume that each dataset has two parameters (X and Y). We therefore have X1 and Y1 as dataset 1, X2 and Y2 as dataset 2, X3 and Y3 as dataset 3, and X4 and Y4 as dataset 4. We can use the code below to load in our sample dataset." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | X1 | \n", "Y1 | \n", "X2 | \n", "Y2 | \n", "X3 | \n", "Y3 | \n", "X4 | \n", "Y4 | \n", "
|---|---|---|---|---|---|---|---|---|
| 0 | \n", "10 | \n", "8.04 | \n", "10 | \n", "9.14 | \n", "10 | \n", "7.46 | \n", "8 | \n", "6.58 | \n", "
| 1 | \n", "8 | \n", "6.95 | \n", "8 | \n", "8.14 | \n", "8 | \n", "6.77 | \n", "8 | \n", "5.76 | \n", "
| 2 | \n", "13 | \n", "7.58 | \n", "13 | \n", "8.74 | \n", "13 | \n", "12.74 | \n", "8 | \n", "7.71 | \n", "
| 3 | \n", "9 | \n", "8.81 | \n", "9 | \n", "8.77 | \n", "9 | \n", "7.11 | \n", "8 | \n", "8.84 | \n", "
| 4 | \n", "11 | \n", "8.33 | \n", "11 | \n", "9.26 | \n", "11 | \n", "7.81 | \n", "8 | \n", "8.47 | \n", "
| 5 | \n", "14 | \n", "9.96 | \n", "14 | \n", "8.10 | \n", "14 | \n", "8.84 | \n", "8 | \n", "7.04 | \n", "
| 6 | \n", "6 | \n", "7.24 | \n", "6 | \n", "6.13 | \n", "6 | \n", "6.08 | \n", "8 | \n", "5.25 | \n", "
| 7 | \n", "4 | \n", "4.26 | \n", "4 | \n", "3.10 | \n", "4 | \n", "5.39 | \n", "19 | \n", "12.50 | \n", "
| 8 | \n", "12 | \n", "10.84 | \n", "12 | \n", "9.13 | \n", "12 | \n", "8.15 | \n", "8 | \n", "5.56 | \n", "
| 9 | \n", "7 | \n", "4.82 | \n", "7 | \n", "7.26 | \n", "7 | \n", "6.42 | \n", "8 | \n", "7.91 | \n", "
| 10 | \n", "5 | \n", "5.68 | \n", "5 | \n", "4.74 | \n", "5 | \n", "5.73 | \n", "8 | \n", "6.89 | \n", "