{ "cells": [ { "cell_type": "markdown", "id": "5aa74260", "metadata": {}, "source": [ "# Outliers in SageWorks: An Exploration \n", "\n", "
\n", "\n", "This notebook investigate data distributions and potential outlier metrics\n", "- Mean, Stddev (https://en.wikipedia.org/wiki/Standard_deviation)\n", "- IQR, Scale (https://en.wikipedia.org/wiki/Interquartile_range)\n", "\n", "## Data\n", "We're using scipy to generate data with different distribution properties.\n", "\n", "### References\n", "- Numpy (https://numpy.org/)\n", "- Scipy (https://scipy.org/)\n", "- Fitter (https://github.com/cokelaer/fitter)" ] }, { "cell_type": "markdown", "id": "2dd13ab2", "metadata": {}, "source": [ "# Outlier Approaches and Data Distributions\n", "For this notebook we're looking at a reasonable set of data distributions and how the IQR and Stddev method compare when computing outliers on those datasets.\n", "\n", "- TWO nice images/graphics of:\n", " - mean/stddev\n", " - IQR/scale" ] }, { "cell_type": "code", "execution_count": 99, "id": "2b4bacba", "metadata": {}, "outputs": [], "source": [ "plot_data('normal')\n", "plot_data('skewed')\n", "plot_data('outliers')\n", "plot_data('outliers_2')\n", "plot_data('outliers_3')\n", "plot_data('negative_binom')\n", "plot_data('poisson')\n", "plot_data('zip')\n", "plot_data('gamma')\n", "plot_data('gamma_smoosh')\n", "plot_data('bimodal')" ] }, { "cell_type": "markdown", "id": "4e9d535c", "metadata": {}, "source": [ "# Take Away\n", "**There's no perfect metric for outliers, it's really about the data distribution and the use cases for your organization.**\n", "\n", "## SageWorks provide BOTH methods but defaults to IQR\n", "Why default to IQR for outliers?\n", "### Fast\n", "Athena/Presto has a scalable/performant way to compute approximate quartiles based on Q-Digest/T-Digest\n", "- https://prestodb.io/docs/current/functions/tdigest.html\n", "- https://prestodb.io/docs/current/functions/qdigest.html\n", "\n", "### Robust\n", "The IQR method is less sensitive to extreme values or outliers in the data compared to the standard deviation method. Outliers can significantly skew the mean and standard deviation, leading to unreliable bounds. IQR, on the other hand, relies on medians and quartiles, making it more robust.\n", "\n", "### No Assumption of Normality\n", "The IQR method does not assume that the data follows a normal distribution. In contrast, the standard deviation method's effectiveness can be compromised if the data is not normally distributed. If the underlying distribution is skewed or has heavy tails, the IQR might provide a more accurate way to identify outliers." ] }, { "cell_type": "markdown", "id": "7fd86dd3", "metadata": {}, "source": [ "## What if I don't know what the distribution of my data is?\n", "Here's an approach you can follow:\n", "\n", "- Prepare the Data: Clean the data and ensure there are no missing values or other issues that might interfere with the fitting process.\n", "\n", "- Choose Candidate Distributions: Select a set of candidate distributions that you believe might be suitable. Common choices might include Normal, Exponential, Poisson, Negative Binomial, etc.\n", "\n", "- Fit the Distributions: For each candidate distribution, estimate the parameters that provide the best fit to the data. You can use functions in scipy.stats for this.\n", "\n", "- Evaluate the Fit: Compare the goodness of fit for each distribution using statistical tests or information criteria like the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC).\n", "\n", "- Visual Inspection: It might also be helpful to plot the empirical data and the fitted probability density functions together to visually assess the fit.\n", "\n", "There's also a Python module called **Fitter** that might be helpful\n", "- Note: They need to put in a few more common distributions\n", "- Upvote this: https://github.com/cokelaer/fitter/issues/76" ] }, { "cell_type": "code", "execution_count": 95, "id": "fea63b62", "metadata": {}, "outputs": [], "source": [ "from fitter import Fitter\n", "\n", "# Lets try two distributions\n", "gaussian = np.random.normal(loc=0, scale=1, size=5000)\n", "gamma_data = gamma.rvs(a=1, scale=1, size=5000)\n", "\n", "print(\"Gaussian Data\")\n", "f = Fitter(gaussian, distributions=\"common\")\n", "f.fit()\n", "f.summary()" ] }, { "cell_type": "code", "execution_count": 96, "id": "5be7fff4", "metadata": {}, "outputs": [], "source": [ "print(\"Chi2 Distribution\")\n", "f = Fitter(gamma_data, distributions=\"common\")\n", "f.fit()\n", "f.summary()" ] }, { "cell_type": "markdown", "id": "2358b668", "metadata": {}, "source": [ "# Wrap up: Building an AWS® ML Pipeline with SageWorks\n", "\n", "\n", "\n", "This notebook used the SageWorks Science Toolkit to quickly build an AWS® Machine Learning Pipeline with the AQSolDB public dataset. We built a full AWS Machine Learning Pipeline from start to finish. \n", "\n", "SageWorks made it easy:\n", "- Visibility into AWS services for every step of the process.\n", "- Managed the complexity of organizing the data and populating the AWS services.\n", "- Provided an easy to use API to perform Transformations and inspect Artifacts.\n", "\n", "Using SageWorks will minimizize the time and manpower needed to incorporate AWS ML into your organization. If your company would like to be a SageWorks Alpha Tester, contact us at [sageworks@supercowpowers.com](mailto:sageworks@supercowpowers.com)." ] }, { "cell_type": "markdown", "id": "3db353c5", "metadata": {}, "source": [ "# Helper Methods" ] }, { "cell_type": "code", "execution_count": 97, "id": "d9c6c579", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from scipy.stats import nbinom, norm, gamma, poisson\n", "import matplotlib.pyplot as plt\n", "\n", "\n", "def zero_inflated_poisson(zero_inflation, size):\n", " psi=zero_inflation # Percent of zero measurements\n", " lam=5 # Increasing lam will fatten the tail\n", " pois = np.random.poisson(lam=lam, size=size) \n", " zeros = np.random.binomial(n=1, p=psi, size=size) \n", " zero_poisson = pois * (1 - zeros)\n", " return zero_poisson\n", "\n", " # Tail bumps\n", " #tail_bumps = [np.random.normal(loc=15, scale=1, size=5), np.random.normal(loc=25, scale=1, size=5)]\n", " #return np.concatenate([zero_poisson] + tail_bumps)\n", "\n", "def negative_binomial(size):\n", " # Play around with these values\n", " r = 1.0 # Decreasing this will lengthen the tail\n", " p = 0.05 # Increasing this will put more data near zero\n", " return nbinom.rvs(r, p, size=size)" ] }, { "cell_type": "code", "execution_count": 98, "id": "5108a463", "metadata": {}, "outputs": [], "source": [ "def generate_data(dist_type, size):\n", " outlier_1 = np.random.normal(loc=30, scale=2, size=int(0.05*size))\n", " outlier_2 = np.random.normal(loc=40, scale=2, size=int(0.05*size))\n", " outlier_3 = np.random.normal(loc=50, scale=2, size=int(0.05*size))\n", " if dist_type == 'normal':\n", " label = \"Guassian/Normal Distribution\" \n", " dist = np.random.normal(loc=0, scale=1, size=size)\n", " return label, dist\n", " elif dist_type == 'skewed':\n", " label = \"Normal Distribution with Left Skew\"\n", " dist = np.random.normal(loc=0, scale=2, size=int(0.7*size))\n", " skew = np.random.normal(loc=5, scale=4, size=int(0.3*size))\n", " return label, np.concatenate([dist, skew])\n", " elif dist_type == 'outliers':\n", " label = \"Normal Distribution with Outliers\"\n", " dist = np.random.normal(loc=0, scale=2, size=int(0.9*size))\n", " return label, np.concatenate([dist, outlier_1])\n", " elif dist_type == 'outliers_2':\n", " label = \"Normal Distribution with two Outliers\"\n", " dist = np.random.normal(loc=0, scale=2, size=int(0.9*size))\n", " return label, np.concatenate([dist, outlier_1, outlier_2])\n", " elif dist_type == 'outliers_3':\n", " label = \"Normal Distribution with three Outliers\"\n", " dist = np.random.normal(loc=0, scale=2, size=int(0.9*size))\n", " return label, np.concatenate([dist, outlier_1, outlier_2, outlier_3])\n", " elif dist_type == 'bimodal':\n", " label = \"Bimodal Distribution\"\n", " mode_size = int(size/2)\n", " first_mode = np.random.normal(loc=-25, scale=10, size=mode_size)\n", " second_mode = np.random.normal(loc=25, scale=10, size=mode_size)\n", " return label, np.concatenate([first_mode, second_mode])\n", " elif dist_type == 'poisson':\n", " label = \"Poisson Distribution (Discrete)\"\n", " return label, zero_inflated_poisson(0.0, size)\n", " elif dist_type == 'zip':\n", " label = \"Zero Inflated Poisson Distribution (Discrete)\" \n", " return label, zero_inflated_poisson(0.05, size)\n", " elif dist_type == 'negative_binom':\n", " label = \"Negative Binomial Distribution\"\n", " return label, negative_binomial(size)\n", " elif dist_type == 'gamma':\n", " label = \"Gamma Distribution\" \n", " return label, gamma.rvs(a=1, scale=1, size=size)\n", " elif dist_type == 'gamma_smoosh':\n", " label = \"Gamma Distribution Heavy 0 Skew\" \n", " return label, gamma.rvs(a=0.5, scale=10, size=size)\n", "\n", "\n", "def detect_outliers_stddev(data, sigma=3):\n", " mean, std_dev = np.mean(data), np.std(data)\n", " cut_off = std_dev * sigma\n", " lower, upper = mean - cut_off, mean + cut_off\n", " return lower, upper\n", "\n", "def detect_outliers_iqr(data, scale=1.72):\n", " q25, q75 = np.percentile(data, 25), np.percentile(data, 75)\n", " iqr = q75 - q25\n", " cut_off = iqr * scale\n", " lower, upper = q25 - cut_off, q75 + cut_off\n", " return lower, upper\n", "\n", "def plot_data(dist_type):\n", " size = 5000\n", " label, data = generate_data(dist_type, size)\n", " lower_stddev, upper_stddev = detect_outliers_stddev(data)\n", " lower_iqr, upper_iqr = detect_outliers_iqr(data)\n", " \n", " plt.figure(figsize=(10, 3))\n", " plt.hist(data, bins=50, alpha=1.0, label='Data')\n", " plt.axvline(lower_stddev, color='r', linestyle='--', label='Std Dev Lower Bound')\n", " plt.axvline(upper_stddev, color='r', linestyle='--', label='Std Dev Upper Bound')\n", " plt.axvline(lower_iqr, color='g', linestyle='--', label='IQR Lower Bound')\n", " plt.axvline(upper_iqr, color='g', linestyle='--', label='IQR Upper Bound')\n", " plt.legend()\n", " plt.title(label)\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "fa6fc530", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }