{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "DQ_birthdates_advanced_stats_v1.0.ipynb\n", "\n", "**Analysis approach of the data quality of dates using z-score (with Python)**" ], "metadata": { "id": "ZC8Y0xZhw2j9" } }, { "cell_type": "markdown", "source": [ "As a continuation of the article *Analysis approach of the data quality of dates using basic statistical methods (with Python)*, we are going to use z-score, a statistical calculation based on the standard deviation.\n", "\n", "**Z-score**\n", "After importing and exploring the dates of birth dataset (from the article related) we need to transform all dates to numbers (dates to simple number and dates to ages from now).\n", "Please note that we have previously reported a potential issue with the date 01/01/2000 (mm/dd/YYYY) possibly due to a technical error or the use of the date as a dummy date, so we will remove again these dates for this analysis." ], "metadata": { "id": "qpMUDYkLxB4t" } }, { "cell_type": "code", "source": [ "import pandas as pd\n", "from datetime import date\n", "\n", "def years_from_now(d):\n", " today = d.today()\n", " age = today.year - d.year - ((today.month, today.day) < (d.month, d.day))\n", " \n", " return age\n", "\n", "#Import dataset dates from github\n", "url = \"https://raw.githubusercontent.com/mabrotons/datasets/master/birthdates.csv\"\n", "\n", "\n", "df = pd.read_csv(url, index_col=0, parse_dates=['birthdates'])\n", "\n", "#transforming dates in numbers\n", "df['birthdates_num'] = [int(d.strftime(\"%Y%m%d\")) for d in df['birthdates']] \n", "df = df.loc[df['birthdates_num'] != 20000101]\n", "\n", "#transforming dates in ages old\n", "df['ages'] = [years_from_now(d) for d in df['birthdates']] " ], "metadata": { "id": "sQHn2cRVxHX5" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "We will build a couple of plots to represent birthdates, in order to have a first view of the dataset. As the plots show, the data represented is identical in both plots (symmetrically), and we have to decide which date format will be most useful for analysis." ], "metadata": { "id": "03XxFY6AvUch" } }, { "cell_type": "code", "source": [ "import matplotlib.pyplot as plt\n", "from matplotlib import ticker\n", "\n", "fig, axes =plt.subplots(1, 2, figsize=(20,5))\n", "\n", "dates_num = df['birthdates_num']\n", "ages = df['ages']\n", "\n", "axes[0].hist(dates_num, bins=50, edgecolor='black')\n", "axes[1].hist(ages, bins=50, edgecolor='black')\n", "plt.xticks(rotation=30)\n", "\n", "#formating the xticks labels (year) for first subplot()\n", "axes[0].xaxis.set_major_formatter(ticker.FuncFormatter(lambda x,pos: format(x/10000,'1.0f')))\n", "axes[0].title.set_text('Dates (num format)')\n", "axes[1].title.set_text('Ages (from now)')\n", "\n", "plt.show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 323 }, "id": "zFtR17L6vVdq", "outputId": "7ca072db-a9ec-4582-97f6-2e61491b241f" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "markdown", "source": [ "Now it's time to start the calculation of standard deviation to know how values are distributed around all dataset, and detect possible outliers: candidates to be outliers will be multiple standard-deviations far from the mean.\n", "As we can see with means and standard deviations calculated, it's easier to work with dates transformed to ages than to simple number." ], "metadata": { "id": "gcxvOeIM3SLo" } }, { "cell_type": "code", "source": [ "std_birthdates_num = df['birthdates_num'].std()\n", "mean_birthdates_num = df['birthdates_num'].mean()\n", "print(\"Mean of birthdates_num: \" + str(mean_birthdates_num))\n", "print(\"Standard deviation of birthdates_num: \" + str(std_birthdates_num))\n", "\n", "std_ages = df['ages'].std()\n", "mean_ages = df['ages'].mean()\n", "print(\"\\nMean deviation of ages: \" + str(mean_ages))\n", "print(\"Standard deviation of ages: \" + str(std_ages))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4RLNZ9x03tN1", "outputId": "f991e995-d80d-434e-8330-33e2ba9f6fd8" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Mean of birthdates_num: 19681984.715186547\n", "Standard deviation of birthdates_num: 336693.98125531274\n", "\n", "Mean deviation of ages: 53.895953757225435\n", "Standard deviation of ages: 33.67419358794017\n" ] } ] }, { "cell_type": "markdown", "source": [ "Now, selecting ages transformed like esay way, let's calculate z-score:\n", "\n", "z = (x – μ) / σ\n", "\n", "where Z is the score, x is the value to calculate the score, μ is the mean and σ is the standard deviation.\n", "The z-score is a calculation that measure how many standard deviations a value is far away from the mean, and the probability of data to be unusual in a distribution.\n", "It's recomended to use z-score with a normal distribution, because in a normal distribution over 99% of values fall within 3 standard deviations from the mean. For that, we can assume:\n", "- if a z-score returned is lower than 1 shoud be a normal data value\n", "- if a z-score returned is larger than 1 and lower than 3, could be an error\n", "- if a z-score returned is larger than 3 should be an error" ], "metadata": { "id": "8rFVJx6n-WhT" } }, { "cell_type": "code", "source": [ "df['zscore'] = [(a-mean_ages)/std_ages for a in df['ages']] \n", "\n", "f = plt.figure()\n", "f.set_figwidth(15)\n", "f.set_figheight(5)\n", "\n", "good_ages = df.loc[(df['zscore'] <= 1) & (df['zscore'] >= -1)]['ages']\n", "regular_ages = df.loc[((df['zscore'] > 1) & (df['zscore'] <= 3)) | ((df['zscore'] < -1) & (df['zscore'] >= -3))]['ages']\n", "bad_ages = df.loc[(df['zscore'] > 3) | (df['zscore'] < -3)]['ages']\n", "\n", "plt.hist([good_ages, regular_ages ,bad_ages], color=['Green', 'Orange', 'Red'], label=['good', 'regular', 'bad'], edgecolor='black', bins=60, histtype='barstacked')\n", "\n", "#add vertical line at mean value of x\n", "plt.axvline(x=mean_ages, color='blue', linewidth=3, label='mean')\n", "\n", "plt.title(\"Ages\")\n", "plt.legend()\n", "plt.show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 425 }, "id": "39KesNitEt0r", "outputId": "e333bd54-b8e4-4c32-c7c3-d634c2849c78" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py:3208: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.\n", " return asarray(a).size\n", "/usr/local/lib/python3.8/dist-packages/matplotlib/cbook/__init__.py:1376: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.\n", " X = np.atleast_1d(X.T if isinstance(X, np.ndarray) else np.asarray(X))\n" ] }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "markdown", "source": [ "Another way to calculate z-score is with scipy.stats funtion:\n", "import scipy.stats as stats\n", "df['zscore'] = stats.zscore(df['ages'])\n", "\n", "Now, we can print to ten ouliers detected with a z-score by both ends." ], "metadata": { "id": "SkNYd0H_i6RP" } }, { "cell_type": "code", "source": [ "sorted_df = df.sort_values('zscore')\n", "print(\"Top 10 left: \")\n", "print(sorted_df.head(10))\n", "\n", "print(\"\\nTop 10 right: \")\n", "print(sorted_df.tail(10))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "KdnDPj3BhRgS", "outputId": "d50041be-c488-410c-fb26-7f2241d6a8d5" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Top 10 left: \n", " ids birthdates birthdates_num ages zscore\n", "1902 C0000001903 2029-10-15 20291015 -7 -1.808386\n", "1901 C0000001902 2029-01-21 20290121 -7 -1.808386\n", "1900 C0000001901 2028-03-10 20280310 -6 -1.778690\n", "1540 C0000001541 2024-09-23 20240923 -2 -1.659905\n", "1284 C0000001285 2024-10-27 20241027 -2 -1.659905\n", "1028 C0000001029 2023-04-13 20230413 -1 -1.630208\n", "1564 C0000001565 2023-03-03 20230303 -1 -1.630208\n", "1675 C0000001676 2023-03-22 20230322 -1 -1.630208\n", "1703 C0000001704 2023-05-27 20230527 -1 -1.630208\n", "1667 C0000001668 2023-11-21 20231121 -1 -1.630208\n", "\n", "Top 10 right: \n", " ids birthdates birthdates_num ages zscore\n", "1894 C0000001895 1808-10-03 18081003 214 4.754503\n", "1859 C0000001860 1808-04-01 18080401 214 4.754503\n", "1856 C0000001857 1807-01-19 18070119 215 4.784199\n", "1857 C0000001858 1806-01-10 18060110 216 4.813895\n", "1893 C0000001894 1806-12-10 18061210 216 4.813895\n", "1891 C0000001892 1805-01-28 18050128 217 4.843592\n", "1851 C0000001852 1803-04-27 18030427 219 4.902984\n", "1873 C0000001874 1802-11-22 18021122 220 4.932681\n", "1860 C0000001861 1801-12-29 18011229 221 4.962377\n", "1898 C0000001899 1801-09-03 18010903 221 4.962377\n" ] } ] }, { "cell_type": "markdown", "source": [ "**Isolation Forest**\n", "\n", "IsolationForest is an unsupervised learning algorithm that identifies possible anomalies by isolating outliers in a dataset. Its calculation is inspired by the Random Forest classification and regression algorithm.\n", "\n", "Firstly, we are going to define and fit the model. We have to instance IsolationForest with the next three parameters:\n", "- n_estimators: number of base estimators or trees in the ensemble. It's optional and the default value is 100.\n", "- max_samples: number of samples used to train each base estimator. The default value of max_samples is 'auto', max_samples=min(256, n_samples).\n", "- contamination: expected proportion of outliers in the dataset. The default value is 'auto', determined as in the original paper of Isolation Forest.\n" ], "metadata": { "id": "Hqkstc-movRW" } }, { "cell_type": "code", "source": [ "from sklearn.ensemble import IsolationForest\n", "import numpy as np\n", "\n", "model = IsolationForest(n_estimators = 1000, max_samples = 'auto', contamination=float(0.1))\n", "print(model.get_params())\n", "\n", "model.fit(df[['ages']])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "OzaRC64uXT-0", "outputId": "a6c4980f-55fa-47ac-c636-4913a4adc89d" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "{'bootstrap': False, 'contamination': 0.1, 'max_features': 1.0, 'max_samples': 'auto', 'n_estimators': 1000, 'n_jobs': None, 'random_state': None, 'verbose': 0, 'warm_start': False}\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.8/dist-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names\n", " warnings.warn(\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "IsolationForest(contamination=0.1, n_estimators=1000)" ] }, "metadata": {}, "execution_count": 32 } ] }, { "cell_type": "markdown", "source": [ "After the model is defined and fitted, it will show the IsolationForest instance result as shown in the output.\n", "\n", "Now, we will create two new columns with decision function and predict information:\n", "\n", "- decision_function(). Average anomaly score of X of the base classifiers.\n", "- predict(). Predict if a particular sample is an outlier or not.\n", "\n", "To show results in a plot we have to split the ages with anomaly_score criteria." ], "metadata": { "id": "QCtBeEhxwWoj" } }, { "cell_type": "code", "source": [ "df['scores'] = model.decision_function(df[['ages']])\n", "df['anomaly_score'] = model.predict(df[['ages']])\n", "\n", "ok = df[df['anomaly_score']==1]\n", "ko = df[df['anomaly_score']==-1]\n", "\n", "f = plt.figure()\n", "f.set_figwidth(15)\n", "f.set_figheight(5)\n", "\n", "plt.hist([ok['ages'], ko['ages']], color=['Green', 'Red'], label=['oks', 'kos'], edgecolor='black', bins=60, histtype='barstacked')\n", "\n", "plt.title(\"Ages\")\n", "plt.legend()\n", "plt.show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 425 }, "id": "reWg5tpWwcoX", "outputId": "e1f44a58-4817-4aa2-e881-6606c99052dd" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py:3208: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.\n", " return asarray(a).size\n", "/usr/local/lib/python3.8/dist-packages/matplotlib/cbook/__init__.py:1376: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.\n", " X = np.atleast_1d(X.T if isinstance(X, np.ndarray) else np.asarray(X))\n" ] }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "markdown", "source": [ "Now, we can print top ten ouliers detected with a Isolation Forest algorithm, agreed with the maximum obtained with z-score." ], "metadata": { "id": "Il4JJfpb0Y8w" } }, { "cell_type": "code", "source": [ "sorted_df = df.sort_values('scores')\n", "print(\"Top 10 left: \")\n", "print(sorted_df.head(10))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "eX1bBGOD0mi2", "outputId": "30f07fd0-6ac9-412d-f36a-6b2ee1090f1f" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Top 10 left: \n", " ids birthdates birthdates_num ages zscore scores \\\n", "1898 C0000001899 1801-09-03 18010903 221 4.962377 -0.206648 \n", "1860 C0000001861 1801-12-29 18011229 221 4.962377 -0.206648 \n", "1873 C0000001874 1802-11-22 18021122 220 4.932681 -0.204303 \n", "1851 C0000001852 1803-04-27 18030427 219 4.902984 -0.201246 \n", "1891 C0000001892 1805-01-28 18050128 217 4.843592 -0.195308 \n", "1857 C0000001858 1806-01-10 18060110 216 4.813895 -0.192998 \n", "1893 C0000001894 1806-12-10 18061210 216 4.813895 -0.192998 \n", "1856 C0000001857 1807-01-19 18070119 215 4.784199 -0.189079 \n", "1877 C0000001878 1849-05-16 18490516 173 3.536953 -0.186404 \n", "1865 C0000001866 1849-08-14 18490814 173 3.536953 -0.186404 \n", "\n", " anomaly_score \n", "1898 -1 \n", "1860 -1 \n", "1873 -1 \n", "1851 -1 \n", "1891 -1 \n", "1857 -1 \n", "1893 -1 \n", "1856 -1 \n", "1877 -1 \n", "1865 -1 \n" ] } ] }, { "cell_type": "markdown", "source": [ "**Conclusion**\n", "\n", "With z-score and Isolation Forest algorithm are two easy ways to identify possible anomalies in a dataset, through the use of scores for each of the data and through its visualization." ], "metadata": { "id": "98F0fR_Nv7lk" } } ] }