{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Extra 2.1 - Unbalanced Data - Application 1: ProvStore Documents\n", "\n", "Identifying owners of provenance documents from their provenance network metrics.\n", "\n", "In this notebook, we compared the classification accuracy on **unbalanced** (original) ProvStore dataset vs that on a **balanced** ProvStore dataset.\n", "\n", "* **Goal**: To determine if the provenance network analytics method can identify the owner of a provenance document from its provenance network metrics.\n", "* **Training data**: In order to ensure that there are sufficient samples to represent a user's provenance documents the Training phase, we limit our experiment to users who have at least 20 documents. There are fourteen such users (the authors were excluded to avoid bias), who we named $u_{1}, u_{2}, \\ldots, u_{14}$. Their numbers of documents range between 21 and 6,745, with the total number of documents in the data set is 13,870.\n", "* **Classification labels**: $\\mathcal{L} = \\left\\{ u_1, u_2, \\ldots, u_{14} \\right\\}$, where $l_{x} = u_i$ if the provenance document $x$ belongs to user $u_i$. Hence, there are 14 labels in total.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading data\n", "For each provenance document, we calculate the 22 provenance network metrics. The dataset provided contains those metrics values for 13,870 provenance documents along with the owner identifier (i.e. $u_{1}, u_{2}, \\ldots, u_{14}$)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelentitiesagentsactivitiesnodesedgesdiameterassortativityaccacc_e...mfd_e_amfd_e_agmfd_a_emfd_a_amfd_a_agmfd_ag_emfd_ag_amfd_ag_agmfd_derpowerlaw_alpha
0u_3175931496-0.1963620.4447090.466667...584250003-1.0
1u_270290-1-1.0000000.0000000.000000...00000000-1-1.0
2u_270290-1-1.0000000.0000000.000000...00000000-1-1.0
3u_270290-1-1.0000000.0000000.000000...00000000-1-1.0
4u_270290-1-1.0000000.0000000.000000...00000000-1-1.0
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " label entities agents activities nodes edges diameter assortativity \\\n", "0 u_3 17 5 9 31 49 6 -0.196362 \n", "1 u_2 7 0 2 9 0 -1 -1.000000 \n", "2 u_2 7 0 2 9 0 -1 -1.000000 \n", "3 u_2 7 0 2 9 0 -1 -1.000000 \n", "4 u_2 7 0 2 9 0 -1 -1.000000 \n", "\n", " acc acc_e ... mfd_e_a mfd_e_ag mfd_a_e mfd_a_a \\\n", "0 0.444709 0.466667 ... 5 8 4 2 \n", "1 0.000000 0.000000 ... 0 0 0 0 \n", "2 0.000000 0.000000 ... 0 0 0 0 \n", "3 0.000000 0.000000 ... 0 0 0 0 \n", "4 0.000000 0.000000 ... 0 0 0 0 \n", "\n", " mfd_a_ag mfd_ag_e mfd_ag_a mfd_ag_ag mfd_der powerlaw_alpha \n", "0 5 0 0 0 3 -1.0 \n", "1 0 0 0 0 -1 -1.0 \n", "2 0 0 0 0 -1 -1.0 \n", "3 0 0 0 0 -1 -1.0 \n", "4 0 0 0 0 -1 -1.0 \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"provstore/data.csv\")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
entitiesagentsactivitiesnodesedgesdiameterassortativityaccacc_eacc_a...mfd_e_amfd_e_agmfd_a_emfd_a_amfd_a_agmfd_ag_emfd_ag_amfd_ag_agmfd_derpowerlaw_alpha
count13870.00000013870.0000013870.00000013870.00000013870.00000013870.00000013870.00000013870.00000013870.00000013870.000000...13870.00000013870.00000013870.00000013870.00000013870.00000013870.00000013870.00000013870.00000013870.00000013870.000000
mean9.9133382.086951.83619313.83648219.2126890.868926-0.6286900.3478350.3411420.323606...1.3127611.7549391.0735400.7092290.7521270.0174480.0149240.0303532.185436-0.916534
std28.9319152.2771618.57082343.352894134.6403661.9439050.3767180.3945310.4095770.395727...1.7693291.3148741.6226061.3433631.0776280.2009020.1523510.2097595.2111180.612437
min0.0000000.000000.0000001.0000000.000000-1.000000-1.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000-1.000000-1.000000
25%2.0000001.000000.0000005.0000005.000000-1.000000-1.0000000.0000000.0000000.000000...0.0000001.0000000.0000000.0000000.0000000.0000000.0000000.0000001.000000-1.000000
50%4.0000001.000001.0000007.0000009.0000001.000000-0.5929490.0000000.0000000.000000...1.0000002.0000000.0000000.0000001.0000000.0000000.0000000.0000002.000000-1.000000
75%5.0000003.000002.00000010.00000013.0000002.000000-0.3500000.6741470.7500000.666667...2.0000002.0000002.0000001.0000001.0000000.0000000.0000000.0000002.000000-1.000000
max1188.00000051.000001580.0000002776.0000006853.00000010.0000001.0000001.0000001.0000001.000000...52.00000044.00000051.00000052.00000043.0000004.0000005.0000006.000000303.0000008.184413
\n", "

8 rows × 22 columns

\n", "
" ], "text/plain": [ " entities agents activities nodes edges \\\n", "count 13870.000000 13870.00000 13870.000000 13870.000000 13870.000000 \n", "mean 9.913338 2.08695 1.836193 13.836482 19.212689 \n", "std 28.931915 2.27716 18.570823 43.352894 134.640366 \n", "min 0.000000 0.00000 0.000000 1.000000 0.000000 \n", "25% 2.000000 1.00000 0.000000 5.000000 5.000000 \n", "50% 4.000000 1.00000 1.000000 7.000000 9.000000 \n", "75% 5.000000 3.00000 2.000000 10.000000 13.000000 \n", "max 1188.000000 51.00000 1580.000000 2776.000000 6853.000000 \n", "\n", " diameter assortativity acc acc_e acc_a \\\n", "count 13870.000000 13870.000000 13870.000000 13870.000000 13870.000000 \n", "mean 0.868926 -0.628690 0.347835 0.341142 0.323606 \n", "std 1.943905 0.376718 0.394531 0.409577 0.395727 \n", "min -1.000000 -1.000000 0.000000 0.000000 0.000000 \n", "25% -1.000000 -1.000000 0.000000 0.000000 0.000000 \n", "50% 1.000000 -0.592949 0.000000 0.000000 0.000000 \n", "75% 2.000000 -0.350000 0.674147 0.750000 0.666667 \n", "max 10.000000 1.000000 1.000000 1.000000 1.000000 \n", "\n", " ... mfd_e_a mfd_e_ag mfd_a_e mfd_a_a \\\n", "count ... 13870.000000 13870.000000 13870.000000 13870.000000 \n", "mean ... 1.312761 1.754939 1.073540 0.709229 \n", "std ... 1.769329 1.314874 1.622606 1.343363 \n", "min ... 0.000000 0.000000 0.000000 0.000000 \n", "25% ... 0.000000 1.000000 0.000000 0.000000 \n", "50% ... 1.000000 2.000000 0.000000 0.000000 \n", "75% ... 2.000000 2.000000 2.000000 1.000000 \n", "max ... 52.000000 44.000000 51.000000 52.000000 \n", "\n", " mfd_a_ag mfd_ag_e mfd_ag_a mfd_ag_ag mfd_der \\\n", "count 13870.000000 13870.000000 13870.000000 13870.000000 13870.000000 \n", "mean 0.752127 0.017448 0.014924 0.030353 2.185436 \n", "std 1.077628 0.200902 0.152351 0.209759 5.211118 \n", "min 0.000000 0.000000 0.000000 0.000000 -1.000000 \n", "25% 0.000000 0.000000 0.000000 0.000000 1.000000 \n", "50% 1.000000 0.000000 0.000000 0.000000 2.000000 \n", "75% 1.000000 0.000000 0.000000 0.000000 2.000000 \n", "max 43.000000 4.000000 5.000000 6.000000 303.000000 \n", "\n", " powerlaw_alpha \n", "count 13870.000000 \n", "mean -0.916534 \n", "std 0.612437 \n", "min -1.000000 \n", "25% -1.000000 \n", "50% -1.000000 \n", "75% -1.000000 \n", "max 8.184413 \n", "\n", "[8 rows x 22 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "u_3 6745\n", "u_8 4449\n", "u_5 1327\n", "u_2 487\n", "u_12 312\n", "u_14 150\n", "u_9 141\n", "u_6 71\n", "u_7 66\n", "u_4 34\n", "u_1 25\n", "u_11 21\n", "u_10 21\n", "u_13 21\n", "Name: label, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The number of each label in the dataset\n", "df.label.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classification on unbalanced (original) data" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from analytics import test_classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Cross Validation tests**: We now run the cross validation tests on the dataset (df) using all the features (combined), only the generic network metrics (generic), and only the provenance-specific network metrics (provenance). Please refer to [Cross Validation Code.ipynb](Cross%20Validation%20Code.ipynb) for the detailed description of the cross validation code." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 96.45% ±0.0209 <-- combined\n", "Accuracy: 95.36% ±0.0241 <-- generic\n", "Accuracy: 96.55% ±0.0209 <-- provenance\n" ] } ], "source": [ "results, importances = test_classification(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## Classification on balanced data" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from analytics import balance_smote" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Balancing the data**\n", "\n", "With an unbalanced like the above, the resulted trained classifier will typically be skewed towards the majority labels. In order to mitigate this, we balance the dataset using the [SMOTE Oversampling Method](https://www.jair.org/media/953/live-953-2037-jair.pdf)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original data shapes: (13870, 22) (13870,)\n", "Balanced data shapes: (94430, 22) (94430,)\n" ] } ], "source": [ "df = balance_smote(df)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 98.14% ±0.0079 <-- combined\n", "Accuracy: 92.27% ±0.0159 <-- generic\n", "Accuracy: 98.13% ±0.0082 <-- provenance\n" ] } ], "source": [ "results_bal, importances_bal = test_classification(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Result**: The classifiers provide a higher performance on balanced data when provenance-specific metrics are used (either with the combined or provenance metrics sets). The classifiers trained on the generic metrics set, however, performs better on the original, unbalanced data. It is, perhaps, some of the minority labels have more distinctive provenance-specific metrics, compared to their generic one; when more such samples are introduced in the balacing process, using only generic metrics cannot identify those samples as well, hence a lower accuracy." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }