{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Extra 2.1 - Unbalanced Data - Application 1: ProvStore Documents\n", "\n", "Identifying owners of provenance documents from their provenance network metrics.\n", "\n", "In this notebook, we compared the classification accuracy on **unbalanced** (original) ProvStore dataset vs that on a **balanced** ProvStore dataset.\n", "\n", "* **Goal**: To determine if the provenance network analytics method can identify the owner of a provenance document from its provenance network metrics.\n", "* **Training data**: In order to ensure that there are sufficient samples to represent a user's provenance documents the Training phase, we limit our experiment to users who have at least 20 documents. There are fourteen such users (the authors were excluded to avoid bias), who we named $u_{1}, u_{2}, \\ldots, u_{14}$. Their numbers of documents range between 21 and 6,745, with the total number of documents in the data set is 13,870.\n", "* **Classification labels**: $\\mathcal{L} = \\left\\{ u_1, u_2, \\ldots, u_{14} \\right\\} $, where $l_{x} = u_i$ if the provenance document $x$ belongs to user $u_i$. Hence, there are 14 labels in total.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading data\n", "For each provenance document, we calculate the 22 provenance network metrics. The dataset provided contains those metrics values for 13,870 provenance documents along with the owner identifier (i.e. $u_{1}, u_{2}, \\ldots, u_{14}$)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | label | \n", "entities | \n", "agents | \n", "activities | \n", "nodes | \n", "edges | \n", "diameter | \n", "assortativity | \n", "acc | \n", "acc_e | \n", "... | \n", "mfd_e_a | \n", "mfd_e_ag | \n", "mfd_a_e | \n", "mfd_a_a | \n", "mfd_a_ag | \n", "mfd_ag_e | \n", "mfd_ag_a | \n", "mfd_ag_ag | \n", "mfd_der | \n", "powerlaw_alpha | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "u_3 | \n", "17 | \n", "5 | \n", "9 | \n", "31 | \n", "49 | \n", "6 | \n", "-0.196362 | \n", "0.444709 | \n", "0.466667 | \n", "... | \n", "5 | \n", "8 | \n", "4 | \n", "2 | \n", "5 | \n", "0 | \n", "0 | \n", "0 | \n", "3 | \n", "-1.0 | \n", "
1 | \n", "u_2 | \n", "7 | \n", "0 | \n", "2 | \n", "9 | \n", "0 | \n", "-1 | \n", "-1.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "-1 | \n", "-1.0 | \n", "
2 | \n", "u_2 | \n", "7 | \n", "0 | \n", "2 | \n", "9 | \n", "0 | \n", "-1 | \n", "-1.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "-1 | \n", "-1.0 | \n", "
3 | \n", "u_2 | \n", "7 | \n", "0 | \n", "2 | \n", "9 | \n", "0 | \n", "-1 | \n", "-1.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "-1 | \n", "-1.0 | \n", "
4 | \n", "u_2 | \n", "7 | \n", "0 | \n", "2 | \n", "9 | \n", "0 | \n", "-1 | \n", "-1.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "-1 | \n", "-1.0 | \n", "
5 rows × 23 columns
\n", "\n", " | entities | \n", "agents | \n", "activities | \n", "nodes | \n", "edges | \n", "diameter | \n", "assortativity | \n", "acc | \n", "acc_e | \n", "acc_a | \n", "... | \n", "mfd_e_a | \n", "mfd_e_ag | \n", "mfd_a_e | \n", "mfd_a_a | \n", "mfd_a_ag | \n", "mfd_ag_e | \n", "mfd_ag_a | \n", "mfd_ag_ag | \n", "mfd_der | \n", "powerlaw_alpha | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | \n", "13870.000000 | \n", "13870.00000 | \n", "13870.000000 | \n", "13870.000000 | \n", "13870.000000 | \n", "13870.000000 | \n", "13870.000000 | \n", "13870.000000 | \n", "13870.000000 | \n", "13870.000000 | \n", "... | \n", "13870.000000 | \n", "13870.000000 | \n", "13870.000000 | \n", "13870.000000 | \n", "13870.000000 | \n", "13870.000000 | \n", "13870.000000 | \n", "13870.000000 | \n", "13870.000000 | \n", "13870.000000 | \n", "
mean | \n", "9.913338 | \n", "2.08695 | \n", "1.836193 | \n", "13.836482 | \n", "19.212689 | \n", "0.868926 | \n", "-0.628690 | \n", "0.347835 | \n", "0.341142 | \n", "0.323606 | \n", "... | \n", "1.312761 | \n", "1.754939 | \n", "1.073540 | \n", "0.709229 | \n", "0.752127 | \n", "0.017448 | \n", "0.014924 | \n", "0.030353 | \n", "2.185436 | \n", "-0.916534 | \n", "
std | \n", "28.931915 | \n", "2.27716 | \n", "18.570823 | \n", "43.352894 | \n", "134.640366 | \n", "1.943905 | \n", "0.376718 | \n", "0.394531 | \n", "0.409577 | \n", "0.395727 | \n", "... | \n", "1.769329 | \n", "1.314874 | \n", "1.622606 | \n", "1.343363 | \n", "1.077628 | \n", "0.200902 | \n", "0.152351 | \n", "0.209759 | \n", "5.211118 | \n", "0.612437 | \n", "
min | \n", "0.000000 | \n", "0.00000 | \n", "0.000000 | \n", "1.000000 | \n", "0.000000 | \n", "-1.000000 | \n", "-1.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "-1.000000 | \n", "-1.000000 | \n", "
25% | \n", "2.000000 | \n", "1.00000 | \n", "0.000000 | \n", "5.000000 | \n", "5.000000 | \n", "-1.000000 | \n", "-1.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "1.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "-1.000000 | \n", "
50% | \n", "4.000000 | \n", "1.00000 | \n", "1.000000 | \n", "7.000000 | \n", "9.000000 | \n", "1.000000 | \n", "-0.592949 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "1.000000 | \n", "2.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "2.000000 | \n", "-1.000000 | \n", "
75% | \n", "5.000000 | \n", "3.00000 | \n", "2.000000 | \n", "10.000000 | \n", "13.000000 | \n", "2.000000 | \n", "-0.350000 | \n", "0.674147 | \n", "0.750000 | \n", "0.666667 | \n", "... | \n", "2.000000 | \n", "2.000000 | \n", "2.000000 | \n", "1.000000 | \n", "1.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "2.000000 | \n", "-1.000000 | \n", "
max | \n", "1188.000000 | \n", "51.00000 | \n", "1580.000000 | \n", "2776.000000 | \n", "6853.000000 | \n", "10.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "... | \n", "52.000000 | \n", "44.000000 | \n", "51.000000 | \n", "52.000000 | \n", "43.000000 | \n", "4.000000 | \n", "5.000000 | \n", "6.000000 | \n", "303.000000 | \n", "8.184413 | \n", "
8 rows × 22 columns
\n", "