{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extra 2.1 - Unbalanced Data - Application 1: ProvStore Documents\n",
    "\n",
    "Identifying owners of provenance documents from their provenance network metrics.\n",
    "\n",
    "In this notebook, we compared the classification accuracy on **unbalanced** (original) ProvStore dataset vs that on a **balanced** ProvStore dataset.\n",
    "\n",
    "* **Goal**: To determine if the provenance network analytics method can identify the owner of a provenance document from its provenance network metrics.\n",
    "* **Training data**: In order to ensure that there are sufficient samples to represent a user's provenance documents the Training phase, we limit our experiment to users who have at least 20 documents. There are fourteen such users (the authors were excluded to avoid bias), who we named $u_{1}, u_{2}, \\ldots, u_{14}$. Their numbers of documents range between 21 and 6,745, with the total number of documents in the data set is 13,870.\n",
    "* **Classification labels**: $\\mathcal{L} = \\left\\{ u_1, u_2, \\ldots, u_{14} \\right\\} $, where $l_{x} = u_i$ if the provenance document $x$ belongs to user $u_i$. Hence, there are 14 labels in total.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading data\n",
    "For each provenance document, we calculate the 22 provenance network metrics. The dataset provided contains those metrics values for 13,870 provenance documents along with the owner identifier (i.e. $u_{1}, u_{2}, \\ldots, u_{14}$)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>entities</th>\n",
       "      <th>agents</th>\n",
       "      <th>activities</th>\n",
       "      <th>nodes</th>\n",
       "      <th>edges</th>\n",
       "      <th>diameter</th>\n",
       "      <th>assortativity</th>\n",
       "      <th>acc</th>\n",
       "      <th>acc_e</th>\n",
       "      <th>...</th>\n",
       "      <th>mfd_e_a</th>\n",
       "      <th>mfd_e_ag</th>\n",
       "      <th>mfd_a_e</th>\n",
       "      <th>mfd_a_a</th>\n",
       "      <th>mfd_a_ag</th>\n",
       "      <th>mfd_ag_e</th>\n",
       "      <th>mfd_ag_a</th>\n",
       "      <th>mfd_ag_ag</th>\n",
       "      <th>mfd_der</th>\n",
       "      <th>powerlaw_alpha</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>u_3</td>\n",
       "      <td>17</td>\n",
       "      <td>5</td>\n",
       "      <td>9</td>\n",
       "      <td>31</td>\n",
       "      <td>49</td>\n",
       "      <td>6</td>\n",
       "      <td>-0.196362</td>\n",
       "      <td>0.444709</td>\n",
       "      <td>0.466667</td>\n",
       "      <td>...</td>\n",
       "      <td>5</td>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>-1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>u_2</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>u_2</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>u_2</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>u_2</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 23 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  label  entities  agents  activities  nodes  edges  diameter  assortativity  \\\n",
       "0   u_3        17       5           9     31     49         6      -0.196362   \n",
       "1   u_2         7       0           2      9      0        -1      -1.000000   \n",
       "2   u_2         7       0           2      9      0        -1      -1.000000   \n",
       "3   u_2         7       0           2      9      0        -1      -1.000000   \n",
       "4   u_2         7       0           2      9      0        -1      -1.000000   \n",
       "\n",
       "        acc     acc_e       ...        mfd_e_a  mfd_e_ag  mfd_a_e  mfd_a_a  \\\n",
       "0  0.444709  0.466667       ...              5         8        4        2   \n",
       "1  0.000000  0.000000       ...              0         0        0        0   \n",
       "2  0.000000  0.000000       ...              0         0        0        0   \n",
       "3  0.000000  0.000000       ...              0         0        0        0   \n",
       "4  0.000000  0.000000       ...              0         0        0        0   \n",
       "\n",
       "   mfd_a_ag  mfd_ag_e  mfd_ag_a  mfd_ag_ag  mfd_der  powerlaw_alpha  \n",
       "0         5         0         0          0        3            -1.0  \n",
       "1         0         0         0          0       -1            -1.0  \n",
       "2         0         0         0          0       -1            -1.0  \n",
       "3         0         0         0          0       -1            -1.0  \n",
       "4         0         0         0          0       -1            -1.0  \n",
       "\n",
       "[5 rows x 23 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"provstore/data.csv\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>entities</th>\n",
       "      <th>agents</th>\n",
       "      <th>activities</th>\n",
       "      <th>nodes</th>\n",
       "      <th>edges</th>\n",
       "      <th>diameter</th>\n",
       "      <th>assortativity</th>\n",
       "      <th>acc</th>\n",
       "      <th>acc_e</th>\n",
       "      <th>acc_a</th>\n",
       "      <th>...</th>\n",
       "      <th>mfd_e_a</th>\n",
       "      <th>mfd_e_ag</th>\n",
       "      <th>mfd_a_e</th>\n",
       "      <th>mfd_a_a</th>\n",
       "      <th>mfd_a_ag</th>\n",
       "      <th>mfd_ag_e</th>\n",
       "      <th>mfd_ag_a</th>\n",
       "      <th>mfd_ag_ag</th>\n",
       "      <th>mfd_der</th>\n",
       "      <th>powerlaw_alpha</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.00000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "      <td>13870.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>9.913338</td>\n",
       "      <td>2.08695</td>\n",
       "      <td>1.836193</td>\n",
       "      <td>13.836482</td>\n",
       "      <td>19.212689</td>\n",
       "      <td>0.868926</td>\n",
       "      <td>-0.628690</td>\n",
       "      <td>0.347835</td>\n",
       "      <td>0.341142</td>\n",
       "      <td>0.323606</td>\n",
       "      <td>...</td>\n",
       "      <td>1.312761</td>\n",
       "      <td>1.754939</td>\n",
       "      <td>1.073540</td>\n",
       "      <td>0.709229</td>\n",
       "      <td>0.752127</td>\n",
       "      <td>0.017448</td>\n",
       "      <td>0.014924</td>\n",
       "      <td>0.030353</td>\n",
       "      <td>2.185436</td>\n",
       "      <td>-0.916534</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>28.931915</td>\n",
       "      <td>2.27716</td>\n",
       "      <td>18.570823</td>\n",
       "      <td>43.352894</td>\n",
       "      <td>134.640366</td>\n",
       "      <td>1.943905</td>\n",
       "      <td>0.376718</td>\n",
       "      <td>0.394531</td>\n",
       "      <td>0.409577</td>\n",
       "      <td>0.395727</td>\n",
       "      <td>...</td>\n",
       "      <td>1.769329</td>\n",
       "      <td>1.314874</td>\n",
       "      <td>1.622606</td>\n",
       "      <td>1.343363</td>\n",
       "      <td>1.077628</td>\n",
       "      <td>0.200902</td>\n",
       "      <td>0.152351</td>\n",
       "      <td>0.209759</td>\n",
       "      <td>5.211118</td>\n",
       "      <td>0.612437</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>4.000000</td>\n",
       "      <td>1.00000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.592949</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>5.000000</td>\n",
       "      <td>3.00000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>13.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>-0.350000</td>\n",
       "      <td>0.674147</td>\n",
       "      <td>0.750000</td>\n",
       "      <td>0.666667</td>\n",
       "      <td>...</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1188.000000</td>\n",
       "      <td>51.00000</td>\n",
       "      <td>1580.000000</td>\n",
       "      <td>2776.000000</td>\n",
       "      <td>6853.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>52.000000</td>\n",
       "      <td>44.000000</td>\n",
       "      <td>51.000000</td>\n",
       "      <td>52.000000</td>\n",
       "      <td>43.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>303.000000</td>\n",
       "      <td>8.184413</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>8 rows × 22 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           entities       agents    activities         nodes         edges  \\\n",
       "count  13870.000000  13870.00000  13870.000000  13870.000000  13870.000000   \n",
       "mean       9.913338      2.08695      1.836193     13.836482     19.212689   \n",
       "std       28.931915      2.27716     18.570823     43.352894    134.640366   \n",
       "min        0.000000      0.00000      0.000000      1.000000      0.000000   \n",
       "25%        2.000000      1.00000      0.000000      5.000000      5.000000   \n",
       "50%        4.000000      1.00000      1.000000      7.000000      9.000000   \n",
       "75%        5.000000      3.00000      2.000000     10.000000     13.000000   \n",
       "max     1188.000000     51.00000   1580.000000   2776.000000   6853.000000   \n",
       "\n",
       "           diameter  assortativity           acc         acc_e         acc_a  \\\n",
       "count  13870.000000   13870.000000  13870.000000  13870.000000  13870.000000   \n",
       "mean       0.868926      -0.628690      0.347835      0.341142      0.323606   \n",
       "std        1.943905       0.376718      0.394531      0.409577      0.395727   \n",
       "min       -1.000000      -1.000000      0.000000      0.000000      0.000000   \n",
       "25%       -1.000000      -1.000000      0.000000      0.000000      0.000000   \n",
       "50%        1.000000      -0.592949      0.000000      0.000000      0.000000   \n",
       "75%        2.000000      -0.350000      0.674147      0.750000      0.666667   \n",
       "max       10.000000       1.000000      1.000000      1.000000      1.000000   \n",
       "\n",
       "            ...             mfd_e_a      mfd_e_ag       mfd_a_e       mfd_a_a  \\\n",
       "count       ...        13870.000000  13870.000000  13870.000000  13870.000000   \n",
       "mean        ...            1.312761      1.754939      1.073540      0.709229   \n",
       "std         ...            1.769329      1.314874      1.622606      1.343363   \n",
       "min         ...            0.000000      0.000000      0.000000      0.000000   \n",
       "25%         ...            0.000000      1.000000      0.000000      0.000000   \n",
       "50%         ...            1.000000      2.000000      0.000000      0.000000   \n",
       "75%         ...            2.000000      2.000000      2.000000      1.000000   \n",
       "max         ...           52.000000     44.000000     51.000000     52.000000   \n",
       "\n",
       "           mfd_a_ag      mfd_ag_e      mfd_ag_a     mfd_ag_ag       mfd_der  \\\n",
       "count  13870.000000  13870.000000  13870.000000  13870.000000  13870.000000   \n",
       "mean       0.752127      0.017448      0.014924      0.030353      2.185436   \n",
       "std        1.077628      0.200902      0.152351      0.209759      5.211118   \n",
       "min        0.000000      0.000000      0.000000      0.000000     -1.000000   \n",
       "25%        0.000000      0.000000      0.000000      0.000000      1.000000   \n",
       "50%        1.000000      0.000000      0.000000      0.000000      2.000000   \n",
       "75%        1.000000      0.000000      0.000000      0.000000      2.000000   \n",
       "max       43.000000      4.000000      5.000000      6.000000    303.000000   \n",
       "\n",
       "       powerlaw_alpha  \n",
       "count    13870.000000  \n",
       "mean        -0.916534  \n",
       "std          0.612437  \n",
       "min         -1.000000  \n",
       "25%         -1.000000  \n",
       "50%         -1.000000  \n",
       "75%         -1.000000  \n",
       "max          8.184413  \n",
       "\n",
       "[8 rows x 22 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "u_3     6745\n",
       "u_8     4449\n",
       "u_5     1327\n",
       "u_2      487\n",
       "u_12     312\n",
       "u_14     150\n",
       "u_9      141\n",
       "u_6       71\n",
       "u_7       66\n",
       "u_4       34\n",
       "u_1       25\n",
       "u_11      21\n",
       "u_10      21\n",
       "u_13      21\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The number of each label in the dataset\n",
    "df.label.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classification on unbalanced (original) data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from analytics import test_classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Cross Validation tests**: We now run the cross validation tests on the dataset (`df`) using all the features (`combined`), only the generic network metrics (`generic`), and only the provenance-specific network metrics (`provenance`). Please refer to [Cross Validation Code.ipynb](Cross%20Validation%20Code.ipynb) for the detailed description of the cross validation code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: 96.45% ±0.0209 <-- combined\n",
      "Accuracy: 95.36% ±0.0241 <-- generic\n",
      "Accuracy: 96.55% ±0.0209 <-- provenance\n"
     ]
    }
   ],
   "source": [
    "results, importances = test_classification(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## Classification on balanced data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from analytics import balance_smote"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Balancing the data**\n",
    "\n",
    "With an unbalanced like the above, the resulted trained classifier will typically be skewed towards the majority labels. In order to mitigate this, we balance the dataset using the [SMOTE Oversampling Method](https://www.jair.org/media/953/live-953-2037-jair.pdf)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original data shapes: (13870, 22) (13870,)\n",
      "Balanced data shapes: (94430, 22) (94430,)\n"
     ]
    }
   ],
   "source": [
    "df = balance_smote(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: 98.14% ±0.0079 <-- combined\n",
      "Accuracy: 92.27% ±0.0159 <-- generic\n",
      "Accuracy: 98.13% ±0.0082 <-- provenance\n"
     ]
    }
   ],
   "source": [
    "results_bal, importances_bal = test_classification(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Result**: The classifiers provide a higher performance on balanced data when provenance-specific metrics are used (either with the `combined` or `provenance` metrics sets). The classifiers trained on the `generic` metrics set, however, performs better on the original, unbalanced data. It is, perhaps, some of the minority labels have more distinctive provenance-specific metrics, compared to their generic one; when more such samples are introduced in the balacing process, using only generic metrics cannot identify those samples as well, hence a lower accuracy."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}