{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extra 3.2 - Historical Provenance - Application 3: RRG Chat Messages\n",
    "Identifying instructions from chat messages in the Radiation Response Game.\n",
    "\n",
    "In this notebook, we explore the performance of classification using the provenance of a data entity instead of its dependencies (as shown [here](Application%203%20-%20RRG%20Messages.ipynb) and in the paper). In order to distinguish between the two, we call the former _historical_ provenance and the latter _forward_ provenance. Apart from using the historical provenance, all other steps are the same as [the original experiments](Application%203%20-%20RRG%20Messages.ipynb).\n",
    "\n",
    "* **Goal**: To determine if the provenance network analytics method can identify instructions from the provenance of a chat messages.\n",
    "* **Classification labels**: $\\mathcal{L} = \\left\\{ \\textit{instruction}, \\textit{other} \\right\\} $.\n",
    "* **Training data**: 69 chat messages manually categorised by HCI researchers.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading data\n",
    "\n",
    "The RRG dataset based on historical provenance is provided in the [`rrg/ancestor-graphs.csv`](rrg/ancestor-graphs.csv) file, which contains a table whose rows correspond to individual chat messages in RRG:\n",
    "* First column: the identifier of the chat message\n",
    "* `label`: the manual classification of the message (e.g., _instruction_, _information_, _requests_, etc.)\n",
    "* The remaining columns provide the provenance network metrics calculated from the *historical provenance* graph of  the message.\n",
    "\n",
    "Note that in this extra experiment, we use the full (historical) provenance of a message, not limiting how far it goes. Hence, there is no $k$ parameter in this experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "filepath = \"rrg/ancestor-graphs.csv\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>entities</th>\n",
       "      <th>agents</th>\n",
       "      <th>activities</th>\n",
       "      <th>nodes</th>\n",
       "      <th>edges</th>\n",
       "      <th>diameter</th>\n",
       "      <th>assortativity</th>\n",
       "      <th>acc</th>\n",
       "      <th>acc_e</th>\n",
       "      <th>...</th>\n",
       "      <th>mfd_e_a</th>\n",
       "      <th>mfd_e_ag</th>\n",
       "      <th>mfd_a_e</th>\n",
       "      <th>mfd_a_a</th>\n",
       "      <th>mfd_a_ag</th>\n",
       "      <th>mfd_ag_e</th>\n",
       "      <th>mfd_ag_a</th>\n",
       "      <th>mfd_ag_ag</th>\n",
       "      <th>mfd_der</th>\n",
       "      <th>powerlaw_alpha</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>requests</td>\n",
       "      <td>186</td>\n",
       "      <td>7</td>\n",
       "      <td>21</td>\n",
       "      <td>214</td>\n",
       "      <td>469</td>\n",
       "      <td>7</td>\n",
       "      <td>0.012152</td>\n",
       "      <td>0.488348</td>\n",
       "      <td>0.445533</td>\n",
       "      <td>...</td>\n",
       "      <td>22</td>\n",
       "      <td>19</td>\n",
       "      <td>34</td>\n",
       "      <td>22</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>37</td>\n",
       "      <td>2.924960</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>commissives</td>\n",
       "      <td>183</td>\n",
       "      <td>7</td>\n",
       "      <td>20</td>\n",
       "      <td>210</td>\n",
       "      <td>461</td>\n",
       "      <td>7</td>\n",
       "      <td>0.007546</td>\n",
       "      <td>0.487386</td>\n",
       "      <td>0.446461</td>\n",
       "      <td>...</td>\n",
       "      <td>22</td>\n",
       "      <td>19</td>\n",
       "      <td>33</td>\n",
       "      <td>22</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>37</td>\n",
       "      <td>2.858642</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>assertives</td>\n",
       "      <td>216</td>\n",
       "      <td>7</td>\n",
       "      <td>23</td>\n",
       "      <td>246</td>\n",
       "      <td>543</td>\n",
       "      <td>7</td>\n",
       "      <td>-0.001550</td>\n",
       "      <td>0.489050</td>\n",
       "      <td>0.447828</td>\n",
       "      <td>...</td>\n",
       "      <td>26</td>\n",
       "      <td>22</td>\n",
       "      <td>38</td>\n",
       "      <td>26</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>46</td>\n",
       "      <td>2.867888</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>instruction</td>\n",
       "      <td>220</td>\n",
       "      <td>7</td>\n",
       "      <td>24</td>\n",
       "      <td>251</td>\n",
       "      <td>553</td>\n",
       "      <td>7</td>\n",
       "      <td>0.002591</td>\n",
       "      <td>0.489752</td>\n",
       "      <td>0.447110</td>\n",
       "      <td>...</td>\n",
       "      <td>26</td>\n",
       "      <td>22</td>\n",
       "      <td>38</td>\n",
       "      <td>26</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>46</td>\n",
       "      <td>2.891161</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>instruction</td>\n",
       "      <td>219</td>\n",
       "      <td>7</td>\n",
       "      <td>24</td>\n",
       "      <td>250</td>\n",
       "      <td>551</td>\n",
       "      <td>7</td>\n",
       "      <td>0.002284</td>\n",
       "      <td>0.489859</td>\n",
       "      <td>0.447021</td>\n",
       "      <td>...</td>\n",
       "      <td>26</td>\n",
       "      <td>22</td>\n",
       "      <td>38</td>\n",
       "      <td>26</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>46</td>\n",
       "      <td>2.928098</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 23 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          label  entities  agents  activities  nodes  edges  diameter  \\\n",
       "21     requests       186       7          21    214    469         7   \n",
       "20  commissives       183       7          20    210    461         7   \n",
       "23   assertives       216       7          23    246    543         7   \n",
       "25  instruction       220       7          24    251    553         7   \n",
       "24  instruction       219       7          24    250    551         7   \n",
       "\n",
       "    assortativity       acc     acc_e       ...        mfd_e_a  mfd_e_ag  \\\n",
       "21       0.012152  0.488348  0.445533       ...             22        19   \n",
       "20       0.007546  0.487386  0.446461       ...             22        19   \n",
       "23      -0.001550  0.489050  0.447828       ...             26        22   \n",
       "25       0.002591  0.489752  0.447110       ...             26        22   \n",
       "24       0.002284  0.489859  0.447021       ...             26        22   \n",
       "\n",
       "    mfd_a_e  mfd_a_a  mfd_a_ag  mfd_ag_e  mfd_ag_a  mfd_ag_ag  mfd_der  \\\n",
       "21       34       22        19         0         0          0       37   \n",
       "20       33       22        19         0         0          0       37   \n",
       "23       38       26        19         0         0          0       46   \n",
       "25       38       26        19         0         0          0       46   \n",
       "24       38       26        19         0         0          0       46   \n",
       "\n",
       "    powerlaw_alpha  \n",
       "21        2.924960  \n",
       "20        2.858642  \n",
       "23        2.867888  \n",
       "25        2.891161  \n",
       "24        2.928098  \n",
       "\n",
       "[5 rows x 23 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(filepath, index_col=0)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Labelling data\n",
    "\n",
    "Since we are only interested in the _instruction_ messages, we categorise the data entity into two sets: _instruction_ and _other_.\n",
    "\n",
    "Note: This section is just an example to show the data transformation to be applied on each dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "label = lambda l: 'other' if l != 'instruction' else l"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>entities</th>\n",
       "      <th>agents</th>\n",
       "      <th>activities</th>\n",
       "      <th>nodes</th>\n",
       "      <th>edges</th>\n",
       "      <th>diameter</th>\n",
       "      <th>assortativity</th>\n",
       "      <th>acc</th>\n",
       "      <th>acc_e</th>\n",
       "      <th>...</th>\n",
       "      <th>mfd_e_a</th>\n",
       "      <th>mfd_e_ag</th>\n",
       "      <th>mfd_a_e</th>\n",
       "      <th>mfd_a_a</th>\n",
       "      <th>mfd_a_ag</th>\n",
       "      <th>mfd_ag_e</th>\n",
       "      <th>mfd_ag_a</th>\n",
       "      <th>mfd_ag_ag</th>\n",
       "      <th>mfd_der</th>\n",
       "      <th>powerlaw_alpha</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>other</td>\n",
       "      <td>186</td>\n",
       "      <td>7</td>\n",
       "      <td>21</td>\n",
       "      <td>214</td>\n",
       "      <td>469</td>\n",
       "      <td>7</td>\n",
       "      <td>0.012152</td>\n",
       "      <td>0.488348</td>\n",
       "      <td>0.445533</td>\n",
       "      <td>...</td>\n",
       "      <td>22</td>\n",
       "      <td>19</td>\n",
       "      <td>34</td>\n",
       "      <td>22</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>37</td>\n",
       "      <td>2.924960</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>other</td>\n",
       "      <td>183</td>\n",
       "      <td>7</td>\n",
       "      <td>20</td>\n",
       "      <td>210</td>\n",
       "      <td>461</td>\n",
       "      <td>7</td>\n",
       "      <td>0.007546</td>\n",
       "      <td>0.487386</td>\n",
       "      <td>0.446461</td>\n",
       "      <td>...</td>\n",
       "      <td>22</td>\n",
       "      <td>19</td>\n",
       "      <td>33</td>\n",
       "      <td>22</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>37</td>\n",
       "      <td>2.858642</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>other</td>\n",
       "      <td>216</td>\n",
       "      <td>7</td>\n",
       "      <td>23</td>\n",
       "      <td>246</td>\n",
       "      <td>543</td>\n",
       "      <td>7</td>\n",
       "      <td>-0.001550</td>\n",
       "      <td>0.489050</td>\n",
       "      <td>0.447828</td>\n",
       "      <td>...</td>\n",
       "      <td>26</td>\n",
       "      <td>22</td>\n",
       "      <td>38</td>\n",
       "      <td>26</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>46</td>\n",
       "      <td>2.867888</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>instruction</td>\n",
       "      <td>220</td>\n",
       "      <td>7</td>\n",
       "      <td>24</td>\n",
       "      <td>251</td>\n",
       "      <td>553</td>\n",
       "      <td>7</td>\n",
       "      <td>0.002591</td>\n",
       "      <td>0.489752</td>\n",
       "      <td>0.447110</td>\n",
       "      <td>...</td>\n",
       "      <td>26</td>\n",
       "      <td>22</td>\n",
       "      <td>38</td>\n",
       "      <td>26</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>46</td>\n",
       "      <td>2.891161</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>instruction</td>\n",
       "      <td>219</td>\n",
       "      <td>7</td>\n",
       "      <td>24</td>\n",
       "      <td>250</td>\n",
       "      <td>551</td>\n",
       "      <td>7</td>\n",
       "      <td>0.002284</td>\n",
       "      <td>0.489859</td>\n",
       "      <td>0.447021</td>\n",
       "      <td>...</td>\n",
       "      <td>26</td>\n",
       "      <td>22</td>\n",
       "      <td>38</td>\n",
       "      <td>26</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>46</td>\n",
       "      <td>2.928098</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 23 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          label  entities  agents  activities  nodes  edges  diameter  \\\n",
       "21        other       186       7          21    214    469         7   \n",
       "20        other       183       7          20    210    461         7   \n",
       "23        other       216       7          23    246    543         7   \n",
       "25  instruction       220       7          24    251    553         7   \n",
       "24  instruction       219       7          24    250    551         7   \n",
       "\n",
       "    assortativity       acc     acc_e       ...        mfd_e_a  mfd_e_ag  \\\n",
       "21       0.012152  0.488348  0.445533       ...             22        19   \n",
       "20       0.007546  0.487386  0.446461       ...             22        19   \n",
       "23      -0.001550  0.489050  0.447828       ...             26        22   \n",
       "25       0.002591  0.489752  0.447110       ...             26        22   \n",
       "24       0.002284  0.489859  0.447021       ...             26        22   \n",
       "\n",
       "    mfd_a_e  mfd_a_a  mfd_a_ag  mfd_ag_e  mfd_ag_a  mfd_ag_ag  mfd_der  \\\n",
       "21       34       22        19         0         0          0       37   \n",
       "20       33       22        19         0         0          0       37   \n",
       "23       38       26        19         0         0          0       46   \n",
       "25       38       26        19         0         0          0       46   \n",
       "24       38       26        19         0         0          0       46   \n",
       "\n",
       "    powerlaw_alpha  \n",
       "21        2.924960  \n",
       "20        2.858642  \n",
       "23        2.867888  \n",
       "25        2.891161  \n",
       "24        2.928098  \n",
       "\n",
       "[5 rows x 23 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.label = df.label.apply(label).astype('category')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Balancing data\n",
    "\n",
    "This section explore the balance of the RRG datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "other          37\n",
       "instruction    32\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Examine the balance of the dataset\n",
    "df.label.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since both labels have roughly the same number of data points, we decide not to balance the RRG datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross validation\n",
    "\n",
    "We now run the cross validation tests on the datasets using all the features (`combined`), only the generic network metrics (`generic`), and only the provenance-specific network metrics (`provenance`). Please refer to [Cross Validation Code.ipynb](Cross%20Validation%20Code.ipynb) for the detailed description of the cross validation code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from analytics import test_classification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: 64.07% ±1.1212 <-- combined\n",
      "Accuracy: 66.20% ±1.1259 <-- generic\n",
      "Accuracy: 61.03% ±1.1090 <-- provenance\n"
     ]
    }
   ],
   "source": [
    "results, importances = test_classification(df, n_iterations=1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Results**: Compared to the top accuracy achieved [using forward provenance](Application%203%20-%20RRG%20Messages.ipynb), 85%, using historical provenance in this application yield much lower accuracy, 66%. This supports our hypothesis that the forward provenance of a data entity correlates better with its nature/characteristic than its historical provenance (as the forward provenance records how the data entity was used)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}