{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Extra 3.2 - Historical Provenance - Application 3: RRG Chat Messages\n", "Identifying instructions from chat messages in the Radiation Response Game.\n", "\n", "In this notebook, we explore the performance of classification using the provenance of a data entity instead of its dependencies (as shown [here](Application%203%20-%20RRG%20Messages.ipynb) and in the paper). In order to distinguish between the two, we call the former _historical_ provenance and the latter _forward_ provenance. Apart from using the historical provenance, all other steps are the same as [the original experiments](Application%203%20-%20RRG%20Messages.ipynb).\n", "\n", "* **Goal**: To determine if the provenance network analytics method can identify instructions from the provenance of a chat messages.\n", "* **Classification labels**: $\\mathcal{L} = \\left\\{ \\textit{instruction}, \\textit{other} \\right\\} $.\n", "* **Training data**: 69 chat messages manually categorised by HCI researchers.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading data\n", "\n", "The RRG dataset based on historical provenance is provided in the [`rrg/ancestor-graphs.csv`](rrg/ancestor-graphs.csv) file, which contains a table whose rows correspond to individual chat messages in RRG:\n", "* First column: the identifier of the chat message\n", "* `label`: the manual classification of the message (e.g., _instruction_, _information_, _requests_, etc.)\n", "* The remaining columns provide the provenance network metrics calculated from the *historical provenance* graph of the message.\n", "\n", "Note that in this extra experiment, we use the full (historical) provenance of a message, not limiting how far it goes. Hence, there is no $k$ parameter in this experiment." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "filepath = \"rrg/ancestor-graphs.csv\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelentitiesagentsactivitiesnodesedgesdiameterassortativityaccacc_e...mfd_e_amfd_e_agmfd_a_emfd_a_amfd_a_agmfd_ag_emfd_ag_amfd_ag_agmfd_derpowerlaw_alpha
21requests18672121446970.0121520.4883480.445533...2219342219000372.924960
20commissives18372021046170.0075460.4873860.446461...2219332219000372.858642
23assertives2167232465437-0.0015500.4890500.447828...2622382619000462.867888
25instruction22072425155370.0025910.4897520.447110...2622382619000462.891161
24instruction21972425055170.0022840.4898590.447021...2622382619000462.928098
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " label entities agents activities nodes edges diameter \\\n", "21 requests 186 7 21 214 469 7 \n", "20 commissives 183 7 20 210 461 7 \n", "23 assertives 216 7 23 246 543 7 \n", "25 instruction 220 7 24 251 553 7 \n", "24 instruction 219 7 24 250 551 7 \n", "\n", " assortativity acc acc_e ... mfd_e_a mfd_e_ag \\\n", "21 0.012152 0.488348 0.445533 ... 22 19 \n", "20 0.007546 0.487386 0.446461 ... 22 19 \n", "23 -0.001550 0.489050 0.447828 ... 26 22 \n", "25 0.002591 0.489752 0.447110 ... 26 22 \n", "24 0.002284 0.489859 0.447021 ... 26 22 \n", "\n", " mfd_a_e mfd_a_a mfd_a_ag mfd_ag_e mfd_ag_a mfd_ag_ag mfd_der \\\n", "21 34 22 19 0 0 0 37 \n", "20 33 22 19 0 0 0 37 \n", "23 38 26 19 0 0 0 46 \n", "25 38 26 19 0 0 0 46 \n", "24 38 26 19 0 0 0 46 \n", "\n", " powerlaw_alpha \n", "21 2.924960 \n", "20 2.858642 \n", "23 2.867888 \n", "25 2.891161 \n", "24 2.928098 \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(filepath, index_col=0)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Labelling data\n", "\n", "Since we are only interested in the _instruction_ messages, we categorise the data entity into two sets: _instruction_ and _other_.\n", "\n", "Note: This section is just an example to show the data transformation to be applied on each dataset." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "label = lambda l: 'other' if l != 'instruction' else l" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelentitiesagentsactivitiesnodesedgesdiameterassortativityaccacc_e...mfd_e_amfd_e_agmfd_a_emfd_a_amfd_a_agmfd_ag_emfd_ag_amfd_ag_agmfd_derpowerlaw_alpha
21other18672121446970.0121520.4883480.445533...2219342219000372.924960
20other18372021046170.0075460.4873860.446461...2219332219000372.858642
23other2167232465437-0.0015500.4890500.447828...2622382619000462.867888
25instruction22072425155370.0025910.4897520.447110...2622382619000462.891161
24instruction21972425055170.0022840.4898590.447021...2622382619000462.928098
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " label entities agents activities nodes edges diameter \\\n", "21 other 186 7 21 214 469 7 \n", "20 other 183 7 20 210 461 7 \n", "23 other 216 7 23 246 543 7 \n", "25 instruction 220 7 24 251 553 7 \n", "24 instruction 219 7 24 250 551 7 \n", "\n", " assortativity acc acc_e ... mfd_e_a mfd_e_ag \\\n", "21 0.012152 0.488348 0.445533 ... 22 19 \n", "20 0.007546 0.487386 0.446461 ... 22 19 \n", "23 -0.001550 0.489050 0.447828 ... 26 22 \n", "25 0.002591 0.489752 0.447110 ... 26 22 \n", "24 0.002284 0.489859 0.447021 ... 26 22 \n", "\n", " mfd_a_e mfd_a_a mfd_a_ag mfd_ag_e mfd_ag_a mfd_ag_ag mfd_der \\\n", "21 34 22 19 0 0 0 37 \n", "20 33 22 19 0 0 0 37 \n", "23 38 26 19 0 0 0 46 \n", "25 38 26 19 0 0 0 46 \n", "24 38 26 19 0 0 0 46 \n", "\n", " powerlaw_alpha \n", "21 2.924960 \n", "20 2.858642 \n", "23 2.867888 \n", "25 2.891161 \n", "24 2.928098 \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.label = df.label.apply(label).astype('category')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Balancing data\n", "\n", "This section explore the balance of the RRG datasets." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "other 37\n", "instruction 32\n", "Name: label, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Examine the balance of the dataset\n", "df.label.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since both labels have roughly the same number of data points, we decide not to balance the RRG datasets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross validation\n", "\n", "We now run the cross validation tests on the datasets using all the features (`combined`), only the generic network metrics (`generic`), and only the provenance-specific network metrics (`provenance`). Please refer to [Cross Validation Code.ipynb](Cross%20Validation%20Code.ipynb) for the detailed description of the cross validation code." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from analytics import test_classification" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 64.07% ±1.1212 <-- combined\n", "Accuracy: 66.20% ±1.1259 <-- generic\n", "Accuracy: 61.03% ±1.1090 <-- provenance\n" ] } ], "source": [ "results, importances = test_classification(df, n_iterations=1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Results**: Compared to the top accuracy achieved [using forward provenance](Application%203%20-%20RRG%20Messages.ipynb), 85%, using historical provenance in this application yield much lower accuracy, 66%. This supports our hypothesis that the forward provenance of a data entity correlates better with its nature/characteristic than its historical provenance (as the forward provenance records how the data entity was used)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }