{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Common Cross Validation Test Code\n",
    "We used the same cross validation test procedure for the three applications described in the paper. This document provides explanations for the code in [analytics.py](analytics.py) used in those tests.\n",
    "\n",
    "See the tests carried out in each application:\n",
    "* [Application 1: ProvStore Documents](Application%201%20-%20ProvStore%20Documents.ipynb)\n",
    "* [Application 2: CollabMap](Application%202%20-%20CollabMap%20Data%20Quality.ipynb)\n",
    "* [Applicaiton 3: HAC-ER Messages](Application%203%20-%20RRG%20Messages.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lists of features\n",
    "\n",
    "In our experiments, we first test our trained classifiers using all 22 provenance network metrics as defined in the paper. We then repeat the test using only the generic network metrics (6) and only the provenance-specific network metrics (16). Comparing the performance from all three tests will help verify whether the provenance-specific\n",
    "network metrics bring added benefits to the classification application being discussed.\n",
    "\n",
    "The lists of metrics `combined`, `generic`, and `provenance` are defined below.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# The 'combined' list has all the 22 metrics\n",
    "feature_names_combined = (\n",
    "    'entities', 'agents', 'activities',  # PROV types (for nodes)\n",
    "    'nodes', 'edges', 'diameter', 'assortativity',  # standard metrics\n",
    "    'acc', 'acc_e', 'acc_a', 'acc_ag',  # average clustering coefficients\n",
    "    'mfd_e_e', 'mfd_e_a', 'mfd_e_ag',  # MFDs\n",
    "    'mfd_a_e', 'mfd_a_a', 'mfd_a_ag',\n",
    "    'mfd_ag_e', 'mfd_ag_a', 'mfd_ag_ag',\n",
    "    'mfd_der',  # MFD derivations\n",
    "    'powerlaw_alpha'  # Power Law\n",
    ")\n",
    "# The 'generic' list has 6 generic network metrics (that do not take provenance information into account)\n",
    "feature_names_generic = (\n",
    "    'nodes', 'edges', 'diameter', 'assortativity',  # standard metrics\n",
    "    'acc',\n",
    "    'powerlaw_alpha'  # Power Law\n",
    ")\n",
    "# The 'provenance' list has 16 provenance-specific network metrics\n",
    "feature_names_provenance = (\n",
    "    'entities', 'agents', 'activities',  # PROV types (for nodes)\n",
    "    'acc_e', 'acc_a', 'acc_ag',  # average clustering coefficients\n",
    "    'mfd_e_e', 'mfd_e_a', 'mfd_e_ag',  # MFDs\n",
    "    'mfd_a_e', 'mfd_a_a', 'mfd_a_ag',\n",
    "    'mfd_ag_e', 'mfd_ag_a', 'mfd_ag_ag',\n",
    "    'mfd_der',  # MFD derivations\n",
    ")\n",
    "# The utitility of above threes set of metrics will be assessed in our experiements to\n",
    "# understand whether provenance type information help us improve data classification performance\n",
    "feature_name_lists = (\n",
    "    ('combined', feature_names_combined),\n",
    "    ('generic', feature_names_generic),\n",
    "    ('provenance', feature_names_provenance)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Balancing Data\n",
    "\n",
    "This section defines the data balancing function by over-sampling using the SMOTE algorithm (see [SMOTE: Synthetic Minority Over-sampling Technique](https://www.jair.org/media/953/live-953-2037-jair.pdf)).\n",
    "\n",
    "It takes a dataframe where each row contains the label (in column `label`) and the feature vector corresponding to that label. It returns a new dataframe of the same format, but with added rows resulted from the SMOTE oversampling process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from imblearn.over_sampling import SMOTE\n",
    "from collections import Counter\n",
    "\n",
    "def balance_smote(df):\n",
    "    X = df.drop('label', axis=1)\n",
    "    Y = df.label\n",
    "    print('Original data shapes:', X.shape, Y.shape)\n",
    "    \n",
    "    smoX, smoY = X, Y\n",
    "    c = Counter(smoY)\n",
    "    while (min(c.values()) < max(c.values())):  # check if all classes are balanced, if not balance the first minority class\n",
    "        smote = SMOTE(ratio=\"auto\", kind='regular')\n",
    "        smoX, smoY = smote.fit_sample(smoX, smoY)\n",
    "        c = Counter(smoY)\n",
    "    \n",
    "    print('Balanced data shapes:', smoX.shape, smoY.shape)\n",
    "    df_balanced = pd.DataFrame(smoX, columns=X.columns)\n",
    "    df_balanced['label'] = smoY\n",
    "    return df_balanced"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The `t_confidence_interval` method below calculate the 95% confidence interval for a given list of values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def t_confidence_interval(an_array, alpha=0.95):\n",
    "    s = np.std(an_array)\n",
    "    n = len(an_array)\n",
    "    return stats.t.interval(alpha=alpha, df=(n - 1), scale=(s / np.sqrt(n)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cross Validation Methodology\n",
    "The following `cv_test` function carries out the cross validation test over `n_iterations` times and returns the accuracy scores and importance scores (for each feature). The cross validation steps are as follow:\n",
    "* Split the input dataset (X, Y) into a training set and a test set using Stratified K-fold method with k = 10\n",
    "* Train the [Decision Tree classifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) `clf` using the training set\n",
    "* Score the accuracy of the classifier `clf` on the test set\n",
    "* (Repeat the above until having done the required number of iterations)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def cv_test(X, Y, n_iterations=1000, test_id=\"\"):\n",
    "    accuracies = []\n",
    "    importances = []\n",
    "    while len(accuracies) < n_iterations:\n",
    "        skf = model_selection.StratifiedKFold(n_splits=10, shuffle=True)\n",
    "        for train, test in skf.split(X, Y):\n",
    "            clf = tree.DecisionTreeClassifier()\n",
    "            clf.fit(X.iloc[train], Y.iloc[train])\n",
    "            accuracies.append(clf.score(X.iloc[test], Y.iloc[test]))\n",
    "            importances.append(clf.feature_importances_)\n",
    "    print(\"Accuracy: %.2f%% ±%.4f <-- %s\" % (np.mean(accuracies) * 100, t_confidence_interval(accuracies)[1] * 100, test_id))\n",
    "    return accuracies, importances"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Experiments**: Having defined the cross validation method above, we now run it on the dataset (`df`) using all the features (`combined`), only the generic network metrics (`generic`), and only the provenance-specific network metrics (`provenance`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def test_classification(df, n_iterations=1000):\n",
    "    results = pd.DataFrame()\n",
    "    imps = pd.DataFrame()\n",
    "    Y = df.label\n",
    "    for feature_list_name, feature_names in feature_name_lists:\n",
    "        X = df[list(feature_names)]\n",
    "        accuracies, importances = cv_test(X, Y, n_iterations, test_id=feature_list_name)\n",
    "        rs = pd.DataFrame(\n",
    "            {\n",
    "                'Metrics': feature_list_name,\n",
    "                'Accuracy': accuracies}\n",
    "        )\n",
    "        results = results.append(rs, ignore_index=True)\n",
    "        if feature_list_name == \"combined\":  # we are interested in the relevance of all features (i.e. 'combined') \n",
    "            imps = pd.DataFrame(importances, columns=feature_names)\n",
    "    return results, imps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "In summary, the `test_classification()` function above takes a DataFrame with a special `label` column holding the labels for the intended classification. It runs the cross validation test three times:\n",
    "\n",
    "1. using *all 22 network metrics* available (in the remaining columns of the DataFrame),\n",
    "2. using *only generic network metrics*, and\n",
    "3. using *only provenance-specific network metrics*.\n",
    "\n",
    "The accuracy measures from those tests (1,000 values from each) are collated in the returned `results` DataFrame. The  the importance measures of all the 22 metrics calculated in test (1) are also collated and returned in the `imps` DataFrame."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}