{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Frequency Classifier\n",
    "\n",
    "In this notebook, we will set up a simple frequency classifier as the baseline classifier. It uses the frequency of positive samples in the training set as the predicted positive probability of all samples in the test set.\n",
    "\n",
    "## Table of Contents\n",
    "\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#1.-Data-Preparation\" data-toc-modified-id=\"1.-Data-Preparation-1\">1. Data Preparation</a></span></li><li><span><a href=\"#2.-Model-Training\" data-toc-modified-id=\"2.-Model-Training-2\">2. Model Training</a></span></li><li><span><a href=\"#3.-Model-Testing\" data-toc-modified-id=\"3.-Model-Testing-3\">3. Model Testing</a></span></li><li><span><a href=\"#4.-Summary\" data-toc-modified-id=\"4.-Summary-4\">4. Summary</a></span></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Data Preparation\n",
    "\n",
    "Since this classifier does not use any features but only the label counts, we use image filenames to build training data. Similar to other classifiers, donor 4 is not included in the train/test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from glob import glob\n",
    "from os.path import basename\n",
    "from functools import reduce\n",
    "from collections import Counter\n",
    "from sklearn import metrics\n",
    "from IPython.display import display, Markdown"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_data = {}\n",
    "image_path = './images/sample_images/processed/augmented/donor_{}/*/*.png'\n",
    "\n",
    "for d in [1, 2, 3, 5, 6]:\n",
    "    names = [basename(n) for n in glob(image_path.format(d))]\n",
    "    # 0 is negative and 1 is positive\n",
    "    labels = [0 if 'noact' in n else 1 for n in names]\n",
    "    \n",
    "    x = np.array(names)\n",
    "    y = np.array(labels)\n",
    "    all_data[d] = {'x': x, 'y': y}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Model Training\n",
    "\n",
    "Even though this is a trivial model, we can implement it using sk-learn's API. This makes it easier to compute the performance statistics and compare with other models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "class FrequencyClassifier():\n",
    "    \"\"\"\n",
    "    Build a trivial frequency classifier with sklearn interface.\n",
    "    \"\"\"\n",
    "    def __init__(self, pos_freq):\n",
    "        self.pos_freq = pos_freq\n",
    "    \n",
    "    def predict_proba(self, x):\n",
    "        \"\"\"\n",
    "        Use the positive sample frequency in the training set as the\n",
    "        positive probablity of all elements in x.\n",
    "        \n",
    "        Args:\n",
    "            x(np.array): the feature vector you want to predict on\n",
    "        \n",
    "        Return:\n",
    "            [p1, p2]: [probability of the negative label, probability of\n",
    "                the positive label]\n",
    "        \"\"\"\n",
    "        \n",
    "        probs = [1-self.pos_freq, self.pos_freq]\n",
    "        return np.vstack([probs for i in range(x.shape[0])])\n",
    "    \n",
    "    def predict(self, x):\n",
    "        \"\"\"\n",
    "        Use the majority class in the training set to predict all\n",
    "        elements in x.\n",
    "        \n",
    "        Args:\n",
    "            x(np.array): the feature vector you want to predict on\n",
    "        \n",
    "        Return:\n",
    "            [p1, p2]: [probability of the negative label, probability of\n",
    "                the positive label]\n",
    "        \"\"\"\n",
    "        return np.array([1 if self.pos_freq >= 0.5 else 0 for i in\n",
    "                         range(x.shape[0])])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we use this classifier to train 5 models for 5 test donors. For example, for test donor 1, the classifier counts the positive frequency in donors 2, 3, 5, 6."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The positive frequency in donor [2, 3, 5, 6] is 0.4966.\n",
      "The positive frequency in donor [1, 3, 5, 6] is 0.3243.\n",
      "The positive frequency in donor [1, 2, 5, 6] is 0.4314.\n",
      "The positive frequency in donor [1, 2, 3, 6] is 0.3278.\n",
      "The positive frequency in donor [1, 2, 3, 5] is 0.3774.\n"
     ]
    }
   ],
   "source": [
    "# Mapping test donor to its trained model\n",
    "trained_models = {}\n",
    "\n",
    "for d in [1, 2, 3, 5, 6]:\n",
    "    # Concatenate y labels to form the training set for current\n",
    "    # test donor d\n",
    "    train_donors = [i for i in [1, 2, 3, 5, 6] if i != d]\n",
    "    train_y = np.hstack([all_data[t]['y'] for t in train_donors])\n",
    "    \n",
    "    # Count the positive samples in these training labels\n",
    "    count = Counter(train_y)\n",
    "    pos_freq = count[1] / len(train_y)\n",
    "    print(\"The positive frequency in donor {} is {:.4f}.\".format(train_donors,\n",
    "                                                                 pos_freq))\n",
    "    \n",
    "    # Create a frequency classifier model for this test donor d\n",
    "    trained_models[d] = FrequencyClassifier(pos_freq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of the positive frequencies differ from our whole-dataset frequency classifier reported in the paper. It is due to the rounding of subsamples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Model Testing\n",
    "\n",
    "For each test donor, we will run its trained model and compute performance statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_score(model, x_test, y_test, pos=1):\n",
    "    \"\"\"\n",
    "    This function runs the trained `model` on `x_test`, compares the test\n",
    "    result with `y_test`. Finally, it outputs a collection of various\n",
    "    classificaiton performance metrics.\n",
    "    \n",
    "    Args:\n",
    "        model: a trained sklearn model\n",
    "        x_test(np.array(m, n)): 2D feature array in the testset, m elements and\n",
    "            each element has n features\n",
    "        y_test(np.array(m)): 1D label array in the testset. There are m entries.\n",
    "        pos: the encoding of postive label in `y_test`\n",
    "\n",
    "    Return:\n",
    "        A dictionary containing the metrics information and predictions:\n",
    "            metrics scores: ['acc': accuracy, 'precision', 'recall',\n",
    "                             'ap': average precision,\n",
    "                             'aroc': area under ROC curve,\n",
    "                             'pr': PR curve points,\n",
    "                             'roc': ROC curve points]\n",
    "            predicitons: ['y_true': the groundtruth labels,\n",
    "                          'y_score': predicted probability]\n",
    "    \"\"\"\n",
    "\n",
    "    y_predict_prob = model.predict_proba(x_test)\n",
    "    y_predict = model.predict(x_test)\n",
    "\n",
    "    # Sklearn requires the prob list to be 1D\n",
    "    y_predict_prob = [x[pos] for x in y_predict_prob]\n",
    "    y_test_fixed = np.array(y_test)\n",
    "    \n",
    "    if pos == 0:\n",
    "        # Flip the array so 1 represents the positive class\n",
    "        y_test_fixed = 1 - np.array(y_test)\n",
    "\n",
    "    # Compute the PR-curve points\n",
    "    precisions, recalls, thresholds = metrics.precision_recall_curve(\n",
    "        y_test_fixed,\n",
    "        y_predict_prob,\n",
    "        pos_label=pos\n",
    "    )\n",
    "\n",
    "    # Compute the roc-curve points\n",
    "    fprs, tprs, roc_thresholds = metrics.roc_curve(y_test_fixed, y_predict_prob,\n",
    "                                                   pos_label=pos)\n",
    "\n",
    "    return ({'acc': metrics.accuracy_score(y_test_fixed, y_predict),\n",
    "             'precision': metrics.precision_score(y_test_fixed, y_predict,\n",
    "                                                  pos_label=pos),\n",
    "             'recall': metrics.recall_score(y_test_fixed, y_predict,\n",
    "                                            pos_label=pos),\n",
    "             'ap': metrics.average_precision_score(y_test_fixed,\n",
    "                                                   y_predict_prob),\n",
    "             'aroc': metrics.roc_auc_score(y_test_fixed,\n",
    "                                           y_predict_prob),\n",
    "             'pr': [precisions.tolist(), recalls.tolist(),\n",
    "                    thresholds.tolist()],\n",
    "             'roc': [fprs.tolist(), tprs.tolist(), roc_thresholds.tolist()],\n",
    "             'y_true': y_test,\n",
    "             'y_score': y_predict_prob})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_table(metric_dict, count_dict):\n",
    "    \"\"\"\n",
    "    Transfer the model performance metric dictionary into a Markdown table.\n",
    "    \n",
    "    Args:\n",
    "        metric_dict(dict): a dictionary encoding model performance statisitcs\n",
    "            and prediction information for all test donors\n",
    "        count_dict(dict): a dictionary encoding the count of activated and\n",
    "            quiescent samples for each test donor\n",
    "    \n",
    "    Return:\n",
    "        string: a Markdown syntax table\n",
    "    \"\"\"\n",
    "\n",
    "    # Define header and line template\n",
    "    table_str = \"\"\n",
    "    line = \"|{}|{:.2f}%|{:.2f}%|{:.2f}%|{:.2f}%|{:.2f}%|{}|{}|\\n\"\n",
    "    table_str += (\"|Test Donor|Accuracy|Precision|Recall|Average Precision|\" +\n",
    "                  \"AUC|Num of Activated|Num of Quiescent|\\n\")\n",
    "    table_str += \"|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\\n\"\n",
    "\n",
    "    for d in [1, 2, 3, 5, 6]:\n",
    "        result = metric_dict[d]\n",
    "        table_str += (line.format(\"donor_{}\".format(d),\n",
    "                                  result['acc'] * 100,\n",
    "                                  result['precision'] * 100,\n",
    "                                  result['recall'] * 100,\n",
    "                                  result['ap'] * 100,\n",
    "                                  result['aroc'] * 100,\n",
    "                                  count_dict[d]['activated'],\n",
    "                                  count_dict[d]['quiescent']))\n",
    "\n",
    "    return table_str"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/JayWong/miniconda3/envs/cellimage/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.\n",
      "  'precision', 'predicted', average, warn_for)\n"
     ]
    }
   ],
   "source": [
    "model_performance = {}\n",
    "\n",
    "for d in [1, 2, 3, 5, 6]:\n",
    "    # Collect the performance metrics for each test donor d\n",
    "    cur_y_test = all_data[d]['y']\n",
    "    cur_x_test = all_data[d]['x']\n",
    "    model_performance[d] = get_score(trained_models[d],\n",
    "                                     cur_x_test,\n",
    "                                     cur_y_test)\n",
    "\n",
    "# Save the model performance, so we can compare all models in transfer_learning\n",
    "# notebook\n",
    "np.savez('./resource/frequency_model_performance.npz',\n",
    "         model_performance=model_performance)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Count the labels for each donor\n",
    "count_dict = {}\n",
    "\n",
    "for d in [1, 2, 3, 5, 6]:\n",
    "    # Do not count augmented images\n",
    "    act_count = len(glob(\"./images/sample_images/processed/\" +\n",
    "                         \"augmented/donor_{}/activated/*.png\".format(d))) // 6\n",
    "    qui_count = len(glob(\"./images/sample_images/processed/\" +\n",
    "                         \"augmented/donor_{}/quiescent/*.png\".format(d))) // 6\n",
    "    count_dict[d] = {\n",
    "        'activated': act_count,\n",
    "        'quiescent': qui_count\n",
    "    }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "|Test Donor|Accuracy|Precision|Recall|Average Precision|AUC|Num of Activated|Num of Quiescent|\n",
       "|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n",
       "|donor_1|88.11%|0.00%|0.00%|11.89%|50.00%|22|163|\n",
       "|donor_2|18.75%|0.00%|0.00%|81.25%|50.00%|65|15|\n",
       "|donor_3|73.41%|0.00%|0.00%|26.59%|50.00%|46|127|\n",
       "|donor_5|27.17%|0.00%|0.00%|72.83%|50.00%|67|25|\n",
       "|donor_6|56.86%|0.00%|0.00%|43.14%|50.00%|44|58|\n"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Create a table summary\n",
    "display(Markdown(make_table(model_performance, count_dict)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the prediction is constant, we failed to compute precision and recall. Also, there are some interesting relationships among accuracy, average precision and AUC. For example, when `pos_freq` is greater than $0.5$, $\\text{AP} = \\text{ACC}$. When `pos_freq` is less than or equal to $0.5$, then $\\text{AP} = 1 - \\text{ACC}$. You can read this [notebook](https://nbviewer.jupyter.org/gist/xiaohk/698ba7c174768a519d147aaea67dc1a0#Unique-Score-and-PR-and-AUC) to learn more."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Summary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, we demonstrate the general workflow of training and testing classifiers for different test donors. We also set up our baseline model using a trivial frequency classifier."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "256px"
   },
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}