{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Source of the materials**: Biopython cookbook (adapted)\n",
    "<font color='red'>Status: Draft</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Supervised learning methods\n",
    "===========================\n",
    "\n",
    "Note the supervised learning methods described in this chapter all\n",
    "require Numerical Python (numpy) to be installed.\n",
    "\n",
    "The Logistic Regression Model {#sec:LogisticRegression}\n",
    "-----------------------------\n",
    "\n",
    "### Background and Purpose\n",
    "\n",
    "Logistic regression is a supervised learning approach that attempts to\n",
    "distinguish $K$ classes from each other using a weighted sum of some\n",
    "predictor variables $x_i$. The logistic regression model is used to\n",
    "calculate the weights $\\beta_i$ of the predictor variables. In\n",
    "Biopython, the logistic regression model is currently implemented for\n",
    "two classes only ($K = 2$); the number of predictor variables has no\n",
    "predefined limit.\n",
    "\n",
    "As an example, let’s try to predict the operon structure in bacteria. An\n",
    "operon is a set of adjacent genes on the same strand of DNA that are\n",
    "transcribed into a single mRNA molecule. Translation of the single mRNA\n",
    "molecule then yields the individual proteins. For <span>*Bacillus\n",
    "subtilis*</span>, whose data we will be using, the average number of\n",
    "genes in an operon is about 2.4.\n",
    "\n",
    "As a first step in understanding gene regulation in bacteria, we need to\n",
    "know the operon structure. For about 10% of the genes in <span>*Bacillus\n",
    "subtilis*</span>, the operon structure is known from experiments. A\n",
    "supervised learning method can be used to predict the operon structure\n",
    "for the remaining 90% of the genes.\n",
    "\n",
    "For such a supervised learning approach, we need to choose some\n",
    "predictor variables $x_i$ that can be measured easily and are somehow\n",
    "related to the operon structure. One predictor variable might be the\n",
    "distance in base pairs between genes. Adjacent genes belonging to the\n",
    "same operon tend to be separated by a relatively short distance, whereas\n",
    "adjacent genes in different operons tend to have a larger space between\n",
    "them to allow for promoter and terminator sequences. Another predictor\n",
    "variable is based on gene expression measurements. By definition, genes\n",
    "belonging to the same operon have equal gene expression profiles, while\n",
    "genes in different operons are expected to have different expression\n",
    "profiles. In practice, the measured expression profiles of genes in the\n",
    "same operon are not quite identical due to the presence of measurement\n",
    "errors. To assess the similarity in the gene expression profiles, we\n",
    "assume that the measurement errors follow a normal distribution and\n",
    "calculate the corresponding log-likelihood score.\n",
    "\n",
    "We now have two predictor variables that we can use to predict if two\n",
    "adjacent genes on the same strand of DNA belong to the same operon:\n",
    "\n",
    "-   $x_1$: the number of base pairs between them;\n",
    "\n",
    "-   $x_2$: their similarity in expression profile.\n",
    "\n",
    "In a logistic regression model, we use a weighted sum of these two\n",
    "predictors to calculate a joint score $S$:\n",
    "$$S = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_2.$$ The logistic regression\n",
    "model gives us appropriate values for the parameters $\\beta_0$,\n",
    "$\\beta_1$, $\\beta_2$ using two sets of example genes:\n",
    "\n",
    "-   OP: Adjacent genes, on the same strand of DNA, known to belong to\n",
    "    the same operon;\n",
    "\n",
    "-   NOP: Adjacent genes, on the same strand of DNA, known to belong to\n",
    "    different operons.\n",
    "\n",
    "In the logistic regression model, the probability of belonging to a\n",
    "class depends on the score via the logistic function. For the two\n",
    "classes OP and NOP, we can write this as $$\\begin{aligned}\n",
    "\\Pr(\\mathrm{OP}|x_1, x_2) & = & \\frac{\\exp(\\beta_0 + \\beta_1 x_1 + \\beta_2 x_2)}{1+\\exp(\\beta_0 + \\beta_1 x_1 + \\beta_2 x_2)} \\label{eq:OP} \\\\\n",
    "\\Pr(\\mathrm{NOP}|x_1, x_2) & = & \\frac{1}{1+\\exp(\\beta_0 + \\beta_1 x_1 + \\beta_2 x_2)} \\label{eq:NOP}\\end{aligned}$$\n",
    "Using a set of gene pairs for which it is known whether they belong to\n",
    "the same operon (class OP) or to different operons (class NOP), we can\n",
    "calculate the weights $\\beta_0$, $\\beta_1$, $\\beta_2$ by maximizing the\n",
    "log-likelihood corresponding to the probability functions (\\[eq:OP\\])\n",
    "and (\\[eq:NOP\\]).\n",
    "\n",
    "### Training the logistic regression model {#subsec:LogisticRegressionTraining}\n",
    "\n",
    "                    Gene pair                    Intergene distance ($x_1$)   Gene expression score ($x_2$)   Class\n",
    "  --------------------------------------------- ---------------------------- ------------------------------- -------\n",
    "   <span>*cotJA*</span> — <span>*cotJB*</span>              -53                          -200.78               OP\n",
    "    <span>*yesK*</span> — <span>*yesL*</span>               117                          -267.14               OP\n",
    "    <span>*lplA*</span> — <span>*lplB*</span>                57                          -163.47               OP\n",
    "    <span>*lplB*</span> — <span>*lplC*</span>                16                          -190.30               OP\n",
    "    <span>*lplC*</span> — <span>*lplD*</span>                11                          -220.94               OP\n",
    "    <span>*lplD*</span> — <span>*yetF*</span>                85                          -193.94               OP\n",
    "    <span>*yfmT*</span> — <span>*yfmS*</span>                16                          -182.71               OP\n",
    "    <span>*yfmF*</span> — <span>*yfmE*</span>                15                          -180.41               OP\n",
    "    <span>*citS*</span> — <span>*citT*</span>               -26                          -181.73               OP\n",
    "    <span>*citM*</span> — <span>*yflN*</span>                58                          -259.87               OP\n",
    "    <span>*yfiI*</span> — <span>*yfiJ*</span>               126                          -414.53               NOP\n",
    "    <span>*lipB*</span> — <span>*yfiQ*</span>               191                          -249.57               NOP\n",
    "    <span>*yfiU*</span> — <span>*yfiV*</span>               113                          -265.28               NOP\n",
    "    <span>*yfhH*</span> — <span>*yfhI*</span>               145                          -312.99               NOP\n",
    "    <span>*cotY*</span> — <span>*cotX*</span>               154                          -213.83               NOP\n",
    "    <span>*yjoB*</span> — <span>*rapA*</span>               147                          -380.85               NOP\n",
    "    <span>*ptsI*</span> — <span>*splA*</span>                93                          -291.13               NOP\n",
    "\n",
    "  : Adjacent gene pairs known to belong to the same operon (class OP) or\n",
    "  to different operons (class NOP). Intergene distances are negative if\n",
    "  the two genes overlap.\n",
    "\n",
    "\\[table:training\\]\n",
    "\n",
    "Table \\[table:training\\] lists some of the <span>*Bacillus\n",
    "subtilis*</span> gene pairs for which the operon structure is known.\n",
    "Let’s calculate the logistic regression model from these data:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from Bio import LogisticRegression\n",
    "xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30],\n",
    "      [11, -220.94], [85, -193.94], [16, -182.71], [15, -180.41],\n",
    "      [-26, -181.73], [58, -259.87], [126, -414.53], [191, -249.57],\n",
    "      [113, -265.28], [145, -312.99], [154, -213.83], [147, -380.85],[93, -291.13]]\n",
    "ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]\n",
    "model = LogisticRegression.train(xs, ys)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Here, `xs` and `ys` are the training data: `xs` contains the predictor\n",
    "variables for each gene pair, and `ys` specifies if the gene pair\n",
    "belongs to the same operon (`1`, class OP) or different operons (`0`,\n",
    "class NOP). The resulting logistic regression model is stored in\n",
    "`model`, which contains the weights $\\beta_0$, $\\beta_1$, and $\\beta_2$:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[8.983029015714472, -0.035968960444850887, 0.021813956629835204]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.beta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Note that $\\beta_1$ is negative, as gene pairs with a shorter intergene\n",
    "distance have a higher probability of belonging to the same operon\n",
    "(class OP). On the other hand, $\\beta_2$ is positive, as gene pairs\n",
    "belonging to the same operon typically have a higher similarity score of\n",
    "their gene expression profiles. The parameter $\\beta_0$ is positive due\n",
    "to the higher prevalence of operon gene pairs than non-operon gene pairs\n",
    "in the training data.\n",
    "\n",
    "The function `train` has two optional arguments: `update_fn` and\n",
    "`typecode`. The `update_fn` can be used to specify a callback function,\n",
    "taking as arguments the iteration number and the log-likelihood. With\n",
    "the callback function, we can for example track the progress of the\n",
    "model calculation (which uses a Newton-Raphson iteration to maximize the\n",
    "log-likelihood function of the logistic regression model):\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iteration: <built-in function iter> Log-likelihood function: -11.7835020695\n",
      "Iteration: <built-in function iter> Log-likelihood function: -7.15886767672\n",
      "Iteration: <built-in function iter> Log-likelihood function: -5.76877209868\n",
      "Iteration: <built-in function iter> Log-likelihood function: -5.11362294338\n",
      "Iteration: <built-in function iter> Log-likelihood function: -4.74870642433\n",
      "Iteration: <built-in function iter> Log-likelihood function: -4.50026077146\n",
      "Iteration: <built-in function iter> Log-likelihood function: -4.31127773737\n",
      "Iteration: <built-in function iter> Log-likelihood function: -4.16015043396\n",
      "Iteration: <built-in function iter> Log-likelihood function: -4.03561719785\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.93073282192\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.84087660929\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.76282560605\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.69425027154\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.6334178602\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.57900855837\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.52999671386\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.48557145163\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.44508206139\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.40799948447\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.3738885624\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.3423876581\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.31319343769\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.2860493346\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.2607366863\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.23706784091\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.21488073614\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.19403459259\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.17440646052\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.15588842703\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.13838533947\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.12181293595\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.10609629966\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.09116857282\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.07696988017\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.06344642288\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.05054971191\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.03823591619\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.02646530573\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.01520177394\n",
      "Iteration: <built-in function iter> Log-likelihood function: -3.00441242601\n",
      "Iteration: <built-in function iter> Log-likelihood function: -2.99406722296\n",
      "Iteration: <built-in function iter> Log-likelihood function: -2.98413867259\n"
     ]
    }
   ],
   "source": [
    "def show_progress(iteration, loglikelihood):\n",
    "    print(\"Iteration:\", iteration, \"Log-likelihood function:\", loglikelihood)\n",
    "\n",
    "model = LogisticRegression.train(xs, ys, update_fn=show_progress)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The iteration stops once the increase in the log-likelihood function is\n",
    "less than 0.01. If no convergence is reached after 500 iterations, the\n",
    "`train` function returns with an `AssertionError`.\n",
    "\n",
    "The optional keyword `typecode` can almost always be ignored. This\n",
    "keyword allows the user to choose the type of Numeric matrix to use. In\n",
    "particular, to avoid memory problems for very large problems, it may be\n",
    "necessary to use single-precision floats (Float8, Float16, etc.) rather\n",
    "than double, which is used by default.\n",
    "\n",
    "### Using the logistic regression model for classification\n",
    "\n",
    "Classification is performed by calling the `classify` function. Given a\n",
    "logistic regression model and the values for $x_1$ and $x_2$ (e.g. for a\n",
    "gene pair of unknown operon structure), the `classify` function returns\n",
    "`1` or `0`, corresponding to class OP and class NOP, respectively. For\n",
    "example, let’s consider the gene pairs <span>*yxcE*</span>,\n",
    "<span>*yxcD*</span> and <span>*yxiB*</span>, <span>*yxiA*</span>:\n",
    "\n",
    "                   Gene pair                   Intergene distance $x_1$   Gene expression score $x_2$\n",
    "  ------------------------------------------- -------------------------- -----------------------------\n",
    "   <span>*yxcE*</span> — <span>*yxcD*</span>              6                     -173.143442352\n",
    "   <span>*yxiB*</span> — <span>*yxiA*</span>             309                    -271.005880394\n",
    "\n",
    "  : Adjacent gene pairs of unknown operon status.\n",
    "\n",
    "The logistic regression model classifies <span>*yxcE*</span>,\n",
    "<span>*yxcD*</span> as belonging to the same operon (class OP), while\n",
    "<span>*yxiB*</span>, <span>*yxiA*</span> are predicted to belong to\n",
    "different operons:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "yxcE, yxcD: 1\n",
      "yxiB, yxiA: 0\n"
     ]
    }
   ],
   "source": [
    "print(\"yxcE, yxcD:\", LogisticRegression.classify(model, [6, -173.143442352]))\n",
    "print(\"yxiB, yxiA:\", LogisticRegression.classify(model, [309, -271.005880394]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "(which, by the way, agrees with the biological literature).\n",
    "\n",
    "To find out how confident we can be in these predictions, we can call\n",
    "the `calculate` function to obtain the probabilities (equations\n",
    "(\\[eq:OP\\]) and \\[eq:NOP\\]) for class OP and NOP. For\n",
    "<span>*yxcE*</span>, <span>*yxcD*</span> we find\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "class OP: probability = 0.993242163503 class NOP: probability = 0.00675783649744\n"
     ]
    }
   ],
   "source": [
    "q, p = LogisticRegression.calculate(model, [6, -173.143442352])\n",
    "print(\"class OP: probability =\", p, \"class NOP: probability =\", q)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "and for <span>*yxiB*</span>, <span>*yxiA*</span>\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "class OP: probability = 0.000321211251817 class NOP: probability = 0.999678788748\n"
     ]
    }
   ],
   "source": [
    "q, p = LogisticRegression.calculate(model, [309, -271.005880394])\n",
    "print(\"class OP: probability =\", p, \"class NOP: probability =\", q)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "To get some idea of the prediction accuracy of the logistic regression\n",
    "model, we can apply it to the training data:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 0\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n"
     ]
    }
   ],
   "source": [
    "for i in range(len(ys)):\n",
    "    print(\"True:\", ys[i], \"Predicted:\", LogisticRegression.classify(model, xs[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "showing that the prediction is correct for all but one of the gene\n",
    "pairs. A more reliable estimate of the prediction accuracy can be found\n",
    "from a leave-one-out analysis, in which the model is recalculated from\n",
    "the training data after removing the gene to be predicted:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 0\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n"
     ]
    }
   ],
   "source": [
    "for i in range(len(ys)):\n",
    "    print(\"True:\", ys[i], \"Predicted:\", LogisticRegression.classify(model, xs[i]))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The leave-one-out analysis shows that the prediction of the logistic\n",
    "regression model is incorrect for only two of the gene pairs, which\n",
    "corresponds to a prediction accuracy of 88%.\n",
    "\n",
    "### Logistic Regression, Linear Discriminant Analysis, and Support Vector Machines\n",
    "\n",
    "The logistic regression model is similar to linear discriminant\n",
    "analysis. In linear discriminant analysis, the class probabilities also\n",
    "follow equations (\\[eq:OP\\]) and (\\[eq:NOP\\]). However, instead of\n",
    "estimating the coefficients $\\beta$ directly, we first fit a normal\n",
    "distribution to the predictor variables $x$. The coefficients $\\beta$\n",
    "are then calculated from the means and covariances of the normal\n",
    "distribution. If the distribution of $x$ is indeed normal, then we\n",
    "expect linear discriminant analysis to perform better than the logistic\n",
    "regression model. The logistic regression model, on the other hand, is\n",
    "more robust to deviations from normality.\n",
    "\n",
    "Another similar approach is a support vector machine with a linear\n",
    "kernel. Such an SVM also uses a linear combination of the predictors,\n",
    "but estimates the coefficients $\\beta$ from the predictor variables $x$\n",
    "near the boundary region between the classes. If the logistic regression\n",
    "model (equations (\\[eq:OP\\]) and (\\[eq:NOP\\])) is a good description for\n",
    "$x$ away from the boundary region, we expect the logistic regression\n",
    "model to perform better than an SVM with a linear kernel, as it relies\n",
    "on more data. If not, an SVM with a linear kernel may perform better.\n",
    "\n",
    "Trevor Hastie, Robert Tibshirani, and Jerome Friedman: <span>*The\n",
    "Elements of Statistical Learning. Data Mining, Inference, and\n",
    "Prediction*</span>. Springer Series in Statistics, 2001. Chapter 4.4.\n",
    "\n",
    "$k$-Nearest Neighbors\n",
    "---------------------\n",
    "\n",
    "### Background and purpose\n",
    "\n",
    "The $k$-nearest neighbors method is a supervised learning approach that\n",
    "does not need to fit a model to the data. Instead, data points are\n",
    "classified based on the categories of the $k$ nearest neighbors in the\n",
    "training data set.\n",
    "\n",
    "In Biopython, the $k$-nearest neighbors method is available in\n",
    "`Bio.kNN`. To illustrate the use of the $k$-nearest neighbor method in\n",
    "Biopython, we will use the same operon data set as in section\n",
    "\\[sec:LogisticRegression\\].\n",
    "\n",
    "### Initializing a $k$-nearest neighbors model\n",
    "\n",
    "Using the data in Table \\[table:training\\], we create and initialize a\n",
    "$k$-nearest neighbors model as follows:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from Bio import kNN\n",
    "k = 3\n",
    "model = kNN.train(xs, ys, k)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "where `xs` and `ys` are the same as in Section\n",
    "\\[subsec:LogisticRegressionTraining\\]. Here, `k` is the number of\n",
    "neighbors $k$ that will be considered for the classification. For\n",
    "classification into two classes, choosing an odd number for $k$ lets you\n",
    "avoid tied votes. The function name `train` is a bit of a misnomer,\n",
    "since no model training is done: this function simply stores `xs`, `ys`,\n",
    "and `k` in `model`.\n",
    "\n",
    "### Using a $k$-nearest neighbors model for classification\n",
    "\n",
    "To classify new data using the $k$-nearest neighbors model, we use the\n",
    "`classify` function. This function takes a data point $(x_1,x_2)$ and\n",
    "finds the $k$-nearest neighbors in the training data set `xs`. The data\n",
    "point $(x_1, x_2)$ is then classified based on which category (`ys`)\n",
    "occurs most among the $k$ neighbors.\n",
    "\n",
    "For the example of the gene pairs <span>*yxcE*</span>,\n",
    "<span>*yxcD*</span> and <span>*yxiB*</span>, <span>*yxiA*</span>, we\n",
    "find:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "yxcE, yxcD: 1\n"
     ]
    }
   ],
   "source": [
    "x = [6, -173.143442352]\n",
    "print(\"yxcE, yxcD:\", kNN.classify(model, x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "yxiB, yxiA: 0\n"
     ]
    }
   ],
   "source": [
    "x = [309, -271.005880394]\n",
    "print(\"yxiB, yxiA:\", kNN.classify(model, x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "In agreement with the logistic regression model, <span>*yxcE*</span>,\n",
    "<span>*yxcD*</span> are classified as belonging to the same operon\n",
    "(class OP), while <span>*yxiB*</span>, <span>*yxiA*</span> are predicted\n",
    "to belong to different operons.\n",
    "\n",
    "The `classify` function lets us specify both a distance function and a\n",
    "weight function as optional arguments. The distance function affects\n",
    "which $k$ neighbors are chosen as the nearest neighbors, as these are\n",
    "defined as the neighbors with the smallest distance to the query point\n",
    "$(x, y)$. By default, the Euclidean distance is used. Instead, we could\n",
    "for example use the city-block (Manhattan) distance:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "yxcE, yxcD: 1\n"
     ]
    }
   ],
   "source": [
    "def cityblock(x1, x2):\n",
    "    assert len(x1)==2\n",
    "    assert len(x2)==2\n",
    "    distance = abs(x1[0]-x2[0]) + abs(x1[1]-x2[1])\n",
    "    return distance\n",
    "\n",
    "x = [6, -173.143442352]\n",
    "print(\"yxcE, yxcD:\", kNN.classify(model, x, distance_fn = cityblock))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The weight function can be used for weighted voting. For example, we may\n",
    "want to give closer neighbors a higher weight than neighbors that are\n",
    "further away:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "yxcE, yxcD: 1\n"
     ]
    }
   ],
   "source": [
    "from math import exp\n",
    "def weight(x1, x2):\n",
    "    assert len(x1)==2\n",
    "    assert len(x2)==2\n",
    "    return exp(-abs(x1[0]-x2[0]) - abs(x1[1]-x2[1]))\n",
    "\n",
    "x = [6, -173.143442352]\n",
    "print(\"yxcE, yxcD:\", kNN.classify(model, x, weight_fn = weight))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "By default, all neighbors are given an equal weight.\n",
    "\n",
    "To find out how confident we can be in these predictions, we can call\n",
    "the `calculate` function, which will calculate the total weight assigned\n",
    "to the classes OP and NOP. For the default weighting scheme, this\n",
    "reduces to the number of neighbors in each category. For\n",
    "<span>*yxcE*</span>, <span>*yxcD*</span>, we find\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "class OP: weight = 0.0 class NOP: weight = 3.0\n"
     ]
    }
   ],
   "source": [
    "x = [6, -173.143442352]\n",
    "weight = kNN.calculate(model, x)\n",
    "print(\"class OP: weight =\", weight[0], \"class NOP: weight =\", weight[1])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "which means that all three neighbors of `x1`, `x2` are in the NOP class.\n",
    "As another example, for <span>*yesK*</span>, <span>*yesL*</span> we find\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "class OP: weight = 2.0 class NOP: weight = 1.0\n"
     ]
    }
   ],
   "source": [
    "x = [117, -267.14]\n",
    "weight = kNN.calculate(model, x)\n",
    "print(\"class OP: weight =\", weight[0], \"class NOP: weight =\", weight[1])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "which means that two neighbors are operon pairs and one neighbor is a\n",
    "non-operon pair.\n",
    "\n",
    "To get some idea of the prediction accuracy of the $k$-nearest neighbors\n",
    "approach, we can apply it to the training data:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 0\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n"
     ]
    }
   ],
   "source": [
    "for i in range(len(ys)):\n",
    "    print(\"True:\", ys[i], \"Predicted:\", kNN.classify(model, xs[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "showing that the prediction is correct for all but two of the gene\n",
    "pairs. A more reliable estimate of the prediction accuracy can be found\n",
    "from a leave-one-out analysis, in which the model is recalculated from\n",
    "the training data after removing the gene to be predicted:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 0\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 1\n",
      "True: 1 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 1\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 0\n",
      "True: 0 Predicted: 1\n"
     ]
    }
   ],
   "source": [
    "for i in range(len(ys)):\n",
    "    model = kNN.train(xs[:i]+xs[i+1:], ys[:i]+ys[i+1:], 3)\n",
    "    print(\"True:\", ys[i], \"Predicted:\", kNN.classify(model, xs[i]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}