{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classifier models\n",
    "\n",
    "In our introductory materials related to data, model, and algorithm objects, we used a regression task as a representative concrete example. Here we complement that material with some implementations of simple classifier objects."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Contents:__\n",
    "\n",
    "- <a href=\"#classifier\">Classifier base class</a>\n",
    "- <a href=\"#multiclass\">Multi-class logistic regression</a>\n",
    "\n",
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"classifier\"></a>\n",
    "## Classifier base class\n",
    "\n",
    "Here we put together `Classifier`, a base class for classification models, which naturally inherits `Model`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import models\n",
    "\n",
    "class Classifier(models.Model):\n",
    "    '''\n",
    "    Generic classifier model, an object with methods\n",
    "    for both training and evaluating classifiers.\n",
    "    '''\n",
    "\n",
    "    def __init__(self, data=None, name=None):\n",
    "        super(Classifier, self).__init__(name=name)\n",
    "\n",
    "        # If given data, collect information about labels.\n",
    "        if data is not None:\n",
    "            self.labels = self.get_labels(data=data) # all unique labels.\n",
    "            self.nc = self.labels.size # number of unique labels.\n",
    "\n",
    "\n",
    "    def onehot(self, y):\n",
    "        '''\n",
    "        A function for encoding y into a one-hot vector.\n",
    "        Inputs:\n",
    "        - y is a (k,1) array, taking values in {0,1,...,nc-1}.\n",
    "        '''\n",
    "        nc = self.nc\n",
    "        k = y.shape[0]\n",
    "        C = np.zeros((k,nc), dtype=y.dtype)\n",
    "\n",
    "        for i in range(k):\n",
    "            j = y[i,0] # assumes y has only one column.\n",
    "            C[i,j] = 1\n",
    "\n",
    "        return C\n",
    "        \n",
    "\n",
    "    def get_labels(self, data):\n",
    "        '''\n",
    "        Get all the (unique) labels that appear in the data.\n",
    "        '''\n",
    "        A = (data.y_tr is None)\n",
    "        B = (data.y_te is None)\n",
    "\n",
    "        if (A and B):\n",
    "            raise ValueError(\"No label data provided!\")\n",
    "        else:\n",
    "            if A:\n",
    "                out_labels = np.unique(data.y_te)\n",
    "            elif B:\n",
    "                out_labels = np.unique(data.y_tr)\n",
    "            else:\n",
    "                out_labels = np.unique(np.concatenate((data.y_tr,\n",
    "                                                       data.y_te), axis=0))\n",
    "            count = out_labels.size\n",
    "            return out_labels.reshape((count,1))\n",
    "\n",
    "\n",
    "    def classify(self, X):\n",
    "        '''\n",
    "        Must be implemented by sub-classes.\n",
    "        '''\n",
    "        raise NotImplementedError\n",
    "\n",
    "\n",
    "    def class_perf(self, y_est, y_true):\n",
    "        '''\n",
    "        Given class label estimates and true values,\n",
    "        compute the fraction of correct classifications\n",
    "        made for each label, yielding typical binary\n",
    "        classification performance metrics.\n",
    "\n",
    "        Input:\n",
    "        y_est and y_true are (k x 1) matrices of labels.\n",
    "\n",
    "        Output:\n",
    "        Returns a dictionary with two components, (1) being\n",
    "        the fraction of correctly classified labels, and\n",
    "        (2) being a dict of per-label precison/recall/F1\n",
    "        scores. \n",
    "        '''\n",
    "        \n",
    "        # First, get the classification rate.\n",
    "        k = y_est.size\n",
    "        num_correct = (y_est == y_true).sum()\n",
    "        frac_correct = num_correct / k\n",
    "        frac_incorrect = 1.0 - frac_correct\n",
    "\n",
    "        # Then, get precision/recall for each class.\n",
    "        prec_rec = { i:None for i in range(self.nc) } # initialize\n",
    "\n",
    "        for c in range(self.nc):\n",
    "\n",
    "            idx_c = (y_true == c)\n",
    "            idx_notc = (idx_c == False)\n",
    "\n",
    "            TP = (y_est[idx_c] == c).sum()\n",
    "            FN = idx_c.sum() - TP\n",
    "            FP = (y_est[idx_notc] == c).sum()\n",
    "            TN = idx_notc.sum() - FP\n",
    "\n",
    "            # Precision.\n",
    "            if (TP == 0 and FP == 0):\n",
    "                prec = 0\n",
    "            else:\n",
    "                prec = TP / (TP+FP)\n",
    "\n",
    "            # Recall.\n",
    "            if (TP == 0 and FN == 0):\n",
    "                rec = 0\n",
    "            else:\n",
    "                rec = TP / (TP+FN)\n",
    "\n",
    "            # F1 (harmonic mean of precision and recall).\n",
    "            if (prec == 0 or rec == 0):\n",
    "                f1 = 0\n",
    "            else:\n",
    "                f1 = 2 * prec * rec / (prec + rec)\n",
    "\n",
    "            prec_rec[c] = {\"P\": prec,\n",
    "                           \"R\": rec,\n",
    "                           \"F1\": f1}\n",
    "\n",
    "        return {\"rate\": frac_incorrect,\n",
    "                \"PRF1\": prec_rec}\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Exercises:__\n",
    "\n",
    "0. Explain the functions of `onehot()`, `get_labels()`, and `class_perf()`.\n",
    "0. What does `class_perf` return?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"multiclass\"></a>\n",
    "## Multi-class logistic regression\n",
    "\n",
    "The multi-class logistic regression model is among the simplest and most popular classification models used in practice. Let us sort out the key elements.\n",
    "\n",
    "- Data: instance-label pairs $(x,y)$ where $x \\in \\mathbb{R}^{d}$ and $y \\in \\{0,1,\\ldots,K\\}$. There are $K+1$ *classes* here.\n",
    "\n",
    "- Model: with controllable parameters $w_{0},w_{1},\\ldots,w_{K} \\in \\mathbb{R}^{d}$, we model class probabilities, conditioned on data $x$, as\n",
    "\n",
    "\\begin{align*}\n",
    "P\\{y = j | x\\} = \\frac{\\exp(w_{j}^{T}x)}{\\sum_{k=0}^{K}\\exp(w_{k}^{T}x)}.\n",
    "\\end{align*}\n",
    "\n",
    "- Loss: minimum negative log-likelihood $(-1) \\log P\\{y | x\\}$ evaluated over the sample (details below).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make the loss expression a bit easier to compute by introducing some new notation. Assume that we are given an $n$-sized sample $(x_{1},y_{1}), \\ldots, (x_{n},y_{n})$. Then for $i=1,\\ldots,n$ and $j=0,\\ldots,K$ define\n",
    "\n",
    "\\begin{align*}\n",
    "p_{ij} & = P\\{y_{i} = j | x_{i}\\}\\\\\n",
    "c_{ij} & = I\\{y_{i} = j\\}.\n",
    "\\end{align*}\n",
    "\n",
    "Collecting all the elements over index $j$, we get vectors of the form\n",
    "\n",
    "\\begin{align*}\n",
    "p_{i} & = (p_{i0},\\ldots,p_{iK})\\\\\n",
    "c_{i} & = (c_{i0},\\ldots,c_{iK}).\n",
    "\\end{align*}\n",
    "\n",
    "Note that $c_{i}$ gives us a handy \"one-hot\" vector representation of $y_{i}$. Using this notation, the example-specific probability returned by our model is given by $p_{ij}^{c_{ij}}$. As a loss function, use the negative log-likelihood, defined as\n",
    "\n",
    "\\begin{align*}\n",
    "L(w;x_{i},y_{i}) & = (-1)\\log \\prod_{j=0}^{K} p_{ij}^{c_{ij}}\\\\\n",
    "& = (-1)\\sum_{j=0}^{K} c_{ij} \\log p_{ij}\\\\\n",
    "& = (-1)\\sum_{j=0}^{K} c_{ij}\\left(w_{j}^{T} x_{i} - \\log\\left( \\sum_{k=0}^{K}\\exp(w_{k}^{T}x_{i}) \\right) \\right)\\\\\n",
    "& = \\log\\left( \\sum_{k=0}^{K}\\exp(w_{k}^{T}x_{i}) \\right) - \\sum_{j=0}^{K} c_{ij}w_{j}^{T} x_{i}.\n",
    "\\end{align*}\n",
    "\n",
    "Following usual ERM fashion, the goal is to minimize the empirical mean (here rescaled by $n$):\n",
    "\n",
    "\\begin{align*}\n",
    "\\min_{w} \\sum_{i=1}^{n} L(w;x_{i},y_{i}) \\to \\hat{w}.\n",
    "\\end{align*}\n",
    "\n",
    "As a function of $w$, this objective is differentiable and convex. To actually minimize this objective as a function of $w$, gradient descent is a popular approach. The gradient of $L(w;x,y)$ evaluated with respect to $w$, and evaluated at $(w,x_{i},y_{i})$ can be compactly written as\n",
    "\n",
    "\\begin{align*}\n",
    "\\nabla L(w;x_{i},y_{i}) = (p_{i}-c_{i}) \\otimes x_{i}\n",
    "\\end{align*}\n",
    "\n",
    "where the binary operator $\\otimes$ is the \"Kronecker product\" defined between two matrices. If $U$ is $a \\times b$ and $V$ is $c \\times d$, then their Kronecker product is\n",
    "\n",
    "\\begin{align*}\n",
    "U \\otimes V =\n",
    "\\begin{bmatrix}\n",
    "u_{1,1}V & \\cdots & u_{1,b}V\\\\\n",
    "\\vdots & \\ddots & \\vdots\\\\\n",
    "u_{a,1}V & \\cdots & u_{a,b}V\n",
    "\\end{bmatrix}\n",
    "\\end{align*}\n",
    "\n",
    "where each $u_{i,j1}V$ is naturally a block matrix of shape $c \\times d$, and thus $U \\otimes V$ has shape $ac \\times bd$. In the special case of $\\nabla L(w;x_{i},y_{i})$ here, since $(p_{i}-c_{i})$ has shape $1 \\times (K+1)$ and $x_{i}$ has shape $1 \\times d$, the gradient has the shape $1 \\times d(K+1)$, as we would expect. The Kronecker product is implemented as a part of Numpy, and so implementing a logistic regression model is extremely straightforward.\n",
    "\n",
    "__Effective degrees of freedom:__ It is perfectly fine to set $w = (w_{0},\\ldots,w_{K}$ and control $d(k+1)$ parameters, but note that because probabilities must sum to one, and by the definition of the model output, if say we have already computed $w_{1},\\ldots,w_{K}$ and thus $p_{i1},\\ldots,p_{iK}$, then naturally $p_{i0} = 1 - \\sum_{j=1}^{K}p_{ij}$. Note that this is of course guaranteed by our model, which means that we could perfectly well fix $w_{0}=(0,\\ldots,0)$ and just optimize with respect to $dK$ parameters, rather than $d(K+1)$. This is reflected in our implementation of `LogisticReg`, a sub-class of `Classifier`, below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class LogisticReg(Classifier):\n",
    "    '''\n",
    "    Multi-class logistic regression model.\n",
    "    '''\n",
    "\n",
    "    def __init__(self, data=None):\n",
    "        \n",
    "        # Given data info, load up the (X,y) data.\n",
    "        super(LogisticReg, self).__init__(data=data)\n",
    "        \n",
    "        # Convert original labels to a one-hot binary representation.\n",
    "        if data.y_tr is not None:\n",
    "            self.C_tr = self.onehot(y=data.y_tr)\n",
    "        if data.y_te is not None:\n",
    "            self.C_te = self.onehot(y=data.y_te)\n",
    "        \n",
    "        \n",
    "    def classify(self, w, X):\n",
    "        '''\n",
    "        Given learned weights (w) and a matrix of one or\n",
    "        more observations, classify them as {0,...,nc-1}.\n",
    "\n",
    "        Input:\n",
    "        w is a (d x 1) matrix of weights.\n",
    "        X is a (k x numfeat) matrix of k observations.\n",
    "        NOTE: k can be anything, the training/test sample size.\n",
    "\n",
    "        Output:\n",
    "        A vector of length k, housing labels in {0,...,nc-1}.\n",
    "        '''\n",
    "        \n",
    "        k, numfeat = X.shape\n",
    "        A = np.zeros((self.nc,k), dtype=np.float32)\n",
    "        \n",
    "        # Get activations, with last row as zeros.\n",
    "        A[:-1,:] = w.reshape((self.nc-1, numfeat)).dot(X.T)\n",
    "        \n",
    "        # Now convert activations to conditional probabilities.\n",
    "        maxes = np.max(A, axis=0) # largest score for each obs.\n",
    "        A = A - maxes\n",
    "        A = np.exp(A)\n",
    "        A = A / A.sum(axis=0)  # (nc x k).\n",
    "        \n",
    "        # Assign classes with highest probability, (k x 1) array.\n",
    "        return A.argmax(axis=0).reshape((k,1))\n",
    "\n",
    "\n",
    "    def l_imp(self, w, X, C, lamreg=None):\n",
    "        '''\n",
    "        Implementation of the multi-class logistic regression\n",
    "        loss function.\n",
    "\n",
    "        Input:\n",
    "        w is a (d x 1) matrix of weights.\n",
    "        X is a (k x numfeat) matrix of k observations.\n",
    "        C is a (k x nc) matrix giving a binarized encoding of the\n",
    "        class labels for each observation; each row a one-hot vector.\n",
    "        lam is a non-negative regularization parameter.\n",
    "        NOTE: k can be anything, the training/test sample size.\n",
    "\n",
    "        Output:\n",
    "        A vector of length k with losses evaluated at k points.\n",
    "        '''\n",
    "        \n",
    "        k, numfeat = X.shape\n",
    "        A = np.zeros((self.nc,k), dtype=np.float64)\n",
    "        \n",
    "        # Get activations, with last row as zeros.\n",
    "        A[:-1,:] = w.reshape((self.nc-1, numfeat)).dot(X.T)\n",
    "        \n",
    "        # Raw activations of all the correct weights.\n",
    "        cvec = (A*C.T).sum(axis=0)\n",
    "        \n",
    "        # Compute the negative log-likelihoods.\n",
    "        maxes = np.max(A, axis=0)\n",
    "        err = (np.log(np.exp(A-maxes).sum(axis=0))+maxes)-cvec\n",
    "\n",
    "        # Return the losses (all data points), with penalty if needed.\n",
    "        if lamreg is None:\n",
    "            return err\n",
    "        else:\n",
    "            penalty = lamreg * np.linalg.norm(W)**2\n",
    "            return err + penalty\n",
    "    \n",
    "    \n",
    "    def l_tr(self, w, data, n_idx=None, lamreg=None):\n",
    "        if n_idx is None:\n",
    "            return self.l_imp(w=w, X=data.X_tr,\n",
    "                              C=self.C_tr,\n",
    "                              lamreg=lamreg)\n",
    "        else:\n",
    "            return self.l_imp(w=w, X=data.X_tr[n_idx,:],\n",
    "                              C=self.C_tr[n_idx,:],\n",
    "                              lamreg=lamreg)\n",
    "    \n",
    "    def l_te(self, w, data, n_idx=None, lamreg=None):\n",
    "        if n_idx is None:\n",
    "            return self.l_imp(w=w, X=data.X_te,\n",
    "                              C=self.C_te,\n",
    "                              lamreg=lamreg)\n",
    "        else:\n",
    "            return self.l_imp(w=w, X=data.X_te[n_idx,:],\n",
    "                              C=self.C_te[n_idx,:],\n",
    "                              lamreg=lamreg)\n",
    "        \n",
    "        \n",
    "    def g_imp(self, w, X, C, lamreg=0):\n",
    "        '''\n",
    "        Implementation of the gradient of the loss function used in\n",
    "        multi-class logistic regression.\n",
    "\n",
    "        Input:\n",
    "        w is a (d x 1) matrix of weights.\n",
    "        X is a (k x numfeat) matrix of k observations.\n",
    "        C is a (k x nc) matrix giving a binarized encoding of the\n",
    "        class labels for each observation; each row a one-hot vector.\n",
    "        lamreg is a non-negative regularization parameter.\n",
    "        NOTE: k can be anything, the training/test sample size.\n",
    "\n",
    "        Output:\n",
    "        A (k x d) matrix of gradients eval'd at k points.\n",
    "        '''\n",
    "        \n",
    "        k, numfeat = X.shape\n",
    "        A = np.zeros((self.nc,k), dtype=np.float32)\n",
    "        \n",
    "        # Get activations, with last row as zeros.\n",
    "        A[:-1,:] = w.reshape((self.nc-1, numfeat)).dot(X.T)\n",
    "        \n",
    "        # Now convert activations to conditional probabilities.\n",
    "        maxes = np.max(A, axis=0) # largest score for each obs.\n",
    "        A = A - maxes\n",
    "        A = np.exp(A)\n",
    "        A = A / A.sum(axis=0)  # (nc x k).\n",
    "        \n",
    "        # Initialize a large matrix (k x d) to house per-point grads.\n",
    "        G = np.zeros((k,w.size), dtype=w.dtype)\n",
    "\n",
    "        for i in range(k):\n",
    "            # A very tall vector (i.e., just one \"axis\").\n",
    "            G[i,:] = np.kron(a=(A[:-1,i]-C[i,:-1]), b=X[i,:])\n",
    "            # Note we carefully remove the last elements.\n",
    "        \n",
    "        if lamreg is None:\n",
    "            return G\n",
    "        else:\n",
    "            return G + lamreg*2*w.T\n",
    "        \n",
    "\n",
    "    def g_tr(self, w, data, n_idx=None, lamreg=None):\n",
    "        if n_idx is None:\n",
    "            return self.g_imp(w=w, X=data.X_tr,\n",
    "                              C=self.C_tr,\n",
    "                              lamreg=lamreg)\n",
    "        else:\n",
    "            return self.g_imp(w=w, X=data.X_tr[n_idx,:],\n",
    "                              C=self.C_tr[n_idx,:],\n",
    "                              lamreg=lamreg)\n",
    "    \n",
    "    def g_te(self, w, data, n_idx=None, lamreg=None):\n",
    "        if n_idx is None:\n",
    "            return self.g_imp(w=w, X=data.X_te,\n",
    "                              C=self.C_te,\n",
    "                              lamreg=lamreg)\n",
    "        else:\n",
    "            return self.g_imp(w=w, X=data.X_te[n_idx,:],\n",
    "                              C=self.C_te[n_idx,:],\n",
    "                              lamreg=lamreg)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Exercise:__\n",
    "\n",
    "0. Explain what is being computed in each line of `classify`, `l_imp`, and `g_imp` above in `LogisticReg`.\n",
    "\n",
    "0. In the special case of just two classes, where $y \\in \\{0,1\\}$, the traditional model says to train just one $d$-dimensional vector $w$, to map $x \\mapsto f(w^{T}x) = P\\{y = 1 | x\\}$, where $f(u) = 1/(1+\\exp(-u))$ is the so-called logistic function. The gradient takes a very simple form in this case: try implementing a two-class logistic regression model object, without using the `np.kron` function. Numerically test to make sure the outputs of `g_imp` in your model and the general `LogisticReg` are the same.\n",
    "\n",
    "0. Why have we re-implemented `l_tr`, `l_te`, `g_tr`, `g_te`?\n",
    "\n",
    "0. What are we doing with`maxes`? Why are we doing this? For reference, given any vector $\\mathbf{u}$ and scalar $a$, the following identities hold.\n",
    "\n",
    "\\begin{align*}\n",
    "\\text{softmax}(\\mathbf{u}+a) & = \\text{softmax}(\\mathbf{u})\\\\\n",
    "\\log \\left( \\sum_{j} \\exp(u_{j}) \\right) & = a + \\log \\left( \\sum_{j} \\exp(u_{j}-a) \\right)\n",
    "\\end{align*}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}