{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Classifier models\n", "\n", "In our introductory materials related to data, model, and algorithm objects, we used a regression task as a representative concrete example. Here we complement that material with some implementations of simple classifier objects." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Contents:__\n", "\n", "- Classifier base class\n", "- Multi-class logistic regression\n", "\n", "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Classifier base class\n", "\n", "Here we put together `Classifier`, a base class for classification models, which naturally inherits `Model`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import models\n", "\n", "class Classifier(models.Model):\n", " '''\n", " Generic classifier model, an object with methods\n", " for both training and evaluating classifiers.\n", " '''\n", "\n", " def __init__(self, data=None, name=None):\n", " super(Classifier, self).__init__(name=name)\n", "\n", " # If given data, collect information about labels.\n", " if data is not None:\n", " self.labels = self.get_labels(data=data) # all unique labels.\n", " self.nc = self.labels.size # number of unique labels.\n", "\n", "\n", " def onehot(self, y):\n", " '''\n", " A function for encoding y into a one-hot vector.\n", " Inputs:\n", " - y is a (k,1) array, taking values in {0,1,...,nc-1}.\n", " '''\n", " nc = self.nc\n", " k = y.shape[0]\n", " C = np.zeros((k,nc), dtype=y.dtype)\n", "\n", " for i in range(k):\n", " j = y[i,0] # assumes y has only one column.\n", " C[i,j] = 1\n", "\n", " return C\n", " \n", "\n", " def get_labels(self, data):\n", " '''\n", " Get all the (unique) labels that appear in the data.\n", " '''\n", " A = (data.y_tr is None)\n", " B = (data.y_te is None)\n", "\n", " if (A and B):\n", " raise ValueError(\"No label data provided!\")\n", " else:\n", " if A:\n", " out_labels = np.unique(data.y_te)\n", " elif B:\n", " out_labels = np.unique(data.y_tr)\n", " else:\n", " out_labels = np.unique(np.concatenate((data.y_tr,\n", " data.y_te), axis=0))\n", " count = out_labels.size\n", " return out_labels.reshape((count,1))\n", "\n", "\n", " def classify(self, X):\n", " '''\n", " Must be implemented by sub-classes.\n", " '''\n", " raise NotImplementedError\n", "\n", "\n", " def class_perf(self, y_est, y_true):\n", " '''\n", " Given class label estimates and true values,\n", " compute the fraction of correct classifications\n", " made for each label, yielding typical binary\n", " classification performance metrics.\n", "\n", " Input:\n", " y_est and y_true are (k x 1) matrices of labels.\n", "\n", " Output:\n", " Returns a dictionary with two components, (1) being\n", " the fraction of correctly classified labels, and\n", " (2) being a dict of per-label precison/recall/F1\n", " scores. \n", " '''\n", " \n", " # First, get the classification rate.\n", " k = y_est.size\n", " num_correct = (y_est == y_true).sum()\n", " frac_correct = num_correct / k\n", " frac_incorrect = 1.0 - frac_correct\n", "\n", " # Then, get precision/recall for each class.\n", " prec_rec = { i:None for i in range(self.nc) } # initialize\n", "\n", " for c in range(self.nc):\n", "\n", " idx_c = (y_true == c)\n", " idx_notc = (idx_c == False)\n", "\n", " TP = (y_est[idx_c] == c).sum()\n", " FN = idx_c.sum() - TP\n", " FP = (y_est[idx_notc] == c).sum()\n", " TN = idx_notc.sum() - FP\n", "\n", " # Precision.\n", " if (TP == 0 and FP == 0):\n", " prec = 0\n", " else:\n", " prec = TP / (TP+FP)\n", "\n", " # Recall.\n", " if (TP == 0 and FN == 0):\n", " rec = 0\n", " else:\n", " rec = TP / (TP+FN)\n", "\n", " # F1 (harmonic mean of precision and recall).\n", " if (prec == 0 or rec == 0):\n", " f1 = 0\n", " else:\n", " f1 = 2 * prec * rec / (prec + rec)\n", "\n", " prec_rec[c] = {\"P\": prec,\n", " \"R\": rec,\n", " \"F1\": f1}\n", "\n", " return {\"rate\": frac_incorrect,\n", " \"PRF1\": prec_rec}\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercises:__\n", "\n", "0. Explain the functions of `onehot()`, `get_labels()`, and `class_perf()`.\n", "0. What does `class_perf` return?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Multi-class logistic regression\n", "\n", "The multi-class logistic regression model is among the simplest and most popular classification models used in practice. Let us sort out the key elements.\n", "\n", "- Data: instance-label pairs $(x,y)$ where $x \\in \\mathbb{R}^{d}$ and $y \\in \\{0,1,\\ldots,K\\}$. There are $K+1$ *classes* here.\n", "\n", "- Model: with controllable parameters $w_{0},w_{1},\\ldots,w_{K} \\in \\mathbb{R}^{d}$, we model class probabilities, conditioned on data $x$, as\n", "\n", "\\begin{align*}\n", "P\\{y = j | x\\} = \\frac{\\exp(w_{j}^{T}x)}{\\sum_{k=0}^{K}\\exp(w_{k}^{T}x)}.\n", "\\end{align*}\n", "\n", "- Loss: minimum negative log-likelihood $(-1) \\log P\\{y | x\\}$ evaluated over the sample (details below).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make the loss expression a bit easier to compute by introducing some new notation. Assume that we are given an $n$-sized sample $(x_{1},y_{1}), \\ldots, (x_{n},y_{n})$. Then for $i=1,\\ldots,n$ and $j=0,\\ldots,K$ define\n", "\n", "\\begin{align*}\n", "p_{ij} & = P\\{y_{i} = j | x_{i}\\}\\\\\n", "c_{ij} & = I\\{y_{i} = j\\}.\n", "\\end{align*}\n", "\n", "Collecting all the elements over index $j$, we get vectors of the form\n", "\n", "\\begin{align*}\n", "p_{i} & = (p_{i0},\\ldots,p_{iK})\\\\\n", "c_{i} & = (c_{i0},\\ldots,c_{iK}).\n", "\\end{align*}\n", "\n", "Note that $c_{i}$ gives us a handy \"one-hot\" vector representation of $y_{i}$. Using this notation, the example-specific probability returned by our model is given by $p_{ij}^{c_{ij}}$. As a loss function, use the negative log-likelihood, defined as\n", "\n", "\\begin{align*}\n", "L(w;x_{i},y_{i}) & = (-1)\\log \\prod_{j=0}^{K} p_{ij}^{c_{ij}}\\\\\n", "& = (-1)\\sum_{j=0}^{K} c_{ij} \\log p_{ij}\\\\\n", "& = (-1)\\sum_{j=0}^{K} c_{ij}\\left(w_{j}^{T} x_{i} - \\log\\left( \\sum_{k=0}^{K}\\exp(w_{k}^{T}x_{i}) \\right) \\right)\\\\\n", "& = \\log\\left( \\sum_{k=0}^{K}\\exp(w_{k}^{T}x_{i}) \\right) - \\sum_{j=0}^{K} c_{ij}w_{j}^{T} x_{i}.\n", "\\end{align*}\n", "\n", "Following usual ERM fashion, the goal is to minimize the empirical mean (here rescaled by $n$):\n", "\n", "\\begin{align*}\n", "\\min_{w} \\sum_{i=1}^{n} L(w;x_{i},y_{i}) \\to \\hat{w}.\n", "\\end{align*}\n", "\n", "As a function of $w$, this objective is differentiable and convex. To actually minimize this objective as a function of $w$, gradient descent is a popular approach. The gradient of $L(w;x,y)$ evaluated with respect to $w$, and evaluated at $(w,x_{i},y_{i})$ can be compactly written as\n", "\n", "\\begin{align*}\n", "\\nabla L(w;x_{i},y_{i}) = (p_{i}-c_{i}) \\otimes x_{i}\n", "\\end{align*}\n", "\n", "where the binary operator $\\otimes$ is the \"Kronecker product\" defined between two matrices. If $U$ is $a \\times b$ and $V$ is $c \\times d$, then their Kronecker product is\n", "\n", "\\begin{align*}\n", "U \\otimes V =\n", "\\begin{bmatrix}\n", "u_{1,1}V & \\cdots & u_{1,b}V\\\\\n", "\\vdots & \\ddots & \\vdots\\\\\n", "u_{a,1}V & \\cdots & u_{a,b}V\n", "\\end{bmatrix}\n", "\\end{align*}\n", "\n", "where each $u_{i,j1}V$ is naturally a block matrix of shape $c \\times d$, and thus $U \\otimes V$ has shape $ac \\times bd$. In the special case of $\\nabla L(w;x_{i},y_{i})$ here, since $(p_{i}-c_{i})$ has shape $1 \\times (K+1)$ and $x_{i}$ has shape $1 \\times d$, the gradient has the shape $1 \\times d(K+1)$, as we would expect. The Kronecker product is implemented as a part of Numpy, and so implementing a logistic regression model is extremely straightforward.\n", "\n", "__Effective degrees of freedom:__ It is perfectly fine to set $w = (w_{0},\\ldots,w_{K}$ and control $d(k+1)$ parameters, but note that because probabilities must sum to one, and by the definition of the model output, if say we have already computed $w_{1},\\ldots,w_{K}$ and thus $p_{i1},\\ldots,p_{iK}$, then naturally $p_{i0} = 1 - \\sum_{j=1}^{K}p_{ij}$. Note that this is of course guaranteed by our model, which means that we could perfectly well fix $w_{0}=(0,\\ldots,0)$ and just optimize with respect to $dK$ parameters, rather than $d(K+1)$. This is reflected in our implementation of `LogisticReg`, a sub-class of `Classifier`, below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class LogisticReg(Classifier):\n", " '''\n", " Multi-class logistic regression model.\n", " '''\n", "\n", " def __init__(self, data=None):\n", " \n", " # Given data info, load up the (X,y) data.\n", " super(LogisticReg, self).__init__(data=data)\n", " \n", " # Convert original labels to a one-hot binary representation.\n", " if data.y_tr is not None:\n", " self.C_tr = self.onehot(y=data.y_tr)\n", " if data.y_te is not None:\n", " self.C_te = self.onehot(y=data.y_te)\n", " \n", " \n", " def classify(self, w, X):\n", " '''\n", " Given learned weights (w) and a matrix of one or\n", " more observations, classify them as {0,...,nc-1}.\n", "\n", " Input:\n", " w is a (d x 1) matrix of weights.\n", " X is a (k x numfeat) matrix of k observations.\n", " NOTE: k can be anything, the training/test sample size.\n", "\n", " Output:\n", " A vector of length k, housing labels in {0,...,nc-1}.\n", " '''\n", " \n", " k, numfeat = X.shape\n", " A = np.zeros((self.nc,k), dtype=np.float32)\n", " \n", " # Get activations, with last row as zeros.\n", " A[:-1,:] = w.reshape((self.nc-1, numfeat)).dot(X.T)\n", " \n", " # Now convert activations to conditional probabilities.\n", " maxes = np.max(A, axis=0) # largest score for each obs.\n", " A = A - maxes\n", " A = np.exp(A)\n", " A = A / A.sum(axis=0) # (nc x k).\n", " \n", " # Assign classes with highest probability, (k x 1) array.\n", " return A.argmax(axis=0).reshape((k,1))\n", "\n", "\n", " def l_imp(self, w, X, C, lamreg=None):\n", " '''\n", " Implementation of the multi-class logistic regression\n", " loss function.\n", "\n", " Input:\n", " w is a (d x 1) matrix of weights.\n", " X is a (k x numfeat) matrix of k observations.\n", " C is a (k x nc) matrix giving a binarized encoding of the\n", " class labels for each observation; each row a one-hot vector.\n", " lam is a non-negative regularization parameter.\n", " NOTE: k can be anything, the training/test sample size.\n", "\n", " Output:\n", " A vector of length k with losses evaluated at k points.\n", " '''\n", " \n", " k, numfeat = X.shape\n", " A = np.zeros((self.nc,k), dtype=np.float64)\n", " \n", " # Get activations, with last row as zeros.\n", " A[:-1,:] = w.reshape((self.nc-1, numfeat)).dot(X.T)\n", " \n", " # Raw activations of all the correct weights.\n", " cvec = (A*C.T).sum(axis=0)\n", " \n", " # Compute the negative log-likelihoods.\n", " maxes = np.max(A, axis=0)\n", " err = (np.log(np.exp(A-maxes).sum(axis=0))+maxes)-cvec\n", "\n", " # Return the losses (all data points), with penalty if needed.\n", " if lamreg is None:\n", " return err\n", " else:\n", " penalty = lamreg * np.linalg.norm(W)**2\n", " return err + penalty\n", " \n", " \n", " def l_tr(self, w, data, n_idx=None, lamreg=None):\n", " if n_idx is None:\n", " return self.l_imp(w=w, X=data.X_tr,\n", " C=self.C_tr,\n", " lamreg=lamreg)\n", " else:\n", " return self.l_imp(w=w, X=data.X_tr[n_idx,:],\n", " C=self.C_tr[n_idx,:],\n", " lamreg=lamreg)\n", " \n", " def l_te(self, w, data, n_idx=None, lamreg=None):\n", " if n_idx is None:\n", " return self.l_imp(w=w, X=data.X_te,\n", " C=self.C_te,\n", " lamreg=lamreg)\n", " else:\n", " return self.l_imp(w=w, X=data.X_te[n_idx,:],\n", " C=self.C_te[n_idx,:],\n", " lamreg=lamreg)\n", " \n", " \n", " def g_imp(self, w, X, C, lamreg=0):\n", " '''\n", " Implementation of the gradient of the loss function used in\n", " multi-class logistic regression.\n", "\n", " Input:\n", " w is a (d x 1) matrix of weights.\n", " X is a (k x numfeat) matrix of k observations.\n", " C is a (k x nc) matrix giving a binarized encoding of the\n", " class labels for each observation; each row a one-hot vector.\n", " lamreg is a non-negative regularization parameter.\n", " NOTE: k can be anything, the training/test sample size.\n", "\n", " Output:\n", " A (k x d) matrix of gradients eval'd at k points.\n", " '''\n", " \n", " k, numfeat = X.shape\n", " A = np.zeros((self.nc,k), dtype=np.float32)\n", " \n", " # Get activations, with last row as zeros.\n", " A[:-1,:] = w.reshape((self.nc-1, numfeat)).dot(X.T)\n", " \n", " # Now convert activations to conditional probabilities.\n", " maxes = np.max(A, axis=0) # largest score for each obs.\n", " A = A - maxes\n", " A = np.exp(A)\n", " A = A / A.sum(axis=0) # (nc x k).\n", " \n", " # Initialize a large matrix (k x d) to house per-point grads.\n", " G = np.zeros((k,w.size), dtype=w.dtype)\n", "\n", " for i in range(k):\n", " # A very tall vector (i.e., just one \"axis\").\n", " G[i,:] = np.kron(a=(A[:-1,i]-C[i,:-1]), b=X[i,:])\n", " # Note we carefully remove the last elements.\n", " \n", " if lamreg is None:\n", " return G\n", " else:\n", " return G + lamreg*2*w.T\n", " \n", "\n", " def g_tr(self, w, data, n_idx=None, lamreg=None):\n", " if n_idx is None:\n", " return self.g_imp(w=w, X=data.X_tr,\n", " C=self.C_tr,\n", " lamreg=lamreg)\n", " else:\n", " return self.g_imp(w=w, X=data.X_tr[n_idx,:],\n", " C=self.C_tr[n_idx,:],\n", " lamreg=lamreg)\n", " \n", " def g_te(self, w, data, n_idx=None, lamreg=None):\n", " if n_idx is None:\n", " return self.g_imp(w=w, X=data.X_te,\n", " C=self.C_te,\n", " lamreg=lamreg)\n", " else:\n", " return self.g_imp(w=w, X=data.X_te[n_idx,:],\n", " C=self.C_te[n_idx,:],\n", " lamreg=lamreg)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise:__\n", "\n", "0. Explain what is being computed in each line of `classify`, `l_imp`, and `g_imp` above in `LogisticReg`.\n", "\n", "0. In the special case of just two classes, where $y \\in \\{0,1\\}$, the traditional model says to train just one $d$-dimensional vector $w$, to map $x \\mapsto f(w^{T}x) = P\\{y = 1 | x\\}$, where $f(u) = 1/(1+\\exp(-u))$ is the so-called logistic function. The gradient takes a very simple form in this case: try implementing a two-class logistic regression model object, without using the `np.kron` function. Numerically test to make sure the outputs of `g_imp` in your model and the general `LogisticReg` are the same.\n", "\n", "0. Why have we re-implemented `l_tr`, `l_te`, `g_tr`, `g_te`?\n", "\n", "0. What are we doing with`maxes`? Why are we doing this? For reference, given any vector $\\mathbf{u}$ and scalar $a$, the following identities hold.\n", "\n", "\\begin{align*}\n", "\\text{softmax}(\\mathbf{u}+a) & = \\text{softmax}(\\mathbf{u})\\\\\n", "\\log \\left( \\sum_{j} \\exp(u_{j}) \\right) & = a + \\log \\left( \\sum_{j} \\exp(u_{j}-a) \\right)\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }