{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Classifier models\n",
"\n",
"In our introductory materials related to data, model, and algorithm objects, we used a regression task as a representative concrete example. Here we complement that material with some implementations of simple classifier objects."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Contents:__\n",
"\n",
"- Classifier base class\n",
"- Multi-class logistic regression\n",
"\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Classifier base class\n",
"\n",
"Here we put together `Classifier`, a base class for classification models, which naturally inherits `Model`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import models\n",
"\n",
"class Classifier(models.Model):\n",
" '''\n",
" Generic classifier model, an object with methods\n",
" for both training and evaluating classifiers.\n",
" '''\n",
"\n",
" def __init__(self, data=None, name=None):\n",
" super(Classifier, self).__init__(name=name)\n",
"\n",
" # If given data, collect information about labels.\n",
" if data is not None:\n",
" self.labels = self.get_labels(data=data) # all unique labels.\n",
" self.nc = self.labels.size # number of unique labels.\n",
"\n",
"\n",
" def onehot(self, y):\n",
" '''\n",
" A function for encoding y into a one-hot vector.\n",
" Inputs:\n",
" - y is a (k,1) array, taking values in {0,1,...,nc-1}.\n",
" '''\n",
" nc = self.nc\n",
" k = y.shape[0]\n",
" C = np.zeros((k,nc), dtype=y.dtype)\n",
"\n",
" for i in range(k):\n",
" j = y[i,0] # assumes y has only one column.\n",
" C[i,j] = 1\n",
"\n",
" return C\n",
" \n",
"\n",
" def get_labels(self, data):\n",
" '''\n",
" Get all the (unique) labels that appear in the data.\n",
" '''\n",
" A = (data.y_tr is None)\n",
" B = (data.y_te is None)\n",
"\n",
" if (A and B):\n",
" raise ValueError(\"No label data provided!\")\n",
" else:\n",
" if A:\n",
" out_labels = np.unique(data.y_te)\n",
" elif B:\n",
" out_labels = np.unique(data.y_tr)\n",
" else:\n",
" out_labels = np.unique(np.concatenate((data.y_tr,\n",
" data.y_te), axis=0))\n",
" count = out_labels.size\n",
" return out_labels.reshape((count,1))\n",
"\n",
"\n",
" def classify(self, X):\n",
" '''\n",
" Must be implemented by sub-classes.\n",
" '''\n",
" raise NotImplementedError\n",
"\n",
"\n",
" def class_perf(self, y_est, y_true):\n",
" '''\n",
" Given class label estimates and true values,\n",
" compute the fraction of correct classifications\n",
" made for each label, yielding typical binary\n",
" classification performance metrics.\n",
"\n",
" Input:\n",
" y_est and y_true are (k x 1) matrices of labels.\n",
"\n",
" Output:\n",
" Returns a dictionary with two components, (1) being\n",
" the fraction of correctly classified labels, and\n",
" (2) being a dict of per-label precison/recall/F1\n",
" scores. \n",
" '''\n",
" \n",
" # First, get the classification rate.\n",
" k = y_est.size\n",
" num_correct = (y_est == y_true).sum()\n",
" frac_correct = num_correct / k\n",
" frac_incorrect = 1.0 - frac_correct\n",
"\n",
" # Then, get precision/recall for each class.\n",
" prec_rec = { i:None for i in range(self.nc) } # initialize\n",
"\n",
" for c in range(self.nc):\n",
"\n",
" idx_c = (y_true == c)\n",
" idx_notc = (idx_c == False)\n",
"\n",
" TP = (y_est[idx_c] == c).sum()\n",
" FN = idx_c.sum() - TP\n",
" FP = (y_est[idx_notc] == c).sum()\n",
" TN = idx_notc.sum() - FP\n",
"\n",
" # Precision.\n",
" if (TP == 0 and FP == 0):\n",
" prec = 0\n",
" else:\n",
" prec = TP / (TP+FP)\n",
"\n",
" # Recall.\n",
" if (TP == 0 and FN == 0):\n",
" rec = 0\n",
" else:\n",
" rec = TP / (TP+FN)\n",
"\n",
" # F1 (harmonic mean of precision and recall).\n",
" if (prec == 0 or rec == 0):\n",
" f1 = 0\n",
" else:\n",
" f1 = 2 * prec * rec / (prec + rec)\n",
"\n",
" prec_rec[c] = {\"P\": prec,\n",
" \"R\": rec,\n",
" \"F1\": f1}\n",
"\n",
" return {\"rate\": frac_incorrect,\n",
" \"PRF1\": prec_rec}\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Exercises:__\n",
"\n",
"0. Explain the functions of `onehot()`, `get_labels()`, and `class_perf()`.\n",
"0. What does `class_perf` return?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Multi-class logistic regression\n",
"\n",
"The multi-class logistic regression model is among the simplest and most popular classification models used in practice. Let us sort out the key elements.\n",
"\n",
"- Data: instance-label pairs $(x,y)$ where $x \\in \\mathbb{R}^{d}$ and $y \\in \\{0,1,\\ldots,K\\}$. There are $K+1$ *classes* here.\n",
"\n",
"- Model: with controllable parameters $w_{0},w_{1},\\ldots,w_{K} \\in \\mathbb{R}^{d}$, we model class probabilities, conditioned on data $x$, as\n",
"\n",
"\\begin{align*}\n",
"P\\{y = j | x\\} = \\frac{\\exp(w_{j}^{T}x)}{\\sum_{k=0}^{K}\\exp(w_{k}^{T}x)}.\n",
"\\end{align*}\n",
"\n",
"- Loss: minimum negative log-likelihood $(-1) \\log P\\{y | x\\}$ evaluated over the sample (details below).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make the loss expression a bit easier to compute by introducing some new notation. Assume that we are given an $n$-sized sample $(x_{1},y_{1}), \\ldots, (x_{n},y_{n})$. Then for $i=1,\\ldots,n$ and $j=0,\\ldots,K$ define\n",
"\n",
"\\begin{align*}\n",
"p_{ij} & = P\\{y_{i} = j | x_{i}\\}\\\\\n",
"c_{ij} & = I\\{y_{i} = j\\}.\n",
"\\end{align*}\n",
"\n",
"Collecting all the elements over index $j$, we get vectors of the form\n",
"\n",
"\\begin{align*}\n",
"p_{i} & = (p_{i0},\\ldots,p_{iK})\\\\\n",
"c_{i} & = (c_{i0},\\ldots,c_{iK}).\n",
"\\end{align*}\n",
"\n",
"Note that $c_{i}$ gives us a handy \"one-hot\" vector representation of $y_{i}$. Using this notation, the example-specific probability returned by our model is given by $p_{ij}^{c_{ij}}$. As a loss function, use the negative log-likelihood, defined as\n",
"\n",
"\\begin{align*}\n",
"L(w;x_{i},y_{i}) & = (-1)\\log \\prod_{j=0}^{K} p_{ij}^{c_{ij}}\\\\\n",
"& = (-1)\\sum_{j=0}^{K} c_{ij} \\log p_{ij}\\\\\n",
"& = (-1)\\sum_{j=0}^{K} c_{ij}\\left(w_{j}^{T} x_{i} - \\log\\left( \\sum_{k=0}^{K}\\exp(w_{k}^{T}x_{i}) \\right) \\right)\\\\\n",
"& = \\log\\left( \\sum_{k=0}^{K}\\exp(w_{k}^{T}x_{i}) \\right) - \\sum_{j=0}^{K} c_{ij}w_{j}^{T} x_{i}.\n",
"\\end{align*}\n",
"\n",
"Following usual ERM fashion, the goal is to minimize the empirical mean (here rescaled by $n$):\n",
"\n",
"\\begin{align*}\n",
"\\min_{w} \\sum_{i=1}^{n} L(w;x_{i},y_{i}) \\to \\hat{w}.\n",
"\\end{align*}\n",
"\n",
"As a function of $w$, this objective is differentiable and convex. To actually minimize this objective as a function of $w$, gradient descent is a popular approach. The gradient of $L(w;x,y)$ evaluated with respect to $w$, and evaluated at $(w,x_{i},y_{i})$ can be compactly written as\n",
"\n",
"\\begin{align*}\n",
"\\nabla L(w;x_{i},y_{i}) = (p_{i}-c_{i}) \\otimes x_{i}\n",
"\\end{align*}\n",
"\n",
"where the binary operator $\\otimes$ is the \"Kronecker product\" defined between two matrices. If $U$ is $a \\times b$ and $V$ is $c \\times d$, then their Kronecker product is\n",
"\n",
"\\begin{align*}\n",
"U \\otimes V =\n",
"\\begin{bmatrix}\n",
"u_{1,1}V & \\cdots & u_{1,b}V\\\\\n",
"\\vdots & \\ddots & \\vdots\\\\\n",
"u_{a,1}V & \\cdots & u_{a,b}V\n",
"\\end{bmatrix}\n",
"\\end{align*}\n",
"\n",
"where each $u_{i,j1}V$ is naturally a block matrix of shape $c \\times d$, and thus $U \\otimes V$ has shape $ac \\times bd$. In the special case of $\\nabla L(w;x_{i},y_{i})$ here, since $(p_{i}-c_{i})$ has shape $1 \\times (K+1)$ and $x_{i}$ has shape $1 \\times d$, the gradient has the shape $1 \\times d(K+1)$, as we would expect. The Kronecker product is implemented as a part of Numpy, and so implementing a logistic regression model is extremely straightforward.\n",
"\n",
"__Effective degrees of freedom:__ It is perfectly fine to set $w = (w_{0},\\ldots,w_{K}$ and control $d(k+1)$ parameters, but note that because probabilities must sum to one, and by the definition of the model output, if say we have already computed $w_{1},\\ldots,w_{K}$ and thus $p_{i1},\\ldots,p_{iK}$, then naturally $p_{i0} = 1 - \\sum_{j=1}^{K}p_{ij}$. Note that this is of course guaranteed by our model, which means that we could perfectly well fix $w_{0}=(0,\\ldots,0)$ and just optimize with respect to $dK$ parameters, rather than $d(K+1)$. This is reflected in our implementation of `LogisticReg`, a sub-class of `Classifier`, below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class LogisticReg(Classifier):\n",
" '''\n",
" Multi-class logistic regression model.\n",
" '''\n",
"\n",
" def __init__(self, data=None):\n",
" \n",
" # Given data info, load up the (X,y) data.\n",
" super(LogisticReg, self).__init__(data=data)\n",
" \n",
" # Convert original labels to a one-hot binary representation.\n",
" if data.y_tr is not None:\n",
" self.C_tr = self.onehot(y=data.y_tr)\n",
" if data.y_te is not None:\n",
" self.C_te = self.onehot(y=data.y_te)\n",
" \n",
" \n",
" def classify(self, w, X):\n",
" '''\n",
" Given learned weights (w) and a matrix of one or\n",
" more observations, classify them as {0,...,nc-1}.\n",
"\n",
" Input:\n",
" w is a (d x 1) matrix of weights.\n",
" X is a (k x numfeat) matrix of k observations.\n",
" NOTE: k can be anything, the training/test sample size.\n",
"\n",
" Output:\n",
" A vector of length k, housing labels in {0,...,nc-1}.\n",
" '''\n",
" \n",
" k, numfeat = X.shape\n",
" A = np.zeros((self.nc,k), dtype=np.float32)\n",
" \n",
" # Get activations, with last row as zeros.\n",
" A[:-1,:] = w.reshape((self.nc-1, numfeat)).dot(X.T)\n",
" \n",
" # Now convert activations to conditional probabilities.\n",
" maxes = np.max(A, axis=0) # largest score for each obs.\n",
" A = A - maxes\n",
" A = np.exp(A)\n",
" A = A / A.sum(axis=0) # (nc x k).\n",
" \n",
" # Assign classes with highest probability, (k x 1) array.\n",
" return A.argmax(axis=0).reshape((k,1))\n",
"\n",
"\n",
" def l_imp(self, w, X, C, lamreg=None):\n",
" '''\n",
" Implementation of the multi-class logistic regression\n",
" loss function.\n",
"\n",
" Input:\n",
" w is a (d x 1) matrix of weights.\n",
" X is a (k x numfeat) matrix of k observations.\n",
" C is a (k x nc) matrix giving a binarized encoding of the\n",
" class labels for each observation; each row a one-hot vector.\n",
" lam is a non-negative regularization parameter.\n",
" NOTE: k can be anything, the training/test sample size.\n",
"\n",
" Output:\n",
" A vector of length k with losses evaluated at k points.\n",
" '''\n",
" \n",
" k, numfeat = X.shape\n",
" A = np.zeros((self.nc,k), dtype=np.float64)\n",
" \n",
" # Get activations, with last row as zeros.\n",
" A[:-1,:] = w.reshape((self.nc-1, numfeat)).dot(X.T)\n",
" \n",
" # Raw activations of all the correct weights.\n",
" cvec = (A*C.T).sum(axis=0)\n",
" \n",
" # Compute the negative log-likelihoods.\n",
" maxes = np.max(A, axis=0)\n",
" err = (np.log(np.exp(A-maxes).sum(axis=0))+maxes)-cvec\n",
"\n",
" # Return the losses (all data points), with penalty if needed.\n",
" if lamreg is None:\n",
" return err\n",
" else:\n",
" penalty = lamreg * np.linalg.norm(W)**2\n",
" return err + penalty\n",
" \n",
" \n",
" def l_tr(self, w, data, n_idx=None, lamreg=None):\n",
" if n_idx is None:\n",
" return self.l_imp(w=w, X=data.X_tr,\n",
" C=self.C_tr,\n",
" lamreg=lamreg)\n",
" else:\n",
" return self.l_imp(w=w, X=data.X_tr[n_idx,:],\n",
" C=self.C_tr[n_idx,:],\n",
" lamreg=lamreg)\n",
" \n",
" def l_te(self, w, data, n_idx=None, lamreg=None):\n",
" if n_idx is None:\n",
" return self.l_imp(w=w, X=data.X_te,\n",
" C=self.C_te,\n",
" lamreg=lamreg)\n",
" else:\n",
" return self.l_imp(w=w, X=data.X_te[n_idx,:],\n",
" C=self.C_te[n_idx,:],\n",
" lamreg=lamreg)\n",
" \n",
" \n",
" def g_imp(self, w, X, C, lamreg=0):\n",
" '''\n",
" Implementation of the gradient of the loss function used in\n",
" multi-class logistic regression.\n",
"\n",
" Input:\n",
" w is a (d x 1) matrix of weights.\n",
" X is a (k x numfeat) matrix of k observations.\n",
" C is a (k x nc) matrix giving a binarized encoding of the\n",
" class labels for each observation; each row a one-hot vector.\n",
" lamreg is a non-negative regularization parameter.\n",
" NOTE: k can be anything, the training/test sample size.\n",
"\n",
" Output:\n",
" A (k x d) matrix of gradients eval'd at k points.\n",
" '''\n",
" \n",
" k, numfeat = X.shape\n",
" A = np.zeros((self.nc,k), dtype=np.float32)\n",
" \n",
" # Get activations, with last row as zeros.\n",
" A[:-1,:] = w.reshape((self.nc-1, numfeat)).dot(X.T)\n",
" \n",
" # Now convert activations to conditional probabilities.\n",
" maxes = np.max(A, axis=0) # largest score for each obs.\n",
" A = A - maxes\n",
" A = np.exp(A)\n",
" A = A / A.sum(axis=0) # (nc x k).\n",
" \n",
" # Initialize a large matrix (k x d) to house per-point grads.\n",
" G = np.zeros((k,w.size), dtype=w.dtype)\n",
"\n",
" for i in range(k):\n",
" # A very tall vector (i.e., just one \"axis\").\n",
" G[i,:] = np.kron(a=(A[:-1,i]-C[i,:-1]), b=X[i,:])\n",
" # Note we carefully remove the last elements.\n",
" \n",
" if lamreg is None:\n",
" return G\n",
" else:\n",
" return G + lamreg*2*w.T\n",
" \n",
"\n",
" def g_tr(self, w, data, n_idx=None, lamreg=None):\n",
" if n_idx is None:\n",
" return self.g_imp(w=w, X=data.X_tr,\n",
" C=self.C_tr,\n",
" lamreg=lamreg)\n",
" else:\n",
" return self.g_imp(w=w, X=data.X_tr[n_idx,:],\n",
" C=self.C_tr[n_idx,:],\n",
" lamreg=lamreg)\n",
" \n",
" def g_te(self, w, data, n_idx=None, lamreg=None):\n",
" if n_idx is None:\n",
" return self.g_imp(w=w, X=data.X_te,\n",
" C=self.C_te,\n",
" lamreg=lamreg)\n",
" else:\n",
" return self.g_imp(w=w, X=data.X_te[n_idx,:],\n",
" C=self.C_te[n_idx,:],\n",
" lamreg=lamreg)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Exercise:__\n",
"\n",
"0. Explain what is being computed in each line of `classify`, `l_imp`, and `g_imp` above in `LogisticReg`.\n",
"\n",
"0. In the special case of just two classes, where $y \\in \\{0,1\\}$, the traditional model says to train just one $d$-dimensional vector $w$, to map $x \\mapsto f(w^{T}x) = P\\{y = 1 | x\\}$, where $f(u) = 1/(1+\\exp(-u))$ is the so-called logistic function. The gradient takes a very simple form in this case: try implementing a two-class logistic regression model object, without using the `np.kron` function. Numerically test to make sure the outputs of `g_imp` in your model and the general `LogisticReg` are the same.\n",
"\n",
"0. Why have we re-implemented `l_tr`, `l_te`, `g_tr`, `g_te`?\n",
"\n",
"0. What are we doing with`maxes`? Why are we doing this? For reference, given any vector $\\mathbf{u}$ and scalar $a$, the following identities hold.\n",
"\n",
"\\begin{align*}\n",
"\\text{softmax}(\\mathbf{u}+a) & = \\text{softmax}(\\mathbf{u})\\\\\n",
"\\log \\left( \\sum_{j} \\exp(u_{j}) \\right) & = a + \\log \\left( \\sum_{j} \\exp(u_{j}-a) \\right)\n",
"\\end{align*}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}