{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lecture 18: Softmax regression, multiclass classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the last lecture we have learned the logistic regression to classify \"0\" digit or a \"1\" digit based on pixel intensities on a 28x28 grid.\n", "\n", "Today we will learn how to classify all 10 digits.\n", "\n", "Reference: adapted from the MATLAB tutorial in [Stanford Deep Learning tutorial](http://deeplearning.stanford.edu/tutorial/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from tqdm import tqdm_notebook\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# MNIST\n", "Let us load the [MNIST dataset of handwritten digits](http://yann.lecun.com/exdb/mnist/), both testing and training data. You can download the `npz` format file on Canvas file tab." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What does the data look like?\n", "* (If you have loaded the `csv` data from Kaggle digit recognizer competition) The first column of the sample data are the labels, and the rest 784 columns represent a 28x28 grayscale image. \n", "* If you have loaded the `npz` data (numpy native zip format), `X_train` and `X_test` both have 784 columns which represent a 28x28 grayscale image. `y_train` and `y_test`, which range from 0 to 9 total 10 classes, are the labels of the training samples, respectively." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_train = np.load('mnist_train.npz')\n", "X_train = data_train['X']\n", "y_train = data_train['y']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_test = np.load('mnist_test.npz')\n", "X_test = data_test['X']\n", "y_test = data_test['y']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# visualize the first 20 rows of the training data, with their labels.\n", "_, axes = plt.subplots(4,5, figsize=(16, 14))\n", "axes = axes.reshape(-1)\n", "\n", "# randomly choosing 20 samples to display\n", "idx = np.random.choice(60000, size=20)\n", "\n", "for i in range(20):\n", " axes[i].axis('off') # hide the axes ticks\n", " axes[i].imshow(X_train[idx[i],:].reshape(28,28), cmap = 'gray')\n", " axes[i].set_title(str(int(y_train[idx[i]])), color= 'black', fontsize=25)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Review: binary classification\n", "\n", "In logistic regression problem, for a certain labeled sample $(\\mathbf{x},y)$t hat is in class 0 (i.e., the image is a 0)\n", "* Ground truth: $ P(y=1) = 0$, $P(y=0) = 1 - P(y=1)=1$, its one-hot vector representation is $[0,1]$.\n", "* Hypothesis: after training, we use $h(\\mathbf{x};\\mathbf{w})$ to estimate the conditional probability $ P(y=1|\\mathbf{x})$, i.e., use features $\\mathbf{x}$ to predict the probability of $y=1$; and $1 - h(\\mathbf{x})$ to estimate $P(y=0|\\mathbf{x}) = 1 - P(y=1|\\mathbf{x})$. For a good model, $h(\\mathbf{x};\\mathbf{w}) \\approx 10^{-9}$, which is to say, we have trained a model that uses $[h(\\mathbf{x};\\mathbf{w}), 1- h(\\mathbf{x};\\mathbf{w})]$ to approximate the ground truth one-hot vector $[0,1]$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Softmax regression\n", "\n", "Suppose that we know for a certain sample, its label $y\\in \\{1,\\dots, K\\}$ where $K$ is the number of classes. Feature vector $\\mathbf{x}\\in \\mathbb{R}^n$ (in MNIST it is a 28x28 image flattened to a `(784,)` array), we want to estimate the probability that $P(y=k|\\mathbf{x})$.\n", "\n", "----\n", "\n", "## Model (hypothesis)\n", "$$\n", "h(\\mathbf{x};W) =\n", "\\begin{pmatrix}\n", "P(y = 1 | \\mathbf{x}; \\mathbf{w}) \\\\\n", "P(y = 2 | \\mathbf{x}; \\mathbf{w}) \\\\\n", "\\vdots \\\\\n", "P(y = K | \\mathbf{x}; \\mathbf{w})\n", "\\end{pmatrix}\n", "=\n", "\\frac{1}{ \\sum_{j=1}^{K}{\\exp\\big(\\mathbf{w}_j^{\\top} \\mathbf{x}\\big) }}\n", "\\begin{pmatrix}\n", "\\exp(\\mathbf{w}_1^{\\top} \\mathbf{x} ) \\\\\n", "\\exp(\\mathbf{w}_2^{\\top} \\mathbf{x} ) \\\\\n", "\\vdots \\\\\n", "\\exp(\\mathbf{w}_K^{\\top} \\mathbf{x} ) \\\\\n", "\\end{pmatrix}.\n", "$$\n", "where we have $K$ sets of parameters, $\\mathbf{w}_1, \\mathbf{w}_2, \\dots, \\mathbf{w}_K$, and the factor $\\sum_{j=1}^{K}{\\exp\\big(\\mathbf{w}_j^{\\top} \\mathbf{x}\\big)}$ normalizes the results to be a probability.\n", "\n", "$W$ is an $n\\times K$ matrix containing all $K$ sets of parameters, obtained by concatenating $\\mathbf{w}_1, \\mathbf{w}_2, \\dots, \\mathbf{w}_K$ into columns, so that $\\mathbf{w}_k = (w_{k1}, \\dots, w_{kn})^{\\top} = (w_{kl})$ for $l = 1,\\dots, n$\n", "\n", "$$\n", "\\mathbf{w} = \\left(\n", "\\begin{array}{cccc}| & | & | & | \\\\\n", "\\mathbf{w}_1 & \\mathbf{w}_2 & \\cdots & \\mathbf{w}_K \\\\\n", "| & | & | & |\n", "\\end{array}\\right),\n", "$$\n", "and $W^{\\top}\\mathbf{x}$ would be sensible and vectorized to be computed.\n", "\n", "----\n", "\n", "## Loss function\n", "\n", "Define the following indicator function:\n", "$$\n", "1_{\\{y = k\\}} = 1_{\\{k\\}}(y) = \\delta_{yk} = \\begin{cases}\n", "1 & \\text{when } y = k,\n", "\\\\[5pt]\n", "0 & \\text{otherwise}.\n", "\\end{cases}\n", "$$\n", "\n", "Loss function is again using the cross entropy:\n", "\n", "$$\n", "\\begin{aligned}\n", "L (\\mathbf{w};X,\\mathbf{y}) & = - \\frac{1}{N}\\sum_{i=1}^N \\sum_{k=1}^K\n", "\\Bigl\\{ 1_{\\{y^{(i)} = k\\}} \\ln P\\big(y^{(i)}=k | \\mathbf{x}^{(i)} ; \\mathbf{w} \\big) \\Bigr\\}\n", "\\\\\n", " & = - \\frac{1}{N}\\sum_{i=1}^N \\sum_{k=1}^K\n", "\\left\\{1_{\\{y^{(i)} = k\\}} \\ln \\Bigg( \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})}{\\sum_{j=1}^{K} \n", "\\exp\\big(\\mathbf{w}_j^{\\top} \\mathbf{x}^{(i)} \\big) } \\Bigg)\\right\\}.\n", "\\end{aligned}\n", "$$\n", "Notice for every term in the sum w.r.t. to the labels, $\\sum_{k=1}^K 1_{\\{y^{(i)} = k\\}} = 1$ for only one term among $K$ terms, and the rest is 0.\n", "\n", "For example, if a sample $(\\mathbf{x},y)$ is in the 5th class (representing digit 4), $y=5$, then its one-hot vector is \n", "$$[1_{\\{y^{(i)} = k\\}} ]_{k=1}^{10}= [0, 0, 0, 0, 1, 0, 0, 0, 0, 0].$$\n", "Hopefully, after training, the model above can predict something like:\n", "$$h(\\mathbf{x};W) = [0.009,\\; 0.01,\\; 0.01,\\; 0.009,\\; 0.91,\\;\n", " 0.009,\\; 0.009,\\; 0.01,\\; 0.009,\\; 0.010]$$\n", "\n", "----\n", "\n", "## Gradient descent\n", "Now the gradient of $L$ with respect the whole $k$-th set of weights is then:\n", "\n", "$$\n", "\\frac{\\partial L }{\\partial \\mathbf{w}_{k}}\n", "= - \\frac{1}{N}\\sum_{i=1}^N \n", "\\mathbf{x}^{(i)}\\left( 1_{\\{y^{(i)} = k\\} } - \\big(\\text{$k$-th component of } h(\\mathbf{x}^{(i)};W)\\big) \n", "\\right)\n", "=\n", "\\frac{1}{N}\\sum_{i=1}^N \n", "\\mathbf{x}^{(i)}\\left( \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})} {\\sum_{m=1}^{K} \n", "\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)} )} -1_{\\{y^{(i)} = k\\} } \n", "\\right).\n", "\\tag{$\\diamond$}\n", "$$\n", "\n", "One big challenge here is that the weights are now represented by a matrix.\n", "\n", "----\n", "\n", "## Prediction\n", "The biggest estimated probability's class as this sample's predicted label.\n", "$$\n", "\\hat{y} = \\operatorname{arg}\\max_{j} P\\big(y = j| \\mathbf{x}\\big),\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "N = len(y_train) # number of training samples\n", "n = np.shape(X_train)[1] # 784, which is number of pixels (number of features)\n", "K = 10 # number of classes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "w = 1e-4*np.random.random(n*K) \n", "# w: a (7840, ) array a small, random initial guess\n", "# 7840 = 784x10, 784 features, 10 classes\n", "# during computation it will be resized to total 10 columns standing for 10 sets of weights" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# model sigma(x; w) \n", "# w: 10 sets weights\n", "# X: training samples, of shape 60000 row by 784\n", "# output: (60000, 10), i-th row represent the probabilities of i-th sample\n", "# in the k-th class (k-th column entry)\n", "def sigma(X,w):\n", " W = w.reshape(n,K)\n", " s = np.exp(np.matmul(X,W))\n", " total = np.sum(s, axis=1).reshape(-1,1)\n", " prob = s / total\n", " return prob\n", "\n", "# loss function, modulo by N (size of training data)\n", "# a vectorized implementation with a for loop with only 10 iterations\n", "def loss(w,X,y):\n", " loss_components = np.zeros(N)\n", " for k in range(K):\n", " loss_components += np.log(sigma(X,w))[:,k] * (y == k)\n", " # above is a dimension (60000,) array\n", " return -np.mean(loss_components) # same with loss_components.sum()/N\n", "\n", "def gradient_loss(w,X,y):\n", " gradient_for_each_weight_class = np.empty([n,K]) \n", " # 10 columns, each column represent a graident\n", " for k in range(K):\n", " gradient_for_all_training_data_for_class_k = (sigma(X,w)[:,k] - (y==k)).reshape(-1,1)*X\n", " gradient_for_each_weight_class[:,k] = np.mean(gradient_for_all_training_data_for_class_k, axis=0)\n", " # we should return a (784,) array, which is averaging all 60000 training data\n", " return gradient_for_each_weight_class.reshape(n*K)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-validation function\n", "For a fixed set of weights `sigma(w)` gives 10 probabilities for each sample (training or testing), here we want to implement a cross-validation function, takes input of `X_train` or `X_test`, compute the class label of the highest probability for each samples, and returns the accuracy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we define a checking accuracy function computing \n", "def check_acc(w,X,y):\n", " prob = sigma(X,w) # for each sample, it computes 10 probabilities based on current weight w\n", " highest_prob_index = np.argmax(prob, axis=1)\n", " return np.mean(highest_prob_index == y.astype(int))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "eta = 1e-5 # step size (learning rate)\n", "num_steps = 5\n", "\n", "for i in tqdm_notebook(range(num_steps)):\n", " dw = gradient_loss(w,X_train,y_train)\n", " w -= eta * dw\n", "\n", " print(\"Training accuracy after\", i+1, \"iterations is: \", check_acc(w,X_train,y_train))\n", " print(\"Testing accuracy after\", i+1, \"iterations is: \", check_acc(w,X_test,y_test))\n", " # keep track of training and testing accuracy just making sure we are in the right direction\n", " \n", "print(\"loss after\", i+1, \"iterations is: \", loss(w,X_train,y_train))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Slow, slow, slow\n", "Because our dataset is big. One iteration of the gradient descent requires evaluating the gradient for all the training samples, and it takes takes $O(N\\cdot d)$ cpu time ($N$: number of training samples, $d$:number of features in each sample). Stochastic gradient descent to remedy next time..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# scikit-learn\n", "\n", "We can use `scikit-learn`'s [`LogisticRegression()` class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) in the `linear_model` to perform this task for us. Quoting the reference:\n", "> In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the 'multi_class' option is set to 'ovr', and uses the cross- entropy loss if the 'multi_class' option is set to 'multinomial'. (Currently the 'multinomial' option is supported only by the 'lbfgs', 'sag' and 'newton-cg' solvers.)\n", "\n", "> For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones.\n", "For multiclass problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs' handle multinomial loss; 'liblinear' is limited to one-versus-rest schemes.\n", "'newton-cg', 'lbfgs' and 'sag' only handle L2 penalty, whereas 'liblinear' and 'saga' handle L1 penalty." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# copy the example\n", "from sklearn.linear_model import LogisticRegression\n", "mnist_clf = LogisticRegression(random_state=42, \n", " solver='lbfgs', tol= 1e-5, max_iter = 2000, verbose=True, \n", " multi_class='multinomial')\n", "# a faster solver is sag according to the reference\n", "# verbose is printing output during training (only applies to lbfgs as solver)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mnist_clf.fit(X_train[:10000,:], y_train[:10000]) # we only use first 10000 images as training data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mnist_clf.predict(X_test[:10, :])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we visualize the first 10 rows of X_test, see how the prediction goes\n", "# visualize the first 20 rows of the training data, with their labels.\n", "_, axes = plt.subplots(2,5, figsize=(16, 7))\n", "axes = axes.reshape(-1)\n", "\n", "for i in range(10):\n", " axes[i].axis('off') # hide the axes ticks\n", " axes[i].imshow(X_test[i,:].reshape(28,28), cmap = 'gray')\n", " axes[i].set_title(str(int(y_test[i])), color= 'black', fontsize=25)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Result\n", "Our softmax got 8 out 10 correct, not a bad score. Run the following cell will give you the prediction accuracy for the first 500 testing samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mnist_clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reading: more details about softmax regression\n", "
\n", "Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. In logistic regression we assumed that the labels were binary: $y^{(i)}\\in\\{0,1\\}$. In the last lecture, we have used such a binary classifier to distinguish between two kinds of handwritten digits. Softmax regression allows us to handle $y^{(i)}\\in \\{1,\\dots, K\\}$ where $K$ is the number of classes.\n", "\n", "Given a test input $\\mathbf{x}\\in \\mathbb{R}^n$ (a 28x28 image flattened to a `(784,)` array), we want to estimate the probability that $P(y=k|\\mathbf{x})$ for each value of $k=1,\\dots,K$ using certain model (hypothesis). In other words, from the input image, we want to estimate the probability of this image being classified as each label among $K$ labels, and we choose the highest probable one to label this image as our prediction based on the model. Thus, for each sample, our model (hypothesis) will output a $K$-dimensional vector (whose elements sum to $1$ to make it a probability) giving us our $K$ estimated probabilities. Concretely, our model $h(\\mathbf{x}; W)$, which stands for given the current weights $W$ the probability vector for $\\mathbf{x}$, takes the form:\n", "\n", "$$\n", "h(\\mathbf{x};W) =\n", "\\begin{pmatrix}\n", "P(y = 1 | \\mathbf{x}; \\mathbf{w}) \\\\\n", "P(y = 2 | \\mathbf{x}; \\mathbf{w}) \\\\\n", "\\vdots \\\\\n", "P(y = K | \\mathbf{x}; \\mathbf{w})\n", "\\end{pmatrix}\n", "=\n", "\\frac{1}{ \\sum_{j=1}^{K}{\\exp\\big(\\mathbf{w}_j^{\\top} \\mathbf{x}\\big) }}\n", "\\begin{pmatrix}\n", "\\exp(\\mathbf{w}_1^{\\top} \\mathbf{x} ) \\\\\n", "\\exp(\\mathbf{w}_2^{\\top} \\mathbf{x} ) \\\\\n", "\\vdots \\\\\n", "\\exp(\\mathbf{w}_K^{\\top} \\mathbf{x} ) \\\\\n", "\\end{pmatrix}.\n", "$$\n", "Totally we have $K$ sets of parameters, $\\mathbf{w}_1, \\mathbf{w}_2, \\dots, \\mathbf{w}_K$, and the factor $\\sum_{j=1}^{K}{\\exp\\big(\\mathbf{w}_j^{\\top} \\mathbf{x}\\big)}$ normalizes the results to be a probability.\n", "\n", "When we implement the softmax regression, it is usually convenient to represent $W$ containing all $K$ sets of parameters as a $n\\times K$ matrix obtained by concatenating $\\mathbf{w}_1, \\mathbf{w}_2, \\dots, \\mathbf{w}_K$ into columns, so that $\\mathbf{w}_k = (w_{k1}, \\dots, w_{kn})^{\\top} = (w_{kl})$ for $l = 1,\\dots, n$\n", "\n", "$$\n", "\\mathbf{w} = \\left(\n", "\\begin{array}{cccc}| & | & | & | \\\\\n", "\\mathbf{w}_1 & \\mathbf{w}_2 & \\cdots & \\mathbf{w}_K \\\\\n", "| & | & | & |\n", "\\end{array}\\right),\n", "$$\n", "and $\\mathbf{w}^{\\top}\\mathbf{x}$ would be sensible and vectorized to be computed.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Loss functions Logistic vs Softmax\n", "Define the following indicator function:\n", "$$\n", "1_{\\{y = k\\}} = 1_{\\{k\\}}(y) = \\delta_{yk} = \\begin{cases}\n", "1 & \\text{when } y = k,\n", "\\\\[5pt]\n", "0 & \\text{otherwise}.\n", "\\end{cases}\n", "$$\n", "First let us recall the loss function for the logistic regression, and we rewrite it as: we have $N$ training samples $(\\mathbf{x}^{(i)}, y^{(i)})$\n", "$$\n", "\\begin{aligned}\n", "L^{\\text{Logistic}} (\\mathbf{w}) &= - \\frac{1}{N}\\sum_{i=1}^N \n", "\\Bigl\\{y^{(i)} \\ln\\big( h(\\mathbf{x}^{(i)}) \\big) \n", "+ (1 - y^{(i)}) \\ln\\big( 1 - h(\\mathbf{x}^{(i)}) \\big) \\Bigr\\}\n", "\\\\\n", "& = - \\frac{1}{N}\\sum_{i=1}^N \\sum_{k=0}^1\n", "\\Bigl\\{ 1_{\\{y^{(i)} = k\\}} \\ln P\\big(y^{(i)}=k | \\mathbf{x}^{(i)} ; \\mathbf{w} \\big) \\Bigr\\}.\n", "\\end{aligned}\n", "$$\n", "\n", "Now our loss function for the softmax regression is then the generalization of above:\n", "\n", "$$\n", "\\begin{aligned}\n", "L (\\mathbf{w}) = L^{\\text{Softmax}} (\\mathbf{w}) & = - \\frac{1}{N}\\sum_{i=1}^N \\sum_{k=1}^K\n", "\\Bigl\\{ 1_{\\{y^{(i)} = k\\}} \\ln P\\big(y^{(i)}=k | \\mathbf{x}^{(i)} ; \\mathbf{w} \\big) \\Bigr\\}\n", "\\\\\n", " & = - \\frac{1}{N}\\sum_{i=1}^N \\sum_{k=1}^K\n", "\\left\\{1_{\\{y^{(i)} = k\\}} \\ln \\Bigg( \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})}{\\sum_{j=1}^{K} \n", "\\exp\\big(\\mathbf{w}_j^{\\top} \\mathbf{x}^{(i)} \\big) } \\Bigg)\\right\\}.\n", "\\end{aligned}\n", "$$\n", "Notice for every term in the sum w.r.t. to the labels, $\\sum_{k=1}^K$, $1_{\\{y^{(i)} = k\\}} = 1$ for only one term among $K$ terms, and the rest is 0. The loss function above is the average of the cross-entropy for each sample:\n", "$$\n", "H(p,q)\\ =\\ -\\sum^{K}_{k=1}p_{k}\\log q_{k},\n", "$$\n", "where $p_{k}$ is the ground truth probability of this sample is in $k$-th class (known), $q_{k}$ is our model's estimated/predicted probability." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Gradient of softmax loss function \n", "# (you might want to re-derive this on paper)\n", "\n", "Notice our weights to be trained have $K$ sets $\\mathbf{w}_1, \\mathbf{w}_2, \\dots, \\mathbf{w}_K$, and each of the $k$-th weights vector has $n$ components: $\\mathbf{w}_k = (w_{k1}, \\dots, w_{kl},\\dots, w_{kn})^{\\top}$. The first subscript is $1\\leq k \\leq K$ (label's index, we have this many set of weights), the second subscript is $1\\leq l\\leq n$ ($\\mathbf{x}$'s feature index). \n", "\n", "The indices involved are pretty complicated, to simplify the notation a bit, denote the probability predicted by our model of the $i$-th training sample being in the $k$-th class as:\n", "\n", "$$\n", "\\sigma_{k}^{(i)}:= P\\big(y^{(i)}=k | \\mathbf{x}^{(i)} ; \\mathbf{w} \\big) = \n", "\\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})} {\\sum_{m=1}^{K} \n", "\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)} )}\n", "$$\n", "then \n", "$$\n", "\\frac{\\partial L }{\\partial w_{jl}}\n", "= - \\sum_{i=1}^N \\sum_{k=1}^K \n", "\\left\\{ 1_{\\{y^{(i)} = k\\} } \\frac{\\partial}{\\partial w_{jl}}\\Big( \\ln \\sigma_{k}^{(i)}\\Big)\n", "\\right\\}\n", "= - \\sum_{i=1}^N \\sum_{k=1}^K \n", "\\left\\{ 1_{\\{y^{(i)} = k\\} } \\frac{1}{\\sigma_{k}^{(i)}}\\frac{\\partial}{\\partial w_{jl}} \\sigma_{k}^{(i)}\n", "\\right\\}.\n", "\\tag{$\\star$}\n", "$$\n", "\n", "Now computing the partial derivative above:\n", "\n", "$$\n", "\\begin{aligned}\n", "\\frac{\\partial \\sigma_{k}^{(i)}}{\\partial w_{jl}} \n", "&= \n", "\\frac{\\partial }{\\partial w_{jl}} \\left( \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})} {\\sum_{m=1}^{K} \n", "\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)} )}\\right)\n", "\\\\\n", "&= \\frac{1}{\\sum_{m=1}^{K}\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)} )}\n", "\\frac{\\partial }{\\partial w_{jl}} \\left( \\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})\\right)\n", "- \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})}\n", "{ \\left(\\sum_{m=1}^{K}\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)}) \\right)^2}\n", "\\frac{\\partial }{\\partial w_{jl}} \\left( \\sum_{m=1}^{K}\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)}) \\right)\n", "\\\\\n", "&= \\frac{1}{\\sum_{m=1}^{K}\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)} )}\n", "1_{\\{j = k\\}} \\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})\n", "\\frac{\\partial }{\\partial w_{jl}} \\left( \\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)} \\right)\n", "- \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})}\n", "{ \\left(\\sum_{m=1}^{K}\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)}) \\right)^2}\n", "\\exp(\\mathbf{w}_j^{\\top} \\mathbf{x}^{(i)})\n", "\\frac{\\partial }{\\partial w_{jl}} \\left( \\mathbf{w}_j^{\\top} \\mathbf{x}^{(i)} \\right).\n", "\\end{aligned}\n", "\\tag{$\\dagger$}\n", "$$\n", "\n", "By the property of the indicator function, we have:\n", "\n", "$$\n", "1_{\\{j = k\\}} \\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})\n", "\\frac{\\partial }{\\partial w_{jl}} \\left( \\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)} \\right)\n", "= \\begin{cases}\n", "\\exp(\\mathbf{w}_j^{\\top} \\mathbf{x}^{(i)}) x_l^{(i)} & \\text{if } j=k,\n", "\\\\[3pt]\n", "0 & \\text{if }j\\neq k.\n", "\\end{cases}\n", "$$\n", "\n", "Hence, $(\\dagger)$ can be further written as:\n", "\n", "$$\n", "\\begin{aligned}\n", "\\frac{\\partial \\sigma_{k}^{(i)}}{\\partial w_{jl}} \n", "= \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})}\n", "{ \\sum_{m=1}^{K}\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)}) }\n", "\\left(\n", "1_{\\{j = k\\}} - \n", " \\frac{\\exp(\\mathbf{w}_j^{\\top} \\mathbf{x}^{(i)})}\n", "{ \\sum_{m=1}^{K}\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)}) }\n", "\\right) x^{(i)}_l\n", "= \\sigma_{k}^{(i)} \\left(\n", "1_{\\{j = k\\}} - \\sigma_{j}^{(i)} \\right) x^{(i)}_l.\n", "\\end{aligned}\n", "$$\n", "\n", "Now plugging this back to $(\\star)$:\n", "\n", "$$\n", "\\begin{aligned}\n", "\\frac{\\partial L }{\\partial w_{jl}}\n", "&= - \\sum_{i=1}^N \\sum_{k=1}^K \n", "\\left\\{ 1_{\\{y^{(i)} = k\\} } \n", "\\left(\n", "1_{\\{j = k\\}} - \\sigma_{j}^{(i)} \\right) x^{(i)}_l\n", "\\right\\}\n", "\\\\\n", "&=- \\sum_{i=1}^N \n", "x^{(i)}_l\\left\\{ \\sum_{k=1}^K \n", "1_{\\{y^{(i)} = k\\} } 1_{\\{j = k\\}} - \n", "\\sum_{k=1}^K 1_{\\{y^{(i)} = k\\} } \\sigma_{j}^{(i)} \n", "\\right\\}\n", "\\\\\n", "&=\n", "- \\sum_{i=1}^N \n", "x^{(i)}_l\\left( \n", "1_{\\{y^{(i)} = j\\} } - \\sigma_{j}^{(i)} \n", "\\right)\n", "\\\\\n", "& = \n", "- \\sum_{i=1}^N \n", "x^{(i)}_l\\Big( \n", "1_{\\{y^{(i)} = j\\} } - P\\big(y^{(i)}=j | \\mathbf{x}^{(i)} ; \\mathbf{w} \\big) \n", "\\Big).\n", "\\end{aligned}\n", "$$\n", "\n", "This is pretty simple, and it has a nice interpretation similar to the maximum likelihood function: the term in the parenthesis is the difference between the actual probability and the probability estimation in our model.\n", "\n", "Now the derivative of $L$ with respect the whole $k$-th set of weights is then:\n", "\n", "$$\n", "\\frac{\\partial L }{\\partial \\mathbf{w}_{k}}\n", "= - \\sum_{i=1}^N \n", "\\mathbf{x}^{(i)}\\left( 1_{\\{y^{(i)} = k\\} } - \\sigma_{k}^{(i)} \n", "\\right)\n", "=\n", "\\sum_{i=1}^N \n", "\\mathbf{x}^{(i)}\\left( \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})} {\\sum_{m=1}^{K} \n", "\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)} )} -1_{\\{y^{(i)} = k\\} } \n", "\\right).\n", "\\tag{$\\diamond$}\n", "$$" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }