{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lecture 18: Softmax regression, multiclass classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the last lecture we have learned the logistic regression to classify \"0\" digit or a \"1\" digit based on pixel intensities on a 28x28 grid.\n",
    "\n",
    "Today we will learn how to classify all 10 digits.\n",
    "\n",
    "Reference: adapted from the MATLAB tutorial in [Stanford Deep Learning tutorial](http://deeplearning.stanford.edu/tutorial/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from tqdm import tqdm_notebook\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MNIST\n",
    "Let us load the [MNIST dataset of handwritten digits](http://yann.lecun.com/exdb/mnist/), both testing and training data. You can download the `npz` format file on Canvas file tab."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What does the data look like?\n",
    "* (If you have loaded the `csv` data from Kaggle digit recognizer competition) The first column of the sample data are the labels, and the rest 784 columns represent a 28x28 grayscale image. \n",
    "* If you have loaded the `npz` data (numpy native zip format), `X_train` and `X_test` both have 784 columns which represent a 28x28 grayscale image. `y_train` and `y_test`, which range from 0 to 9 total 10 classes, are the labels of the training samples, respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train = np.load('mnist_train.npz')\n",
    "X_train = data_train['X']\n",
    "y_train = data_train['y']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_test = np.load('mnist_test.npz')\n",
    "X_test = data_test['X']\n",
    "y_test = data_test['y']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# visualize the first 20 rows of the training data, with their labels.\n",
    "_, axes = plt.subplots(4,5, figsize=(16, 14))\n",
    "axes = axes.reshape(-1)\n",
    "\n",
    "# randomly choosing 20 samples to display\n",
    "idx = np.random.choice(60000, size=20)\n",
    "\n",
    "for i in range(20):\n",
    "    axes[i].axis('off') # hide the axes ticks\n",
    "    axes[i].imshow(X_train[idx[i],:].reshape(28,28), cmap = 'gray')\n",
    "    axes[i].set_title(str(int(y_train[idx[i]])), color= 'black', fontsize=25)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Review: binary classification\n",
    "\n",
    "In logistic regression problem, for a certain labeled sample $(\\mathbf{x},y)$t hat is in class 0 (i.e., the image is a 0)\n",
    "* Ground truth: $ P(y=1) = 0$,  $P(y=0) = 1 - P(y=1)=1$, its one-hot vector representation is $[0,1]$.\n",
    "* Hypothesis: after training, we use $h(\\mathbf{x};\\mathbf{w})$ to estimate the conditional probability $ P(y=1|\\mathbf{x})$,  i.e., use features $\\mathbf{x}$ to predict the probability of $y=1$; and $1 - h(\\mathbf{x})$ to estimate $P(y=0|\\mathbf{x}) = 1 - P(y=1|\\mathbf{x})$. For a good model, $h(\\mathbf{x};\\mathbf{w}) \\approx 10^{-9}$, which is to say, we have trained a model that uses $[h(\\mathbf{x};\\mathbf{w}), 1- h(\\mathbf{x};\\mathbf{w})]$ to approximate the ground truth one-hot vector $[0,1]$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Softmax regression\n",
    "\n",
    "Suppose that we know for a certain sample, its label $y\\in \\{1,\\dots, K\\}$ where $K$ is the number of classes. Feature vector $\\mathbf{x}\\in \\mathbb{R}^n$ (in MNIST it is a 28x28 image flattened to a `(784,)` array), we want to estimate the probability that $P(y=k|\\mathbf{x})$.\n",
    "\n",
    "----\n",
    "\n",
    "## Model (hypothesis)\n",
    "$$\n",
    "h(\\mathbf{x};W) =\n",
    "\\begin{pmatrix}\n",
    "P(y = 1 | \\mathbf{x}; \\mathbf{w}) \\\\\n",
    "P(y = 2 | \\mathbf{x}; \\mathbf{w}) \\\\\n",
    "\\vdots \\\\\n",
    "P(y = K | \\mathbf{x}; \\mathbf{w})\n",
    "\\end{pmatrix}\n",
    "=\n",
    "\\frac{1}{ \\sum_{j=1}^{K}{\\exp\\big(\\mathbf{w}_j^{\\top} \\mathbf{x}\\big) }}\n",
    "\\begin{pmatrix}\n",
    "\\exp(\\mathbf{w}_1^{\\top} \\mathbf{x} ) \\\\\n",
    "\\exp(\\mathbf{w}_2^{\\top} \\mathbf{x} ) \\\\\n",
    "\\vdots \\\\\n",
    "\\exp(\\mathbf{w}_K^{\\top} \\mathbf{x} ) \\\\\n",
    "\\end{pmatrix}.\n",
    "$$\n",
    "where we have $K$ sets of parameters, $\\mathbf{w}_1, \\mathbf{w}_2, \\dots, \\mathbf{w}_K$, and the factor $\\sum_{j=1}^{K}{\\exp\\big(\\mathbf{w}_j^{\\top} \\mathbf{x}\\big)}$ normalizes the results to be a probability.\n",
    "\n",
    "$W$ is an $n\\times K$ matrix containing all $K$ sets of parameters, obtained by concatenating $\\mathbf{w}_1, \\mathbf{w}_2, \\dots, \\mathbf{w}_K$ into columns, so that $\\mathbf{w}_k = (w_{k1}, \\dots, w_{kn})^{\\top} = (w_{kl})$ for $l = 1,\\dots, n$\n",
    "\n",
    "$$\n",
    "\\mathbf{w} = \\left(\n",
    "\\begin{array}{cccc}| & | & | & | \\\\\n",
    "\\mathbf{w}_1 & \\mathbf{w}_2 & \\cdots & \\mathbf{w}_K \\\\\n",
    "| & | & | & |\n",
    "\\end{array}\\right),\n",
    "$$\n",
    "and $W^{\\top}\\mathbf{x}$ would be sensible and vectorized to be computed.\n",
    "\n",
    "----\n",
    "\n",
    "## Loss function\n",
    "\n",
    "Define the following indicator function:\n",
    "$$\n",
    "1_{\\{y = k\\}} = 1_{\\{k\\}}(y) = \\delta_{yk} = \\begin{cases}\n",
    "1 & \\text{when } y = k,\n",
    "\\\\[5pt]\n",
    "0 & \\text{otherwise}.\n",
    "\\end{cases}\n",
    "$$\n",
    "\n",
    "Loss function is again using the cross entropy:\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "L (\\mathbf{w};X,\\mathbf{y})  & = - \\frac{1}{N}\\sum_{i=1}^N \\sum_{k=1}^K\n",
    "\\Bigl\\{ 1_{\\{y^{(i)} = k\\}} \\ln P\\big(y^{(i)}=k | \\mathbf{x}^{(i)} ; \\mathbf{w} \\big) \\Bigr\\}\n",
    "\\\\\n",
    " & = - \\frac{1}{N}\\sum_{i=1}^N \\sum_{k=1}^K\n",
    "\\left\\{1_{\\{y^{(i)} = k\\}} \\ln \\Bigg( \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})}{\\sum_{j=1}^{K} \n",
    "\\exp\\big(\\mathbf{w}_j^{\\top} \\mathbf{x}^{(i)} \\big) }  \\Bigg)\\right\\}.\n",
    "\\end{aligned}\n",
    "$$\n",
    "Notice for every term in the sum w.r.t. to the labels, $\\sum_{k=1}^K 1_{\\{y^{(i)} = k\\}} = 1$ for only one term among $K$ terms, and the rest is 0.\n",
    "\n",
    "For example, if a sample $(\\mathbf{x},y)$ is in the 5th class (representing digit 4), $y=5$, then its one-hot vector is \n",
    "$$[1_{\\{y^{(i)} = k\\}} ]_{k=1}^{10}= [0, 0, 0, 0, 1, 0, 0, 0, 0, 0].$$\n",
    "Hopefully, after training, the model above can predict something like:\n",
    "$$h(\\mathbf{x};W) = [0.009,\\; 0.01,\\; 0.01,\\; 0.009,\\; 0.91,\\;\n",
    "        0.009,\\; 0.009,\\; 0.01,\\; 0.009,\\; 0.010]$$\n",
    "\n",
    "----\n",
    "\n",
    "## Gradient descent\n",
    "Now the gradient of $L$ with respect the whole $k$-th set of weights is then:\n",
    "\n",
    "$$\n",
    "\\frac{\\partial L }{\\partial \\mathbf{w}_{k}}\n",
    "= - \\frac{1}{N}\\sum_{i=1}^N \n",
    "\\mathbf{x}^{(i)}\\left(   1_{\\{y^{(i)} = k\\} } -  \\big(\\text{$k$-th component of } h(\\mathbf{x}^{(i)};W)\\big)  \n",
    "\\right)\n",
    "=\n",
    "\\frac{1}{N}\\sum_{i=1}^N \n",
    "\\mathbf{x}^{(i)}\\left(    \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})} {\\sum_{m=1}^{K} \n",
    "\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)} )} -1_{\\{y^{(i)} = k\\} } \n",
    "\\right).\n",
    "\\tag{$\\diamond$}\n",
    "$$\n",
    "\n",
    "One big challenge here is that the weights are now represented by a matrix.\n",
    "\n",
    "----\n",
    "\n",
    "## Prediction\n",
    "The biggest estimated probability's class as this sample's predicted label.\n",
    "$$\n",
    "\\hat{y} = \\operatorname{arg}\\max_{j} P\\big(y = j| \\mathbf{x}\\big),\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = len(y_train) # number of training samples\n",
    "n = np.shape(X_train)[1] # 784, which is number of pixels (number of features)\n",
    "K = 10 # number of classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(42)\n",
    "w  = 1e-4*np.random.random(n*K) \n",
    "# w: a (7840, ) array a small, random initial guess\n",
    "# 7840 = 784x10, 784 features, 10 classes\n",
    "# during computation it will be resized to total 10 columns standing for 10 sets of weights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# model sigma(x; w) \n",
    "# w: 10 sets weights\n",
    "# X: training samples, of shape 60000 row by 784\n",
    "# output: (60000, 10), i-th row represent the probabilities of i-th sample\n",
    "# in the k-th class (k-th column entry)\n",
    "def sigma(X,w):\n",
    "    W = w.reshape(n,K)\n",
    "    s = np.exp(np.matmul(X,W))\n",
    "    total = np.sum(s, axis=1).reshape(-1,1)\n",
    "    prob = s / total\n",
    "    return prob\n",
    "\n",
    "# loss function, modulo by N (size of training data)\n",
    "# a vectorized implementation with a for loop with only 10 iterations\n",
    "def loss(w,X,y):\n",
    "    loss_components = np.zeros(N)\n",
    "    for k in range(K):\n",
    "        loss_components += np.log(sigma(X,w))[:,k] * (y == k)\n",
    "    # above is a dimension (60000,) array\n",
    "    return -np.mean(loss_components) # same with loss_components.sum()/N\n",
    "\n",
    "def gradient_loss(w,X,y):\n",
    "    gradient_for_each_weight_class = np.empty([n,K]) \n",
    "    # 10 columns, each column represent a graident\n",
    "    for k in range(K):\n",
    "        gradient_for_all_training_data_for_class_k = (sigma(X,w)[:,k] - (y==k)).reshape(-1,1)*X\n",
    "        gradient_for_each_weight_class[:,k] = np.mean(gradient_for_all_training_data_for_class_k, axis=0)\n",
    "    # we should return a (784,) array, which is averaging all 60000 training data\n",
    "    return gradient_for_each_weight_class.reshape(n*K)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross-validation function\n",
    "For a fixed set of weights `sigma(w)` gives 10 probabilities for each sample (training or testing), here we want to implement a cross-validation function, takes input of `X_train` or `X_test`, compute the class label of the highest probability for each samples, and returns the accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we define a checking accuracy function computing \n",
    "def check_acc(w,X,y):\n",
    "    prob = sigma(X,w) # for each sample, it computes 10 probabilities based on current weight w\n",
    "    highest_prob_index = np.argmax(prob, axis=1)\n",
    "    return np.mean(highest_prob_index == y.astype(int))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eta = 1e-5  # step size (learning rate)\n",
    "num_steps = 5\n",
    "\n",
    "for i in tqdm_notebook(range(num_steps)):\n",
    "    dw = gradient_loss(w,X_train,y_train)\n",
    "    w -= eta * dw\n",
    "\n",
    "    print(\"Training accuracy after\", i+1, \"iterations is: \", check_acc(w,X_train,y_train))\n",
    "    print(\"Testing accuracy after\", i+1, \"iterations is: \", check_acc(w,X_test,y_test))\n",
    "    # keep track of training and testing accuracy just making sure we are in the right direction\n",
    "    \n",
    "print(\"loss after\", i+1, \"iterations is: \", loss(w,X_train,y_train))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Slow, slow, slow\n",
    "Because our dataset is big. One iteration of the gradient descent requires evaluating the gradient for all the training samples, and it takes takes $O(N\\cdot d)$ cpu time ($N$: number of training samples, $d$:number of features in each sample). Stochastic gradient descent to remedy next time..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# scikit-learn\n",
    "\n",
    "We can use `scikit-learn`'s [`LogisticRegression()` class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) in the `linear_model` to perform this task for us. Quoting the reference:\n",
    "> In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the 'multi_class' option is set to 'ovr', and uses the cross- entropy loss if the 'multi_class' option is set to 'multinomial'. (Currently the 'multinomial' option is supported only by the 'lbfgs', 'sag' and 'newton-cg' solvers.)\n",
    "\n",
    "> For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones.\n",
    "For multiclass problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs' handle multinomial loss; 'liblinear' is limited to one-versus-rest schemes.\n",
    "'newton-cg', 'lbfgs' and 'sag' only handle L2 penalty, whereas 'liblinear' and 'saga' handle L1 penalty."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# copy the example\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "mnist_clf = LogisticRegression(random_state=42, \n",
    "                               solver='lbfgs', tol= 1e-5, max_iter = 2000, verbose=True, \n",
    "                               multi_class='multinomial')\n",
    "# a faster solver is sag according to the reference\n",
    "# verbose is printing output during training (only applies to lbfgs as solver)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mnist_clf.fit(X_train[:10000,:], y_train[:10000]) # we only use first 10000 images as training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mnist_clf.predict(X_test[:10, :])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we visualize the first 10 rows of X_test, see how the prediction goes\n",
    "# visualize the first 20 rows of the training data, with their labels.\n",
    "_, axes = plt.subplots(2,5, figsize=(16, 7))\n",
    "axes = axes.reshape(-1)\n",
    "\n",
    "for i in range(10):\n",
    "    axes[i].axis('off') # hide the axes ticks\n",
    "    axes[i].imshow(X_test[i,:].reshape(28,28), cmap = 'gray')\n",
    "    axes[i].set_title(str(int(y_test[i])), color= 'black', fontsize=25)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Result\n",
    "Our softmax got 8 out 10 correct, not a bad score. Run the following cell will give you the prediction accuracy for the first 500 testing samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mnist_clf.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading: more details about softmax regression\n",
    "<br>\n",
    "Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. In logistic regression we assumed that the labels were binary: $y^{(i)}\\in\\{0,1\\}$. In the last lecture, we have used such a binary classifier to distinguish between two kinds of handwritten digits. Softmax regression allows us to handle $y^{(i)}\\in \\{1,\\dots, K\\}$ where $K$ is the number of classes.\n",
    "\n",
    "Given a test input $\\mathbf{x}\\in \\mathbb{R}^n$ (a 28x28 image flattened to a `(784,)` array), we want to estimate the probability that $P(y=k|\\mathbf{x})$ for each value of $k=1,\\dots,K$ using certain model (hypothesis). In other words, from the input image, we want to estimate the probability of this image being classified as each label among $K$ labels, and we choose the highest probable one to label this image as our prediction based on the model. Thus, for each sample, our model (hypothesis) will output a $K$-dimensional vector (whose elements sum to $1$ to make it a probability) giving us our $K$ estimated probabilities. Concretely, our model $h(\\mathbf{x}; W)$, which stands for given the current weights $W$ the probability vector for $\\mathbf{x}$, takes the form:\n",
    "\n",
    "$$\n",
    "h(\\mathbf{x};W) =\n",
    "\\begin{pmatrix}\n",
    "P(y = 1 | \\mathbf{x}; \\mathbf{w}) \\\\\n",
    "P(y = 2 | \\mathbf{x}; \\mathbf{w}) \\\\\n",
    "\\vdots \\\\\n",
    "P(y = K | \\mathbf{x}; \\mathbf{w})\n",
    "\\end{pmatrix}\n",
    "=\n",
    "\\frac{1}{ \\sum_{j=1}^{K}{\\exp\\big(\\mathbf{w}_j^{\\top} \\mathbf{x}\\big) }}\n",
    "\\begin{pmatrix}\n",
    "\\exp(\\mathbf{w}_1^{\\top} \\mathbf{x} ) \\\\\n",
    "\\exp(\\mathbf{w}_2^{\\top} \\mathbf{x} ) \\\\\n",
    "\\vdots \\\\\n",
    "\\exp(\\mathbf{w}_K^{\\top} \\mathbf{x} ) \\\\\n",
    "\\end{pmatrix}.\n",
    "$$\n",
    "Totally we have $K$ sets of parameters, $\\mathbf{w}_1, \\mathbf{w}_2, \\dots, \\mathbf{w}_K$, and the factor $\\sum_{j=1}^{K}{\\exp\\big(\\mathbf{w}_j^{\\top} \\mathbf{x}\\big)}$ normalizes the results to be a probability.\n",
    "\n",
    "When we implement the softmax regression, it is usually convenient to represent $W$ containing all $K$ sets of parameters as a $n\\times K$ matrix obtained by concatenating $\\mathbf{w}_1, \\mathbf{w}_2, \\dots, \\mathbf{w}_K$ into columns, so that $\\mathbf{w}_k = (w_{k1}, \\dots, w_{kn})^{\\top} = (w_{kl})$ for $l = 1,\\dots, n$\n",
    "\n",
    "$$\n",
    "\\mathbf{w} = \\left(\n",
    "\\begin{array}{cccc}| & | & | & | \\\\\n",
    "\\mathbf{w}_1 & \\mathbf{w}_2 & \\cdots & \\mathbf{w}_K \\\\\n",
    "| & | & | & |\n",
    "\\end{array}\\right),\n",
    "$$\n",
    "and $\\mathbf{w}^{\\top}\\mathbf{x}$ would be sensible and vectorized to be computed.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Loss functions Logistic vs Softmax\n",
    "Define the following indicator function:\n",
    "$$\n",
    "1_{\\{y = k\\}} = 1_{\\{k\\}}(y) = \\delta_{yk} = \\begin{cases}\n",
    "1 & \\text{when } y = k,\n",
    "\\\\[5pt]\n",
    "0 & \\text{otherwise}.\n",
    "\\end{cases}\n",
    "$$\n",
    "First let us recall the loss function for the logistic regression, and we rewrite it as: we have $N$ training samples $(\\mathbf{x}^{(i)}, y^{(i)})$\n",
    "$$\n",
    "\\begin{aligned}\n",
    "L^{\\text{Logistic}} (\\mathbf{w}) &= - \\frac{1}{N}\\sum_{i=1}^N \n",
    "\\Bigl\\{y^{(i)} \\ln\\big( h(\\mathbf{x}^{(i)}) \\big) \n",
    "+ (1 - y^{(i)}) \\ln\\big( 1 - h(\\mathbf{x}^{(i)}) \\big) \\Bigr\\}\n",
    "\\\\\n",
    "& = - \\frac{1}{N}\\sum_{i=1}^N \\sum_{k=0}^1\n",
    "\\Bigl\\{ 1_{\\{y^{(i)} = k\\}} \\ln P\\big(y^{(i)}=k | \\mathbf{x}^{(i)} ; \\mathbf{w} \\big) \\Bigr\\}.\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "Now our loss function for the softmax regression is then the generalization of above:\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "L (\\mathbf{w}) = L^{\\text{Softmax}} (\\mathbf{w}) & = - \\frac{1}{N}\\sum_{i=1}^N \\sum_{k=1}^K\n",
    "\\Bigl\\{ 1_{\\{y^{(i)} = k\\}} \\ln P\\big(y^{(i)}=k | \\mathbf{x}^{(i)} ; \\mathbf{w} \\big) \\Bigr\\}\n",
    "\\\\\n",
    " & = - \\frac{1}{N}\\sum_{i=1}^N \\sum_{k=1}^K\n",
    "\\left\\{1_{\\{y^{(i)} = k\\}} \\ln \\Bigg( \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})}{\\sum_{j=1}^{K} \n",
    "\\exp\\big(\\mathbf{w}_j^{\\top} \\mathbf{x}^{(i)} \\big) }  \\Bigg)\\right\\}.\n",
    "\\end{aligned}\n",
    "$$\n",
    "Notice for every term in the sum w.r.t. to the labels, $\\sum_{k=1}^K$, $1_{\\{y^{(i)} = k\\}} = 1$ for only one term among $K$ terms, and the rest is 0. The loss function above is the average of the cross-entropy for each sample:\n",
    "$$\n",
    "H(p,q)\\ =\\ -\\sum^{K}_{k=1}p_{k}\\log q_{k},\n",
    "$$\n",
    "where $p_{k}$ is the ground truth probability of this sample is in $k$-th class (known), $q_{k}$ is our model's estimated/predicted probability."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Gradient of softmax loss function \n",
    "# (you might want to  re-derive this on paper)\n",
    "\n",
    "Notice our weights to be trained have $K$ sets $\\mathbf{w}_1, \\mathbf{w}_2, \\dots, \\mathbf{w}_K$, and each of the $k$-th weights vector has $n$ components: $\\mathbf{w}_k = (w_{k1}, \\dots, w_{kl},\\dots, w_{kn})^{\\top}$. The first subscript is $1\\leq k \\leq K$ (label's index, we have this many set of weights), the second subscript is $1\\leq l\\leq n$ ($\\mathbf{x}$'s feature index). \n",
    "\n",
    "The indices involved are pretty complicated, to simplify the notation a bit, denote the probability predicted by our model of the $i$-th training sample being in the $k$-th class as:\n",
    "\n",
    "$$\n",
    "\\sigma_{k}^{(i)}:= P\\big(y^{(i)}=k | \\mathbf{x}^{(i)} ; \\mathbf{w} \\big) = \n",
    "\\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})} {\\sum_{m=1}^{K} \n",
    "\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)} )}\n",
    "$$\n",
    "then \n",
    "$$\n",
    "\\frac{\\partial L }{\\partial w_{jl}}\n",
    "= - \\sum_{i=1}^N \\sum_{k=1}^K \n",
    "\\left\\{ 1_{\\{y^{(i)} = k\\} } \\frac{\\partial}{\\partial w_{jl}}\\Big( \\ln \\sigma_{k}^{(i)}\\Big)\n",
    "\\right\\}\n",
    "= - \\sum_{i=1}^N \\sum_{k=1}^K \n",
    "\\left\\{ 1_{\\{y^{(i)} = k\\} } \\frac{1}{\\sigma_{k}^{(i)}}\\frac{\\partial}{\\partial w_{jl}} \\sigma_{k}^{(i)}\n",
    "\\right\\}.\n",
    "\\tag{$\\star$}\n",
    "$$\n",
    "\n",
    "Now computing the partial derivative above:\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\frac{\\partial \\sigma_{k}^{(i)}}{\\partial w_{jl}} \n",
    "&= \n",
    "\\frac{\\partial }{\\partial w_{jl}} \\left( \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})} {\\sum_{m=1}^{K} \n",
    "\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)} )}\\right)\n",
    "\\\\\n",
    "&= \\frac{1}{\\sum_{m=1}^{K}\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)} )}\n",
    "\\frac{\\partial }{\\partial w_{jl}} \\left(  \\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})\\right)\n",
    "- \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})}\n",
    "{ \\left(\\sum_{m=1}^{K}\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)}) \\right)^2}\n",
    "\\frac{\\partial }{\\partial w_{jl}} \\left(  \\sum_{m=1}^{K}\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)})  \\right)\n",
    "\\\\\n",
    "&= \\frac{1}{\\sum_{m=1}^{K}\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)} )}\n",
    "1_{\\{j = k\\}} \\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})\n",
    "\\frac{\\partial }{\\partial w_{jl}} \\left( \\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)} \\right)\n",
    "- \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})}\n",
    "{ \\left(\\sum_{m=1}^{K}\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)}) \\right)^2}\n",
    "\\exp(\\mathbf{w}_j^{\\top} \\mathbf{x}^{(i)})\n",
    "\\frac{\\partial }{\\partial w_{jl}} \\left( \\mathbf{w}_j^{\\top} \\mathbf{x}^{(i)} \\right).\n",
    "\\end{aligned}\n",
    "\\tag{$\\dagger$}\n",
    "$$\n",
    "\n",
    "By the property of the indicator function, we have:\n",
    "\n",
    "$$\n",
    "1_{\\{j = k\\}} \\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})\n",
    "\\frac{\\partial }{\\partial w_{jl}} \\left( \\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)} \\right)\n",
    "= \\begin{cases}\n",
    "\\exp(\\mathbf{w}_j^{\\top} \\mathbf{x}^{(i)}) x_l^{(i)} & \\text{if } j=k,\n",
    "\\\\[3pt]\n",
    "0 & \\text{if }j\\neq k.\n",
    "\\end{cases}\n",
    "$$\n",
    "\n",
    "Hence, $(\\dagger)$ can be further written as:\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\frac{\\partial \\sigma_{k}^{(i)}}{\\partial w_{jl}}  \n",
    "= \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})}\n",
    "{ \\sum_{m=1}^{K}\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)}) }\n",
    "\\left(\n",
    "1_{\\{j = k\\}} - \n",
    " \\frac{\\exp(\\mathbf{w}_j^{\\top} \\mathbf{x}^{(i)})}\n",
    "{ \\sum_{m=1}^{K}\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)}) }\n",
    "\\right) x^{(i)}_l\n",
    "= \\sigma_{k}^{(i)} \\left(\n",
    "1_{\\{j = k\\}} - \\sigma_{j}^{(i)} \\right) x^{(i)}_l.\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "Now plugging this back to $(\\star)$:\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\frac{\\partial L }{\\partial w_{jl}}\n",
    "&= - \\sum_{i=1}^N \\sum_{k=1}^K \n",
    "\\left\\{ 1_{\\{y^{(i)} = k\\} } \n",
    "\\left(\n",
    "1_{\\{j = k\\}} - \\sigma_{j}^{(i)} \\right) x^{(i)}_l\n",
    "\\right\\}\n",
    "\\\\\n",
    "&=- \\sum_{i=1}^N \n",
    "x^{(i)}_l\\left\\{ \\sum_{k=1}^K  \n",
    "1_{\\{y^{(i)} = k\\} } 1_{\\{j = k\\}} - \n",
    "\\sum_{k=1}^K 1_{\\{y^{(i)} = k\\} } \\sigma_{j}^{(i)}  \n",
    "\\right\\}\n",
    "\\\\\n",
    "&=\n",
    "- \\sum_{i=1}^N \n",
    "x^{(i)}_l\\left(   \n",
    "1_{\\{y^{(i)} = j\\} } -  \\sigma_{j}^{(i)}  \n",
    "\\right)\n",
    "\\\\\n",
    "& = \n",
    "- \\sum_{i=1}^N \n",
    "x^{(i)}_l\\Big(   \n",
    "1_{\\{y^{(i)} = j\\} } -  P\\big(y^{(i)}=j | \\mathbf{x}^{(i)} ; \\mathbf{w} \\big)  \n",
    "\\Big).\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "This is pretty simple, and it has a nice interpretation similar to the maximum likelihood function: the term in the parenthesis is the difference between the actual probability and the probability estimation in our model.\n",
    "\n",
    "Now the derivative of $L$ with respect the whole $k$-th set of weights is then:\n",
    "\n",
    "$$\n",
    "\\frac{\\partial L }{\\partial \\mathbf{w}_{k}}\n",
    "= - \\sum_{i=1}^N \n",
    "\\mathbf{x}^{(i)}\\left(   1_{\\{y^{(i)} = k\\} } -  \\sigma_{k}^{(i)}  \n",
    "\\right)\n",
    "=\n",
    "\\sum_{i=1}^N \n",
    "\\mathbf{x}^{(i)}\\left(    \\frac{\\exp(\\mathbf{w}_k^{\\top} \\mathbf{x}^{(i)})} {\\sum_{m=1}^{K} \n",
    "\\exp(\\mathbf{w}_m^{\\top} \\mathbf{x}^{(i)} )} -1_{\\{y^{(i)} = k\\} } \n",
    "\\right).\n",
    "\\tag{$\\diamond$}\n",
    "$$"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}