{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Stochastic Gradient Descent (SGD)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
\n", " This notebook presents the Stochastic Gradient Descent (SGD), which is a popular algorithm frequently used in the field of machine learning. The example describes a single layer neural network with logistic regression for breast cancer prediction. The proposed model is analytically derived and implemented using the Numpy library to demonstrate the core functionality of training and testing. Nonetheless, a Pytorch equivalent of the model is given further below for validation purposes.
\n", "
last update: 23/06/2024\n", "
\n", "
\n", " Author

\n", " \n", " \n", "





\n", " Christopher
Hahne, PhD
\n", "
\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data acquisition\n", "\n", "For our classification example, we employ real data using the [UCI ML Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)). It consists of $N=569$ test persons with 2 classes (malignant and benign) defined as $y_i \\in \\{0,1\\}$ and $J=30$ measured attributes per person $i$ given as a feature vector $\\mathbf{x}_i\\in\\mathbb{R}^{J\\times 1}$." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(426, 30) (426,) (143, 30) (143,)\n" ] } ], "source": [ "# import required packages\n", "import numpy as np\n", "from sklearn.datasets import load_breast_cancer\n", "from sklearn.model_selection import train_test_split\n", "\n", "# load data\n", "bc = load_breast_cancer()\n", "\n", "# normalize\n", "x_min = np.min(bc.data, 0)\n", "x_scale = 3*np.std(bc.data-x_min, 0)\n", "bc_norm = (bc.data-x_min) / x_scale\n", "\n", "# split data into training and validation set\n", "train_X, val_X, train_y, val_y = train_test_split(bc_norm, bc.target, random_state=42)\n", "class_labels = bc.target_names\n", "\n", "# plot shapes\n", "print(train_X.shape, train_y.shape, val_X.shape, val_y.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gradient descent\n", "\n", "Stochastic Gradient Descent (SGD) is a variant of the gradient descent algorithm used to learn weight parameters of a multi-layer perceptron, particularly useful for training large datasets. The weights of the model are updated iteratively. Models trained using SGD are found to generalize better on unseen data.\n", "\n", "### Optimization\n", "\n", "For a stochastic regression, a predicted value $\\hat{y}$ is a scalar composed by $\\hat{y}_i=\\mathbf{x}_i^\\intercal\\mathbf{w}$ where a vector $\\mathbf{w}=\\left[w^{(1)}, w^{(2)}, \\dots, w^{(J)}\\right]^\\intercal$ consists of weights $w^{(j)}$ for each feature $j$ and the vector $\\mathbf{x}_i=\\left[x_i^{(1)}, x_i^{(2)}, \\dots, x_i^{(J)}\\right]^\\intercal$ represents a data sample with $J$ features while $i$ is the sample index from a set with total number of $N$ samples. Note that we add a feature vector of $[1, 1, \\dots, 1] \\in \\mathbb{R}^{N}$ to the data set to embed and train the bias as variable $w^{(1)}$ instead of treating it as a separate variable. Our predicted class value $\\hat{y}$ is supposed to match its actual class $y$ for which a least-squares cost metric $(\\hat{y}-y)^2$ may be a reasonable choice. Similar to conventional optimization, SGD aims to minimize an objective function $F(\\mathbf{w})$, which may be defined as a mean squared error\n", "\n", "$$\n", "L(\\mathbf{w})=\\frac{1}{N}\\sum_{i=1}^N\\left(\\hat{y}_i-y_i\\right)^2=\\frac{1}{N}\\sum_{i=1}^N\\left(\\mathbf{x}_i^\\intercal\\mathbf{w}-y_i\\right)^2\n", "$$\n", "\n", "where $\\left(\\mathbf{x}_i^\\intercal\\mathbf{w}-y_i\\right)^2$ is the **loss function**. The **training** refers to an optimization problem where weights $\\mathbf{w}$ are adjusted so that the objective is\n", "\n", "$$\n", "\\text{arg min}_{\\mathbf{w}} \\, L(\\mathbf{w})\n", "$$\n", "\n", "To achieve this, SGD inherits the **Gradient Descent** update method at iteration $k$ (known as **back-propagation**), which writes\n", "\n", "$$\n", "\\mathbf{w}_{k+1} = \\mathbf{w}_k - \\gamma \\, \\nabla_{\\mathbf{w}_k} \\left(\\mathbf{x}_i^\\intercal\\mathbf{w}_k-y_i\\right)^2 \\, , \\, \\forall i\n", "$$\n", "\n", "where $\\gamma$ denotes the learning rate and $\\nabla_{\\mathbf{w}_k} f\\left(\\mathbf{w}_k, \\mathbf{x}_i, y_i\\right)$ is the gradient of the loss function with respect to the weights $\\mathbf{w}_k$. Here, the gradient $\\nabla_{\\mathbf{w}} \\left(\\mathbf{x}_i^\\intercal\\mathbf{w}_k-y_i\\right)^2$ can be generally obtained by\n", "\n", "$$\n", "\\nabla_{\\mathbf{w}_k} \\left(\\mathbf{x}_i^\\intercal\\mathbf{w}_k-y_i\\right)^2 = \\frac{\\partial \\left(\\mathbf{x}_i^\\intercal\\mathbf{w}_k-y_i\\right)^2}{\\partial \\mathbf{w}_k}\n", "=\n", "\\begin{bmatrix} \n", "\\frac{\\partial}{\\partial \\mathbf{w}_k^{(1)}} \\left(\\mathbf{x}_i^\\intercal\\mathbf{w}_k-y_i\\right)^2 \\\\\n", "\\frac{\\partial}{\\partial \\mathbf{w}_k^{(2)}} \\left(\\mathbf{x}_i^\\intercal\\mathbf{w}_k-y_i\\right)^2 \\\\\n", "\\vdots \\\\\n", "\\frac{\\partial}{\\partial \\mathbf{w}_k^{(J)}} \\left(\\mathbf{x}_i^\\intercal\\mathbf{w}_k-y_i\\right)^2 \\\\\n", "\\end{bmatrix}\n", "=\n", "\\begin{bmatrix} \n", "\\mathbf{x}_i^{(1)} 2\\left(\\mathbf{x}_i^\\intercal\\mathbf{w}_k-y_i\\right) \\\\\n", "\\mathbf{x}_i^{(2)} 2\\left(\\mathbf{x}_i^\\intercal\\mathbf{w}_k-y_i\\right) \\\\\n", "\\vdots \\\\\n", "\\mathbf{x}_i^{(J)} 2\\left(\\mathbf{x}_i^\\intercal\\mathbf{w}_k-y_i\\right) \\\\\n", "\\end{bmatrix}\n", "= 2\\mathbf{x}_i^\\intercal\\left(\\mathbf{x}_i^\\intercal\\mathbf{w}_k-y_i\\right)\n", "$$\n", "\n", "where $^\\intercal$ denotes the transpose. Iteration through the entire data set $\\forall i \\in \\{1, \\dots, N\\}$ is referred to as one *epoch*. The resulting weights $\\mathbf{w}$ have shown to be improved by letting the optimization procedure see the training data several times. This means that SGD sweeps through the entire dataset for several epochs.\n", "\n", "### Mini-Batching\n", "\n", "Completion of a single epoch is often sub-divided in bundled subsets of samples, so-called *batches*, of size $B$ where $B1$, which helps reduce the variance in each parameter update. The batch size can be chosen to be a power-of-two for better performance from available matrix multiplication libraries. In practice, we determine how many training examples will fit on the GPU or main memory and then use it as the batch size." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implementation" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "# learning rate is the step size as in classical gradient descent\n", "l_rate = 1e-3\n", "\n", "# epochs is the number of maximum iterations for minimization\n", "epochs = 700\n", "\n", "# batch size amounts to the number of samples used in one epoch iteration (up to hardware memory)\n", "b_size = 2**4\n", "assert b_size <= train_X.shape[0]\n", "\n", "# insert column of ones as first feature entry to cover bias as a trainable parameter within the weight vector (instead of separate variable)\n", "train_X, val_X = np.c_[np.ones(train_X.shape[0]), train_X], np.c_[np.ones(val_X.shape[0]), val_X]\n", "\n", "# initialize weight vector such it has the same number of columns as input features\n", "np.random.seed(1111)\n", "w = np.random.uniform(size=(train_X.shape[1],)) * 0.1\n", "\n", "# initialize a list to track the loss value for each epoch\n", "loss_list = []\n", "\n", "# batch composition\n", "def next_batch(X, y, b_size):\n", " \n", " # loop over our dataset in mini-batches\n", " for i in np.arange(0, X.shape[0], b_size):\n", " \n", " # yield a tuple for current batch of data and labels\n", " yield (X[i:i+b_size], y[i:i+b_size])\n", "\n", "for epoch in range(epochs+1):\n", " \n", " # reset total epoch loss\n", " epoch_loss = []\n", " \n", " # loop over data in batches\n", " for (batch_X, batch_y) in next_batch(train_X, train_y, b_size):\n", " \n", " # take dot product between current feature batch and weights\n", " preds_y = np.dot(batch_X, w)\n", " \n", " # compare prediction and true values\n", " diff = preds_y - batch_y\n", " \n", " # compute mean of squared loss for current batch\n", " epoch_loss.extend(diff**2)\n", "\n", " # compute the derivative\n", " gradient = 2 * np.dot(batch_X.T, diff)\n", " \n", " # scale gradient of current batch to step in the correct direction\n", " w -= l_rate * gradient\n", " \n", " # update loss list by taking average across all batches\n", " loss_list.append(np.mean(epoch_loss))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "fig = plt.figure(figsize=(15, 5))\n", "# plot loss function\n", "plt.plot(range(len(loss_list)), loss_list)\n", "plt.title('Training Loss', fontsize=14)\n", "plt.xlabel('Epoch #', fontsize=14)\n", "plt.ylabel('$L(\\mathbf{w})$', fontsize=14)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Validation\n", "\n", "### Activation function\n", "In logistic regression, we desire a classification label $\\mathring{y}$ that only has two possible values whereas, so far, our model employs linear combinations $\\hat{y}_i=\\mathbf{x}_i\\mathbf{w}$ that yield results in the $\\hat{y}_i \\in (-\\infty, \\infty)$ range. Thus, we seek a continuous function that maps real numbers $\\hat{y}_i=\\mathbf{\\hat{y}} \\in \\mathbb{R}^N$ to the $(0,1)$ codomain. A function that satisfies this condition is the *sigmoid function*, also known as *standard logistic function*, given by\n", "\n", "$$\n", "\\sigma(\\hat{y}_i)=\\frac{1}{1+\\exp(-\\hat{y}_i)}\n", "$$" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "sigmoid = lambda y: 1.0 / (1 + np.exp(-y))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The returned value $\\sigma_i \\in (0,1)$ of the activation function is then assigned a predicted label $\\mathring{y}_i \\in \\{0,1\\}$ which is negative if it is closer to 0 and positive in case it is closer to 1, so that\n", "\n", "$$\n", "\\mathring{y}_i=\n", "\\begin{cases}\n", " 1, & \\text{if } \\sigma_i \\geq \\tau\\\\\n", " 0, & \\text{otherwise}\n", "\\end{cases}\n", "$$\n", "\n", "where $\\tau$ is an adjustable threshold scalar. Here, we estimate an ideal threshold via [Youden's method](https://en.wikipedia.org/wiki/Youden%27s_J_statistic)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# threshold estimation\n", "from sklearn.metrics import roc_curve, auc\n", "fpr, tpr, thresholds = roc_curve(train_y, np.dot(train_X, w))\n", "tau = thresholds[np.argmax(tpr - fpr)]\n", "\n", "# compute predictions from test set\n", "pred_y = (sigmoid(np.dot(val_X, w)) >= tau).astype('uint8')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# compose confusion matrix\n", "from sklearn.metrics import confusion_matrix\n", "conf_mat = confusion_matrix(y_true=val_y, y_pred=pred_y)\n", "\n", "# print confusion matrix\n", "fig, ax2 = plt.subplots(1, 1, figsize=(15,5))\n", "ax2.matshow(conf_mat, cmap=plt.cm.Blues, alpha=0.3)\n", "for i in range(conf_mat.shape[0]):\n", " for j in range(conf_mat.shape[1]):\n", " ax2.text(x=j, y=i, s=conf_mat[i, j], va='center', ha='center', size='xx-large')\n", "\n", "ax2.set_title('Validation Confusion Matrix', fontsize=14)\n", "ax2.set_xlabel('Predictions $\\mathbf{\\hat{y}}$', fontsize=14)\n", "ax2.set_ylabel('Actuals $\\mathbf{y}$', fontsize=14)\n", "ax2.set_yticks([0, 1])\n", "ax2.set_xticks([0, 1])\n", "ax2.set_yticklabels(class_labels)\n", "ax2.set_xticklabels(class_labels)\n", "ax2.tick_params(top=False, bottom=True, labeltop=False, labelbottom=True)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loss animation" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from matplotlib.animation import FuncAnimation\n", "from IPython.display import HTML\n", "#plt.style.use('seaborn-notebook')\n", "\n", "# animated figure that plots the loss over time\n", "fig, ax = (plt.figure(figsize=(8, 5)), plt.axes())\n", "ax.set_xlabel('Epoch #', fontsize=16)\n", "ax.set_ylabel('Loss $F(\\mathbf{w})$', fontsize=16)\n", "ax.xaxis.set_tick_params(labelsize=12)\n", "ax.yaxis.set_tick_params(labelsize=12)\n", "line, = ax.semilogy(range(len(loss_list)), loss_list, lw=2)\n", "point, = ax.plot(0, np.nan, 'r', marker='.', markersize=14)\n", "plt.tight_layout()\n", "plt.close()\n", "# animation\n", "div = 50\n", "def animate(i):\n", " line.set_data(np.arange(len(loss_list))[:i*div], loss_list[:i*div])\n", " point.set_data(i*div, loss_list[i*div])\n", " return line, point\n", "\n", "anim = FuncAnimation(fig, animate, interval=div, frames=epochs//div+1)\n", "\n", "if False:\n", " anim.save('./img/sgd_anim.gif', writer='imagemagick', fps=5)\n", "\n", " HTML(anim.to_jshtml())" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "98.11 % Precision\n", "96.30 % Recall\n", "97.90 % Accuracy\n" ] } ], "source": [ "# TP/(TP+FP)\n", "precision = lambda conf_mat: conf_mat[0, 0]/(conf_mat[0, 0]+conf_mat[1, 0]) * 100\n", "print('%.2f %% Precision' % precision(conf_mat))\n", "\n", "# TP/(TP+FN)\n", "recall = lambda conf_mat: conf_mat[0, 0]/(conf_mat[0, 0]+conf_mat[0, 1]) * 100\n", "print('%.2f %% Recall' % recall(conf_mat))\n", "\n", "# (TP+TN)/ALL\n", "accuracy = lambda conf_mat: (conf_mat[0, 0]+conf_mat[1, 1])/np.sum(conf_mat) * 100\n", "print('%.2f %% Accuracy' % accuracy(conf_mat))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overfitting validation\n", "\n", "In statistical learning, we want to make sure that the classification performs equally well on test and training data. Therefore, we employ the Mean-Squared Error (MSE) given by\n", "$$\n", "\\text{MSE}(\\mathbf{\\hat{y}}, \\mathbf{y})=\\frac{1}{N}\\sum_{i=1}^N \\left(\\hat{y}_i-y_i\\right)^2\n", "$$\n", "on both sets while aiming for\n", "$\\text{MSE}(\\mathbf{\\hat{y}}_{\\text{test}}, \\mathbf{y}_{\\text{test}}) \\approx \\text{MSE}(\\mathbf{\\hat{y}}_{\\text{train}}, \\mathbf{y}_{\\text{train}})$. If this fails, it gives indication for either under- or overfitting of the trained weights $\\mathbf{w}$." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSEs are close enough with 0.2238 (test) and 0.2582 (train).\n" ] } ], "source": [ "# mean squared error\n", "MSE = lambda y, pred_y: np.round(sum((y-pred_y)**2)/len(y), 4)\n", "\n", "# compute predictions of test and training sets\n", "pred_val_y = np.round(sigmoid(np.dot(val_X, w))).astype('uint8')\n", "pred_train_y = np.round(sigmoid(np.dot(train_X, w))).astype('uint8')\n", "\n", "# compare MSEs\n", "val_MSE = MSE(val_y, pred_val_y)\n", "train_MSE = MSE(train_y, pred_train_y)\n", "res = np.isclose(val_MSE, train_MSE, rtol=.95)\n", "\n", "if res:\n", " print('MSEs are close enough with %s (test) and %s (train).' % (val_MSE, train_MSE))\n", "else:\n", " print('Potential over-/underfitting from MSEs with %s (test) and %s (train).' % (val_MSE, train_MSE))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PyTorch equivalent\n", "\n", "Neural network models are often defined using PyTorch's *Module* class, which offers inheritance from Object-Oriented Programming (OOP). Variables of other types (e.g. numpy) have to be converted to torch data types to enable the convenient automatic gradient computation. The model, optimizer (here SGD) and loss function are instantiated before training." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "epoch 0\n", " Train set - loss: 0.445, accuracy: 12.587\n", " Validation set - loss: 0.448, accuracy: 12.587\n", " \n", "epoch 100\n", " Train set - loss: 0.102, accuracy: 93.007\n", " Validation set - loss: 0.087, accuracy: 93.007\n", " \n", "epoch 200\n", " Train set - loss: 0.081, accuracy: 95.105\n", " Validation set - loss: 0.073, accuracy: 95.105\n", " \n", "epoch 300\n", " Train set - loss: 0.073, accuracy: 95.804\n", " Validation set - loss: 0.067, accuracy: 95.804\n", " \n", "epoch 400\n", " Train set - loss: 0.069, accuracy: 96.503\n", " Validation set - loss: 0.064, accuracy: 96.503\n", " \n", "epoch 500\n", " Train set - loss: 0.066, accuracy: 96.503\n", " Validation set - loss: 0.063, accuracy: 96.503\n", " \n", "epoch 600\n", " Train set - loss: 0.065, accuracy: 96.503\n", " Validation set - loss: 0.061, accuracy: 96.503\n", " \n", "epoch 700\n", " Train set - loss: 0.064, accuracy: 96.503\n", " Validation set - loss: 0.06, accuracy: 96.503\n", " \n" ] } ], "source": [ "import torch\n", "torch.manual_seed(1111)\n", "\n", "# define single layer model\n", "class SingleLayerNet(torch.nn.Module):\n", " def __init__(self, n_features):\n", " super(SingleLayerNet, self).__init__()\n", " self.linear = torch.nn.Linear(n_features, 1, bias=True)\n", " torch.nn.init.uniform_(self.linear.weight, 0, 1e-1)\n", " \n", " def forward(self, X):\n", " z = self.linear(X)\n", " return torch.squeeze(z, 1)\n", " \n", "# convert to torch data types\n", "train_Xt = torch.autograd.Variable(torch.FloatTensor(train_X))\n", "train_yt = torch.autograd.Variable(torch.FloatTensor(train_y))\n", "val_Xt = torch.autograd.Variable(torch.FloatTensor(val_X))\n", "val_yt = torch.autograd.Variable(torch.FloatTensor(val_y))\n", "\n", "# instantiate model and loss\n", "model = SingleLayerNet(n_features=train_Xt.shape[1])\n", "optimizer = torch.optim.SGD(model.parameters(), lr=l_rate)\n", "criterion = torch.nn.MSELoss()\n", "\n", "# training\n", "loss_list = []\n", "for epoch in range(epochs+1):\n", " # loop over data in batches\n", " for i_X, i_y in next_batch(train_Xt, train_yt, b_size):\n", " pred_y = model(i_X)\n", " loss = criterion(pred_y, i_y)\n", " optimizer.zero_grad()\n", " loss.backward()\n", " optimizer.step()\n", " # track loss\n", " loss_list.append(loss.item())\n", " if epoch % 100 == 0:\n", " y_val_pred = model(val_Xt)\n", " y_train_pred = model(train_Xt)\n", " val_loss = criterion(y_val_pred, val_yt)\n", " train_loss = criterion(y_train_pred, train_yt)\n", " pred_yt = sigmoid(y_val_pred.detach().numpy()) >= tau\n", " conf_mat = confusion_matrix(y_true=val_yt.numpy(), y_pred=pred_yt)\n", " val_acc = accuracy(conf_mat)\n", " train_acc = accuracy(conf_mat)\n", " print(\n", " f'''epoch {epoch}\n", " Train set - loss: {np.round(train_loss.detach().numpy().astype('float'), 3)}, accuracy: {np.round(train_acc, 3)}\n", " Validation set - loss: {np.round(val_loss.detach().numpy().astype('float'), 3)}, accuracy: {np.round(val_acc, 3)}\n", " ''')\n", "\n", "if False:\n", " torch.save(model, 'torch_sgd_model.pth')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# threshold estimation\n", "from sklearn.metrics import roc_curve, auc\n", "fpr, tpr, thresholds = roc_curve(train_yt, model(train_Xt).detach())\n", "tau = thresholds[np.argmax(tpr - fpr)]\n", "\n", "# compute predictions from test set\n", "pred_yt = sigmoid(y_val_pred.detach()) >= tau" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "conf_mat = confusion_matrix(y_true=val_yt, y_pred=pred_yt)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))\n", "\n", "# plot loss function\n", "ax1.plot(range(len(loss_list)), loss_list)\n", "ax1.set_title('Training Loss', fontsize=14)\n", "ax1.set_xlabel('Epoch #', fontsize=14)\n", "ax1.set_ylabel('$L(\\mathbf{w})$', fontsize=14)\n", "\n", "# print confusion matrix\n", "ax2.matshow(conf_mat, cmap=plt.cm.Blues, alpha=0.3)\n", "for i in range(conf_mat.shape[0]):\n", " for j in range(conf_mat.shape[1]):\n", " ax2.text(x=j, y=i, s=conf_mat[i, j], va='center', ha='center', size='xx-large')\n", "\n", "ax2.set_title('Validation Confusion Matrix', fontsize=14)\n", "ax2.set_xlabel('Predictions $\\mathbf{\\hat{y}}$', fontsize=14)\n", "ax2.set_ylabel('Actuals $\\mathbf{y}$', fontsize=14)\n", "ax2.set_yticks([0, 1])\n", "ax2.set_xticks([0, 1])\n", "ax2.set_yticklabels(class_labels)\n", "ax2.set_xticklabels(class_labels)\n", "ax2.tick_params(top=False, bottom=True, labeltop=False, labelbottom=True)\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }