{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "0babfbfaf6c894c68fcbe35cd6875873", "grade": false, "grade_id": "cell-9c30f1e63ac50295", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "# Part 2: Logistic Regression and Gradient Checking" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "69d943c7d8973d9297dd6f9c10a78f3a", "grade": false, "grade_id": "cell-2762c2a88e3706e1", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "# Execute this code block to install dependencies when running on colab\n", "try:\n", " import torch\n", "except:\n", " from os.path import exists\n", " from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag\n", " platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())\n", " cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\\.\\([0-9]*\\)\\.\\([0-9]*\\)$/cu\\1\\2/'\n", " accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'\n", "\n", " !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.0.0-{platform}-linux_x86_64.whl torchvision" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "5ea8e79f2dd44017849977c9fd421c3c", "grade": false, "grade_id": "cell-c0a45f6dfbe9a832", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "In the first part of the lab we saw how to make predictions of continously varying values with a linear regression model. Lets now turn our focus to binary classification using a simple classification algorithm known as Logistic regression.\n", "\n", "In linear regression we tried to predict the value of $y$ for an example $\\mathbf{x}$ using a linear function $y=\\mathbf{x}^\\top\\theta$ (where $\\mathbf{x}$ and $\\theta$ are column-vectors). This will clearly not be a great solution for predicting binary-valued labels ($y\\in\\{0,1\\}$). In logistic regression we use a different hypothesis class to try to predict the probability that a given example belongs to the \"1\" class versus the probability that it belongs to the \"0\" class. Specifically, we will try to learn a function of the form:\n", "\n", "\\begin{align}\n", "P(y=1|\\mathbf{x}) &= \\frac{1}{1 + \\exp(-\\mathbf{x}^\\top\\theta)} \\equiv \\sigma(\\mathbf{x}^\\top\\theta),\\\\\n", "P(y=0|\\mathbf{x}) &= 1 - P(y=1|\\mathbf{x}).\n", "\\end{align}\n", " \n", "The function $\\sigma(z) \\equiv \\frac{1}{1 + \\exp(-z)}$ is called the \"sigmoid\" or \"logistic\" function. The sigmoid function squashes any real valued input into the range $[0,1]$ enabling us to interprete the output as a probability. Our goal is to search for a value of $\\theta$ so that the probability $P(y=1|\\mathbf{x})=\\sigma(\\mathbf{x}^\\top\\theta)$ is large when $\\mathbf{x}$ belongs to the \"1\" class and small when $\\mathbf{x}$ belongs to the \"0\" class (so that $P(y=0|\\mathbf{x})$ is large). \n", "\n", "With Linear Regression, the natural cost function was one that measured the sum of squared residuals (the difference between the predicted value and true value). With logisitic regression we have a probabilisitic model, so it makes sense that we use a function that measures the likelihood of the data given the model (note that we want to maximise this function rather than minimise it). As an aside, note that in the case of linear regression if we assume that the data has errors that are IID (independently and identically distributed) according to a Normal distribution, then it can be shown that the maximising the likelihood is exactly the same as minimising the sum of squared residuals. For logistic regression, the likelihood function for a single data point is:\n", "\n", "\\begin{align}\n", "p(y|\\mathbf{x}; \\theta) &= \\sigma(\\mathbf{x}^\\top\\theta)^y(1-\\sigma(\\mathbf{x}^\\top\\theta)^{(1-y)}.\n", "\\end{align}\n", "\n", "For a complete dataset of points $(y_i, \\mathbf{x}_i)$, then the complete likelihood is:\n", "\n", "\\begin{align}\n", "L(\\theta) &= \\prod_i \\sigma(\\mathbf{x}_i^\\top\\theta)^{y_i}(1-\\sigma(\\mathbf{x}_i^\\top\\theta)^{(1-y_i)}\n", "\\end{align}\n", "\n", "However, it is considerably easier to maximise the log-likelihood function:\n", "\n", "\\begin{align}\n", "\\mathcal{l}(\\theta) &= \\log L(\\theta) \\\\\n", " & = \\log \\prod_i \\sigma(\\mathbf{x}_i^\\top\\theta)^{y_i}(1-\\sigma(\\mathbf{x}_i^\\top\\theta)^{(1-y_i)} \\\\\n", " & = \\sum_i y_i \\log(\\sigma(\\mathbf{x}_i^\\top\\theta)) + (1-y_i) \\log(1-\\sigma(\\mathbf{x}_i^\\top\\theta))\n", "\\end{align}\n", "\n", "Clearly, maximising the log-likelihood is equivalent to minimising the negative log-likelihood. The negative of the log-likelihood function having the form $-\\sum_i y_i \\log(p) + (1-y_i) \\log(p)$, where p is a function returning the predicted probability of class \"1\", is often called the __\"Binary Cross Entropy\"__ function, __\"Binary Cross Entropy Loss\"__ or sometimes the __\"log loss\"__.\n", "\n", "For conciseness and computational efficiency, we can write the negative logistic regression log-likelihood function in matrix form. Assuming the $y_i$ are stored in a column vector $\\mathbf{y}$ and the data vectors $x_i$ in the __rows__ of a matrix $\\mathbf{X}$, then: \n", "\n", "\\begin{align}\n", "\\mathrm{NLL}(\\theta) & = -(\\mathbf{y}^\\top \\log(\\sigma(\\mathbf{X}\\theta)) + (1-\\mathbf{y})^\\top \\log(1-\\sigma(\\mathbf{X}\\theta)))\n", "\\end{align}\n", "\n", "The gradients of this function are given by:\n", "\n", "\\begin{align}\n", "\\nabla_\\theta \\mathrm{NLL}(\\theta) & = \\mathbf{X}^\\top(\\sigma(\\mathbf{X}\\theta) - \\mathbf{y})\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "1921cfe81a296b412a9e7e9225f256fc", "grade": false, "grade_id": "cell-2e920801eca3f37a", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "__Use the box below to compute the gradients of the negative log-likelihood function $\\nabla_\\theta \\mathrm{NLL}(\\theta)$. You can use `torch.sigmoid()` to apply the sigmoid function.__" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "2bf657d9115afca5f85c2a9913405227", "grade": false, "grade_id": "cell-1d05572289571209", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "import torch\n", "\n", "# we wouldn't normally do this, but for this lab we want to work in double precision\n", "# as we'll need the numerical accuracy later on for doing checks on our gradients:\n", "torch.set_default_dtype(torch.float64) \n", "\n", "def logistic_regression_loss_grad(theta, X, y):\n", " # YOUR CODE HERE\n", " raise NotImplementedError()\n", " return grad" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "e3c122bc0ea83661f4775974a44646e7", "grade": true, "grade_id": "cell-56edf61ac19106c5", "locked": true, "points": 4, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "theta = torch.zeros(1)\n", "X = torch.Tensor([[1]])\n", "y = torch.Tensor([[0]])\n", "assert(logistic_regression_loss_grad(theta, X, y) == 0.5)\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "2636e164d2cd4afd3fef15a1f9ab8afc", "grade": false, "grade_id": "cell-25aaf7d157a2e997", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "## Training a Logistic Regressor with real data\n", "\n", "We'll now try gradient descent using our gradient function on a real dataset from `scikit-learn` called `digits`. \n", "\n", "The `digits` dataset contains handwritten characters (much like the `MNIST` dataset that you may have heard of - we'll explore `MNIST` in a future lab). As logistic regression is a binary classifier, we'll just use the first 2 characters (0 and 1) from the `digits` dataset, and make our own training and test splits:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "6431a75f69c02547330e7775148df963", "grade": false, "grade_id": "cell-9c6e2f914cfddeb5", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "from sklearn.datasets import load_digits\n", "\n", "X, y = tuple(torch.Tensor(z) for z in load_digits(2, True)) #convert to pytorch Tensors\n", "X = torch.cat((X, torch.ones((X.shape[0], 1))), 1) # append a column of 1's to the X's\n", "y = y.reshape(-1, 1) # reshape y into a column vector\n", "\n", "# We're also going to break the data into a training set for computing the regression parameters\n", "# and a test set to evaluate the predictive ability of those parameters\n", "perm = torch.randperm(y.shape[0])\n", "X_train = X[perm[0:260], :]\n", "y_train = y[perm[0:260]]\n", "X_test = X[perm[260:], :]\n", "y_test = y[perm[260:]]" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "1e6f5d66c52a8d510614a26c83f9b43f", "grade": false, "grade_id": "cell-220b2d0a770d55b3", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "Now we have the data, we can use our loss function to try and estimate the optimal parameters for the two-digit classification problem. We'll use `PyTorch`s `torch.nn.functional.binary_cross_entropy_with_logits` function to print out the Binary Cross Entropy of the training data at each iteration, and of the test data once the optimisation is complete. \n", "\n", "Note: `logits` refers to unscaled probabilities before the sigmoid is applied, so in the `binary_cross_entropy_with_logits` function we just pass in $\\mathbf{X}\\theta$. `PyTorch` does also have a `torch.nn.binary_cross_entropy` method that takes in probabilities, however, as we'll see when implementing neural networks in a later lab, we'll often choose to work with logits as they provide better numerical stability thanks to the _log-sum-exp_ trick. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "1446595438554769359790f27164e8fc", "grade": false, "grade_id": "cell-30175a88ab1d4440", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "alpha = 0.001\n", "theta_gd = torch.rand((X_train.shape[1], 1))\n", "for e in range(0, 10):\n", " gr = logistic_regression_loss_grad(theta_gd, X_train, y_train)\n", " theta_gd -= alpha * gr\n", " print(\"Epoch:\", e, \" BCE of training data:\", torch.nn.functional.binary_cross_entropy_with_logits(X_train @ theta_gd, y_train))\n", "\n", "print(\"Gradient Descent Theta:\", theta_gd.t())\n", "print(\"BCE of test data:\", torch.nn.functional.binary_cross_entropy_with_logits(X_test @ theta_gd, y_test))" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "7c162297646c249bcf53e3dc9560df73", "grade": false, "grade_id": "cell-18607d78a082c3aa", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "What do you observe from running the above? Write your answer in the following block:" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "nbgrader": { "cell_type": "markdown", "checksum": "ec5b74769f52b33d845d70b4ccb3c468", "grade": true, "grade_id": "cell-d3044f0cbdfe7969", "locked": false, "points": 3, "schema_version": 3, "solution": true } }, "source": [ "YOUR ANSWER HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gradient Checking\n", "\n", "How can we be sure that our gradient function was correct? We might have made an error in the analytical derivation or in the implementation of that derivation into code. Even though we observed the optimisation process on real data converge (hopefully!), you might have made a subtle error in the implementation...\n", "\n", "So far we have worked with relatively simple algorithms where it is straightforward to compute the objective function and its gradient with pen-and-paper, and then implement the necessary computations in PyTorch. For more complex models that we will see later, the gradient computation can be notoriously difficult to debug and get right. Sometimes a subtly buggy implementation will manage to learn something that can look surprisingly reasonable (while performing less well than a correct implementation). Thus, even with a buggy implementation, it may not at all be apparent that anything is amiss. In this section, we describe a method for numerically checking the derivatives computed by your code to make sure that your implementation is correct. Carrying out the derivative checking procedure described here will significantly increase your confidence in the correctness of your code.\n", "\n", "Suppose we want to minimize $J(\\theta)$ as a function of $\\theta$. For this example, suppose $J:\\mathbb{R}\\mapsto\\mathbb{R}$, so that $\\theta \\in \\mathbb{R}$. If we are using gradient descent (or other gradient-based optimisation function), then we usually have implemented some function $g(\\theta)$ that purportedly computes $\\frac{d}{d\\theta}J(\\theta)$.\n", "\n", "How can we check if our implementation of $g$ is correct?\n", "\n", "Recall the mathematical definition of the derivative is:\n", "\n", "\\begin{align}\n", "\\frac{d}{d\\theta}J(\\theta) = \\lim_{\\epsilon \\rightarrow 0}\n", "\\frac{J(\\theta+ \\epsilon) - J(\\theta-\\epsilon)}{2 \\epsilon}.\n", "\\end{align}\n", "\n", "Thus, at any specific value of $\\theta$, we can numerically approximate the derivative as follows:\n", "\n", "\\begin{align}\n", "\\frac{J(\\theta+{\\rm EPSILON}) - J(\\theta-{\\rm EPSILON})}{2 \\times {\\rm EPSILON}}\n", "\\end{align}\n", " \n", "In practice, we set ${\\rm EPSILON}$ to a small constant, say around $10^{−4}$. (There is a large range of values of EPSILON values that should work well, but we don’t set ${\\rm EPSILON}$ to be \"extremely\" small, say $10^{−20}$, as that would lead to numerical roundoff errors.)\n", "\n", "Thus, given a function $g(\\theta)$ that is supposedly computing $\\frac{d}{d\\theta}J(\\theta)$, we can now numerically verify its correctness by checking that\n", "\n", "\\begin{align}\n", "g(\\theta) \\approx\n", "\\frac{J(\\theta+{\\rm EPSILON}) - J(\\theta-{\\rm EPSILON})}{2 \\times {\\rm EPSILON}}.\n", "\\end{align}\n", " \n", "The degree to which these two values should approximate each other will depend on the details of $J$. But assuming ${\\rm EPSILON}=10^{−4}$, you’ll usually find that the left- and right-hand sides of the above will agree to at least 4 significant digits (and often many more).\n", "\n", "Now, consider the case where $\\theta \\in \\mathbb{R}^n$ is a vector rather than a single real number (so that we have $n$ parameters that we want to learn), and $J: \\mathbb{R}^n \\mapsto \\mathbb{R}$. We now generalize our derivative checking procedure to the case where $\\theta$ may be a vector (as in our linear regression and logistic regression examples).\n", "\n", "Suppose we have a function $g_i(\\theta)$ that purportedly computes $\\frac{\\partial}{\\partial\\theta_i}J(\\theta)$; we’d like to check if $g_i$ is outputting correct derivative values. Let $\\textstyle \\theta^{(i+)} = \\theta + {\\rm EPSILON} \\times \\vec{e}_i$, where\n", "\n", "\\begin{align}\n", "\\vec{e}_i = \\begin{bmatrix}0 \\\\ 0 \\\\ \\vdots \\\\ 1 \\\\ \\vdots \\\\ 0\\end{bmatrix}\n", "\\end{align}\n", "\n", "is the $i$-th basis vector (a vector of the same dimension as $\\theta$, with a \"1\" in the $i$-th position and \"0\"s everywhere else). So, $\\theta^{(i+)}$ is the same as $\\theta$, except its $i$-th element has been incremented by ${\\rm EPSILON}$. Similarly, let $\\theta^{(i−)}=\\theta−{\\rm EPSILON} \\times \\vec{e}_i$ be the corresponding vector with the $i$-th element decreased by ${\\rm EPSILON}$.\n", "\n", "We can now numerically verify $g_i(\\theta)$'s correctness by checking, for each $i$, that:\n", "\n", "\\begin{align}\n", "g_i(\\theta) \\approx\n", "\\frac{J(\\theta^{(i+)}) - J(\\theta^{(i-)})}{2 \\times {\\rm EPSILON}}.\n", "\\end{align}\n", "\n", "### Gradient checker code\n", "\n", "The following code block contains an implementation of the gradient checking proceedure described above. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "e150bfbcf79176283150a9c1e1d567fc", "grade": false, "grade_id": "cell-e035a05e7b6b4f48", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "from random import randrange\n", "\n", "def grad_check(f, x, analytic_grad, num_checks=10, h=1e-5):\n", " sum_error = 0\n", " for i in range(num_checks):\n", " ix = tuple([randrange(m) for m in x.shape]) #randomly sample value to change\n", "\n", " oldval = x[ix].item()\n", " x[ix] = oldval + h # increment by h\n", " fxph = f(x) # evaluate f(x + h)\n", " x[ix] = oldval - h # increment by h\n", " fxmh = f(x) # evaluate f(x - h)\n", " x[ix] = oldval # reset\n", "\n", " grad_numerical = (fxph - fxmh) / (2 * h)\n", " grad_analytic = analytic_grad[ix]\n", " rel_error = abs(grad_numerical - grad_analytic) / (abs(grad_numerical) + abs(grad_analytic) + 1e-8)\n", " sum_error += rel_error\n", " print('numerical: %f\\tanalytic: %f\\trelative error: %e' % (grad_numerical, grad_analytic, rel_error))\n", " return sum_error / num_checks" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "895bbc8a80d3679843b7d6fd156d2dd3", "grade": false, "grade_id": "cell-465484e88bb39e62", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "To use the gradient checker, we provide our analytical gradients, together with a function that computes the actual loss (rather than the gradients of the loss) and the parameters at which the gradient was computed:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "2861f02065d7ae522e4e37116f582c1e", "grade": true, "grade_id": "cell-6505f3d2a5b382af", "locked": true, "points": 3, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "#we'll use random parameters:\n", "theta = torch.rand_like(theta_gd)*0.001\n", "# and compute the analytic gradient (w.r.t the test data we loaded in this case)\n", "grad = logistic_regression_loss_grad(theta, X_test, y_test)\n", "\n", "# we need a function that computes the loss for a given theta (and implicitly the data)\n", "def func(th):\n", " sigm = torch.sigmoid(X_test @ th)\n", " f = -(y_test.t() @ torch.log(sigm) + (1 - y_test.t()) @ torch.log(1 - sigm));\n", " return f\n", "\n", "# and run the gradient checker\n", "relerr = grad_check(func, theta, grad)\n", "print(\"average error:\", relerr)\n", "\n", "assert relerr < 1e-5\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "fcd74f63e94ee4ae005b1295e46a1f29", "grade": false, "grade_id": "cell-d2228b469bb3b0e0", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "Running the above, you should have a very small average error, and the relative error for each trial should also be a very small value." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 2 }