{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Logistic Regression in plain Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In logistic regression, we are trying to model the outcome of a **binary variable** given a **linear combination of input features**. For example, we could try to predict the outcome of an election (win/lose) using information about how much money a candidate spent campaigning, how much time she/he spent campaigning, etc.\n", "\n", "### Model \n", "\n", "Logistic regression works as follows.\n", "\n", "**Given:** \n", "- dataset $\\{(\\boldsymbol{x}^{(1)}, y^{(1)}), ..., (\\boldsymbol{x}^{(m)}, y^{(m)})\\}$\n", "- with $\\boldsymbol{x}^{(i)}$ being a $d-$dimensional vector $\\boldsymbol{x}^{(i)} = (x^{(i)}_1, ..., x^{(i)}_d)$\n", "- $y^{(i)}$ being a binary target variable, $y^{(i)} \\in \\{0,1\\}$\n", "\n", "The logistic regression model can be interpreted as a very **simple neural network:**\n", "- it has a real-valued weight vector $\\boldsymbol{w}= (w^{(1)}, ..., w^{(d)})$\n", "- it has a real-valued bias $b$\n", "- it uses a sigmoid function as its activation function\n", "\n", "![title](figures/logistic_regression.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training\n", "\n", "Different to [linear regression](linear_regression.ipynb), logistic regression has no closed form solution. But the cost function is convex, so we can train the model using gradient descent. In fact, **gradient descent** (or any other optimization algorithm) is guaranteed to find the global minimum (if the learning rate is small enough and enough training iterations are used). \n", "\n", "Training a logistic regression model has different steps. In the beginning (step 0) the parameters are initialized. The other steps are repeated for a specified number of training iterations or until convergence of the parameters.\n", "\n", "* * * \n", "**Step 0: ** Initialize the weight vector and bias with zeros (or small random values).\n", "* * *\n", "\n", "**Step 1: ** Compute a linear combination of the input features and weights. This can be done in one step for all training examples, using vectorization and broadcasting:\n", "$\\boldsymbol{a} = \\boldsymbol{X} \\cdot \\boldsymbol{w} + b $\n", "\n", "where $\\boldsymbol{X}$ is a matrix of shape $(n_{samples}, n_{features})$ that holds all training examples, and $\\cdot$ denotes the dot product.\n", "* * *\n", "\n", "**Step 2: ** Apply the sigmoid activation function, which returns values between 0 and 1:\n", "\n", "$\\boldsymbol{\\hat{y}} = \\sigma(\\boldsymbol{a}) = \\frac{1}{1 + \\exp(-\\boldsymbol{a})}$\n", "* * *\n", "\n", "** Step 3: ** Compute the cost over the whole training set. We want to model the probability of the target values being 0 or 1. So during training we want to adapt our parameters such that our model outputs high values for examples with a positive label (true label being 1) and small values for examples with a negative label (true label being 0). This is reflected in the cost function:\n", "\n", "$J(\\boldsymbol{w},b) = - \\frac{1}{m} \\sum_{i=1}^m \\Big[ y^{(i)} \\log(\\hat{y}^{(i)}) + (1 - y^{(i)}) \\log(1 - \\hat{y}^{(i)}) \\Big]$\n", "* * *\n", "\n", "** Step 4: ** Compute the gradient of the cost function with respect to the weight vector and bias. A detailed explanation of this derivation can be found [here](https://stats.stackexchange.com/questions/278771/how-is-the-cost-function-from-logistic-regression-derivated).\n", "\n", "The general formula is given by:\n", "\n", "$ \\frac{\\partial J}{\\partial w_j} = \\frac{1}{m}\\sum_{i=1}^m\\left[\\hat{y}^{(i)}-y^{(i)}\\right]\\,x_j^{(i)}$\n", "\n", "For the bias, the inputs $x_j^{(i)}$ will be given 1.\n", "* * *\n", "\n", "** Step 5: ** Update the weights and bias\n", "\n", "$\\boldsymbol{w} = \\boldsymbol{w} - \\eta \\, \\nabla_w J$ \n", "\n", "$b = b - \\eta \\, \\nabla_b J$\n", "\n", "where $\\eta$ is the learning rate." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2018-03-11T13:52:33.101260Z", "start_time": "2018-03-11T13:52:33.093590Z" } }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.datasets import make_blobs\n", "import matplotlib.pyplot as plt\n", "np.random.seed(123)\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2018-03-11T13:52:34.810918Z", "start_time": "2018-03-11T13:52:34.493190Z" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# We will perform logistic regression using a simple toy dataset of two classes\n", "X, y_true = make_blobs(n_samples= 1000, centers=2)\n", "\n", "fig = plt.figure(figsize=(8,6))\n", "plt.scatter(X[:,0], X[:,1], c=y_true)\n", "plt.title(\"Dataset\")\n", "plt.xlabel(\"First feature\")\n", "plt.ylabel(\"Second feature\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2018-03-11T13:52:35.271152Z", "start_time": "2018-03-11T13:52:35.258996Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape X_train: (750, 2)\n", "Shape y_train: (750, 1)\n", "Shape X_test: (250, 2)\n", "Shape y_test: (250, 1)\n" ] } ], "source": [ "# Reshape targets to get column vector with shape (n_samples, 1)\n", "y_true = y_true[:, np.newaxis]\n", "# Split the data into a training and test set\n", "X_train, X_test, y_train, y_test = train_test_split(X, y_true)\n", "\n", "print(f'Shape X_train: {X_train.shape}')\n", "print(f'Shape y_train: {y_train.shape}')\n", "print(f'Shape X_test: {X_test.shape}')\n", "print(f'Shape y_test: {y_test.shape}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logistic regression class" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2018-03-11T13:52:36.892670Z", "start_time": "2018-03-11T13:52:36.796510Z" } }, "outputs": [], "source": [ "class LogisticRegression:\n", " \n", " def __init__(self):\n", " pass\n", "\n", " def sigmoid(self, a):\n", " return 1 / (1 + np.exp(-a))\n", "\n", " def train(self, X, y_true, n_iters, learning_rate):\n", " \"\"\"\n", " Trains the logistic regression model on given data X and targets y\n", " \"\"\"\n", " # Step 0: Initialize the parameters\n", " n_samples, n_features = X.shape\n", " self.weights = np.zeros((n_features, 1))\n", " self.bias = 0\n", " costs = []\n", " \n", " for i in range(n_iters):\n", " # Step 1 and 2: Compute a linear combination of the input features and weights, \n", " # apply the sigmoid activation function\n", " y_predict = self.sigmoid(np.dot(X, self.weights) + self.bias)\n", " \n", " # Step 3: Compute the cost over the whole training set.\n", " cost = (- 1 / n_samples) * np.sum(y_true * np.log(y_predict) + (1 - y_true) * (np.log(1 - y_predict)))\n", "\n", " # Step 4: Compute the gradients\n", " dw = (1 / n_samples) * np.dot(X.T, (y_predict - y_true))\n", " db = (1 / n_samples) * np.sum(y_predict - y_true)\n", "\n", " # Step 5: Update the parameters\n", " self.weights = self.weights - learning_rate * dw\n", " self.bias = self.bias - learning_rate * db\n", "\n", " costs.append(cost)\n", " if i % 100 == 0:\n", " print(f\"Cost after iteration {i}: {cost}\")\n", "\n", " return self.weights, self.bias, costs\n", "\n", " def predict(self, X):\n", " \"\"\"\n", " Predicts binary labels for a set of examples X.\n", " \"\"\"\n", " y_predict = self.sigmoid(np.dot(X, self.weights) + self.bias)\n", " y_predict_labels = [1 if elem > 0.5 else 0 for elem in y_predict]\n", "\n", " return np.array(y_predict_labels)[:, np.newaxis]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initializing and training the model" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2018-03-11T13:53:08.008432Z", "start_time": "2018-03-11T13:53:07.714396Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cost after iteration 0: 0.6931471805599453\n", "Cost after iteration 100: 0.046514002935609956\n", "Cost after iteration 200: 0.02405337743999163\n", "Cost after iteration 300: 0.016354408151412207\n", "Cost after iteration 400: 0.012445770521974634\n", "Cost after iteration 500: 0.010073981792906512\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "regressor = LogisticRegression()\n", "w_trained, b_trained, costs = regressor.train(X_train, y_train, n_iters=600, learning_rate=0.009)\n", "\n", "fig = plt.figure(figsize=(8,6))\n", "plt.plot(np.arange(600), costs)\n", "plt.title(\"Development of cost over training\")\n", "plt.xlabel(\"Number of iterations\")\n", "plt.ylabel(\"Cost\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Testing the model" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2018-03-11T13:53:48.726560Z", "start_time": "2018-03-11T13:53:48.714892Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train accuracy: 100.0%\n", "test accuracy: 100.0%\n" ] } ], "source": [ "y_p_train = regressor.predict(X_train)\n", "y_p_test = regressor.predict(X_test)\n", "\n", "print(f\"train accuracy: {100 - np.mean(np.abs(y_p_train - y_train)) * 100}%\")\n", "print(f\"test accuracy: {100 - np.mean(np.abs(y_p_test - y_test))}%\")" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }