{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Logistic Regression/Classification\n", "> In this post, it will cover the basic concept of Logistic Regression, which is widely used in classification tasks. And it will explain what the hypothesis and cost function, and how to solve it with gradient descent as we saw previously.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Tensorflow, Machine_Learning]\n", "- image: images/cross_entropy.png" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import tensorflow as tf\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "\n", "plt.rcParams['figure.figsize'] = (16, 10)\n", "plt.rcParams['text.usetex'] = True\n", "plt.rc('font', size=15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classification\n", "\n", "Classification is the task to classify the data with labels. If we have two kinds of labels, its task is called **binary classification**, and labels more than 2, then that task is **multi-class classification**. In binary classification, variable (or label) is either 0 or 1, or True or False. For example,\n", "\n", "- Exam: Pass or Fail\n", "- Spam: Not Spam or Spam\n", "- Face: Real or Fake\n", "- Tumor: Malignant or Benign (or Not Malignant)\n", "\n", "To interpret it in model side, we must encode the label through one-hot encoding.\n", "\n", "## Logistic Regression\n", "\n", "Unlike linear regression, **Logistic Regression** is an regression approach to handle the classification problem. So what's the difference between Logistic Regression and Linear Regression? At first, let's look at the type of data we care.\n", "\n", "![logistic_linear](image/logistic_linear_diff.png)\n", "\n", "There are two types of data, discrete (counted) and continuous (measured). Roughly speaking, discrete data can count it by hand, so it can classify it with its value. But continous data can only measure its information, and cannot classify it with its value.\n", "\n", "## Hypothesis Representation\n", "\n", "While we use linear regression, the output $Y$ can calculated from the product of the weight vector $\\theta$ and observation $X$, adding bias $b$. Mathematically, we can derive this,\n", "\n", "$$ H_{\\theta}(X) = \\theta^TX + b$$\n", "\n", "And we can define the cost function by measuring the error between the hypothesis and actual data. But what about logistic regression? In binary classification task, its output is only 0 or 1. So we cannot bring the same hypothesis and cost functions from linear regression.\n", "\n", "So we need a new function to classify it correctly, and here is a new hypothesis.\n", "\n", "$$ H_{\\theta}(X) = g((\\theta^T X)) $$\n", "\n", "There is a new function $g(\\dot)$ If this function can outputs the probability of True or False, compare it with decision boundary, we can handle the classification problem with logistic regression.\n", "\n", "![logistic](image/logistic.png)\n", "\n", "## Sigmoid (Logistic) function\n", "\n", "And, Sigmoid function is introduced. (Also called logistic function). It has following term,\n", "\n", "$$ g(z) = \\frac{1}{1 + e^{-z}} $$\n", "\n", "and it generates the output between 0 and 1. Suppose that $z$ is changed from $-\\infty$ to $\\infty$, then\n", "\n", "$$ \\lim_{z \\to -\\infty} g(z) = \\frac{1}{1 + \\infty} = 0 \\quad \\lim_{z \\to \\infty} g(z) = \\frac{1}{1 + 0} = 1 $$\n", "\n", "When the $z = 0$, then $g(z) = 0.5$, so it is reasonable to define the decision boundary to 0.5. If $g(z)$ is less than 0.5, then the function classify its data to negative, and vice versa. \n", "\n", "## Cost Function\n", "Cost function is also changed. Because it cannot measure the difference between the hypothesis and label. In common sense, $H_{\\theta}(x) = y$ then cost function must be 0. To do this, we can introduce the new type of cost function, [Cross-Entropy](https://en.wikipedia.org/wiki/Cross_entropy). Cross Entropy measures the average number of bits needed to identify an event. Mathematically, it can expressed like this, if the probability of event happening is $p$,\n", "$$ \\text{C.E.} = -y\\log(p) - (1-y)\\log(1-p) $$\n", "\n", "so we can calculate the Cross Entropy in terms of $y$,\n", "\n", "$$ \\text{C.E.} = \\begin{cases} -\\log(p) & \\text{if } y = 1 \\\\ -\\log(1-p) & \\text{if } y=0 \\end{cases} $$\n", "\n", "We can apply it in cost function,\n", "\n", "$$ \\text{Cost}(H_{\\theta}(x), y) = -y \\log(H_{\\theta}(x)) - (1-y)\\log(1-h_{\\theta}(x)) $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the cost function." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x = np.arange(0.01, 1, 0.01)\n", "y_1 = -np.log(x)\n", "y_2 = -np.log(1-x)\n", "\n", "fig, ax = plt.subplots(1, 2, figsize=(16, 10))\n", "ax[0].plot(x, y_1, color='red', marker='s');\n", "ax[0].set_xlabel(r'$H_{\\theta}(x)$')\n", "ax[0].set_ylabel('Cost')\n", "ax[0].set_title('$y=1$')\n", "ax[1].plot(x, y_2, color='blue', marker='o');\n", "ax[1].set_xlabel(r'$H_{\\theta}(x)$')\n", "ax[1].set_ylabel('Cost')\n", "ax[1].set_title('$y=0$')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optimization\n", "\n", "We defined the cost function with Cross-entropy, we need to minimize it to make the model correctly classified, and Gradient Descent can apply this. As you noticed before, it requires to calculate the gradient.\n", "\n", "Actually, when you try implement it with tensorflow, you don't need to calculate the gradient manually. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logistic Regression in Tensorflow\n", "Let's build the Logistic Regression model with tensorflow. Before beginning, let's analyze the sample data patterns.\n", "\n", "We made a sample data like this," ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Training data\n", "x_train = [[1., 2.], [2., 3.], [3., 1.], [4., 3.], [5., 3.], [6., 2.]]\n", "y_train = [[0.], [0.], [0.], [1.], [1.], [1.]]\n", "\n", "# Test data\n", "x_test = [[5., 2.]]\n", "y_test = [[1.]]\n", "\n", "x1 = [x[0] for x in x_train]\n", "x2 = [x[1] for x in x_train]\n", "\n", "# Visualize it\n", "colors = [int(y[0] % 3) for y in y_train]\n", "plt.scatter(x1, x2, c=colors, marker='^')\n", "plt.scatter(x_test[0][0], x_test[0][1], c='red')\n", "\n", "plt.grid()\n", "plt.xlabel('x1')\n", "plt.ylabel('x2')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Actually, we defined the dataset with python list. But to handle it with tensorflow, it need to convert tensor type. Let's define the whole process." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))\n", "\n", "# Initialize Weight and bias\n", "W = tf.Variable(tf.zeros([2, 1]), name='weight')\n", "b = tf.Variable(tf.zeros([1]), name='bias')\n", "\n", "# Sigmoid Function\n", "def sigmoid(X):\n", " h = tf.divide(1., 1. + tf.exp(-(tf.matmul(X, W) + b)))\n", " return h\n", "\n", "# Loss function \n", "def loss_fn(h, y):\n", " cost = -tf.reduce_mean(y * tf.math.log(h) + (1 - y) * tf.math.log(1 - h))\n", " return cost\n", "\n", "# Optimizer (Stochastic Gradient Descent)\n", "optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)\n", "\n", "# Accuracy function for decision boundary\n", "def accuracy_fn(h, y):\n", " y_hat = tf.cast(h > 0.5, dtype=tf.float32)\n", " accuracy = tf.reduce_mean(tf.cast(tf.equal(y_hat, y), dtype=tf.int32))\n", " return accuracy\n", "\n", "# Gradient function\n", "def grad(X, y):\n", " with tf.GradientTape() as tape:\n", " h = sigmoid(X)\n", " loss = loss_fn(h, y)\n", " return tape.gradient(loss, [W, b])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this function, we will build the training step," ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch: 0, Loss: 0.6931\n", "Epoch: 100, Loss: 0.5781\n", "Epoch: 200, Loss: 0.5352\n", "Epoch: 300, Loss: 0.5056\n", "Epoch: 400, Loss: 0.4840\n", "Epoch: 500, Loss: 0.4673\n", "Epoch: 600, Loss: 0.4537\n", "Epoch: 700, Loss: 0.4421\n", "Epoch: 800, Loss: 0.4320\n", "Epoch: 900, Loss: 0.4229\n" ] } ], "source": [ "for e in range(1000):\n", " for x, y in iter(dataset.batch(len(x_train))):\n", " h = sigmoid(x)\n", " grads = grad(x, y)\n", " optimizer.apply_gradients(grads_and_vars=zip(grads, [W, b]))\n", " \n", " if e % 100 == 0:\n", " print('Epoch: {}, Loss: {:.4f}'.format(e, loss_fn(h, y)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see the cost is decreasing in each epoch. Weight $W$ and bias $b$ is stored in memory, so we can test it with test dataset." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test Result = [[1]]\n", "Test Accuracy: 1.0000\n" ] } ], "source": [ "test_accuracy = accuracy_fn(sigmoid(x_test), y_test)\n", "print('Test Result = {}'.format(tf.cast(sigmoid(x_test) > 0.5, dtype=tf.int32)))\n", "print('Test Accuracy: {:.4f}'.format(test_accuracy))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the model correctly classified test set based on trained weight and bias." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logistic Regression with diabetes classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at more complicated dataset. Diabetes dataset, released from [UCI ML repository](https://archive.ics.uci.edu/ml/datasets/diabetes), is common dataset for classification. Let's apply the same approach in this dataset." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pregnanciesglucosediastolictricepsinsulinbmidpfagediabetes
061487235033.60.627501
11856629026.60.351310
28183640023.30.672321
318966239428.10.167210
40137403516843.12.288331
\n", "
" ], "text/plain": [ " pregnancies glucose diastolic triceps insulin bmi dpf age \\\n", "0 6 148 72 35 0 33.6 0.627 50 \n", "1 1 85 66 29 0 26.6 0.351 31 \n", "2 8 183 64 0 0 23.3 0.672 32 \n", "3 1 89 66 23 94 28.1 0.167 21 \n", "4 0 137 40 35 168 43.1 2.288 33 \n", "\n", " diabetes \n", "0 1 \n", "1 0 \n", "2 1 \n", "3 0 \n", "4 1 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./dataset/diabetes.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the scale is different from each variable. So we need to standardizing it." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(460, 8) (460, 1)\n", "[[-5.4791862e-01 -1.1859907e+00 -2.1224350e-01 ... 6.1015427e-01\n", " 4.7453222e-01 -7.8628618e-01]\n", " [ 4.6014335e-02 6.5895163e-02 5.6322277e-01 ... 9.4197702e-04\n", " -8.7209895e-02 6.4591348e-02]\n", " [ 4.6014335e-02 -3.4096771e-01 -1.6054575e-01 ... -1.1749996e-02\n", " -2.6465813e-03 -3.6084738e-01]\n", " ...\n", " [ 1.2338802e+00 2.1002097e+00 4.5982724e-01 ... 2.0189581e+00\n", " -1.0113661e+00 8.3038110e-01]\n", " [ 1.2338802e+00 -1.1546935e+00 2.5303626e-01 ... 8.0053312e-01\n", " -4.4928238e-02 4.9003014e-01]\n", " [-5.4791862e-01 -6.8523633e-01 4.6245251e-02 ... -1.4713213e+00\n", " -7.1539456e-01 -5.3102291e-01]]\n" ] } ], "source": [ "from sklearn.preprocessing import StandardScaler\n", "from sklearn.model_selection import train_test_split\n", "\n", "X = df.iloc[:, :-1].to_numpy()\n", "y = df.iloc[:, [-1]].to_numpy()\n", "\n", "X = X.astype(np.float32)\n", "y = y.astype(np.float32)\n", "\n", "# Standardize the data\n", "ss = StandardScaler()\n", "X = ss.fit_transform(X)\n", "\n", "# Split with training and test dataset\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)\n", "\n", "print(X_train.shape, y_train.shape)\n", "print(X_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try!" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(len(X_train))" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "W = tf.Variable(tf.random.normal((8, 1)), name='weight')\n", "b = tf.Variable(tf.random.normal((1,)), name='bias')" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch: 0, Loss: 0.9894\n", "Epoch: 100, Loss: 0.8589\n", "Epoch: 200, Loss: 0.7700\n", "Epoch: 300, Loss: 0.7060\n", "Epoch: 400, Loss: 0.6584\n", "Epoch: 500, Loss: 0.6217\n", "Epoch: 600, Loss: 0.5928\n", "Epoch: 700, Loss: 0.5695\n", "Epoch: 800, Loss: 0.5504\n", "Epoch: 900, Loss: 0.5344\n", "Epoch: 1000, Loss: 0.5210\n", "Epoch: 1100, Loss: 0.5095\n", "Epoch: 1200, Loss: 0.4996\n", "Epoch: 1300, Loss: 0.4910\n", "Epoch: 1400, Loss: 0.4836\n", "Epoch: 1500, Loss: 0.4771\n", "Epoch: 1600, Loss: 0.4714\n", "Epoch: 1700, Loss: 0.4664\n", "Epoch: 1800, Loss: 0.4620\n", "Epoch: 1900, Loss: 0.4582\n" ] } ], "source": [ "for e in range(2000):\n", " for x, y in iter(dataset.batch(len(X_train))):\n", " h = sigmoid(x)\n", " grads = grad(x, y)\n", " optimizer.apply_gradients(grads_and_vars=zip(grads, [W, b]))\n", " \n", " if e % 100 == 0:\n", " print('Epoch: {}, Loss: {:.4f}'.format(e, loss_fn(h, y)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After training, we can validate model performance with test dataset. In this time, we want to measure whether model is correctly classified or not, so it can write the test_accuracy manually." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test Accuracy: 0.7468\n" ] } ], "source": [ "y_hat = tf.cast(sigmoid(X_test) > 0.5, dtype=tf.int32)\n", "# print('Test Result = {}'.format(tf.cast(sigmoid(X_test) > 0.5, dtype=tf.int32)))\n", "print('Test Accuracy: {:.4f}'.format(np.sum(y_test.reshape(-1) == y_hat.numpy().reshape(-1)) / y_test.shape[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "In this post, we cover the basic definition of logistic regression. Logistic regression is the approach to handle the classification task. So its hypothesis and cost function are different from that in linear regression. For cost function, Cross-Entropy is introduced, and we can implement whole process with tensorflow 2.x." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }