{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "ws7NIiiATD1b", "slideshow": { "slide_type": "slide" } }, "source": [ "# ECE 046211- Technion - Deep Learning\n", "---\n", "\n", "#### Tal Daniel\n", "\n", "## Tutorial 02 - Single Neuron\n", "---" ] }, { "cell_type": "markdown", "metadata": { "id": "1HM7LX6ATD1d", "slideshow": { "slide_type": "slide" } }, "source": [ "### Agenda\n", "---\n", "\n", "* [Discriminative Models](#-Discriminative-Models)\n", "* [The Perceptron](#-The-Perceptron)\n", "* [Logistic Regression](#-Logistic-Regression)\n", " * [Logistic Regression with PyTorch](#-Logistic-Regression-with-PyTorch)\n", "* [Multi-Class (Multinomial) Logistic Regression - Softmax Regression](#-Multi-Class-(Multinomial)-Logistic-Regression---Softmax-Regression)\n", "* [Activation Functions](#-Activation-Functions)\n", "* [Recommended Videos](#-Recommended-Videos)\n", "* [Credits](#-Credits)" ] }, { "cell_type": "markdown", "metadata": { "id": "LedcT50PTD1e", "slideshow": { "slide_type": "skip" } }, "source": [ "#### Additional Packages for Google Colab\n", "----\n", "If you are using Google Colab, you have to install additional packages. To do this, simply run the following cell." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1O-9ASIdTD1e", "outputId": "9e4a777b-5537-47ea-abab-5efd32eed6d1", "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting torchviz\n", "\u001b[?25l Downloading https://files.pythonhosted.org/packages/8f/8e/a9630c7786b846d08b47714dd363a051f5e37b4ea0e534460d8cdfc1644b/torchviz-0.0.1.tar.gz (41kB)\n", "\r", "\u001b[K |████████ | 10kB 15.6MB/s eta 0:00:01\r", "\u001b[K |████████████████ | 20kB 20.8MB/s eta 0:00:01\r", "\u001b[K |███████████████████████▉ | 30kB 11.4MB/s eta 0:00:01\r", "\u001b[K |███████████████████████████████▉| 40kB 9.6MB/s eta 0:00:01\r", "\u001b[K |████████████████████████████████| 51kB 3.7MB/s \n", "\u001b[?25hRequirement already satisfied: torch in /usr/local/lib/python3.6/dist-packages (from torchviz) (1.7.0+cu101)\n", "Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (from torchviz) (0.10.1)\n", "Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from torch->torchviz) (0.16.0)\n", "Requirement already satisfied: dataclasses in /usr/local/lib/python3.6/dist-packages (from torch->torchviz) (0.8)\n", "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.6/dist-packages (from torch->torchviz) (3.7.4.3)\n", "Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from torch->torchviz) (1.19.5)\n", "Building wheels for collected packages: torchviz\n", " Building wheel for torchviz (setup.py) ... \u001b[?25l\u001b[?25hdone\n", " Created wheel for torchviz: filename=torchviz-0.0.1-cp36-none-any.whl size=3522 sha256=6a736ad7c5fb8745a0b4912099f190f303604fe6b847bd2e0ab2e6141c6b5828\n", " Stored in directory: /root/.cache/pip/wheels/2a/c2/c5/b8b4d0f7992c735f6db5bfa3c5f354cf36502037ca2b585667\n", "Successfully built torchviz\n", "Installing collected packages: torchviz\n", "Successfully installed torchviz-0.0.1\n" ] } ], "source": [ "# to work locally (win/linux/mac), first install 'graphviz': https://graphviz.org/download/ and restart your machine\n", "!pip install torchviz" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "Pslwr4ogTD1e", "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# imports for the tutorial\n", "import numpy as np\n", "import pandas as pd\n", "import torch\n", "import torch.nn as nn\n", "import torchviz\n", "import matplotlib.pyplot as plt\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import Perceptron, LogisticRegression\n", "from sklearn.preprocessing import StandardScaler\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": { "id": "rrVeTId5TD1f", "slideshow": { "slide_type": "slide" } }, "source": [ "## Discriminative Models\n", "---\n", "* **Discriminative models** are a class of models used in statistical classification, especially in supervised machine learning. A discriminative classifier tries to build a model just by depending on the observed data while learning how to do the classification from the given statistics. \n", " * Compared to **generative models**, discriminative models make fewer assumptions on the distributions but depend heavily on the quality of the data.\n", " * For example, given a set of labeled pictures of dog and rabbit, discriminative models will be matching a new, unlabeled picture to a most similar labeled picture and then give out the label class, a dog or a rabbit.\n", "* The typical discriminative learning approaches include Logistic Regression (LR), Support Vector Machine (SVM), conditional random fields (CRFs) (specified over an undirected graph), and others." ] }, { "cell_type": "markdown", "metadata": { "id": "XHOJ_WQrTD1f", "slideshow": { "slide_type": "slide" } }, "source": [ "### The Perceptron\n", "---\n", "* One of the first and simplest linear model.\n", "* Based on a *linear threshold unit* (LTU): the input and output are numbers (not binary values), and each connection is associated with a weight.\n", "* The LTU computes a weighted sum of its inputs: $z = w_1x_1 + w_2x_2 +....+w_nx_n = w^Tx$, and then it applies a **step function** to that sum and outputs the result: $$ h_w(x) = step(z) = step(w^Tx) $$\n" ] }, { "cell_type": "markdown", "metadata": { "id": "3CdUq31wTD1f", "slideshow": { "slide_type": "subslide" } }, "source": [ "* Illustration:\n", "
\n", "* The most common step function used is the *Heaviside step function* but sometimes the *sign function* is used (as is the illustration)." ] }, { "cell_type": "markdown", "metadata": { "id": "nJfMnoAMTD1g", "slideshow": { "slide_type": "subslide" } }, "source": [ "* **Perceptron Training** draws inspiration from biological neurons: the connection weight between two neurons is increased whenever they have **the same output**. Perceptrons are trained by considering the error made.\n", " * At each iteration, the Perceptron is fed with one training instance and makes a prediction for it.\n", " * For every output that produced a wrong prediction, it reinforces the connection weights from the inputs that would have contributed to the correct prediction.\n", " * Criterion: $ E^{perc}(w) = - \\sum_{i \\in D_{miss}}w^T(x^iy^i) $\n", "* **Perceptron Learning Rule (weight update)**: $$ w_{t+1} = w_t +\\eta(y_i -sign(w_t^Tx_i))x_i $$\n", " * $\\eta$ is the learing rate (hyper-parameter).\n", "* The decision boundary learned is linear, the Perceptron is incapable of learning complex patterns." ] }, { "cell_type": "markdown", "metadata": { "id": "qti-1OD6TD1g", "slideshow": { "slide_type": "subslide" } }, "source": [ "* **Perceptron Convergence Theorem**: If the training instances are **linearly seperable**, the algorithm would converge to a solution.\n", " * **There can be multiple solutions (multiple hyperplanes)**.\n", "* Perceptrons do not output a class probability, they just make predicitons based on a **hard threshold**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* **Pseudocode**:\n", " * **Require**: Learning rate $\\eta$\n", " * **Require**: Initial parameter $w$\n", " * **While** stopping criterion not met **do**\n", " * For $i=1,...,m$:\n", " * $ w_{t+1} \\leftarrow w_t +\\eta(y_i -sign(w_t^Tx_i))x_i $\n", " * $t \\leftarrow t + 1$\n", " * **end while**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 380 }, "id": "RcEnElbETD1g", "outputId": "d7f9afa5-fc91-45d9-b5bc-2a0d7ba62ad3", "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddiagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_mean...texture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave points_worstsymmetry_worstfractal_dimension_worstUnnamed: 32
351899667M15.75019.22107.10758.60.124300.236400.291400.12420...24.17119.40915.30.155000.504600.687200.213500.42450.10500NaN
27852781M18.61020.25122.101094.00.094400.106600.149000.07731...27.26139.901403.00.133800.211700.344600.149000.23410.07421NaN
56892751B7.76024.5447.92181.00.052630.043620.000000.00000...30.3759.16268.60.089960.064440.000000.000000.28710.07039NaN
535919555M20.55020.86137.801308.00.104600.173900.208500.13220...25.48160.201809.00.126800.313500.443300.214800.30770.07569NaN
497914580B12.47017.3180.45480.10.089280.076300.036090.02369...24.3492.82607.30.127600.250600.202800.105300.30350.07661NaN
3589010333B8.87815.4956.74241.00.082930.076980.047210.02381...17.7065.27302.00.101500.124800.094410.047620.24340.07431NaN
758610404M16.07019.65104.10817.70.091680.084240.097690.06638...24.56128.801223.00.150000.204500.282900.152000.26500.06387NaN
464911320502B13.17018.2284.28537.30.074660.059940.048590.02870...23.8995.10687.60.128200.196500.187600.104500.22350.06925NaN
323895100M20.34021.51135.901264.00.117000.187500.256500.15040...31.86171.101938.00.159200.449200.534400.268500.55580.10240NaN
485913063B12.45016.4182.85476.70.095140.151100.154400.04846...21.0397.82580.60.117500.406100.489600.134200.32310.10340NaN
\n", "

10 rows × 33 columns

\n", "
" ], "text/plain": [ " id diagnosis radius_mean texture_mean perimeter_mean \\\n", "351 899667 M 15.750 19.22 107.10 \n", "27 852781 M 18.610 20.25 122.10 \n", "568 92751 B 7.760 24.54 47.92 \n", "535 919555 M 20.550 20.86 137.80 \n", "497 914580 B 12.470 17.31 80.45 \n", "358 9010333 B 8.878 15.49 56.74 \n", "75 8610404 M 16.070 19.65 104.10 \n", "464 911320502 B 13.170 18.22 84.28 \n", "323 895100 M 20.340 21.51 135.90 \n", "485 913063 B 12.450 16.41 82.85 \n", "\n", " area_mean smoothness_mean compactness_mean concavity_mean \\\n", "351 758.6 0.12430 0.23640 0.29140 \n", "27 1094.0 0.09440 0.10660 0.14900 \n", "568 181.0 0.05263 0.04362 0.00000 \n", "535 1308.0 0.10460 0.17390 0.20850 \n", "497 480.1 0.08928 0.07630 0.03609 \n", "358 241.0 0.08293 0.07698 0.04721 \n", "75 817.7 0.09168 0.08424 0.09769 \n", "464 537.3 0.07466 0.05994 0.04859 \n", "323 1264.0 0.11700 0.18750 0.25650 \n", "485 476.7 0.09514 0.15110 0.15440 \n", "\n", " concave points_mean ... texture_worst perimeter_worst area_worst \\\n", "351 0.12420 ... 24.17 119.40 915.3 \n", "27 0.07731 ... 27.26 139.90 1403.0 \n", "568 0.00000 ... 30.37 59.16 268.6 \n", "535 0.13220 ... 25.48 160.20 1809.0 \n", "497 0.02369 ... 24.34 92.82 607.3 \n", "358 0.02381 ... 17.70 65.27 302.0 \n", "75 0.06638 ... 24.56 128.80 1223.0 \n", "464 0.02870 ... 23.89 95.10 687.6 \n", "323 0.15040 ... 31.86 171.10 1938.0 \n", "485 0.04846 ... 21.03 97.82 580.6 \n", "\n", " smoothness_worst compactness_worst concavity_worst \\\n", "351 0.15500 0.50460 0.68720 \n", "27 0.13380 0.21170 0.34460 \n", "568 0.08996 0.06444 0.00000 \n", "535 0.12680 0.31350 0.44330 \n", "497 0.12760 0.25060 0.20280 \n", "358 0.10150 0.12480 0.09441 \n", "75 0.15000 0.20450 0.28290 \n", "464 0.12820 0.19650 0.18760 \n", "323 0.15920 0.44920 0.53440 \n", "485 0.11750 0.40610 0.48960 \n", "\n", " concave points_worst symmetry_worst fractal_dimension_worst \\\n", "351 0.21350 0.4245 0.10500 \n", "27 0.14900 0.2341 0.07421 \n", "568 0.00000 0.2871 0.07039 \n", "535 0.21480 0.3077 0.07569 \n", "497 0.10530 0.3035 0.07661 \n", "358 0.04762 0.2434 0.07431 \n", "75 0.15200 0.2650 0.06387 \n", "464 0.10450 0.2235 0.06925 \n", "323 0.26850 0.5558 0.10240 \n", "485 0.13420 0.3231 0.10340 \n", "\n", " Unnamed: 32 \n", "351 NaN \n", "27 NaN \n", "568 NaN \n", "535 NaN \n", "497 NaN \n", "358 NaN \n", "75 NaN \n", "464 NaN \n", "323 NaN \n", "485 NaN \n", "\n", "[10 rows x 33 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's load the cancer dataset, shuffle it and speratre into train and test set\n", "dataset = pd.read_csv('./datasets/cancer_dataset.csv')\n", "# print the number of rows in the data set\n", "number_of_rows = len(dataset)\n", "num_train = int(0.8 * number_of_rows)\n", "# reminder, the data looks like this\n", "dataset.sample(10)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "lZLemcOrTD1h", "outputId": "e4a6d0ee-7691-42f4-c526-c9a2c0b2c0c7", "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total training samples: 455, total test samples: 114\n" ] } ], "source": [ "# we will take the first 2 features as our data (X) and the diagnosis as labels (y)\n", "x = dataset[['radius_mean', 'texture_mean']].values\n", "y = dataset['diagnosis'].values == 'M' # 1 for Malignat, 0 for Benign\n", "# shuffle\n", "rand_gen = np.random.RandomState(0)\n", "shuffled_indices = rand_gen.permutation(np.arange(len(x)))\n", "\n", "x_train = x[shuffled_indices[:num_train]]\n", "y_train = y[shuffled_indices[:num_train]]\n", "x_test = x[shuffled_indices[num_train:]]\n", "y_test = y[shuffled_indices[num_train:]]\n", "\n", "print(\"total training samples: {}, total test samples: {}\".format(num_train, number_of_rows - num_train))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# fit scaler on training data (not on test data!)\n", "scaler = StandardScaler().fit(x_train)\n", "\n", "# transform training data\n", "x_train = scaler.transform(x_train)\n", "\n", "# transform testing data\n", "x_test = scaler.transform(x_test)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "FJ63i3NgTD1h", "outputId": "e73c7e1c-0e26-40b6-eed4-acde82a254a0", "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "perceptron accuracy: 85.088 %\n" ] } ], "source": [ "# perceptron using Scikit-Learn\n", "per_clf = Perceptron(max_iter=1000)\n", "per_clf.fit(x_train, y_train)\n", "y_pred = per_clf.predict(x_test)\n", "accuracy = np.sum(y_pred == y_test) / len(y_test)\n", "print(\"perceptron accuracy: {:.3f} %\".format(accuracy * 100))\n", "w = (per_clf.coef_).reshape(-1,)\n", "b = (per_clf.intercept_).reshape(-1,)\n", "boundary = (-b -w[0] * x_train[:, 0]) / w[1] " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "oKqXmJlLTD1h", "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "def plot_perceptron_result():\n", " fig = plt.figure(figsize=(10, 8))\n", " ax = fig.add_subplot(1,1,1)\n", " ax.scatter(x_train[y_train,0], x_train[y_train, 1], color='r', label=\"M, train\", alpha=0.5)\n", " ax.scatter(x_train[~y_train,0], x_train[~y_train, 1], color='b', label=\"B, train\", alpha=0.5)\n", " ax.scatter(x_test[y_test,0], x_test[y_test, 1], color='r', label=\"M, test\", alpha=1)\n", " ax.scatter(x_test[~y_test,0], x_test[~y_test, 1], color='b', label=\"B, test\", alpha=1)\n", " ax.plot(x_train[:,0], boundary, label=\"decision boundary\", color='g')\n", " ax.legend()\n", " ax.grid()\n", " ax.set_ylim([-5, 5])\n", " ax.set_xlabel(\"radius_mean\")\n", " ax.set_ylabel(\"texture_mean\")\n", " ax.set_title(\"texture_mean vs. radius_mean\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "Axek1oI8TD1i", "outputId": "5d5f8c0e-f59c-4f63-8c70-4a65c01ff89c", "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot\n", "plot_perceptron_result()" ] }, { "cell_type": "markdown", "metadata": { "id": "Ho1haAExTD1i", "slideshow": { "slide_type": "slide" } }, "source": [ "### MLE with Bernoulli Assumption\n", "---\n", "* Recall that there is a connection between maximum likelihood estimation (MLE) and linear regression when we assume that the data can be created as follows: $y=\\theta^Tx + \\epsilon$, where $\\epsilon \\sim \\mathcal{N}(0, \\sigma^2) \\to y \\sim \\mathcal{N}(\\theta^Tx, \\sigma^2)$.\n", " * In this case, **minimizing** the negative log-likelihood (NLL): $-\\log P(y|x;\\theta)$ results in the MSE error $(y-\\theta^Tx)^2$, and **minimizing** the NLL is the same as **maximizing** $\\log P(y|x;\\theta)$, which is exactly the MLE!\n", "* When we assume that the data is created in a different way, we get different loss functions, as we will now demonstrate. But the idea is the same --\n", "\n", " **maximizing the log-likelihood (MLE) = minimizing the negative log-likelihood (NLL)**.\n", " $$ \\log P(y|x;\\theta) = \\log \\left[\\frac{1}{\\sqrt{2 \\pi \\sigma^2}} \\exp{\\left(-\\frac{(y - \\theta^Tx)^2}{2\\sigma^2}\\right)} \\right] $$\n", " $$ = -0.5\\log(2\\pi\\sigma^2) -\\frac{1}{2\\sigma^2}(y -\\theta^Tx)^2$$\n", " $$ \\to \\max_{\\theta}\\log P(y|x;\\theta) = \\min_{\\theta}-\\log P(y|x;\\theta) = \\min_{\\theta}\\frac{1}{2}(y -\\theta^Tx)^2 = \\min_{\\theta} MSE $$\n", "* The *Sigmoid* function (also the Logistic Function): $$ \\sigma(x) = \\frac{1}{1+e^{-x}} = \\frac{e^x}{1+e^x} $$\n", " * The output is in $[0,1]$, which is exactly what we need to model a probability distribution.\n", "* We assume that: $$ P(y|x,\\theta) = Bern(y|\\sigma(\\theta^Tx)) $$\n", " * Bernoulli Distribution (coin flip): $$ P(y) = p^y(1-p)^{1-y} $$\n", " * $p = \\sigma(\\theta^Tx) \\in [0,1]$\n", "* We will use the following notations:\n", "\n", "$$P(y_i|x_i, w) = \\begin{cases}\n", " \\pi_{i1} = \\sigma(w^Tx) = \\frac{1}{1+e^{-x}} & \\quad \\text{if } y_i=1 \\\\\n", " \\pi_{i0} = 1 - \\sigma(w^Tx) = 1 - \\frac{1}{1+e^{-x}} & \\quad \\text{if } y_i = 0 \n", " \\end{cases} $$" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "rB2JSPbiTD1j", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# let's see the sigmoid function\n", "def sigmoid(x):\n", " return 1 / (1 + np.exp(-x))\n", "\n", "x = np.linspace(-5, 5, 1000)\n", "sig_x = sigmoid(x)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "FyelZ9cOTD1j", "outputId": "f0f5af02-45ef-4947-b708-ae01d73de8d3", "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 'sigmoid(x)')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot\n", "fig = plt.figure(figsize=(8,5))\n", "ax = fig.add_subplot(111)\n", "ax.plot(x, sig_x)\n", "ax.grid()\n", "ax.set_title(\"The Sigmoid Function\")\n", "ax.set_xlabel(\"x\")\n", "ax.set_ylabel(\"sigmoid(x)\")" ] }, { "cell_type": "markdown", "metadata": { "id": "6JWfaOsRTD1j", "slideshow": { "slide_type": "slide" } }, "source": [ "### Logistic Regression\n", "---\n", "* Some regression algorithms can be used for **classification** as well.\n", "* *Logistic Regression* is commonly used to **estimate the probability** that an instance belongs to a particular class.\n", " * Typically, if the estimated proabibility is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled \"1\"), or else it predicts that it does not - a binary classifier.\n", "* **Estimating Probabilities** - Similarly to *Linear Regression*, a Logistic Regression model computes a weighted sum of the input features (plus a bias term), but unlike Linear Regression, it outputs the **logistic** of the weighted sum - $\\sigma(w^Tx)$, which is a number between 0 and 1." ] }, { "cell_type": "markdown", "metadata": { "id": "MqcWxurYTD1j", "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Training and Cost Function\n", "---\n", "* The objective of training is to set the parameter vector $\\theta$ (or $w$) so that the model estimates high probabilities for positive instances ($y=1$) and low probabilities for negative instances ($y=0$)\n", "* Expanding the expression using the negative log-likelihood (NLL): \n", " $$ P(y|x,\\theta) = Bern(y|\\sigma(\\theta^Tx)) \\rightarrow NLL(\\theta) = -\\frac{1}{m}\\sum_{i=1}^m \\log \\sigma(\\theta^Tx_i)^{y_i}(1-\\sigma(\\theta^Tx_i))^{1-y_i} =- \\frac{1}{m} \\sum_{i=1}^m\\log\\pi_{i1}^{y_i}\\pi_{i0}^{1-y_i} $$\n", " $$ = -\\frac{1}{m} \\sum_{i=1}^m \\left[y_i\\log\\pi_{i1} + (1-y_i)\\log\\pi_{i0} \\right]$$" ] }, { "cell_type": "markdown", "metadata": { "id": "o1UPe7nfTD1k", "slideshow": { "slide_type": "subslide" } }, "source": [ "* This yields **the Logistic Regression cost function (log loss)**: $$ J(\\theta) = -\\frac{1}{m} \\sum_{i=1}^m \\big[ y_i\\log \\pi_{i1} + (1-y_i)\\log \\pi_{i0} \\big] = -\\frac{1}{m} \\sum_{i=1}^m \\big[ y_i\\log \\pi_{i1} + (1-y_i)\\log (1 - \\pi_{i1}) \\big] $$\n", " * Intuition: $-\\log(t)$ grows very large when $t$ approaches 0, so the cost will be large if the model estimates a probability close to 0 for a **positive instance**, and it will also be very large if the estimated probability is close to 1 for a **negative instance**. On the other hand, $-log(t)$ is close to 0 when $t$ is close to 1, so the cost will be close to 0 if the estimated probability is close to 0 for a **negative instance** or close to 1 for a **positive instance**.\n", " * This expression is also called the **binary cross-entropy (BCE)** loss.\n", " * The cost function in the case of *Logistic Regression* is **convex**." ] }, { "cell_type": "markdown", "metadata": { "id": "jrukBjrYTD1k", "slideshow": { "slide_type": "subslide" } }, "source": [ "* **Logistic cost function derivatives**: $$ \\frac{\\partial}{\\partial \\theta_j}J(\\theta) = \\frac{1}{m}\\sum_{i=1}^m \\big( \\sigma(\\theta^Tx^{i}) - y_i \\big) x_j^{i} $$\n", " * No closed-form solution.\n", " * Thanks to the convexity of the cost function (for the case of Logistic Regression), we can use **Gradient Descent** (or SGD, Mini-Batch GD)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "eZHB4a4yTD1k", "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "def plot_lr_boundary(x_train, x_test, y_train, y_test, boundary):\n", " fig = plt.figure(figsize=(8, 5))\n", " ax = fig.add_subplot(1,1,1)\n", " ax.scatter(x_train[y_train,0], x_train[y_train, 1], color='r', label=\"M, train\", alpha=0.5)\n", " ax.scatter(x_train[~y_train,0], x_train[~y_train, 1], color='b', label=\"B, train\", alpha=0.5)\n", " ax.scatter(x_test[y_test,0], x_test[y_test, 1], color='r', label=\"M, test\", alpha=1)\n", " ax.scatter(x_test[~y_test,0], x_test[~y_test, 1], color='b', label=\"B, test\", alpha=1)\n", " ax.plot(x_train[:,0], boundary, label=\"decision boundary\", color='g')\n", " ax.legend()\n", " ax.grid()\n", " ax.set_ylim([-5, 5])\n", " ax.set_xlabel(\"radius_mean\")\n", " ax.set_ylabel(\"texture_mean\")\n", " ax.set_title(\"texture_mean vs. radius_mean\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "182fuqu7TD1l", "outputId": "a8a004fc-d0f2-466a-b6c3-ad8ec3a49f18", "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Logistic Regression accuracy: 90.351 %\n" ] } ], "source": [ "# logistic regression with scikit-learn\n", "log_reg = LogisticRegression(solver='lbfgs')\n", "log_reg.fit(x_train, y_train)\n", "y_pred = log_reg.predict(x_test)\n", "accuracy = np.sum(y_pred == y_test) / len(y_test)\n", "print(\"Logistic Regression accuracy: {:.3f} %\".format(accuracy * 100))\n", "w = (log_reg.coef_).reshape(-1,)\n", "b = (log_reg.intercept_).reshape(-1,)\n", "boundary = (-b -w[0] * x_train[:, 0]) / w[1] " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "fqScvG9TTD1l", "outputId": "35add6eb-c3b4-4f06-f148-eb18d7732f26", "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot\n", "plot_lr_boundary(x_train, x_test, y_train, y_test, boundary)" ] }, { "cell_type": "markdown", "metadata": { "id": "u-HhhoBxTD1l", "slideshow": { "slide_type": "slide" } }, "source": [ "### Logistic Regression with PyTorch\n", "---\n", "* We will now get familiar with building neural networks with PyTorch.\n", "* All neural network models inherit from a parent class `nn.Module`. The user must implement the `__init__()` and `__forward()__` methods.\n", "* In `__init__()` we initialize the parameters of the neural networks, e.g., number of parameters (such as number of hidden units/layers, type of layers and etc...)\n", "* In `__forward()__` we implement the forward pass of the network, i.e., what happens to the input.\n", " * For example, if in `__init__()` you defined a linear layer and a ReLU activation, then in `__forward()__` you will define that the input goes first into the linear layer and then into the activation." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "Cfuod-TtTD1l", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# define our simple single neuron network\n", "class SingleNeuron(nn.Module):\n", " # notice that we inherit from nn.Module\n", " def __init__(self, input_dim):\n", " super(SingleNeuron, self).__init__()\n", " # here we initialize the building blocks of our network\n", " # single neuron is just one linear (fully-connected) layer\n", " self.fc = nn.Linear(input_dim, 1) \n", " # non-linearity: the sigmoid function for binary classification\n", " self.sigmoid = nn.Sigmoid()\n", "\n", " def forward(self, x):\n", " # here we define what happens to the input x in the forward pass\n", " # that is, the order in which x goes through the building blocks\n", " # in our case, x first goes through the signle neuron and then activated with sigmoid\n", " return self.sigmoid(self.fc(x))" ] }, { "cell_type": "markdown", "metadata": { "id": "2TG7ywsPTD1n", "slideshow": { "slide_type": "subslide" } }, "source": [ "* Okay, so we have our network, now we need to train it.\n", "* We need to define how to optimize the weights and other hyper-parameters, such as number of epochs." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "-OS_6OIsTD1n", "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# define the device we are going to run calculations on (cpu or gpu)\n", "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n", "# create an instance of our model and send it to the device\n", "input_dim = x_train.shape[1]\n", "model = SingleNeuron(input_dim=input_dim).to(device)\n", "# define optimizer, and give it the networks weights\n", "learning_rate = 0.1\n", "# every class that inherits from nn.Module() has the .parameters() method to access the weights\n", "optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)\n", "# other hyper-parameters\n", "num_epochs = 5000\n", "# define loss function - BCE for binary classification\n", "criterion = nn.BCELoss()\n", "# preprocess the data\n", "scaler = StandardScaler()\n", "x_train_prep = scaler.fit_transform(x_train)\n", "x_test_prep = scaler.transform(x_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Kl2wqmFRTD1n", "outputId": "8f6adf9a-a2c0-4167-9886-1c801a1a9d45", "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "epoch: 0 loss: 0.6858659386634827\n", "epoch: 1000 loss: 0.26053810119628906\n", "epoch: 2000 loss: 0.25988084077835083\n", "epoch: 3000 loss: 0.25985535979270935\n", "epoch: 4000 loss: 0.25985416769981384\n" ] } ], "source": [ "# training loop for the model\n", "for epoch in range(num_epochs):\n", " # get data\n", " features = torch.tensor(x_train_prep, dtype=torch.float, device=device)\n", " labels = torch.tensor(y_train, dtype=torch.float, device=device)\n", " # forward pass\n", " logits = model(features)\n", " # loss\n", " loss = criterion(logits.view(-1), labels)\n", " # backward pass\n", " optimizer.zero_grad() # clean the gradients from previous iteration\n", " loss.backward() # autograd backward to calculate gradients\n", " optimizer.step() # apply update to the weights\n", " if epoch % 1000 == 0:\n", " print(f'epoch: {epoch} loss: {loss}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "D0e0JTKUTD1n", "outputId": "8f9d18df-933a-41c2-ca9c-0cc66879fa26", "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Logistic Regression accuracy: 90.351 %\n" ] } ], "source": [ "# predict and check accuracy\n", "test_features = torch.from_numpy(x_test_prep).float().to(device)\n", "y_pred_logits = model(test_features).data.cpu().view(-1).numpy()\n", "y_pred = (y_pred_logits > 0.5)\n", "accuracy = np.sum(y_pred == y_test) / len(y_test)\n", "print(\"Logistic Regression accuracy: {:.3f} %\".format(accuracy * 100))" ] }, { "cell_type": "markdown", "metadata": { "id": "QtWfWb3QTD1o", "slideshow": { "slide_type": "subslide" } }, "source": [ "Same as Scikit-Learn!" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 322 }, "id": "PuI11UkBTS_x", "outputId": "3c842acb-dac1-4075-82a8-cb3ceee51809", "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "%3\n", "\n", "\n", "\n", "140116645813776\n", "\n", "SigmoidBackward\n", "\n", "\n", "\n", "140116645813888\n", "\n", "AddmmBackward\n", "\n", "\n", "\n", "140116645813888->140116645813776\n", "\n", "\n", "\n", "\n", "\n", "140116644302976\n", "\n", "fc.bias\n", " (1)\n", "\n", "\n", "\n", "140116644302976->140116645813888\n", "\n", "\n", "\n", "\n", "\n", "140116644303032\n", "\n", "TBackward\n", "\n", "\n", "\n", "140116644303032->140116645813888\n", "\n", "\n", "\n", "\n", "\n", "140116644303144\n", "\n", "fc.weight\n", " (1, 2)\n", "\n", "\n", "\n", "140116644303144->140116644303032\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# visualize computational graph\n", "x = torch.randn(1, input_dim).to(device)\n", "torchviz.make_dot(model(x), params=dict(model.named_parameters()))" ] }, { "cell_type": "markdown", "metadata": { "id": "cmZUQFrxTD1o", "slideshow": { "slide_type": "slide" } }, "source": [ "### Multi-Class (Multinomial) Logistic Regression - Softmax Regression\n", "---\n", "* The Logistic Regression model can be generalized to support multiple classes.\n", "* The idea: when given an instance $x$, the Softmax Regression model first computes a score $s_k(x)$ for each class $k$, then estimates a probability of each class by applying the *softmax function* (normalized exponential) to the scores.\n", "* The **Softmax score for class $k$**: $$ s_k(x) = \\big( \\theta^{(k)} \\big)^T \\cdot x $$ \n", " * Each class has its own dedicated parameter vector $\\theta^{(k)}$, which is usually stored in a row of the parameter matrix $\\Theta$." ] }, { "cell_type": "markdown", "metadata": { "id": "5Q4BBYwCTD1o", "slideshow": { "slide_type": "subslide" } }, "source": [ "* **The Softmax Function**: $$\\hat{p}_k = p(y=k|x,\\theta) = \\sigma(s(x))_k = \\frac{e^{s_k(x)}}{\\sum_{j=1}^K e^{s_j(x)}} $$\n", " * $K$ is the number of classes.\n", " * $s(x)$ is a *vector* containing the scores of each class for the instance $x$.\n", " * $\\sigma(s(x))_k$ is the estimated probability that the instance $x$ belongs to class $k$ given the scores of each class for that instance.\n", "* **The Softmax Regression classifier prediction**: $$\\hat{y} = \\underset{k}{\\mathrm{argmax}} \\sigma(s(x))_k = \\underset{k}{\\mathrm{argmax}} s_k(x) = \\underset{k}{\\mathrm{argmax}} \\big( (\\theta^{(k)})^Tx \\big) $$" ] }, { "cell_type": "markdown", "metadata": { "id": "OGBIchXMTD1o", "slideshow": { "slide_type": "subslide" } }, "source": [ "* **Cross-Entropy cost function**: $$ J(\\Theta) = -\\frac{1}{m} \\sum_{i=1}^m \\sum_{k=1}^K y_k^{(i)} \\log(\\hat{p}_k^{(i)}) $$\n", " * $y_k^{(i)}$ is equal to 1 if the target class for the $i^{th}$ instance is $k$, otherwise, it is 0.\n", " * When $K=2$ it is the BCE from the previous section.\n", "* **Cross-Entropy gradient vector for class $k$**: $$ \\nabla_{\\theta^{(k)}}J(\\Theta) = \\frac{1}{m}\\sum_{i=1}^m (\\hat{p}_k^{(i)} - y_k^{(i)})x^{(i)} $$\n", " * Use Gradient Descent or its variants to solve\n", "* In Scikit-Learn: `LogisticRegression(multi_class=\"multinomial\", solver=\"lbfgs\", C=10)`\n", " * $C$ is a regularization term: the inverse of regularization strength, smaller values specify stronger regularization." ] }, { "cell_type": "markdown", "metadata": { "id": "yM_pQ-16TD1o", "slideshow": { "slide_type": "slide" } }, "source": [ "### Activation Functions\n", "---\n", "\n", "The key change made to the Perceptron that brought upon the era of deep learning is the addition of **activation functions** to the output of each neuron. These allow the learning of **non-linear functions**. We will use three popular activation functions:\n", "1. **Logistic function (sigmoid)**: $\\sigma(z) = \\frac{1}{1 + e^{-z}}$. The output is in $[0,1]$ which can be used for binary clssification or as a probability.\n", "2. **Hyperbolic tangent function**: $tanh(z) = 2\\sigma(2z) - 1$. The output is in $[-1,1]$ which tends to make each layer's output more or less normalized at the beginning of the training (which may speed up convergence).\n", "3. **ReLU (Rectified Linear Unit) function**: $ReLU(z) = max(0,z)$. Continuous but not differentiable at $z=0$. However, it is the most common activation function as it is fast to compute and does not bound the output (which helps with some issues during Gradient Descent)." ] }, { "cell_type": "markdown", "metadata": { "id": "gZes_GHxTD1o", "slideshow": { "slide_type": "subslide" } }, "source": [ "**The activation functions derivatives (for the backpropagation)**:\n", "1. $\\frac{d\\sigma(z)}{dz} = \\sigma(z)(1-\\sigma(z))$\n", "2. $\\frac{d tanh(z)}{dz} = 1 - tanh^2(z)$\n", "3. We define the derivative at 0 to be zero: $\\frac{d ReLU(z)}{dz} = 1$ if $x>0$ else $0$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0aWjDI-STD1p", "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# activation functions\n", "def sigmoid(z, deriv=False):\n", " output = 1 / (1 + np.exp(-1.0 * z))\n", " if deriv:\n", " return output * (1 - output)\n", " return output\n", "\n", "def tanh(z, deriv=False):\n", " output = np.tanh(z)\n", " if deriv:\n", " return 1 - np.square(output)\n", " return output\n", "\n", "def relu(z, deriv=False):\n", " output = z if z > 0 else 0\n", " if deriv:\n", " return 1 if z > 0 else 0\n", " return output" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wSAflgsZTD1p", "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "def plot_activations():\n", " x = np.linspace(-5, 5, 1000)\n", " y_sig = sigmoid(x)\n", " y_tanh = tanh(x)\n", " y_relu = list(map(lambda z: relu(z), x))\n", " fig = plt.figure(figsize=(8, 5))\n", " ax1 = fig.add_subplot(2,1,1)\n", " ax1.plot(x, y_sig, label='sigmoid')\n", " ax1.plot(x, y_tanh, label='tanh')\n", " ax1.plot(x, y_relu, label='relu')\n", " ax1.grid()\n", " ax1.legend()\n", " ax1.set_xlabel('x')\n", " ax1.set_ylabel('y')\n", " ax1.set_ylim([-2, 2])\n", " ax1.set_title('Activation Functions')\n", "\n", " y_sig_derv = sigmoid(x, deriv=True)\n", " y_tanh_derv = tanh(x, deriv=True)\n", " y_relu_derv = list(map(lambda z: relu(z, deriv=True), x))\n", " ax2 = fig.add_subplot(2,1,2)\n", " ax2.plot(x, y_sig_derv, label='sigmoid')\n", " ax2.plot(x, y_tanh_derv, label='tanh')\n", " ax2.plot(x, y_relu_derv, label='relu')\n", " ax2.grid()\n", " ax2.legend()\n", " ax2.set_xlabel('x')\n", " ax2.set_ylabel('y')\n", " # ax2.set_ylim([-2, 2])\n", " ax2.set_title('Activation Functions Derivatives')\n", " plt.tight_layout()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JIzEgdyCTD1p", "outputId": "851d88d7-e1fc-4ef1-8e97-7e55aa278424", "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light", "tags": [] }, "output_type": "display_data" } ], "source": [ "# plot\n", "plot_activations()" ] }, { "cell_type": "markdown", "metadata": { "id": "J68kd8QcTD1q", "slideshow": { "slide_type": "subslide" } }, "source": [ "### Recommended Videos\n", "---\n", "#### Warning!\n", "* These videos do not replace the lectures and tutorials.\n", "* Please use these to get a better understanding of the material, and not as an alternative to the written material.\n", "\n", "#### Video By Subject\n", "\n", "* Pereceptron - Pereceptron\n", " * Perceptron Training\n", "* Logistic Regression - Lecture 3 | Machine Learning (Stanford)\n", " * StatQuest: Logistic Regression\n", "* Softmax Regression - Softmax Regression (C2W3L08)\n", "* Activation Functions - Activation Functions (C1W3L06)\n", " * Why Non-linear Activation Functions (C1W3L07)" ] }, { "cell_type": "markdown", "metadata": { "id": "ASKEitEBTD1q", "slideshow": { "slide_type": "skip" } }, "source": [ "## Credits\n", "---\n", "* Icons made by Becris from www.flaticon.com\n", "* Icons from Icons8.com - https://icons8.com\n", "* Datasets from Kaggle - https://www.kaggle.com/\n", "* Examples and code snippets were taken from \"Hands-On Machine Learning with Scikit-Learn and TensorFlow\"" ] } ], "metadata": { "colab": { "name": "ee046xxx_tutorial_04_single_neuron.ipynb", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.18" } }, "nbformat": 4, "nbformat_minor": 4 }