{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "jupytext": { "cell_metadata_filter": "-all", "formats": "ipynb" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "colab": { "name": "dropout-and-batch-normalization.ipynb", "provenance": [], "collapsed_sections": [] } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "rbbaAwB1XtJP" }, "source": [ "Modified from: https://www.kaggle.com/ryanholbrook/overfitting-and-underfitting" ] }, { "cell_type": "markdown", "metadata": { "id": "DKwljf9XZFYX" }, "source": [ "# Introduction #\n", "\n", "There's more to the world of deep learning than just dense layers. There are dozens of kinds of layers you might add to a model. (Try browsing through the [Keras docs](https://www.tensorflow.org/api_docs/python/tf/keras/layers/) for a sample!) Some are like dense layers and define connections between neurons, and others can do preprocessing or transformations of other sorts.\n", "\n", "In this lesson, we'll learn about a two kinds of special layers, not containing any neurons themselves, but that add some functionality that can sometimes benefit a model in various ways. Both are commonly used in modern architectures.\n", "\n", "# Dropout #\n", "\n", "The first of these is the \"dropout layer\", which can help correct overfitting.\n", "\n", "In the last lesson we talked about how overfitting is caused by the network learning spurious patterns in the training data. To recognize these spurious patterns a network will often rely on very a specific combinations of weight, a kind of \"conspiracy\" of weights. Being so specific, they tend to be fragile: remove one and the conspiracy falls apart.\n", "\n", "This is the idea behind **dropout**. To break up these conspiracies, we randomly *drop out* some fraction of a layer's input units every step of training, making it much harder for the network to learn those spurious patterns in the training data. Instead, it has to search for broad, general patterns, whose weight patterns tend to be more robust.\n", "\n", "
\n", "\"An\n", "
Here, 50% dropout has been added between the two hidden layers.
\n", "
\n", "\n", "You could also think about dropout as creating a kind of *ensemble* of networks. The predictions will no longer be made by one big network, but instead by a committee of smaller networks. Individuals in the committee tend to make different kinds of mistakes, but be right at the same time, making the committee as a whole better than any individual. (If you're familiar with random forests as an ensemble of decision trees, it's the same idea.)\n", "\n", "## Adding Dropout ##\n", "\n", "In Keras, the dropout rate argument `rate` defines what percentage of the input units to shut off. Put the `Dropout` layer just before the layer you want the dropout applied to:\n", "\n", "```\n", "keras.Sequential([\n", " # ...\n", " layers.Dropout(rate=0.3), # apply 30% dropout to the next layer\n", " layers.Dense(16),\n", " # ...\n", "])\n", "```\n", "\n", "# Batch Normalization #\n", "\n", "The next special layer we'll look at performs \"batch normalization\" (or \"batchnorm\"), which can help correct training that is slow or unstable.\n", "\n", "With neural networks, it's generally a good idea to put all of your data on a common scale, perhaps with something like scikit-learn's [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) or [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html). The reason is that SGD will shift the network weights in proportion to how large an activation the data produces. Features that tend to produce activations of very different sizes can make for unstable training behavior.\n", "\n", "Now, if it's good to normalize the data before it goes into the network, maybe also normalizing inside the network would be better! In fact, we have a special kind of layer that can do this, the **batch normalization layer**. A batch normalization layer looks at each batch as it comes in, first normalizing the batch with its own mean and standard deviation, and then also putting the data on a new scale with two trainable rescaling parameters. Batchnorm, in effect, performs a kind of coordinated rescaling of its inputs.\n", "\n", "Most often, batchnorm is added as an aid to the optimization process (though it can sometimes also help prediction performance). Models with batchnorm tend to need fewer epochs to complete training. Moreover, batchnorm can also fix various problems that can cause the training to get \"stuck\". Consider adding batch normalization to your models, especially if you're having trouble during training.\n", "\n", "## Adding Batch Normalization ##\n", "\n", "It seems that batch normalization can be used at almost any point in a network. You can put it after a layer...\n", "\n", "```\n", "layers.Dense(16, activation='relu'),\n", "layers.BatchNormalization(),\n", "```\n", "\n", "... or between a layer and its activation function:\n", "\n", "```\n", "layers.Dense(16),\n", "layers.BatchNormalization(),\n", "layers.Activation('relu'),\n", "```\n", "\n", "And if you add it as the first layer of your network it can act as a kind of adaptive preprocessor, standing in for something like Sci-Kit Learn's `StandardScaler`.\n", "\n", "# Example - Using Dropout and Batch Normalization #\n", "\n", "Let's continue developing the *Red Wine* model. Now we'll increase the capacity even more, but add dropout to control overfitting and batch normalization to speed up optimization. This time, we'll also leave off standardizing the data, to demonstrate how batch normalization can stabalize the training." ] }, { "cell_type": "markdown", "metadata": { "id": "PP2KojJnY-S-" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "cbXj1tgXY9B3" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "3IZftzPwY9M1" }, "source": [ "" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "XcBINM75X1tn", "outputId": "5217fa57-3880-41b4-ac95-b94a9d0f61a9" }, "source": [ "from google.colab import drive\n", "drive.mount('/content/drive')" ], "execution_count": 1, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n" ] } ] }, { "cell_type": "code", "metadata": { "_kg_hide-input": true, "id": "meZZ7M-FXtJZ" }, "source": [ "\n", "# Setup plotting\n", "import matplotlib.pyplot as plt\n", "\n", "plt.style.use('seaborn-whitegrid')\n", "# Set Matplotlib defaults\n", "plt.rc('figure', autolayout=True)\n", "plt.rc('axes', labelweight='bold', labelsize='large',\n", " titleweight='bold', titlesize=18, titlepad=10)\n", "\n", "\n", "import pandas as pd\n", "red_wine = pd.read_csv('/content/drive/MyDrive/CommonFiles/MUSA650-Data/red-wine.csv')\n", "\n", "# Create training and validation splits\n", "df_train = red_wine.sample(frac=0.7, random_state=0)\n", "df_valid = red_wine.drop(df_train.index)\n", "\n", "# Split features and target\n", "X_train = df_train.drop('quality', axis=1)\n", "X_valid = df_valid.drop('quality', axis=1)\n", "y_train = df_train['quality']\n", "y_valid = df_valid['quality']" ], "execution_count": 2, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "D-k94iG7XtJa" }, "source": [ "When adding dropout, you may need to increase the number of units in your `Dense` layers." ] }, { "cell_type": "code", "metadata": { "id": "qnHxSX4rXtJb" }, "source": [ "from tensorflow import keras\n", "from tensorflow.keras import layers\n", "\n", "model = keras.Sequential([\n", " layers.Dense(1024, activation='relu', input_shape=[11]),\n", " layers.Dropout(0.3),\n", " layers.BatchNormalization(),\n", " layers.Dense(1024, activation='relu'),\n", " layers.Dropout(0.3),\n", " layers.BatchNormalization(),\n", " layers.Dense(1024, activation='relu'),\n", " layers.Dropout(0.3),\n", " layers.BatchNormalization(),\n", " layers.Dense(1),\n", "])" ], "execution_count": 3, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "KDkVo9slXtJb" }, "source": [ "There's nothing to change this time in how we set up the training." ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 297 }, "id": "mcVqYYvmXtJb", "outputId": "e7284c27-a835-450e-97b0-c869e98ff6af" }, "source": [ "model.compile(\n", " optimizer='adam',\n", " loss='mae',\n", ")\n", "\n", "history = model.fit(\n", " X_train, y_train,\n", " validation_data=(X_valid, y_valid),\n", " batch_size=256,\n", " epochs=100,\n", " verbose=0,\n", ")\n", "\n", "\n", "# Show the learning curves\n", "history_df = pd.DataFrame(history.history)\n", "history_df.loc[:, ['loss', 'val_loss']].plot();" ], "execution_count": 4, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "code", "metadata": { "id": "u7CofNpqYo5S" }, "source": [ "" ], "execution_count": 4, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "MHw4gQJSXtJc" }, "source": [ "You'll typically get better performance if you standardize your data before using it for training. That we were able to use the raw data at all, however, shows how effective batch normalization can be on more difficult datasets.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "X1wFAwl-XtJc" }, "source": [ "---" ] } ] }