{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Regularization\n", "Authors: Brian Stucky, Carson Andorf\n", "\n", "## 1. Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Two loss function](../nb-images/Regularization.svg)\n", "
(Image from Google machine learning crash course)
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### An example dataset\n", "\n", "This simple dataset contains information about insect, fish, and bird species and whether or not they can fly:\n", "\n", "|Name|Class|Can fly|\n", "|:--:|:---:|:-----:|\n", "|Pileated woodpecker|Birds|Yes|\n", "|Emu|Birds|No|\n", "|Northern cardinal|Birds|Yes|\n", "|Blacktip shark|Cartilaginous fishes|No|\n", "|Bluntnose stingray|Cartilaginous fishes|No|\n", "|Black drum|Bony fishes|No|\n", "|Florida carpenter ant|Insects|No|\n", "|Periodical cicada|Insects|Yes|\n", "|Luna moth|Insects|Yes|\n", "\n", "**Your task:** Develop a model to classify whether or not an animal can fly, based on information available in the dataset.\n", "\n", "### Model 1\n", "\n", " * If the animal is a bird or an insect, predict that it can fly.\n", " * Otherwise, predict that it cannot fly.\n", "\n", "Does this model make any mistakes? If so, can we improve it?\n", "\n", "\n", "### Model 2\n", "\n", " * If the species is a bird and has a one-word name, predict that it cannot fly.\n", " * If it is a bird with a two-word name, predict that it can fly.\n", " * If it is an insect with a three-word name, predict that it cannot fly.\n", " * If it is an insect with a two-word name, predict that it can fly.\n", " * Otherwise, predict that it cannot fly.\n", "\n", "Aha! That model classifies each training example perfectly!\n", "\n", "\n", "### Key points\n", "\n", " * We want our models to be general enough to work well on new examples.\n", " * Methods to help prevent overfitting are collectively referred to as *regularization* techniques.\n", " * Do not trust your training examples too much!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. L1 and L2 regularization\n", "\n", "For this lesson, we will focus on two widely used regularization methods: L1 and L2 regularization. Both of these methods represent model complexity as a function of the model's feature weights.\n", "\n", "Reminder: The general linear regression model looks like this:\n", "\n", "$$ y = w_0 + w_1 x_1 + w_2 x_2 + \\ldots + w_k x_k $$\n", "\n", "The L1 regularization penalty is:\n", "\n", "$$L_1\\text{ }regularization\\text{ }penalty = \\lambda\\sum_{i=1}^k |w_i|$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "weights = [-0.5, -0.2, 0.5, 0.7, 1.0, 2.5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The L2 regularization penalty is:\n", "\n", "$$L_2\\text{ }regularization\\text{ }penalty = \\lambda\\sum_{i=1}^k w_i^2$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.a. Adding regularization to a loss/cost function\n", "\n", "Recall that the usual loss function for linear regression is the *mean square error*:\n", "\n", "$$ MSE = \\frac{1}{n} \\sum_{i=1}^n (y_i - (w_0 + w_1 x_{i,1} + w_2 x_{i,2} + \\ldots + w_k x_{i,k}))^2 $$\n", "\n", "To add L1 regularization, we want to minimize:\n", "\n", "$$ MSE + \\lambda\\sum_{i=1}^k |w_i|$$\n", "\n", "\n", "\n", "### 2.b. Lambda\n", "\n", "\n", "### 2.c. Practical differences between L1 and L2 regularization\n", "\n", " * L1 regularization can result in models where some of the feature weights are 0.\n", " * L2 regularization can decrease model weights but not drive them to 0.\n", " * L2 regularization results in a minimization problem with a unique solution, which is not always the case for L1 regularization.\n", " * Which is best depends on the specifics of the data, the modeling problem, and the goals of the analysis.\n", " \n", "\n", "## 3. Practice example / demonstration\n", "\n", "Let's analyze a dataset called `regularization.csv` that you can find in the `nb-datasets` folder.\n", "\n", "### First, try using regular old non-regularized linear regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.linear_model import LinearRegression, Ridge, Lasso\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import mean_squared_error as mse\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Try using L1 regularization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Try experimenting with the value of the regularization parameter in the code above. How does changing the value of alpha affect the results? When do you get results that are misleading or just plain wrong?\n", "\n", "\n", "### Try using L2 regularization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Practice example using real data\n", "\n", "Let's try using regularization on a real dataset. We'll again use the iris dataset that you've already seen in previous lessons. We might not have time for this example during the workshop, and if not, I encourage you to explore it on your own.\n", "\n", "### Load the data and split out training and testing sets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "idata = pd.read_csv('../nb-datasets/iris_dataset.csv')\n", "idata['species'] = idata['species'].astype('category')\n", "\n", "# Convert the categorical variable \"species\" to 1-hot encoding (AKA \"dummy variables\"),\n", "# but eliminate the first dummy variable because it is collinear with the other two\n", "# and does not provide any additional information.\n", "idata_enc = pd.get_dummies(idata, drop_first=True)\n", "\n", "# Separate the x and y values.\n", "x = idata_enc.drop(columns='petal_length')\n", "y = idata_enc['petal_length']\n", "\n", "# Split the train and test sets.\n", "x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)\n", "\n", "# See what we have.\n", "idata_enc.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Give standard linear regression a try" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Try L1 regularization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Try L2 regularization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercises\n", "\n", "Try experimenting with the value of `alpha`/$\\lambda$ in the code above for both L1 regularization and L2 regularization. As you do so, consider these questions:\n", "\n", "1. How does changing the value of the regularization parameter affect the coefficient weights and training/test performance?\n", "2. What values of the regularization parameter give you the best test accuracy?\n", "3. For these data, does L1 or L2 regularization perform better?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }