# Regularization
Authors: Brian Stucky, Carson Andorf

## 1. Introduction

![Two loss function](../nb-images/Regularization.svg)
<div style="text-align: right"> (Image from Google machine learning crash course) </div>

### An example dataset

This simple dataset contains information about insect, fish, and bird species and whether or not they can fly:

|Name|Class|Can fly|
|:--:|:---:|:-----:|
|Pileated woodpecker|Birds|Yes|
|Emu|Birds|No|
|Northern cardinal|Birds|Yes|
|Blacktip shark|Cartilaginous fishes|No|
|Bluntnose stingray|Cartilaginous fishes|No|
|Black drum|Bony fishes|No|
|Florida carpenter ant|Insects|No|
|Periodical cicada|Insects|Yes|
|Luna moth|Insects|Yes|

**Your task:** Develop a model to classify whether or not an animal can fly, based on information available in the dataset.

### Model 1

  * If the animal is a bird or an insect, predict that it can fly.
  * Otherwise, predict that it cannot fly.

Does this model make any mistakes?  If so, can we improve it?


### Model 2

  * If the species is a bird and has a one-word name, predict that it cannot fly.
  * If it is a bird with a two-word name, predict that it can fly.
  * If it is an insect with a three-word name, predict that it cannot fly.
  * If it is an insect with a two-word name, predict that it can fly.
  * Otherwise, predict that it cannot fly.

Aha!  That model classifies each training example perfectly!


### Key points

  * We want our models to be general enough to work well on new examples.
  * Methods to help prevent overfitting are collectively referred to as *regularization* techniques.
  * Do not trust your training examples too much!

## 2. L<sub>1</sub> and L<sub>2</sub> regularization

For this lesson, we will focus on two widely used regularization methods: L<sub>1</sub> and L<sub>2</sub> regularization.  Both of these methods represent model complexity as a function of the model's feature weights.

Reminder: The general linear regression model looks like this:

$$ y = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_k x_k $$

The L<sub>1</sub> regularization penalty is:

$$L_1\text{ }regularization\text{ }penalty = \lambda\sum_{i=1}^k |w_i|$$

In [None]:
import numpy as np

weights = [-0.5, -0.2, 0.5, 0.7, 1.0, 2.5]

The L<sub>2</sub> regularization penalty is:

$$L_2\text{ }regularization\text{ }penalty = \lambda\sum_{i=1}^k w_i^2$$

### 2.a. Adding regularization to a loss/cost function

Recall that the usual loss function for linear regression is the *mean square error*:

$$ MSE = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_{i,1} + w_2 x_{i,2} + \ldots + w_k x_{i,k}))^2 $$

To add L<sub>1</sub> regularization, we want to minimize:

$$ MSE + \lambda\sum_{i=1}^k |w_i|$$



### 2.b. Lambda


### 2.c. Practical differences between L<sub>1</sub> and L<sub>2</sub> regularization

  * L<sub>1</sub> regularization can result in models where some of the feature weights are 0.
  * L<sub>2</sub> regularization can decrease model weights but not drive them to 0.
  * L<sub>2</sub> regularization results in a minimization problem with a unique solution, which is not always the case for L<sub>1</sub> regularization.
  * Which is best depends on the specifics of the data, the modeling problem, and the goals of the analysis.
 

## 3. Practice example / demonstration

Let's analyze a dataset called `regularization.csv` that you can find in the `nb-datasets` folder.

### First, try using regular old non-regularized linear regression

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse



### Try using L<sub>1</sub> regularization

### Exercise

Try experimenting with the value of the regularization parameter in the code above.  How does changing the value of alpha affect the results?  When do you get results that are misleading or just plain wrong?


### Try using L<sub>2</sub> regularization

## 4.  Practice example using real data

Let's try using regularization on a real dataset.  We'll again use the iris dataset that you've already seen in previous lessons.  We might not have time for this example during the workshop, and if not, I encourage you to explore it on your own.

### Load the data and split out training and testing sets.

In [None]:
idata = pd.read_csv('../nb-datasets/iris_dataset.csv')
idata['species'] = idata['species'].astype('category')

# Convert the categorical variable "species" to 1-hot encoding (AKA "dummy variables"),
# but eliminate the first dummy variable because it is collinear with the other two
# and does not provide any additional information.
idata_enc = pd.get_dummies(idata, drop_first=True)

# Separate the x and y values.
x = idata_enc.drop(columns='petal_length')
y = idata_enc['petal_length']

# Split the train and test sets.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

# See what we have.
idata_enc.head()

### Give standard linear regression a try

### Try L<sub>1</sub> regularization

### Try L<sub>2</sub> regularization

### Exercises

Try experimenting with the value of `alpha`/$\lambda$ in the code above for both L<sub>1</sub> regularization and L<sub>2</sub> regularization.  As you do so, consider these questions:

1. How does changing the value of the regularization parameter affect the coefficient weights and training/test performance?
2. What values of the regularization parameter give you the best test accuracy?
3. For these data, does L<sub>1</sub> or L<sub>2</sub> regularization perform better?