{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 9 - Overfitting\n", "\n", "> What is overfitting and how can it be avoided?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/lvwerra/dslectures/master?urlpath=lab/tree/notebooks%2Flesson09_overfitting.ipynb)[![slides](https://img.shields.io/static/v1?label=slides&message=2021-lesson09.pdf&color=blue&logo=Google-drive)](https://drive.google.com/open?id=1KnV9j6Gnh0aJdhXnXJnMYH8Ppyn-H29U)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning Objectives\n", "Overfitting is a phenomena that can always occur when a model is fitted to data. Therefore, it is important to understand what it entails and how it can be avoided. In this notebook we will address these three questions related to overfitting:\n", "1. What is overfitting?\n", "2. How can we measure overfitting?\n", "3. How can overfitting be avoided?\n", "\n", "## References\n", "* Chapter 5: Overfitting and its avoidance of _Data Science for Business_ by F. Provost and P. Fawcett\n", "\n", "\n", "## Homework\n", "* Work through part 2 of the notebook concerning the housing dataset.\n", "* Solve exercises in the notebook. In particular, tune a random forest for the churn dataset in part 3." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is overfitting?\n", "\n", "Already John von Neumann, one of the founding fathers of computing, knew that fitting complex models to data is a tricky business:\n", ">With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.\n", ">\n", "> \\- John von Neumann\n", "\n", "
Figure reference:Irrelevant image.
\n", "Figure reference: https://scikit-learn.org/stable/modules/cross_validation.html
\n", "\n", " | longitude | \n", "latitude | \n", "housing_median_age | \n", "total_rooms | \n", "total_bedrooms | \n", "population | \n", "households | \n", "median_income | \n", "median_house_value | \n", "city | \n", "postal_code | \n", "rooms_per_household | \n", "bedrooms_per_household | \n", "bedrooms_per_room | \n", "population_per_household | \n", "ocean_proximity_INLAND | \n", "ocean_proximity_<1H OCEAN | \n", "ocean_proximity_NEAR BAY | \n", "ocean_proximity_NEAR OCEAN | \n", "ocean_proximity_ISLAND | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "-122.23 | \n", "37.88 | \n", "41.0 | \n", "880.0 | \n", "129.0 | \n", "322.0 | \n", "126.0 | \n", "8.3252 | \n", "452600.0 | \n", "69 | \n", "94705 | \n", "6.984127 | \n", "1.023810 | \n", "0.146591 | \n", "2.555556 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
1 | \n", "-122.22 | \n", "37.86 | \n", "21.0 | \n", "7099.0 | \n", "1106.0 | \n", "2401.0 | \n", "1138.0 | \n", "8.3014 | \n", "358500.0 | \n", "620 | \n", "94611 | \n", "6.238137 | \n", "0.971880 | \n", "0.155797 | \n", "2.109842 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
2 | \n", "-122.24 | \n", "37.85 | \n", "52.0 | \n", "1467.0 | \n", "190.0 | \n", "496.0 | \n", "177.0 | \n", "7.2574 | \n", "352100.0 | \n", "620 | \n", "94618 | \n", "8.288136 | \n", "1.073446 | \n", "0.129516 | \n", "2.802260 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
3 | \n", "-122.25 | \n", "37.85 | \n", "52.0 | \n", "1274.0 | \n", "235.0 | \n", "558.0 | \n", "219.0 | \n", "5.6431 | \n", "341300.0 | \n", "620 | \n", "94618 | \n", "5.817352 | \n", "1.073059 | \n", "0.184458 | \n", "2.547945 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
4 | \n", "-122.25 | \n", "37.85 | \n", "52.0 | \n", "1627.0 | \n", "280.0 | \n", "565.0 | \n", "259.0 | \n", "3.8462 | \n", "342200.0 | \n", "620 | \n", "94618 | \n", "6.281853 | \n", "1.081081 | \n", "0.172096 | \n", "2.181467 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "