{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Descending into ML\n", "\n", "Author: Gaurav Vaidya " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Source\n", "This content is based on the [Descending into ML](https://developers.google.com/machine-learning/crash-course/descending-into-ml/) section of Google's *Machine Learning Crash Course*.\n", "\n", "# Learning objectives\n", "\n", "* A model is a way to predict the label for a given set of features.\n", "* Loss is a way of measuring how far the predicted label is from the actual label.\n", "\n", "# Linear regression for fun and profit\n", "\n", "[Linear regression](https://en.wikipedia.org/wiki/Linear_regression) is a method for finding the straight line or hyperplane that best fits a set of points. \n", "\n", "> If you remember this from previous mathematical training -- great! If not, just think of it as drawing a *line of best fit* on your data. And if you don't know what that is, don't worry, I'll show you!\n", "\n", "# Working with the Iris flower dataset\n", "\n", "Let's start by loading the [Iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set) introduced earlier." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
75.03.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
105.43.71.50.2setosa
114.83.41.60.2setosa
124.83.01.40.1setosa
134.33.01.10.1setosa
145.84.01.20.2setosa
155.74.41.50.4setosa
165.43.91.30.4setosa
175.13.51.40.3setosa
185.73.81.70.3setosa
195.13.81.50.3setosa
205.43.41.70.2setosa
215.13.71.50.4setosa
224.63.61.00.2setosa
235.13.31.70.5setosa
244.83.41.90.2setosa
255.03.01.60.2setosa
265.03.41.60.4setosa
275.23.51.50.2setosa
285.23.41.40.2setosa
294.73.21.60.2setosa
..................
1206.93.25.72.3virginica
1215.62.84.92.0virginica
1227.72.86.72.0virginica
1236.32.74.91.8virginica
1246.73.35.72.1virginica
1257.23.26.01.8virginica
1266.22.84.81.8virginica
1276.13.04.91.8virginica
1286.42.85.62.1virginica
1297.23.05.81.6virginica
1307.42.86.11.9virginica
1317.93.86.42.0virginica
1326.42.85.62.2virginica
1336.32.85.11.5virginica
1346.12.65.61.4virginica
1357.73.06.12.3virginica
1366.33.45.62.4virginica
1376.43.15.51.8virginica
1386.03.04.81.8virginica
1396.93.15.42.1virginica
1406.73.15.62.4virginica
1416.93.15.12.3virginica
1425.82.75.11.9virginica
1436.83.25.92.3virginica
1446.73.35.72.5virginica
1456.73.05.22.3virginica
1466.32.55.01.9virginica
1476.53.05.22.0virginica
1486.23.45.42.3virginica
1495.93.05.11.8virginica
\n", "

150 rows × 5 columns

\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa\n", "5 5.4 3.9 1.7 0.4 setosa\n", "6 4.6 3.4 1.4 0.3 setosa\n", "7 5.0 3.4 1.5 0.2 setosa\n", "8 4.4 2.9 1.4 0.2 setosa\n", "9 4.9 3.1 1.5 0.1 setosa\n", "10 5.4 3.7 1.5 0.2 setosa\n", "11 4.8 3.4 1.6 0.2 setosa\n", "12 4.8 3.0 1.4 0.1 setosa\n", "13 4.3 3.0 1.1 0.1 setosa\n", "14 5.8 4.0 1.2 0.2 setosa\n", "15 5.7 4.4 1.5 0.4 setosa\n", "16 5.4 3.9 1.3 0.4 setosa\n", "17 5.1 3.5 1.4 0.3 setosa\n", "18 5.7 3.8 1.7 0.3 setosa\n", "19 5.1 3.8 1.5 0.3 setosa\n", "20 5.4 3.4 1.7 0.2 setosa\n", "21 5.1 3.7 1.5 0.4 setosa\n", "22 4.6 3.6 1.0 0.2 setosa\n", "23 5.1 3.3 1.7 0.5 setosa\n", "24 4.8 3.4 1.9 0.2 setosa\n", "25 5.0 3.0 1.6 0.2 setosa\n", "26 5.0 3.4 1.6 0.4 setosa\n", "27 5.2 3.5 1.5 0.2 setosa\n", "28 5.2 3.4 1.4 0.2 setosa\n", "29 4.7 3.2 1.6 0.2 setosa\n", ".. ... ... ... ... ...\n", "120 6.9 3.2 5.7 2.3 virginica\n", "121 5.6 2.8 4.9 2.0 virginica\n", "122 7.7 2.8 6.7 2.0 virginica\n", "123 6.3 2.7 4.9 1.8 virginica\n", "124 6.7 3.3 5.7 2.1 virginica\n", "125 7.2 3.2 6.0 1.8 virginica\n", "126 6.2 2.8 4.8 1.8 virginica\n", "127 6.1 3.0 4.9 1.8 virginica\n", "128 6.4 2.8 5.6 2.1 virginica\n", "129 7.2 3.0 5.8 1.6 virginica\n", "130 7.4 2.8 6.1 1.9 virginica\n", "131 7.9 3.8 6.4 2.0 virginica\n", "132 6.4 2.8 5.6 2.2 virginica\n", "133 6.3 2.8 5.1 1.5 virginica\n", "134 6.1 2.6 5.6 1.4 virginica\n", "135 7.7 3.0 6.1 2.3 virginica\n", "136 6.3 3.4 5.6 2.4 virginica\n", "137 6.4 3.1 5.5 1.8 virginica\n", "138 6.0 3.0 4.8 1.8 virginica\n", "139 6.9 3.1 5.4 2.1 virginica\n", "140 6.7 3.1 5.6 2.4 virginica\n", "141 6.9 3.1 5.1 2.3 virginica\n", "142 5.8 2.7 5.1 1.9 virginica\n", "143 6.8 3.2 5.9 2.3 virginica\n", "144 6.7 3.3 5.7 2.5 virginica\n", "145 6.7 3.0 5.2 2.3 virginica\n", "146 6.3 2.5 5.0 1.9 virginica\n", "147 6.5 3.0 5.2 2.0 virginica\n", "148 6.2 3.4 5.4 2.3 virginica\n", "149 5.9 3.0 5.1 1.8 virginica\n", "\n", "[150 rows x 5 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "iris_dataset = pd.read_csv('../nb-datasets/iris_dataset.csv')\n", "iris_dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_dataset.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_width
count150.000000150.000000150.000000150.000000
mean5.8433333.0540003.7586671.198667
std0.8280660.4335941.7644200.763161
min4.3000002.0000001.0000000.100000
25%5.1000002.8000001.6000000.300000
50%5.8000003.0000004.3500001.300000
75%6.4000003.3000005.1000001.800000
max7.9000004.4000006.9000002.500000
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width\n", "count 150.000000 150.000000 150.000000 150.000000\n", "mean 5.843333 3.054000 3.758667 1.198667\n", "std 0.828066 0.433594 1.764420 0.763161\n", "min 4.300000 2.000000 1.000000 0.100000\n", "25% 5.100000 2.800000 1.600000 0.300000\n", "50% 5.800000 3.000000 4.350000 1.300000\n", "75% 6.400000 3.300000 5.100000 1.800000\n", "max 7.900000 4.400000 6.900000 2.500000" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_dataset.describe()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "versicolor 50\n", "setosa 50\n", "virginica 50\n", "Name: species, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_dataset.species.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This refers to three species of plants:\n", "* [*Iris versicolor*](https://en.wikipedia.org/wiki/Iris_versicolor) (purple iris or poison flag),\n", "* [*Iris virginica*](https://en.wikipedia.org/wiki/Iris_virginica) (Virginia iris), and\n", "* [*Iris setosa*](https://en.wikipedia.org/wiki/Iris_setosa) (bristle-pointed iris)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# What's a \"Sepal\"?\n", "\n", "![An image illustrating petals and sepals side by side](../nb-images/Petal-sepal.jpg)\n", "![*Alcea rosea*, a plant with distinctive sepals and petals](../nb-images/Alcea_rosea3_ies.jpg)\n", "\n", "A sepal is the green leaf-like structure found underneath the petal in many flowers. [Sepals](https://en.wikipedia.org/wiki/Sepal) provides protection for the flower when budding, and support for it once it is blooming.\n", "\n", "Given that that is the case, we might expect plants with larger petals to also have larger sepals for additional support." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Can we predict the length of the petal of a plant from the length of its sepal?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAELCAYAAADawD2zAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3X+cXXV95/HXZ5LJ78HMTrIRmIREJ6UPKhBg1MSUtALa+ivsFhZxy6awdeOvVdR1we7jsdrS1j5CrT9od7VZqBq0Vkz4pVtZcKONUsCdYAgIWKaCZICGZJzgBJIhyXz2j3vv5N6bufecM/ecc8+Z834+Hnkkc8699/u5R/zMuZ/7+X6/5u6IiMj019HuAEREJB1K+CIiBaGELyJSEEr4IiIFoYQvIlIQSvgiIgWhhC8iUhBK+CIiBaGELyJSEDPbHUC1RYsW+fLly9sdhohIruzcuXO/uy8OelymEv7y5csZGBhodxgiIrliZj8P8ziVdERECkIJX0SkIBJN+GZ2upntqvrzSzP7cJJjiojI5BKt4bv7T4FVAGY2A3gGuC3JMUVEZHJplnQuBP7Z3UN9uSAiIvFKM+FfDny9/qCZbTSzATMb2LdvX4rhiIgUSyoJ38xmAeuBb9afc/fN7t7v7v2LFwe2kYqIZMrwwTEe2nOA4YNj7Q4lUFp9+G8BHnT3vSmNJyKSuDt2PcO123bT2dHBkfFxrr/kLNavOrXdYTWUVknnXUxSzhERyavhg2Ncu203h4+MMzp2lMNHxrlm2+5M3+knnvDNbB7wJuDWpMcSEUnL0MghOjtqU2hnRwdDI4faFFGwxEs67v4S0JP0OCIiaertnsuR8fGaY0fGx+ntntumiIJppq2IyBT0LJjN9ZecxZzODrpmz2ROZwfXX3IWPQtmtzu0hjK1eJqISJ6sX3Uqp7xiDjue2M+6lYvoX5HtYoYSvojIFH3i9ofZcv/TANywfZANa5Zx3cVntjmqxlTSERGZgsG9oxPJvmLLfU8zuHe0TREFU8IXEZmCXXsORDqeBUr4IiINNJtFu2rpwkmf0+h4FqiGLyIyiaBZtH1LutiwZhlb7jte1tmwZhl9S7raEW4o5u7tjmFCf3+/a4tDEWm34YNjrN20ncNHjvfZz+ns4N5rLzih7XJw7yi79hxg1dKFbUv2ZrbT3fuDHqc7fBGROpVZtIc5nvArs2jrE37fkq5M39VXUw1fRKROHmfRhqGELyJSJ4+zaMNQSUdEZBLrV53K2r5FDI0cord7bqLJfvjgWCrjKOGLiDTQs2B24nf1aa6pr5KOiEibpL2mvhK+iEibpL2mvhK+iORSnvaSbSTtbiDV8EUkd/K2l2wjlW6ga+reS1LfGyjhi0iuVNe9KxOjrtm2m7V9i3LZNplmN5ASvojkSpRZsHmRRjcQqIYvIjkzXWfBpkEJX0RyZbrOgk2DSjoikknNZp+GqXunNXs1rXHioIQvIpkTpgunWd07rS6evHULqaQjIpnS6uzTtGavpj1LNg6JJ3wzW2hmW83scTN7zMzWJD2miORXq7NP05q9mvYs2TikUdL5PHCXu19qZrOAeSmMKZI5ear1tlOrXThpdfHksVso0Tt8MzsJWAfcBODuL7t7drd0F0nIHbueYe2m7Vxx4wOs3bSdO3c90+6QMqtnwWwuO6+35thl/b2hf0mm1cWTx26hRPe0NbNVwGbgUeBsYCdwtbu/ONnjtaetTEdR9keV+K5Xkbp0wu5pm3QNfyZwLvAFdz8HeBH4ePUDzGyjmQ2Y2cC+ffsSDkckfXmr9aaxKFmzMeK6Xj0LZnP20oWJJ+G0xolD0jX8IWDI3R8o/7yVuoTv7pspfQqgv78/uY8bIm2Sp1pvGm2GQWPk6XrlTaJ3+O7+L8AeMzu9fOhCSuUdkcLIS603jTbDMGPk5XrlURpdOh8Evlbu0PkZcFUKY4pkSlwrIrZaL272/DQWJQs7xvpVp3LGySexa88BVi1dSN+SrljGL7rEE7677wICv0wQme5aXRGx1XJLFkopYcfI2wzWvNBMW5EcSGP2aRqllDBj5HEGa15oLR2RHGi13BKllJL0ZhxBY0zH9e6zQglfJAfSnH0aVHqKo++82Rjq0kmOSjoiOdBquaVnwWz6T+uuOfba07ojJ+w0ZgyrSyc5usMXyYlWyi2De0f54eBwzbEfDA4zuHc0dAdMmnvJprnPa5Eo4YuEEEcZY3DvaNM2w6DzMPVOn117Jl/CateeA6ETftq19bT2eS0SJXyRAHG0CH7i9ofZcv/TEz9vWLOM6y4+M/T5Vq1aujDS8cmotp5/quGLNBFHi+Dg3tGaZA6w5b6nGdw7Gup8HLrnz6LDao91WOl4WKqt55/u8EWaiKOMEVROiVJumWppaWjkEPNnzWR07OjEsfmzZk76PlrdS1aySwlfpIk4yhhB5ZSw5ZZWSktxznBVbT2/VNIRaSKOMkbfki42rFlWc2zDmmUTd+/d82cxo67eMqPDasotrZaWNMNVQHf4IoHiKGNcd/GZbFi9fNIunKGRQ8zrnFFTbpnXOaOm3BJHaUkzXEUJXySEOGaf9i3pmrQFMky5Ja4OmThmuGZhhyeZGpV0RFrU6uzTMOWWrCxspr158y3RPW2j0p62kjdx7lcb5s45jbvrRmNob97sCrunrUo6Ii2Is+498uLLPLF3lPmzZjR8bhodMo3GUI0//5TwRVoQV2096Zm2cdBM2/xTDV+kBXHU1tOYaRsHzbTNP93hi7So1bbNNGbaxvV8zbTNNyV8kRi0UltPY6ZtHM+v0Ezb/FJJR6TNgmbiQjp72sr0pzt8kQxoNhMX0tvTVqY3JXzJvYEnh9nxxH7WrVxE/4qeE84H1a3TmjkaNE6jmbiQ7p62QTTTNr+U8CXXrrjx/omt+27YPsj5fT3c/O7VE+eD6tZx1bWDtDpOz4LZXNbfy5b7jnfzXNbfG2lP2+svOYtr6mKYyp62aVwvSUbiM23N7ClgFDgGHG02G0wzbSWKgSeHufSv7z/h+Nb3rKZ/RU/gzNC0Zo7GMU5csbZyd66ZttkVdqZtWl/avtHdV4UJSKTa8MExHtpzYNIvF3c8sX/S51SOV+rW1Sp16zDn4xLHOGnFmvUYpDUq6UhmBZUP1q1cxA3bB0943rqVi4DgunVv91wOHTlac/7QkaOxzxyNY5w4avCtlmM00zb/0rjDd+BuM9tpZhtTGE+mgTBthP0reji/r/ZL2vP7eia+uA0zM9SsduOR+p/j0uo4rc5yjaMtUzNt8y+NO/y17v6smf1r4B4ze9zdd1ROln8JbARYtmxZo9eQaapRTTlsG+HN717dtEun2czQoZFDzJk5gyPHjt99z5k5Y9JWxVZq31HGaaaVWa5xtWWuX3UqZ5x8UsP2Ucm2xBO+uz9b/vt5M7sNeB2wo+r8ZmAzlL60TToeyY5mJYYo5YP+FT2TtmNWNJoZGuc+r830ds/lpSPHao69dOTYlEohU53lGlc5Rl06+ZZoScfM5ptZV+XfwJuBR5IcU/IhqMSQlQ0/4iiFjLz4MsfGa+9ljo07Iy++HNt7CRLH9dRs3fxL+g5/CXBbuV45E/hbd78r4TElB8KUGNJYqCuNfV6jLI6WpFavp2br5l+iCd/dfwacneQYkk9hSwxxLNTVSv09jn1ewy6OloYw17PRe1GXTv6pLVPaIq6Zn0FanWkbJs6g16gsjlY9S7Z+cbSsaPZe0vrfTJKjPW2lrZJclyXOmbZx7PM6uHc0090tYd+L1tLJnkT2tDWzNwDLq5/n7lsiRydSluTa6kE15yg16Ub7zUZ5je75s1i5pIvu+bPifquhNUvWYd9LK2Uhaa/QCd/MbgZeDeyitC4OlCZVKeFLJoWZaRumJt1sv9m0WjvjEBSDWjenvyhtmf2UJlG9390/WP7zoaQCE2lVUCtimFbFoP1m02rtbFWYGNS6Of1FKek8ArwSeC6hWERiF9SKGHQ+TEtlGq2dFVNd2z9sDGrdnN4CE76ZfYtS6aYLeNTMfgRM/Lp29/XJhSfSuqCac7PzYVsqm71GWqWSuGYut/K9ilo3sy1MSefTwF8Afwj8G+BT5Z8rf0Smre75s5jRUbvQ2YwOi/TFaxqlkizMXE5zHJmawDt8d/8HADPb5O7XVp8zs03APyQUm0jbDY0cYl7nDEbHji98Nq8z3YXPKnG02nGUxszlNMeR6KJ8afumSY69Ja5A5Lhmm35kSRxxBr1GGtdicO8oWwf2THwRWy3KTNugOHsWzObspQsbJsBmrxFXx1FQDHFJaxyJJkwN/33A+4FXmdnuqlNdwL1JBVZUeWlpiyPOLOw326zlEuKZaRtGqzN+NQtWwgicaWtmrwC6gT8DPl51atTdfxFnMEWfaZuXPUPT2KM1jWsxuHeUiz6744Tj3/3IuhNmwsYx07aROGb8hj0v01Nse9q6+wvu/hTwAUqbkVf+YGadLcYpVfKyZ2jYOJuVKLKw32yzlst6jUoUUeJsdD2ivEarpZK8lAslGVH68B8ElgIjgAELgefM7HngP7n7zgTiK5S8tLSFibPVWZ1pXIs4VrGMY6ZtFto2pRiifGl7F/BWd1/k7j2UvrC9hVJ9/38mEVzR5KWlLSjOOGZ1pnEtKqtYVou6imUcM22z0LYpxRDlDr/f3d9b+cHd7zazT7n7R80sWxkpx/LS0ha0V2wcszrTuBbXXXwmG1Yvb2kVy6B9XsO2TLayV2ycC8XJ9BUl4f/CzK4F/q788zuBETObAYw3fppEleQKknFqFGecszrTuBZ9S7paWq44jkXJ4tg3t90lMsm+KCWdfw/0ArcDdwDLysdmAJfFH5rkVV5KU3GIo3wVR7klCyUyyb7Qd/juvh/4YIPTg/GEI9NFXkpTrYqjfBVXuSULJTLJtijr4f8K8DFO3ADlgvjDkukgjo0y4tglKsne9DjKV3GWW7JQIpPsilLD/ybwReBGjm+AIjJlQXXroFmwcYzRqjhmuGqWrKQl9J625Zlc5yUZTNFn2hZJ0OzSKLNgpzpGnOL4FKFZsjJVsc20rfItM3u/mZ1sZv+q8qeFGKXAgmaXRpkFG8cM1lbFsViYFhyTpEUp6fxe+e//WnXMgVfFF44URW/3XA4fra0MHj56bKJuHXYWbBozWEWmi9B3+O6+YpI/SvYyZfXlxOqfw2w8kpVNP0TyIkqXzjzgo8Ayd99oZiuB09392yGeOwMYAJ5x97dPOVqZNoZGDjG3c2bNxiJzO2fWzAwN2ngkS5t+iORBlBr+l4CXgTeUfx4C/iTkc68GHoswlkwDrWwsEqYcE9emH2FWkNQqkzIdRKnhv9rd32lm7wJw90NmZkFPMrNe4G3An1L6hCAF0OrGImFaFXsWzOay83prxrmsvzfSXXyYtk2tMinTRZSE/7KZzaX0RS1m9mogzO3O54BrKO2QJQUwuHe0JgkDbLnvaTasXl7TUtnqzNDhg2PcsnOo5tgtA0NcfeGvhEr61d8BVMpC12zbzdq+RZMue9DoMSJ5EaWk80lKSyQvNbOvAf+XUiJvyMzeDjzfbK18M9toZgNmNrBv374I4UgrkixRRGmpDNKsHNNq22WY5+dlUxqRMKKspXOPmT0IrKa0AcrV5fV1mlkLrDeztwJzgJPM7KvufkXV624GNkNp4lXUNyDRJV2iiKOlMoyg1s4wz4/rewKRPAi8wzezcyt/gNOA54BngWXlYw25+x+4e6+7LwcuB7ZXJ3tJXxobYYTZWCSuOJq1dgYJ07ap1k6ZTsLc4f9Fk3MOaPG0HElrI4zrLj6T9Wedwo4n9rNu5SL6V/TEHkdQa2cYYdo21dop00Vgwnf3N4Z5ITN7k7vf0+R1vg98P3Rkkoi0ShTV5ZrNP/jZlDYFCRLXewmzgqRWmZTpIMqXtkE2xfhakpA0ShRxbAqSlfciMp1EacsMEtiTL9mQdIkirj1tw1C5RSS8OBO+OmxyJMkSRZx72oahcotIOHGWdESA4zNgq0WdASsi8Ysz4T8V42tJjjWaAat1aETaK7CkY2a/0+y8u99a/rvp46Q4orRcapcnkfSEqeG/o8k5B26NKRaZJsLW8LUomUi6wvThX5VGIDJ9hFnpUouSiaQvUpeOmb0N+DVK6+IA4O7XxR2UNJeHMkhQu2TYsk8e3qtIXkTZ8eqLwDzgjcCNwKXAjxKKSxrIUxmkWbtkmLJPnt6rSB5E6dJ5g7tvAEbc/Y+ANcDSZMKSyaSx8FlagmbJTqf3KpIVUUo6lQXAXzKzU4BhYEX8IUkjaS18lpZmZZ/p9l5FsiBKwv+2mS0E/hx4kFKHzo2JRCWTmo5rszcq+0zH9yrSblFKOte7+wF330ZpXfxfJfwm5hKDIi0WVqT3KpIWC7thhJk96O7nBh1rRX9/vw8MDMT1ctNWkTpXivReRabKzHa6e3/Q48LMtH0lcCow18zO4fiqmCdR6tqRmCnJHaeF0UTiE6aG/1vAlUAv8Jmq478E/lsCMRVaUCuiWhVFZKqilHQuKdfvE1P0ks7wwTHWbtrO4SPHv6yc09nBvddeQM+C2YHnRaSYwpZ0onxpe6+Z3WRm3ykPcIaZ/f6UI5QTVFoRq1VaEcOcFxFpJkrC/xLwf4BTyj//E/Dh2CMqsKBWRLUqikgroiT8Re5+C5Rmwrj7UeBYIlEVVFAroloVRaQVUSZevWhmPZS3MjSz1cALiURVYEGLjmkPVxGZqigJ/6PAncCrzOxeYDGlBdQkZkGtiGpVFJGpiJLwHwVuA14CRoHbKdXxRUQkB6LU8LdQWk7hU8BfAiuBm5s9wczmmNmPzOwhM/uJmf3R1EMVEZFWRLnDP93dz676+Xtm9lDAc8aAC9z9oJl1Aj80s++4+/2RI5VINFtXROpFSfg/NrPVlWRtZq8H7m32BC/N6jpY/rGz/CfcTC+ZMs3GFZHJRCnpvB74RzN7ysyeAu4DfsPMHjaz3Y2eZGYzzGwX8Dxwj7s/0FLE0pQ2DhGRRqLc4f/2VAZw92PAqvJa+reZ2Wvc/ZHKeTPbCGwEWLZs2VSGkCraOEREGgmd8N39560M5O4HzOz7lH5xPFJ1fDOwGUpr6bQyhmg2rog0FqWkE5mZLS7f2WNmc4GLgMeTHLPoNBtXRBqJUtKZipOBr5jZDEq/XG5x928nPGbhaTauiEwm0YTv7ruBc5IcQyan2bgiUi/Rko6IiGSHEr6ISEEo4YuIFIQSvohIQSjhi4gUhBK+iEhBKOGLiBSEEr6ISEEo4YuIFIQSvohIQSjhi4gUhBK+iEhBKOGLiBSEEr6ISEEo4YuIFIQSvohIQSjhi4gUhBK+iEhBKOGLiBSEEr6ISEEo4YuIFIQSvohIQSjhi4gUhBK+iEhBKOGLiBREognfzJaa2ffM7DEz+4mZXZ3keCIi0tjMhF//KPBf3P1BM+sCdprZPe7+aMLjiohInUTv8N39OXd/sPzvUeAx4NQkxxQRkcmlVsM3s+XAOcADdcc3mtmAmQ3s27cvrXBERAonlYRvZguAbcCH3f2X1efcfbO797t7/+LFi9MIp62GD47x0J4DDB8cmxbjiEh+JF3Dx8w6KSX7r7n7rUmPl2V37HqGa7ftprOjgyPj41x/yVmsXxV/hSutcUQkX5Lu0jHgJuAxd/9MkmNl3fDBMa7dtpvDR8YZHTvK4SPjXLNtd+x34GmNIyL5k3RJZy3wH4ALzGxX+c9bEx4zk4ZGDtHZUXu5Ozs6GBo5lMtxRCR/Ei3puPsPAUtyjLzo7Z7LkfHxmmNHxsfp7Z6by3FEJH800zYlPQtmc/0lZzGns4Ou2TOZ09nB9ZecRc+C2bkcR0TyJ/EvbeW49atOZW3fIoZGDtHbPTexJLx+1amccfJJ7NpzgFVLF9K3pCuRcQb3jiY+xvDBscSvl0hRKOGnrGfB7MQTVxpdOp+4/WG23P/0xM8b1izjuovPjHUMdRuJxEslnWkmjS6dwb2jNckeYMt9TzO4dzS2MdRtJBI/JfxpJo0unV17DkQ6PhXqNhKJnxJ+zIJmuA7uHWXrwJ6W7oYHnhzmM3f/lIEnh084l0aXzqqlCyMdnwp1G4nETzX8GAXVnOOoe19x4/38cLCU6G/YPsj5fT3c/O7VE+d7FszmsvN6a8a5rL831u8N+pZ0sWHNMrbcV/te4vzittJtdE3d9dQXtyJTZ+7e7hgm9Pf3+8DAQLvDmJLhg2Os3bSdw0eO35XO6ezg3msvoGfBbAb3jnLRZ3ec8LzvfmRd6EQ58OQwl/71/Scc3/qe1fSv6AkVR5zUpSOSDWa20937gx6nkk5EjUo2QTXnKHXvRiWbHU/sn/Q1qo+HrX0HlZa0+JrI9KOSTgTNSjZBNeewde9mJZt1Kxdxw/bBE15j3cpFE//u7Z7Liy8frTn/4stHa2rfQaWlMO2QassUyR/d4YcU1CYYNMO1UveuVl/3HnhyeCLZV/xgcHjiTn/F4gWTxlZ9/Ml9Bxmvq9KNe+k4BLdUhmmHVFumSD4V6g6/lXpwpVRymON38ZVSSeW1gmbSXnfxmfzGysXc/ehe3nzGEi4845U155uVbPpX9DA0coiu2TMZHTt+B981e2ZNDEGv0ay01LekK9T7DHqNiqSvt4hEU5iE32p5oLd7LgfHakslB8eOntAm2GwmbXUZ5BsDQyeUQYJKNmHKNUGvEVRaCtMOGaY8Fcf1VlumSLwKUdKJozww8uLL1Pczefl4GGHKIEElm5EXX560XFMdw8J5syZ9jcrx7vmz6Khbv7TDSsch3OJrQeWpOK63FoETiV8h7vDjKA+ELWO08vygkk2Y1whTspk/q3aM+bNmRipNQak8tWH18knbMuMqx6S12JxIURQi4YctDzSrOYftsmn0GmGeH0enTxwlGwi3yFv3/FmsXNI18ekg7PuIIo3F5kSKohAlnZ4Fs+k/rbvm2GtP665JJHfseoa1m7ZzxY0PsHbTdu7c9UzN48N02TR7jTDPD9Ppc35fT81rnN/XU/MaQePEVSpp9l4rs32rxT3bV0SiK8RM26BZrlFmpzaaXRr2NcLMTm30KSGOOIPGCCMojjRn+4pI+Jm2hSjpxNGKWNG3pGvSBBr2NRo9v1qjMkYccQaNEUZQHGm3VGr5BZFwCpHw46prN5NGG2Fv91wOHz1Wc+zw0WOptyoGvdc0Wyo1G1ckvELU8NOoa6fVRlhfgmtHSS7ovaZ1LTQbVySaaXOHH/SxvlkbIcTTAph0G+HQyCHmdta2VM7tnNmW2adB7zWNlkrNxhWJZlok/LAf65Osa8f5Go1kbfZp0HtNuqUya9dDJOtyX9Ip0sd6zT6tpeshEk2id/hm9jfA24Hn3f01SYxRtI/1mn1aS9dDJLykSzpfBv4K2JLUAGl/rM9CC6Bmn9bS9RAJJ9GSjrvvAH6R5BhpfqwPmo0rIpJl0+JL2zQ+1ld/V1ApH12zbTdr+xbp7lJEcqHtCd/MNgIbAZYtWxbw6MaS/lhftO8KRGT6aXuXjrtvdvd+d+9fvHhxu8NpSC2AIpJ3bU/4eaEWQBHJu6TbMr8O/CawyMyGgE+6+01JjpkktQCKSJ4lmvDd/V1Jvn47qAVQRPJKJR0RkYJQwhcRKQglfBGRglDCFxEpCCV8EZGCyNQm5ma2D/h5m8NYBOxvcwxhKM545SVOyE+sijNezeI8zd0DZ65mKuFngZkNhNn9vd0UZ7zyEifkJ1bFGa844lRJR0SkIJTwRUQKQgn/RJvbHUBIijNeeYkT8hOr4oxXy3Gqhi8iUhC6wxcRKYhCJ3wzm2FmPzazb09y7koz22dmu8p/3t2mGJ8ys4fLMQxMct7M7AYzGzSz3WZ2bkbj/E0ze6Hqen6iTXEuNLOtZva4mT1mZmvqzmflegbFmZXreXpVDLvM7Jdm9uG6x7T9moaMMyvX9CNm9hMze8TMvm5mc+rOzzazb5Sv5wNmtjzsa7d9x6s2uxp4DDipwflvuPt/TjGeRt7o7o36b98CrCz/eT3whfLf7dAsToAfuPvbU4tmcp8H7nL3S81sFjCv7nxWrmdQnJCB6+nuPwVWQekGCngGuK3uYW2/piHjhDZfUzM7FfgQcIa7HzKzW4DLgS9XPez3gRF37zOzy4FNwDvDvH5h7/DNrBd4G3Bju2Np0cXAFi+5H1hoZie3O6gsMrOTgHXATQDu/rK7H6h7WNuvZ8g4s+hC4J/dvX7yZNuvaZ1GcWbFTGCumc2k9Iv+2brzFwNfKf97K3ChmVmYFy5swgc+B1wDjDd5zCXlj6BbzWxpSnHVc+BuM9tZ3v+33qnAnqqfh8rH0hYUJ8AaM3vIzL5jZr+WZnBlrwL2AV8ql/JuNLP5dY/JwvUMEye0/3rWuxz4+iTHs3BNqzWKE9p8Td39GeDTwNPAc8AL7n533cMmrqe7HwVeAHrCvH4hE76ZvR143t13NnnYt4Dl7n4W8F2O/0ZN21p3P5fSx+IPmNm6uvOT/WZvR+tVUJwPUpr+fTbwl8DtaQdI6c7pXOAL7n4O8CLw8brHZOF6hokzC9dzQrnstB745mSnJznWlvbAgDjbfk3NrJvSHfwK4BRgvpldUf+wSZ4a6noWMuEDa4H1ZvYU8HfABWb21eoHuPuwu4+Vf/xfwHnphjgRx7Plv5+nVHN8Xd1DhoDqTx+9nPgRMHFBcbr7L939YPnffw90mtmilMMcAobc/YHyz1spJdb6x7T7egbGmZHrWe0twIPuvneSc1m4phUN48zINb0IeNLd97n7EeBW4A11j5m4nuWyzyuAX4R58UImfHf/A3fvdffllD7ebXf3mt+idTXG9ZS+3E2Vmc03s67Kv4E3A4/UPexOYEO5E2I1pY+Az2UtTjN7ZaXOaGavo/Tf3nCacbr7vwB7zOz08qELgUfrHtb26xkmzixczzrvonGZpO3XtErDODNyTZ8GVpvZvHIsF3Ji7rkT+L3yvy+llL9C3eEXvUunhpldBwy4+53Ah8xsPXCU0m/PK9sQ0hLgtvJ/gzOBv3X3u8zsvQDu/kXg74G3AoPAS8BVGY3zUuB9ZnYUOARcHvY/0ph9EPha+aP9z4CrMng9w8SZleuJmc0D3gS8p+pY5q5piDhxEX6eAAADHElEQVTbfk3d/QEz20qpvHQU+DGwuS433QTcbGaDlHLT5WFfXzNtRUQKopAlHRGRIlLCFxEpCCV8EZGCUMIXESkIJXwRkYJQwhcRKQglfJGy8vK4JyyVXXX+SjP7qwTGvdLMTqn6+ak2z5qVaUoJX6T9rqS0bopIojTTVnKlvHTDLZTWY5kB/DGlGZyfARYA+4Er3f05M/s+sIvSuj4nAf/R3X9Unjb/OWAupRmVV5XXS48Sx2Lgi8Cy8qEPu/u9ZvaH5WOvKv/9OXe/ofyc/w78LqWVDvcDO4GngH5Ks2oPAZWNTj5oZu8AOoF/5+6PR4lPZDK6w5e8+W3gWXc/291fA9xFaWXDS939POBvgD+tevx8d38D8P7yOYDHgXXllSg/AXxqCnF8Hvisu78WuITafRV+FfgtSr9oPmlmnWbWX37cOcDvUEryuPtWYAD4XXdf5e6Hyq+xv7z66BeAj00hPpET6A5f8uZh4NNmtgn4NjACvAa4p7yWzwxK64hXfB3A3XeY2UlmthDoAr5iZispLSvbOYU4LgLOqNp34qTKAnLA/y6vtDpmZs9TWmvo14E7KgndzL4V8Pq3lv/eSekXhEjLlPAlV9z9n8zsPEqLcf0ZcA/wE3df0+gpk/z8x8D33P3fWmk/0O9PIZQOYE3VHTkA5V8AY1WHjlH6/1moHYmqVF6j8nyRlqmkI7lS7mZ5yd2/SmlnoNcDi628yXe5fFK9U9E7y8d/ndKyvC9QWj/8mfL5K6cYyt3AxH7HZrYq4PE/BN5hZnPMbAGl7TUrRil96hBJlO4cJG/OBP7czMaBI8D7KC0je4OZvYLSf9OfA35SfvyImf0j5S9ty8eup1TS+SiwfYpxfAj4H2a2uzzmDuC9jR7s7v/PzO4EHgJ+Tqlu/0L59JeBL9Z9aSsSOy2PLNNWuUvnY+4+0O5YAMxsgbsfLK/LvgPY6O4PtjsuKQ7d4YukZ7OZnQHMAb6iZC9p0x2+SB0zuwq4uu7wve7+gXbEIxIXJXwRkYJQl46ISEEo4YuIFIQSvohIQSjhi4gUhBK+iEhB/H9PzZ1A5+UaYQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "iris_dataset.plot(\"sepal_length\", \"petal_length\", kind=\"scatter\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "iris1 = iris_dataset.plot(\n", " \"sepal_length\",\n", " \"petal_length\",\n", " kind=\"scatter\",\n", " title=\"Petal and sepal length in three species of Iris\"\n", ")\n", "iris1.set_xlabel(\"Sepal length (cm)\")\n", "iris1.set_ylabel(\"Petal length (cm)\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like the answer is... yes! If we draw a line across the plot, we can *predict* what the petal length might be for a plant given a particular sepal length.\n", "\n", "**That's all a model is!** -- something that can extrapolate from known data to predict what the value might be for a given input value." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Linear regression can define that model precisely\n", "\n", "Drawing a line by hand is fine, but we would like to determine exactly how the petal length varies as the sepal length varies. Luckily, `matplotlib` can run a linear regression for us easily." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 5.1\n", "1 4.9\n", "2 4.7\n", "3 4.6\n", "4 5.0\n", "Name: sepal_length, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_dataset.sepal_length.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.4\n", "1 1.4\n", "2 1.3\n", "3 1.5\n", "4 1.4\n", "Name: petal_length, dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_dataset.petal_length.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn's LinearRegression module needs the data as a two-dimensional array:\n", "* It expects each row to contain multiple features.\n", "* It expects as many rows as there are data points." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[5.1],\n", " [4.9],\n", " [4.7],\n", " [4.6],\n", " [5. ],\n", " [5.4],\n", " [4.6],\n", " [5. ],\n", " [4.4],\n", " [4.9],\n", " [5.4],\n", " [4.8],\n", " [4.8],\n", " [4.3],\n", " [5.8],\n", " [5.7],\n", " [5.4],\n", " [5.1],\n", " [5.7],\n", " [5.1],\n", " [5.4],\n", " [5.1],\n", " [4.6],\n", " [5.1],\n", " [4.8],\n", " [5. ],\n", " [5. ],\n", " [5.2],\n", " [5.2],\n", " [4.7],\n", " [4.8],\n", " [5.4],\n", " [5.2],\n", " [5.5],\n", " [4.9],\n", " [5. ],\n", " [5.5],\n", " [4.9],\n", " [4.4],\n", " [5.1],\n", " [5. ],\n", " [4.5],\n", " [4.4],\n", " [5. ],\n", " [5.1],\n", " [4.8],\n", " [5.1],\n", " [4.6],\n", " [5.3],\n", " [5. ],\n", " [7. ],\n", " [6.4],\n", " [6.9],\n", " [5.5],\n", " [6.5],\n", " [5.7],\n", " [6.3],\n", " [4.9],\n", " [6.6],\n", " [5.2],\n", " [5. ],\n", " [5.9],\n", " [6. ],\n", " [6.1],\n", " [5.6],\n", " [6.7],\n", " [5.6],\n", " [5.8],\n", " [6.2],\n", " [5.6],\n", " [5.9],\n", " [6.1],\n", " [6.3],\n", " [6.1],\n", " [6.4],\n", " [6.6],\n", " [6.8],\n", " [6.7],\n", " [6. ],\n", " [5.7],\n", " [5.5],\n", " [5.5],\n", " [5.8],\n", " [6. ],\n", " [5.4],\n", " [6. ],\n", " [6.7],\n", " [6.3],\n", " [5.6],\n", " [5.5],\n", " [5.5],\n", " [6.1],\n", " [5.8],\n", " [5. ],\n", " [5.6],\n", " [5.7],\n", " [5.7],\n", " [6.2],\n", " [5.1],\n", " [5.7],\n", " [6.3],\n", " [5.8],\n", " [7.1],\n", " [6.3],\n", " [6.5],\n", " [7.6],\n", " [4.9],\n", " [7.3],\n", " [6.7],\n", " [7.2],\n", " [6.5],\n", " [6.4],\n", " [6.8],\n", " [5.7],\n", " [5.8],\n", " [6.4],\n", " [6.5],\n", " [7.7],\n", " [7.7],\n", " [6. ],\n", " [6.9],\n", " [5.6],\n", " [7.7],\n", " [6.3],\n", " [6.7],\n", " [7.2],\n", " [6.2],\n", " [6.1],\n", " [6.4],\n", " [7.2],\n", " [7.4],\n", " [7.9],\n", " [6.4],\n", " [6.3],\n", " [6.1],\n", " [7.7],\n", " [6.3],\n", " [6.4],\n", " [6. ],\n", " [6.9],\n", " [6.7],\n", " [6.9],\n", " [5.8],\n", " [6.8],\n", " [6.7],\n", " [6.7],\n", " [6.3],\n", " [6.5],\n", " [6.2],\n", " [5.9]])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_dataset.sepal_length.values.reshape(-1, 1)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1.85750967] -7.0953814782793145\n" ] } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "X = iris_dataset.sepal_length.values.reshape(-1, 1)\n", "Y = iris_dataset.petal_length\n", "\n", "model = LinearRegression()\n", "model.fit(X, Y)\n", "slopes = model.coef_\n", "intercept = model.intercept_\n", "\n", "print(slopes, intercept)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In other words, based on the available data, we can construct a model that predicts a petal length given a particular sepal length.\n", "\n", "$$petal\\_length = 1.8575 * sepal\\_length - 7.095$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Equation of a Line\n", "\n", "You may be familiar with this as the [equation of a line](https://en.wikipedia.org/wiki/Linear_equation#One_variable):\n", "\n", "$$ y = mx + c $$\n", "\n", "See how easy it is to predict a petal value given a sepal value: you just plug it into the equation! For example:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.1925\n" ] } ], "source": [ "# What is the predicted petal length when the sepal length is 5cm?\n", "sepal_length = 5\n", "petal_length = 1.8575 * sepal_length - 7.095\n", "print(petal_length)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.192166848827913\n" ] } ], "source": [ "# We could be more precise by plugging in the slope and intercept values directly.\n", "petal_length = slopes[0] * sepal_length + intercept\n", "print(petal_length)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine learning? For real?\n", "\n", "Yes! By convention, we write this equation like this when thinking about it in machine-learning terms:\n", "\n", "$$ y' = b + w_1x_1 $$\n", "\n", "Where:\n", "* $y'$ is the *predicted label* (the desired output), which in our example is the petal length in centimeters.\n", "* $b$ is the *bias* (the y-intercept, sometimes referred to as $w_0$), which in our example is 7.095 cm.\n", "* $w_1$ is the *weight* of feature 1, which is the same concept as the \"slope\" in the traditional equation of the line. In our example, this is 1.8575.\n", "* $x_1$ is a *feature* (a known input), which in our example is the sepal length.\n", "\n", "Writing it in this way makes it easy to extend our model when we are considering multiple features, such as sepal length and sepal width and many more. In that case, our equation would look like:\n", "\n", "$$ y' = b + w_1x_1 + w_2x_2 + w_3x_3 + \\ldots + w_nx_n $$\n", " \n", " \n", "# What does this model actually look like?\n", "\n", "We can draw this model onto our plot from earlier as a *line of best fit*.\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "iris1 = iris_dataset.plot(\"sepal_length\", \"petal_length\", kind=\"scatter\", title=\"Petal and sepal length in three species of Iris\", color=\"red\")\n", "iris1.set_xlabel(\"Sepal length (cm)\")\n", "iris1.set_ylabel(\"Petal length (cm)\")\n", "\n", "iris1.plot(iris_dataset.sepal_length, iris_dataset.sepal_length * slopes[0] + intercept, 'black')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Err...\n", "\n", "You might have noticed that this is *not* a very good model:\n", "\n", "* There's a lot of points at the bottom of the figure that don't follow the predicted relationship.\n", "* Petal sizes above 3cm seem to be increasing less quickly as sepal size increases than the predicted relationship.\n", "\n", "Let's calculate the predicted petal length for each flower -- what does our model predict as compared to the actual petal length we see?" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthpetal_lengthpredicted_petal_length
05.11.42.377918
14.91.42.006416
24.71.31.634914
34.61.51.449163
45.01.42.192167
\n", "
" ], "text/plain": [ " sepal_length petal_length predicted_petal_length\n", "0 5.1 1.4 2.377918\n", "1 4.9 1.4 2.006416\n", "2 4.7 1.3 1.634914\n", "3 4.6 1.5 1.449163\n", "4 5.0 1.4 2.192167" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame({\n", " 'sepal_length': iris_dataset.sepal_length,\n", " 'petal_length': iris_dataset.petal_length\n", "})\n", "data['predicted_petal_length'] = data.sepal_length * slopes[0] + intercept\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we plot the predicted petal length against the actual petal length, what would we expect to see?" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "predicted1 = data.plot('petal_length', 'predicted_petal_length', kind='scatter', color='red')\n", "predicted1.plot(data.petal_length, data.petal_length, 'black')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Loss\n", "\n", "Loss is a number indicating how bad the model's prediction was on one particular data point. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. **The goal of training a model is to find a set of *weights* and *biases* that have low loss, on average, across all examples.**\n", "\n", "In this example, we have **150 data points** that provide a feature (the sepal length) as well as the label (the petal length). We can use these to determine how much total loss our model has over this dataset by calculating predicted labels and comparing them to the actual labels.\n", "\n", "There are many different measures of loss. One common measure of loss that is particularly useful in linear regressions is *squared loss* (or $L_2$ loss). This is defined as *the square of the difference between the label and the prediction*. In other words, it is equal to:\n", "$$ = (predicted\\ label - actual\\ label)^2 $$\n", "$$ = (observation - prediction(x))^2 $$\n", "$$ = (y - y')^2 $$\n", "\n", "We can use this equation to find the loss for a single data point. What does this look like in Python?" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthpetal_lengthpredicted_petal_lengthsquared_error
05.11.42.3779180.956323
14.91.42.0064160.367740
24.71.31.6349140.112167
34.61.51.4491630.002584
45.01.42.1921670.627528
\n", "
" ], "text/plain": [ " sepal_length petal_length predicted_petal_length squared_error\n", "0 5.1 1.4 2.377918 0.956323\n", "1 4.9 1.4 2.006416 0.367740\n", "2 4.7 1.3 1.634914 0.112167\n", "3 4.6 1.5 1.449163 0.002584\n", "4 5.0 1.4 2.192167 0.627528" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['squared_error'] = (data.petal_length - data.predicted_petal_length)**2\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But how can we measure our total loss across all our 150 data points?\n", "\n", "## Mean Square Error (MSE)\n", "\n", "The Mean Square Error (MSE) can be calculated as the arithmetic mean of all squared losses in a particular dataset $D$. We can calculate this as the total squared loss divided by the number of data points, i.e.:\n", "\n", "$$ MSE = \\frac{1}{N} \\sum_{(x, y)\\ \\in\\ D}{(y - y')^2} $$\n", "\n", "What does this look like in Python?" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7423201713947026" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.squared_error.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tada!\n", "\n", "We now have:\n", "* A model that allows us to *predict* (or *infer*) the petal length from the sepal length within this dataset.\n", "* A measure for determining how much the prediction varied from the actual *label*, i.e. its *loss*.\n", "* A measure for determining how well the model performed on a particular dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercises\n", "\n", "Here are a few exercises to test your understanding of this material." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1\n", "\n", "In the [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), we looked at whether we could predict the petal length based on sepal length.\n", "\n", "For this exercise, try using the sepal width to predict petal width, find the equation of the line of best fit, and plot that line on the same graph." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas\n", "import numpy\n", "import matplotlib.pyplot as plt\n", "\n", "# Import Iris dataset.\n", "iris_dataset = # How can we load our dataset?\n", "\n", "# Plot sepal widths against petal widths.\n", "iris_dataset.plot(\n", " # How do we plot this dataset?\n", ")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Hmm, this is NOT looking good. Oh well, let's see how awful it is!\n", "# Construct our model.\n", "slope, intercept = # How do we calculate the slope (weight) and intercept (bias).\n", "print(slope, intercept)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris1 = iris_dataset.plot(\n", " # How do you plot a pretty graph?\n", ")\n", "iris1.set_xlabel(\"#TODO\")\n", "iris1.set_ylabel(\"#TODO\")\n", "iris1.plot(\n", " # How can you plot the line of best fit?\n", ")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 2\n", "\n", "For the model that predicts petal length from sepal length in Exercise 1, calculate the Mean Square Error (MSE)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calculate predicted petal widths.\n", "iris_dataset['predicted_petal_width'] = # How?\n", "iris_dataset['squared_error'] = # How??\n", "print(\"The mean squared error is: \",\n", " # How???\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }