{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "%%html\n", "\n", "\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# Install the necessary dependencies\n", "\n", "import os\n", "import sys\n", "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython " ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Linear Regression" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## What is Linear Regression?\n", "\n", "Finding a straight line of best fit through the data. This works well when the true underlying function is linear.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Example\n", "\n", "We use features $\\mathbf{x}$ to predict a \"response\" $y$. For example we might want to regress `num_hours_studied` onto `exam_score` - in other words we predict exam score from number of hours studied.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's generate some example data for this case and examine the relationship between $\\mathbf{x}$ and $y$.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "num_hours_studied = np.array([1, 3, 3, 4, 5, 6, 7, 7, 8, 8, 10])\n", "exam_score = np.array([18, 26, 31, 40, 55, 62, 71, 70, 75, 85, 97])\n", "plt.scatter(num_hours_studied, exam_score)\n", "plt.xlabel('num_hours_studied')\n", "plt.ylabel('exam_score')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can see the this is nearly a straight line. We suspect with such a high linear correlation that linear regression will be a successful technique for this task." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We will now build a linear model to fit this data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Linear Model\n", "\n", "#### Hypothesis\n", "\n", "A linear model makes a \"hypothesis\" about the true nature of the underlying function - that it is linear. We express this hypothesis in the univariate case as\n", "\n", "$$h_\\theta(x) = ax + b$$\n", "\n", "Our simple example above was an example of \"univariate regression\" - i.e. just one variable (or \"feature\") - number of hours studied. Below we will have more than one feature (\"multivariate regression\") which is given by\n", "\n", "$$h_\\theta(\\mathbf{x}) = \\mathbf{a}^\\top \\mathbf{X}$$\n", "\n", "Here $\\mathbf{a}$ is a vector of learned parameters, and $\\mathbf{X}$ is the \"design matrix\" with all the data points. In this formulation the intercept term has been added to the design matrix as the first column (of all ones)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Design Matrix\n", "\n", "In general with $n$ data points and $p$ features our design matrix will have $n$ rows and $p$ columns.\n", "\n", "Returning to our exam score regression example, let's add one more feature - number of hours slept the night before the exam. If we have 4 data points and 2 features, then our matrix will be of shape $4 \\times 3$ (remember we add a bias column). It might look like\n", "\n", "$$\n", "\\begin{bmatrix}\n", " 1 & 1 & 8 \\\\\n", " 1 & 5 & 6 \\\\\n", " 1 & 7 & 6 \\\\\n", " 1 & 8 & 4 \\\\\n", "\\end{bmatrix}\n", "$$\n", "\n", "Notice we do **not** include the response (label/target) in the design matrix." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Univariate Example\n", "\n", "Let's now see what our univariate example looks like" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from sklearn import linear_model\n", "\n", "# Fit the model\n", "exam_model = linear_model.LinearRegression(normalize=True)\n", "x = np.expand_dims(num_hours_studied, 1)\n", "y = exam_score\n", "exam_model.fit(x, y)\n", "a = exam_model.coef_\n", "b = exam_model.intercept_\n", "print(exam_model.coef_)\n", "print(exam_model.intercept_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Visualize the results\n", "plt.scatter(num_hours_studied, exam_score)\n", "x = np.linspace(0, 10)\n", "y = a * x + b\n", "plt.plot(x, y, 'r')\n", "plt.xlabel('num_hours_studied')\n", "plt.ylabel('exam_score')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The line fits pretty well using the eye, as it should, because the true function is linear and the data has just a little noise." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "But we need a mathematical way to define a good fit in order to find the optimal parameters for our hypothesis." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### What is a Good Fit?\n", "\n", "Typically we use \"mean squared error\" to measure the goodness of fit in a regression problem.\n", "\n", "$$\\text{MSE} = \\frac{1}{n} \\sum_{i=1}^n (y^{(i)} - h_\\theta^{(i)})^2$$\n", "\n", "You can see that this is measuring how far away each of the real data points are from our predicted point which makes good sense. Here is a visualization\n", "\n", "![](../images/ml-fundamentals/linear-regression/residual.webp)\n", "\n", "\n", "This function is then taken to be our \"loss\" function - a measure of how badly we are doing. In general we want to minimize this." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Optimization Problem\n", "\n", "The typical recipe for machine learning algorithms is to define a loss function of the parameters of a hypothesis, then to minimize the loss function. In our case we have the optimization problem\n", "\n", "$$\\min_{\\mathbf{a}} \\frac{1}{n} \\sum_{i=1}^n (y^{(i)} - \\mathbf{a}^\\top\\mathbf{X}^{(i)})^2$$\n", "\n", "Note that we have added the bias into the design matrix." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Normal Equations\n", "\n", "Linear regression actually has a closed-form solution - the normal equation. It is beyond our scope to show the derivation, but here it is:\n", "\n", "$$\\mathbf{a}^* = (\\mathbf{X}^\\top\\mathbf{X})^{-1}\\mathbf{X}^\\top \\mathbf{y}$$\n", "\n", "We won't be implementing this equation, but you should know this is what `sklearn.linear_model.LinearRegression` is doing under the hood. We will talk more about optimization in later tutorials, where we have no closed-form solution." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Normalization\n", "\n", "It is generally a good idea to normalize all the values in the design matrix. This means all values should be in the range $(0, 1)$ and centered around zero.\n", "\n", "![](../images/ml-fundamentals/linear-regression/normalization.jpeg)\n", "\n", "Normalization usually helps the learning algorithm perform better." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Example 1\n", "\n", "Simple Linear Regression" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Importing the libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Importing the dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "dataset = pd.read_csv('../../assets/data/Salary_Data.csv')\n", "X = dataset.iloc[:, :-1].values\n", "y = dataset.iloc[:, -1].values" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Splitting the dataset into the Training set and Test set" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Training the Simple Linear Regression model on the Training set" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "regressor = LinearRegression()\n", "regressor.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Predicting the Test set results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "y_pred = regressor.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Visualising the Training set results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "plt.scatter(X_train, y_train, color = 'red')\n", "plt.plot(X_train, regressor.predict(X_train), color = 'blue')\n", "plt.title('Salary vs Experience (Training set)')\n", "plt.xlabel('Years of Experience')\n", "plt.ylabel('Salary')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Visualising the Test set results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "plt.scatter(X_test, y_test, color = 'red')\n", "plt.plot(X_train, regressor.predict(X_train), color = 'blue')\n", "plt.title('Salary vs Experience (Test set)')\n", "plt.xlabel('Years of Experience')\n", "plt.ylabel('Salary')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Example 2\n", "\n", "Multiple Linear Regression" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Importing the libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Importing the dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "dataset = pd.read_csv('../../assets/data/50_Startups.csv')\n", "X = dataset.iloc[:, :-1].values\n", "y = dataset.iloc[:, -1].values" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "print(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "print(y)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Encoding categorical data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from sklearn.compose import ColumnTransformer\n", "from sklearn.preprocessing import OneHotEncoder\n", "ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')\n", "X = np.array(ct.fit_transform(X))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "print(X)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Splitting the dataset into the Training set and Test set" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Training the Multiple Linear Regression model on the Training set" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "regressor = LinearRegression()\n", "regressor.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Predicting the Test set results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "y_pred = regressor.predict(X_test)\n", "np.set_printoptions(precision=2)\n", "print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Example 3\n", "\n", "Polynomial Linear Regression" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Importing the libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Importing the dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "dataset = pd.read_csv('../../assets/data/Position_Salaries.csv')\n", "X = dataset.iloc[:, 1:-1].values\n", "y = dataset.iloc[:, -1].values" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Training the Linear Regression model on the whole dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "lin_reg = LinearRegression()\n", "lin_reg.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Training the Polynomial Regression model on the whole dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from sklearn.preprocessing import PolynomialFeatures\n", "poly_transformer = PolynomialFeatures(degree = 4)\n", "X_poly = poly_transformer.fit_transform(X)\n", "lin_reg_2 = LinearRegression()\n", "lin_reg_2.fit(X_poly, y)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Visualising the Polynomial Linear Regression results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "plt.scatter(X, y, color = 'red')\n", "plt.plot(X, lin_reg.predict(X), color = 'blue')\n", "plt.title('Truth or Bluff (Linear Regression)')\n", "plt.xlabel('Position Level')\n", "plt.ylabel('Salary')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Implementation of Linear Regression for scratch" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class MyOwnLinearRegression:\n", " def __init__(self, learning_rate=0.0001, n_iters=30000):\n", " self.lr = learning_rate\n", " self.n_iters = n_iters\n", " self.weights = None\n", " self.bias = None\n", "\n", " def fit(self, X, y):\n", " n_samples, n_features = X.shape\n", "\n", " # init parameters\n", " self.weights = np.zeros(n_features)\n", " self.bias = 0\n", "\n", " # gradient descent\n", " for _ in range(self.n_iters):\n", " # approximate y with linear combination of weights and x, plus bias\n", " y_predicted = np.dot(X, self.weights) + self.bias\n", "\n", " # compute gradients\n", " dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))\n", " db = (1 / n_samples) * np.sum(y_predicted - y)\n", " # update parameters\n", " self.weights = self.weights - self.lr * dw\n", " self.bias = self.bias - self.lr * db\n", "\n", " def predict(self, X):\n", " y_predicted = np.dot(X, self.weights) + self.bias\n", " return y_predicted" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Traning model with our own Linear Regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "dataset = pd.read_csv('../../assets/data/Salary_Data.csv')\n", "X = dataset.iloc[:, :-1].values\n", "y = dataset.iloc[:, -1].values\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)\n", "regressor = MyOwnLinearRegression()\n", "regressor.fit(X_train, y_train)\n", "y_pred = regressor.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Visualising the Training set results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "plt.scatter(X_train, y_train, color = 'red')\n", "plt.plot(X_train, regressor.predict(X_train), color = 'blue')\n", "plt.title('Salary vs Experience (Training set)')\n", "plt.xlabel('Years of Experience')\n", "plt.ylabel('Salary')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Acknowledgments\n", "\n", "Thanks to TIM NIVEN for creating the open-source [Kaggle jupyter notebook](https://www.kaggle.com/code/timniven/linear-regression-tutorial), licensed under Apache 2.0. It inspires the majority of the content of this assignment." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 4 }