{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "skip"
},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"%%html\n",
"\n",
"\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# Install the necessary dependencies\n",
"\n",
"import os\n",
"import sys\n",
"!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython "
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"remove-cell"
]
},
"source": [
"---\n",
"license:\n",
" code: MIT\n",
" content: CC-BY-4.0\n",
"github: https://github.com/ocademy-ai/machine-learning\n",
"venue: By Ocademy\n",
"open_access: true\n",
"bibliography:\n",
" - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Linear Regression"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## What is Linear Regression?\n",
"\n",
"Finding a straight line of best fit through the data. This works well when the true underlying function is linear.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"### Example\n",
"\n",
"We use features $\\mathbf{x}$ to predict a \"response\" $y$. For example we might want to regress `num_hours_studied` onto `exam_score` - in other words we predict exam score from number of hours studied.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Let's generate some example data for this case and examine the relationship between $\\mathbf{x}$ and $y$.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"num_hours_studied = np.array([1, 3, 3, 4, 5, 6, 7, 7, 8, 8, 10])\n",
"exam_score = np.array([18, 26, 31, 40, 55, 62, 71, 70, 75, 85, 97])\n",
"plt.scatter(num_hours_studied, exam_score)\n",
"plt.xlabel('num_hours_studied')\n",
"plt.ylabel('exam_score')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"We can see the this is nearly a straight line. We suspect with such a high linear correlation that linear regression will be a successful technique for this task."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"We will now build a linear model to fit this data."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Linear Model\n",
"\n",
"#### Hypothesis\n",
"\n",
"A linear model makes a \"hypothesis\" about the true nature of the underlying function - that it is linear. We express this hypothesis in the univariate case as\n",
"\n",
"$$h_\\theta(x) = ax + b$$\n",
"\n",
"Our simple example above was an example of \"univariate regression\" - i.e. just one variable (or \"feature\") - number of hours studied. Below we will have more than one feature (\"multivariate regression\") which is given by\n",
"\n",
"$$h_\\theta(\\mathbf{x}) = \\mathbf{a}^\\top \\mathbf{X}$$\n",
"\n",
"Here $\\mathbf{a}$ is a vector of learned parameters, and $\\mathbf{X}$ is the \"design matrix\" with all the data points. In this formulation the intercept term has been added to the design matrix as the first column (of all ones)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Design Matrix\n",
"\n",
"In general with $n$ data points and $p$ features our design matrix will have $n$ rows and $p$ columns.\n",
"\n",
"Returning to our exam score regression example, let's add one more feature - number of hours slept the night before the exam. If we have 4 data points and 2 features, then our matrix will be of shape $4 \\times 3$ (remember we add a bias column). It might look like\n",
"\n",
"$$\n",
"\\begin{bmatrix}\n",
" 1 & 1 & 8 \\\\\n",
" 1 & 5 & 6 \\\\\n",
" 1 & 7 & 6 \\\\\n",
" 1 & 8 & 4 \\\\\n",
"\\end{bmatrix}\n",
"$$\n",
"\n",
"Notice we do **not** include the response (label/target) in the design matrix."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Univariate Example\n",
"\n",
"Let's now see what our univariate example looks like"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn import linear_model\n",
"\n",
"# Fit the model\n",
"exam_model = linear_model.LinearRegression(normalize=True)\n",
"x = np.expand_dims(num_hours_studied, 1)\n",
"y = exam_score\n",
"exam_model.fit(x, y)\n",
"a = exam_model.coef_\n",
"b = exam_model.intercept_\n",
"print(exam_model.coef_)\n",
"print(exam_model.intercept_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"# Visualize the results\n",
"plt.scatter(num_hours_studied, exam_score)\n",
"x = np.linspace(0, 10)\n",
"y = a * x + b\n",
"plt.plot(x, y, 'r')\n",
"plt.xlabel('num_hours_studied')\n",
"plt.ylabel('exam_score')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The line fits pretty well using the eye, as it should, because the true function is linear and the data has just a little noise."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"But we need a mathematical way to define a good fit in order to find the optimal parameters for our hypothesis."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### What is a Good Fit?\n",
"\n",
"Typically we use \"mean squared error\" to measure the goodness of fit in a regression problem.\n",
"\n",
"$$\\text{MSE} = \\frac{1}{n} \\sum_{i=1}^n (y^{(i)} - h_\\theta^{(i)})^2$$\n",
"\n",
"You can see that this is measuring how far away each of the real data points are from our predicted point which makes good sense. Here is a visualization\n",
"\n",
"\n",
"\n",
"\n",
"This function is then taken to be our \"loss\" function - a measure of how badly we are doing. In general we want to minimize this."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Optimization Problem\n",
"\n",
"The typical recipe for machine learning algorithms is to define a loss function of the parameters of a hypothesis, then to minimize the loss function. In our case we have the optimization problem\n",
"\n",
"$$\\min_{\\mathbf{a}} \\frac{1}{n} \\sum_{i=1}^n (y^{(i)} - \\mathbf{a}^\\top\\mathbf{X}^{(i)})^2$$\n",
"\n",
"Note that we have added the bias into the design matrix."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Normal Equations\n",
"\n",
"Linear regression actually has a closed-form solution - the normal equation. It is beyond our scope to show the derivation, but here it is:\n",
"\n",
"$$\\mathbf{a}^* = (\\mathbf{X}^\\top\\mathbf{X})^{-1}\\mathbf{X}^\\top \\mathbf{y}$$\n",
"\n",
"We won't be implementing this equation, but you should know this is what `sklearn.linear_model.LinearRegression` is doing under the hood. We will talk more about optimization in later tutorials, where we have no closed-form solution."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Normalization\n",
"\n",
"It is generally a good idea to normalize all the values in the design matrix. This means all values should be in the range $(0, 1)$ and centered around zero.\n",
"\n",
"\n",
"\n",
"Normalization usually helps the learning algorithm perform better."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Example 1\n",
"\n",
"Simple Linear Regression"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Importing the libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Importing the dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"dataset = pd.read_csv('../../assets/data/Salary_Data.csv')\n",
"X = dataset.iloc[:, :-1].values\n",
"y = dataset.iloc[:, -1].values"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Splitting the dataset into the Training set and Test set"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Training the Simple Linear Regression model on the Training set"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"\n",
"regressor = LinearRegression()\n",
"regressor.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Predicting the Test set results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"y_pred = regressor.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Visualising the Training set results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"plt.scatter(X_train, y_train, color = 'red')\n",
"plt.plot(X_train, regressor.predict(X_train), color = 'blue')\n",
"plt.title('Salary vs Experience (Training set)')\n",
"plt.xlabel('Years of Experience')\n",
"plt.ylabel('Salary')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Visualising the Test set results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"plt.scatter(X_test, y_test, color = 'red')\n",
"plt.plot(X_train, regressor.predict(X_train), color = 'blue')\n",
"plt.title('Salary vs Experience (Test set)')\n",
"plt.xlabel('Years of Experience')\n",
"plt.ylabel('Salary')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Example 2\n",
"\n",
"Multiple Linear Regression"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Importing the libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Importing the dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"dataset = pd.read_csv('../../assets/data/50_Startups.csv')\n",
"X = dataset.iloc[:, :-1].values\n",
"y = dataset.iloc[:, -1].values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"print(X)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"print(y)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Encoding categorical data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.compose import ColumnTransformer\n",
"from sklearn.preprocessing import OneHotEncoder\n",
"ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')\n",
"X = np.array(ct.fit_transform(X))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"print(X)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Splitting the dataset into the Training set and Test set"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Training the Multiple Linear Regression model on the Training set"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"regressor = LinearRegression()\n",
"regressor.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Predicting the Test set results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"y_pred = regressor.predict(X_test)\n",
"np.set_printoptions(precision=2)\n",
"print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Example 3\n",
"\n",
"Polynomial Linear Regression"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Importing the libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Importing the dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"dataset = pd.read_csv('../../assets/data/Position_Salaries.csv')\n",
"X = dataset.iloc[:, 1:-1].values\n",
"y = dataset.iloc[:, -1].values"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Training the Linear Regression model on the whole dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"lin_reg = LinearRegression()\n",
"lin_reg.fit(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Training the Polynomial Regression model on the whole dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.preprocessing import PolynomialFeatures\n",
"poly_transformer = PolynomialFeatures(degree = 4)\n",
"X_poly = poly_transformer.fit_transform(X)\n",
"lin_reg_2 = LinearRegression()\n",
"lin_reg_2.fit(X_poly, y)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Visualising the Polynomial Linear Regression results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"plt.scatter(X, y, color = 'red')\n",
"plt.plot(X, lin_reg.predict(X), color = 'blue')\n",
"plt.title('Truth or Bluff (Linear Regression)')\n",
"plt.xlabel('Position Level')\n",
"plt.ylabel('Salary')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Implementation of Linear Regression for scratch"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"class MyOwnLinearRegression:\n",
" def __init__(self, learning_rate=0.0001, n_iters=30000):\n",
" self.lr = learning_rate\n",
" self.n_iters = n_iters\n",
" self.weights = None\n",
" self.bias = None\n",
"\n",
" def fit(self, X, y):\n",
" n_samples, n_features = X.shape\n",
"\n",
" # init parameters\n",
" self.weights = np.zeros(n_features)\n",
" self.bias = 0\n",
"\n",
" # gradient descent\n",
" for _ in range(self.n_iters):\n",
" # approximate y with linear combination of weights and x, plus bias\n",
" y_predicted = np.dot(X, self.weights) + self.bias\n",
"\n",
" # compute gradients\n",
" dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))\n",
" db = (1 / n_samples) * np.sum(y_predicted - y)\n",
" # update parameters\n",
" self.weights = self.weights - self.lr * dw\n",
" self.bias = self.bias - self.lr * db\n",
"\n",
" def predict(self, X):\n",
" y_predicted = np.dot(X, self.weights) + self.bias\n",
" return y_predicted"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Traning model with our own Linear Regression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"dataset = pd.read_csv('../../assets/data/Salary_Data.csv')\n",
"X = dataset.iloc[:, :-1].values\n",
"y = dataset.iloc[:, -1].values\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)\n",
"regressor = MyOwnLinearRegression()\n",
"regressor.fit(X_train, y_train)\n",
"y_pred = regressor.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Visualising the Training set results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"plt.scatter(X_train, y_train, color = 'red')\n",
"plt.plot(X_train, regressor.predict(X_train), color = 'blue')\n",
"plt.title('Salary vs Experience (Training set)')\n",
"plt.xlabel('Years of Experience')\n",
"plt.ylabel('Salary')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Acknowledgments\n",
"\n",
"Thanks to TIM NIVEN for creating the open-source [Kaggle jupyter notebook](https://www.kaggle.com/code/timniven/linear-regression-tutorial), licensed under Apache 2.0. It inspires the majority of the content of this assignment."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}