{ "cells": [ { "cell_type": "markdown", "id": "651019ae-1bd8-41b3-be75-474a841f1f8c", "metadata": {}, "source": [ "# Example of Multiple Linear Regression in Python\n", "https://datatofish.com/multiple-linear-regression-python/" ] }, { "cell_type": "markdown", "id": "de3b5be7-792e-4a94-8cd0-a17df627c419", "metadata": {}, "source": [ "## About Linear Regression\n", "Linear regression is used as a predictive model that assumes a linear relationship between the dependent variable (which is the variable we are trying to predict/estimate) and the independent variable/s (input variable/s used in the prediction)." ] }, { "cell_type": "markdown", "id": "efd67629-b49f-47e2-9305-a7abb6f10366", "metadata": {}, "source": [ "## Example of Multiple Linear Regression in Python\n", "\n", "In the following example, we will perform multiple linear regression for a fictitious economy, where the index_price is the dependent variable, and the 2 independent/input variables are:\n", "\n", "* interest_rate\n", "* unemployment_rate\n", "\n", "Please note that you will have to validate that several assumptions are met before you apply linear regression models. Most notably, you have to make sure that a linear relationship exists between the dependent variable and the independent variable/s (more on that under the checking for linearity section)." ] }, { "cell_type": "code", "execution_count": 16, "id": "3dacd2dc-aa3e-40f3-a66d-252b7512bef7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " year month interest_rate unemployment_rate index_price\n", "0 2017 12 2.75 5.3 1464\n", "1 2017 11 2.50 5.3 1394\n", "2 2017 10 2.50 5.3 1357\n", "3 2017 9 2.50 5.3 1293\n", "4 2017 8 2.50 5.4 1256\n", "5 2017 7 2.50 5.6 1254\n", "6 2017 6 2.50 5.5 1234\n", "7 2017 5 2.25 5.5 1195\n", "8 2017 4 2.25 5.5 1159\n", "9 2017 3 2.25 5.6 1167\n", "10 2017 2 2.00 5.7 1130\n", "11 2017 1 2.00 5.9 1075\n", "12 2016 12 2.00 6.0 1047\n", "13 2016 11 1.75 5.9 965\n", "14 2016 10 1.75 5.8 943\n", "15 2016 9 1.75 6.1 958\n", "16 2016 8 1.75 6.2 971\n", "17 2016 7 1.75 6.1 949\n", "18 2016 6 1.75 6.1 884\n", "19 2016 5 1.75 6.1 866\n", "20 2016 4 1.75 5.9 876\n", "21 2016 3 1.75 6.2 822\n", "22 2016 2 1.75 6.2 704\n", "23 2016 1 1.75 6.1 719\n" ] } ], "source": [ "import pandas as pd\n", "\n", "data = {'year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],\n", " 'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],\n", " 'interest_rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],\n", " 'unemployment_rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],\n", " 'index_price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719] \n", " }\n", "\n", "df = pd.DataFrame(data) \n", "\n", "print(df)" ] }, { "cell_type": "markdown", "id": "545f4540-581f-4fc9-89e8-02150f40cabd", "metadata": {}, "source": [ "## Checking for Linearity" ] }, { "cell_type": "code", "execution_count": 17, "id": "9a3bfbae-0877-442b-86fa-d477cf6153a2", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.scatter(df['interest_rate'], df['index_price'], color='red')\n", "plt.title('Index Price Vs Interest Rate', fontsize=14)\n", "plt.xlabel('Interest Rate', fontsize=14)\n", "plt.ylabel('Index Price', fontsize=14)\n", "plt.grid(True)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "61835a39-e63e-430a-b5df-e614d96776d0", "metadata": {}, "source": [ "## Performing the Multiple Linear Regression" ] }, { "cell_type": "code", "execution_count": 18, "id": "4658f05a-3adc-44e4-a576-eed94de204d0", "metadata": {}, "outputs": [], "source": [ "x = df[['interest_rate','unemployment_rate']]\n", "y = df['index_price']" ] }, { "cell_type": "markdown", "id": "6b7443b5-1c4a-49ac-89e4-4d3bd29f584a", "metadata": {}, "source": [ "### The Python Code using Statsmodels" ] }, { "cell_type": "code", "execution_count": 19, "id": "9d2f8b51-ee14-4e3e-bd86-75343644d1e1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: index_price R-squared: 0.898\n", "Model: OLS Adj. R-squared: 0.888\n", "Method: Least Squares F-statistic: 92.07\n", "Date: Sun, 09 Jul 2023 Prob (F-statistic): 4.04e-11\n", "Time: 17:15:55 Log-Likelihood: -134.61\n", "No. Observations: 24 AIC: 275.2\n", "Df Residuals: 21 BIC: 278.8\n", "Df Model: 2 \n", "Covariance Type: nonrobust \n", "=====================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "-------------------------------------------------------------------------------------\n", "const 1798.4040 899.248 2.000 0.059 -71.685 3668.493\n", "interest_rate 345.5401 111.367 3.103 0.005 113.940 577.140\n", "unemployment_rate -250.1466 117.950 -2.121 0.046 -495.437 -4.856\n", "==============================================================================\n", "Omnibus: 2.691 Durbin-Watson: 0.530\n", "Prob(Omnibus): 0.260 Jarque-Bera (JB): 1.551\n", "Skew: -0.612 Prob(JB): 0.461\n", "Kurtosis: 3.226 Cond. No. 394.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n" ] } ], "source": [ "import statsmodels.api as sm\n", "\n", "x = sm.add_constant(x)\n", " \n", "model = sm.OLS(y, x).fit()\n", "predictions = model.predict(x) \n", " \n", "print_model = model.summary()\n", "print(print_model)" ] }, { "cell_type": "markdown", "id": "80f3744a-6b0f-4fef-bd77-587bdf7bbed0", "metadata": {}, "source": [ "### The Python Code using Sklearn" ] }, { "cell_type": "code", "execution_count": 20, "id": "3c6d8a8d-ba24-477a-bda4-c54e74e76be4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LinearRegression()" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import linear_model\n", "\n", "regr = linear_model.LinearRegression()\n", "regr.fit(x, y)" ] }, { "cell_type": "code", "execution_count": 21, "id": "cd5a8b11-a49c-4fcd-bfdb-059b6b3e2ab5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Intercept: \n", " 1798.4039776258558\n", "Coefficients: \n", " [ 0. 345.54008701 -250.14657137]\n" ] } ], "source": [ "print('Intercept: \\n', regr.intercept_)\n", "print('Coefficients: \\n', regr.coef_)" ] }, { "cell_type": "markdown", "id": "722a577e-9e89-4808-841e-0c9535b8fd99", "metadata": {}, "source": [ "### Conclusion\n", "\n", "Linear regression is often used in Machine Learning. You have seen some examples of how to perform multiple linear regression in Python using both sklearn and statsmodels.\n", "\\\n", "Before applying linear regression models, make sure to check that a linear relationship exists between the dependent variable (i.e., what you are trying to predict) and the independent variable/s (i.e., the input variable/s)" ] }, { "cell_type": "code", "execution_count": null, "id": "0b683e7d-d9ca-4ab2-87f9-c8c0d992ada3", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }