{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear and Polynomial Regression\n", "Aim: To practice Linear Regression and Polynomial Regression using Python code. The activity also helps in understanding the difference between Linear Regression and Polynomial Regression.\n", "\n", "Data: The dataset includes Annual Production-based Emissions of Carbon Dioxide (CO2) by China, measured in million tonnes per year, for the span 1902-2018.\n", "\n", "Data Source: Carbon Dioxide Information Analysis Center (CDIAC) and Global Carbon Project\n", "\n", "The code and the dataset are avaialable for download.\n", "Developed by TROP ICSU (https://tropicsu.org).\n", "\n", "The complete Lesson Plan is available here:https://tropicsu.org/lesson-plan-data-science-linear-and-polynomial-regression/" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "#Import necessary librarires of Python\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import math\n", "from sklearn import metrics\n", "import math\n", "import sklearn\n", "import statistics" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Annual production-based emissions of carbon dioxide (CO2) by China\n" ] } ], "source": [ "print(\"Annual production-based emissions of carbon dioxide (CO2) by China\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 10 entries from the dataset\n", " year co2\n", "0 1902 0.095\n", "1 1903 1.964\n", "2 1904 2.088\n", "3 1905 2.297\n", "4 1906 17.111\n", "5 1907 16.840\n", "6 1908 22.731\n", "7 1909 20.837\n", "8 1910 18.749\n", "9 1911 27.846\n", "(117, 2)\n", "['year' 'co2']\n", "\n", "RangeIndex: 117 entries, 0 to 116\n", "Data columns (total 2 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 year 117 non-null int64 \n", " 1 co2 117 non-null float64\n", "dtypes: float64(1), int64(1)\n", "memory usage: 2.0 KB\n" ] } ], "source": [ "#Read the Dataset\n", "df=pd.read_csv('F:\\Aparna\\Aparna-Data\\Aparna\\IISER\\Data_Sci-Internship\\Climate-Data-Science\\Lesson_Plan_Python_Codes\\Simple-and-Polynomial-Linear-Regression\\china-co2-csv.csv')\n", "#Know the basics of the dataset:\n", "print(\"First 10 entries from the dataset\")\n", "print (df.head(10)) # display first 10 entries\n", "print(df.shape) # display the dimensions of the dataset (rows and columns)\n", "print(df.columns.values) #display columns names\n", "df.info() # display data types and memory usage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scatter Plot \n", "Let's try to plot the data points of yearly annual production-based CO2 emissions by China during 1902-2018 " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#Scatter plot : Plot the scatter plot of yearly average_co2_concentrations variable \n", "\n", "df.plot.scatter(x=\"year\",y=\"co2\")\n", "plt.xlabel('Year') \n", "plt.ylabel('CO2 emissions (million tonnes)') \n", "plt.title ('Annual production-based CO2 emissions by China 1902-2018')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear Regression and Polynomial Regression\n", "We will try to apply Linear Regression and Polynomial Regression Methods to the dataset. The comparision wil help us to understand which methods describes the data better." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1 : Linear Regression\n", "Let us try to fit a Line to the data. Equation of a line is y = b0 + b1*x, where b0 is Y-intercept and b1 is the slope." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Use NumPy library to convert the DataFrame to NumPy Array which would be used in the further steps. \n", "x=[]\n", "y=[]\n", "x=df['year'].to_numpy()\n", "y=df['co2'].to_numpy()\n", "n = np.size(x) # number of observations/points" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Function: Calculate Regression Coefficients : b0 is Y-intercept and b1 is slope for a Regression Line b0 + b1*x \n", "def estimate_coef(x, y): \n", " \n", " # mean of x and y vector \n", " m_x, m_y = np.mean(x), np.mean(y) \n", " \n", " # calculating cross-deviation and deviation about x \n", " SS_xy = np.sum(y*x) - n*m_y*m_x \n", " SS_xx = np.sum(x*x) - n*m_x*m_x \n", " \n", " \n", " b_1 = SS_xy / SS_xx \n", " b_0 = m_y - b_1*m_x \n", " \n", " return(b_0, b_1) \n", " \n", "# Function: Plot the scatter plot and Regression Line as per the predicted coefficients\n", "def plot_regression_line(x, y, b): \n", " # plotting the actual points as scatter plot \n", " plt.scatter(x, y, color = \"m\", \n", " marker = \"o\", s = 30) \n", " \n", " # predicted response vector \n", " y_pred = b[0] + b[1]*x \n", " \n", " # plot the regression line \n", " plt.plot(x, y_pred, color = \"g\") \n", " \n", " # prepare and render the plot \n", " plt.xlabel('Year') \n", " plt.ylabel('CO2 emissions (million tonnes)') \n", " plt.title ('Annual production-based CO2 emissions by China 1902-2018') \n", " plt.title ('Annual production-based CO2 emissions by China 1902-2018') \n", " plt.legend([\"Linear Regression\",\"Actual\"], loc =\"lower right\")\n", " plt.show() \n", "\n", "# Function: Calculate RMSE (Root Mean-Squared Error values) \n", "def rmse(b,y):\n", " predict=[]\n", " for i in range(0,n):\n", " predict.append(b[0]+b[1]*x[i])\n", " predict=np.array(predict) \n", " rmse_linear = np.sqrt(sklearn.metrics.mean_squared_error(y,predict))\n", " return(rmse_linear)\n", "\n", "# Function: Call the functions in a particular order\n", "def main(x,y): \n", " # Estimate Regression Coefficients \n", " b = estimate_coef(x, y) \n", " print(\"Estimated coefficients of the line y = b0 + b1*x are:\\nb0 = {} \\nb1 = {}\".format(b[0], b[1])) \n", " \n", " # Check the Root Mean Sqaured Error\n", " residual_error = rmse(b,y)\n", " print(\"RMSE Value by using Linear Regression is=\",residual_error)\n", " \n", " # Plot regression line \n", " plot_regression_line(x, y, b)\n", " " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Estimated coefficients of the line y = b0 + b1*x are:\n", "b0 = -127261.4302050189 \n", "b1 = 65.84592892895174\n", "RMSE Value by using Linear Regression is= 1705.0661924295737\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#Call the main function for Linear Regression \n", "if __name__ == \"__main__\": \n", " main(x,y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Polynomial Regression\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Polynomial Regession is used to fit a non-linear model to the data. In Polynomial Regession, relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import chart_studio\n", "import chart_studio.plotly as py\n", "import plotly.express as px\n", "\n", "# Fitting Polynomial Regression to the dataset\n", "from sklearn.preprocessing import PolynomialFeatures\n", "from sklearn.linear_model import LinearRegression\n", "\n", "\n", "# Function: Visualizing the Polymonial Regression results\n", "# prepare and render the plot \n", "def viz_polymonial(X,Y, Y_poly,poly_degree):\n", " plt.scatter(X,Y, color='red')\n", " plt.plot(X,Y_poly, color='blue')\n", " plt.xlabel('Year') \n", " plt.ylabel('CO2 emissions (million tonnes)') \n", " plt.title ('Annual production-based CO2 emissions by China 1902-2018') \n", " plt.legend([\"Polynomial Regression degree \"+ str(poly_degree),\"Actual\"], loc =\"upper left\")\n", " plt.show()\n", " \n", " return\n", "\n", "# Function: Calculating the Polymonial Regression values\n", "def poly_regression(degree):\n", " X = df.iloc[:, 0:1].values # Consider the Year column as an array\n", " poly_reg = PolynomialFeatures(degree) # Get ready for a polynomial regression of given degree\n", " X_poly = poly_reg.fit_transform(X) # Compute number of output features, then Transform data to polynomial features\n", " lin_reg = LinearRegression() # \n", " lin_reg.fit(X_poly,y) # Fit a Linear Regression for Transformed data\n", " Y_poly = lin_reg.predict(X_poly) # Predict values \n", " \n", " # Check the Root Mean Sqaured Error\n", " rmse_poly = np.sqrt(sklearn.metrics.mean_squared_error(y,Y_poly))\n", " print(\"RMSE Value by using Polynomial Regression is=\",rmse_poly) \n", " \n", " # Visualize the results\n", " viz_polymonial(X,y,Y_poly,degree) \n", " \n", " return" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSE Value by using Polynomial Regression is= 443.9556515605107\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#Call the function for Polynomial Regression for a particular degree \n", "poly_regression(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Changing the degree of the Polynomial\n", "One can try different values for the degree of the Polynomial and see the difference between the results visually and also by comparing it using the RMSE value." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }