{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Yearly Global Average CO2 Concentrations in parts per million (ppm) and Linear Regression\n", "\n", "AIM: To practice Linear Regression using Python code.\n", "DATA: The dataset includes monthly mean carbon dioxide globally averaged over marine surface sites for the span 1980-2020. \n", "\n", "Data Source: National Oceanic and Atmospheric Administration (NOAA)\n", "https://gml.noaa.gov/ccgg/trends/global.html\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "#Import necessary librarires of Python\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import math\n", "from sklearn import metrics\n", "import math\n", "import sklearn\n", "import statistics" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Yearly Global Average CO2 Concentrations in parts per million (ppm) and Linear Regression\n" ] } ], "source": [ "print(\"Yearly Global Average CO2 Concentrations in parts per million (ppm) and Linear Regression\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " year average_co2_concentrations\n", "0 1980 338.911667\n", "1 1981 340.105000\n", "2 1982 340.856667\n", "3 1983 342.530833\n", "4 1984 344.074167\n", "5 1985 345.544167\n", "6 1986 346.965833\n", "7 1987 348.674167\n", "8 1988 351.159167\n", "9 1989 352.782500\n", "(42, 2)\n", "['year' 'average_co2_concentrations']\n", "\n", "RangeIndex: 42 entries, 0 to 41\n", "Data columns (total 2 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 year 42 non-null int64 \n", " 1 average_co2_concentrations 42 non-null float64\n", "dtypes: float64(1), int64(1)\n", "memory usage: 800.0 bytes\n" ] } ], "source": [ "#Read the Dataset\n", "df=pd.read_csv('global-atm-co2.csv')\n", "\n", "#Know the basics of the dataset\n", "print (df.head(10)) # display first 10 entries\n", "print(df.shape) # display the dimensions of the dataset (rows and columns)\n", "print(df.columns.values) #display columns names\n", "df.info() # display data types and memory usage" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#Scatter plot : Plot the scatter plot of yearly average_co2_concentrations variable \n", "\n", "df.plot.scatter(x=\"year\",y=\"average_co2_concentrations\")\n", "plt.xlabel('Year') \n", "plt.ylabel('Global Average CO2 Concentrations (ppm)') \n", "plt.title ('Yearly Global Average CO2 Concentrations in parts per million (ppm)')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear Regression\n", "Let us try to fit a Line to the data. \n", "Equation of a line is y = b0 + b1*x, where b0 is Y-intercept and b1 is the slope." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Use NumPy library to convert the DataFrame to NumPy Array which would be used in the further steps. \n", "x=[]\n", "y=[]\n", "x=df['year'].to_numpy()\n", "y=df['average_co2_concentrations'].to_numpy()\n", "n = np.size(x) # number of observations/points " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Estimated coefficients of the line y = b0 + b1*x are:\n", "b0 = -3291.4041041406904 \n", "b1 = 1.8315106823366656\n", "RMSE VALUE is 2.1644921537661395\n", "Normalized RMSE VALUE is 0.005810202214397372\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ " # Function: Calculate Regression Coefficients : b0 is Y-intercept and b1 is slope for a Regression Line b0 + b1*x \n", "def estimate_coef(x, y): \n", " \n", " # mean of x and y vector \n", " m_x, m_y = np.mean(x), np.mean(y) \n", " \n", " # calculating cross-deviation and deviation about x \n", " SS_xy = np.sum(y*x) - n*m_y*m_x \n", " SS_xx = np.sum(x*x) - n*m_x*m_x \n", " \n", " \n", " b_1 = SS_xy / SS_xx \n", " b_0 = m_y - b_1*m_x \n", " \n", " return(b_0, b_1) \n", " \n", "# Function: Plot the scatter plot and Regression Line as per the predicted coefficients\n", "def plot_regression_line(x, y, b): \n", " # plotting the actual points as scatter plot \n", " plt.scatter(x, y, color = \"m\", \n", " marker = \"o\", s = 30) \n", " \n", " # predicted response vector \n", " y_pred = b[0] + b[1]*x \n", " \n", " # plot the regression line \n", " plt.plot(x, y_pred, color = \"g\") \n", " \n", " # prepare and render the scatter plot \n", " plt.xlabel('Year') \n", " plt.ylabel('Global Average CO2 Concentrations (ppm)') \n", " plt.title ('Yearly Global Average CO2 Concentrations in parts per million (ppm) and Linear Regression') \n", " plt.show() \n", "\n", "# Function: Calculate RMSE (Root Mean-Squared Error values) \n", "def rmse(b,y):\n", " predict=[]\n", " for i in range(0,n):\n", " predict.append(b[0]+b[1]*x[i])\n", " predict=np.array(predict) \n", " mse = sklearn.metrics.mean_squared_error(y, predict)\n", " root_mse = math.sqrt(mse) # RMSE value\n", " nrmse = root_mse/statistics.mean(y) # Normalized RMSE value\n", " return(root_mse,nrmse)\n", "\n", "# Function: Call the functions in a particular order\n", "def main(x,y): \n", " # Estimate Regression Coefficients \n", " b = estimate_coef(x, y) \n", " print(\"Estimated coefficients of the line y = b0 + b1*x are:\\nb0 = {} \\nb1 = {}\".format(b[0], b[1])) \n", " \n", " # Plot regression line \n", " residual_error = rmse(b,y)\n", " print(\"RMSE VALUE is\",residual_error[0])\n", " print(\"Normalized RMSE VALUE is\",residual_error[1])\n", " plot_regression_line(x, y, b)\n", "\n", "#Call the main function \n", "if __name__ == \"__main__\": \n", " main(x,y)\n", "\n", "#EoF\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Root Mean Square Error,RMSE, is the standard deviation of the residuals (prediction errors).\n", "Residuals are a measure of how far from the regression line data points are.\n", "RMSE is a measure of how spread out these residuals are. It tells us how concentrated the data is around the line of best fit." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }