{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> All content here is under a Creative Commons Attribution [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) and all source code is released under a [BSD-2 clause license](https://en.wikipedia.org/wiki/BSD_licenses).\n", ">\n", ">Please reuse, remix, revise, and [reshare this content](https://github.com/kgdunn/python-basic-notebooks) in any way, keeping this notice." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Course overview\n", "\n", "This is the fifth module of several (11, 12, 13, 14, 15 and 16), which refocuses the course material in the [prior 10 modules](https://github.com/kgdunn/python-basic-notebooks) in a slightly different way. It places more emphasis on\n", "\n", "* dealing with data: importing, merging, filtering;\n", "* calculations from the data;\n", "* visualization of it.\n", "\n", "In short: ***how to extract value from your data***.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Module 15 Overview\n", "\n", "In this module we will cover\n", "\n", "* Visualization of a correlation matrix with a heat map\n", "* Fitting a linear regression model to the data\n", "* Visualization of the linear regression model\n", "* Accessing data from your data frame using `.iloc`\n", "\n", "**Requirements before starting**\n", "\n", "* Have your Python installation working as you had for modules 11 to 14, including the Pandas library installed. \n", "\n", "* An extra requirement: install the `scikit-learn` and `seaborn` libraries. See instructions below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Keeping Conda up to date and installing new packages\n", "\n", "\n", "Newer versions of packages are released frequently. You can update your packages (libraries), with these commands. Do this at the command line (not in Jupyter notebook!):\n", "```bash\n", "\n", " conda update -n base conda\n", " conda update --all\n", "```\n", "\n", "### Installing a new package in your virtual environment\n", "\n", "You will come across people recommending different packages in Python for all sorts of interesting applications. For example, the library `seaborn` is often recommended for visualization. But you might not have it installed yet. In this module we will use `seaborn` and also `scikit-learn`.\n", "\n", "This is how you can install the `seaborn` and `scikit-learn` packages in your virtual environment called ``myenv``:\n", "```bash\n", " conda activate myenv <--- change the last word in the command to the name of your actual environment\n", " conda install seaborn scikit-learn\n", "```\n", "\n", "Or in one line:\n", "```bash\n", " conda install -n myenv seaborn scikit-learn\n", "```\n", "\n", "\n", "### Updating an existing package\n", "\n", "Similar to the above, you can update a package to the latest version. Just change ``install`` to ``update`` instead.\n", "Or in one line:\n", "```bash\n", " conda update -n myenv seaborn scikit-learn\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building regression models\n", "\n", "In the [prior module](https://yint.org/pybasic14) you learned about setting the date and time when importing data, visualizing your data with box plots, histograms, line or time-series plots, and scatter plots. You applied these to your own data, and learned about the very powerful ``groupby`` function in Pandas.\n", "\n", "In this module we will take these skills a step further, but first, we will learn about fitting regression models to some data. \n", "\n", "Start by importing Pandas, and also a tool to build regression models, which is from the `scikit-learn` library. This is imported as `sklearn`. You can read more about scikit-learn at their website: https://scikit-learn.org/stable/" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.linear_model import LinearRegression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use a data set that is from a [distillation column](https://openmv.net/info/distillation-tower), and predicting an important output variable, called the Reid Vapour Pressure (RVP).\n", "\n", "Read in the data set and set the column called \"Case\" to be the index:\n", "```python\n", "distill = pd.read_csv(\"https://openmv.net/file/distillation-tower.csv\")\n", "```\n", "\n", "In the [prior module](https://yint.org/pybasic14) you were asked to use your own data and:\n", "\n", "1. Calculate the correlation matrix of values and display that. Were you able to do so? \n", "2. Could you also visualize a scatter plot matrix of these values with the \"kde\" on the diagonal, squares for the markers and an alpha value of 0.8 for the points?\n", "\n", "We will do this interactively below, but also introduce a new plot, the *heat map*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing the correlation matrix\n", "\n", "The correlation matrix of numbers can be tedious to read on a screen. You can easily visualize it though:\n", "\n", "```python\n", "import seaborn as sns\n", "display(distill.corr());\n", "\n", "# Let's visualize it instead, as a heat map.\n", "sns.set(rc={'figure.figsize':(15, 15)})\n", "sns.heatmap(distill.corr());\n", "\n", "# This is not so attractive. Use a different colour map (cmap):\n", "cmap = sns.diverging_palette(220, 10, as_cmap=True)\n", "sns.heatmap(distill.corr(), cmap=cmap, square=True, linewidths=0.2, cbar_kws={\"shrink\": 0.5});\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing the scatter plot matrix\n", "\n", "We saw the scatter plots before, and the scatter plot matrix.\n", "\n", "```python\n", "from pandas.plotting import scatter_matrix\n", "scatter_matrix(distill, alpha = 0.2, figsize=(15, 15), diagonal = \"kde\");\n", "```\n", "\n", "The data set is quite big and takes some time to generate all the scatter plot combinations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use every third row instead.\n", "```python\n", "print(distill.shape)\n", "subset = distill.iloc[0::3, :]\n", "subset.shape\n", "```\n", "\n", "The `.iloc` function accesses the data by `index` (the `i` in `iloc`) and for a given `loc`ation, so `iloc`= *index location*.\n", "\n", "Some examples:\n", "* `mydata.iloc[0:10, :]` will return the 10 rows, and all columns\n", "* `mydata.iloc[20, 2:4]` will return only row \\_\\_\\_, and columns \\_\\_\\_\n", "* `mydata.iloc[0:10:2, :]` will return only rows with index \\_\\_\\_; and \\_\\_\\_ columns\n", "* `mydata.iloc[0::2, :]` will return \\_\\_\\_ row; and \\_\\_\\_ columns\n", "* `mydata.iloc[:, -1]` will return \\_\\_\\_ rows of the \\_\\_\\_ column\n", "\n", "Try some examples of using `.iloc` below on the `distill` data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that you understand what `.iloc` is doing, you can understand why this code is faster, because it uses half the data set to create the scatter plot matrix:\n", "```python\n", "scatter_matrix(distill.iloc[0::2,:], alpha = 0.2, figsize=(15, 15), diagonal = \"kde\");\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which 2 columns are the most correlated with the outcome variable called \"VapourPressure\"?\n", "\n", "```python\n", "distill.corr().shape\n", "distill.corr().iloc[...] # <-- fill in some code to show only the last row of the correlation matrix\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us build a regression model using the `InvTemp3` measurement, the inverse temperature measured at position 3 in the distillation column, to predict the `VapourPressure`.\n", "\n", "There are 253 measurements (rows) in the dataset. We will use the first 150 rows in the data set to build the model, and then use the remaining rows to test the model, to see how well we can predict vapour pressure. This is good statistical practice. Do not use all the data to build the prediction model; you will get an inflated sense of how well the model works. **Always keep some testing for validation/verification.**\n", "\n", "Create the model building data set from the first 150 rows:\n", "```python\n", "build = distill.iloc[___] # <-- what goes here?\n", "display(build)\n", "```\n", "\n", "Then create test testing data from the rest of the data frame:\n", "```python\n", "test = distill.iloc[___] # <-- what goes here?\n", "display(test)\n", "```\n", "\n", "Try it below, and use the `.shape` command to verify the `build`ing and `test`ing data have the correct shape." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we set up an instance of the linear regression model:\n", "```python\n", "mymodel = LinearRegression()\n", "type(mymodel)\n", "```\n", "\n", "The `mymodel` is an object. It is an object of a linear regression model, but it is empty at the moment. We will give it some training data, to build the model:\n", "\n", "```python\n", "mymodel = LinearRegression()\n", "mymodel.fit(X, y)\n", "```\n", "\n", "but we have to tell it what is `X` and what is `y`. So we have a small detour..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need numeric values for `X` and `y`. Scikit-learn requires the `X` data (the values used to predict `y`) to be a column vector or a matrix. \n", "\n", "Notice that a column vector is just a matrix with 1 column. This is because, you will see later, you can have 1 or more columns used to predict `y`. Therefore every input used to predict `y` must be in a column. Each row in the input matrix is one observation.\n", "\n", "There is a shortcut to force Pandas to return a single column to be extracted as a column, and not a Series\n", "\n", "```python\n", "print(build[\"InvTemp3\"].shape)\n", "print(build[[\"InvTemp3\"]].shape)\n", "```\n", "\n", "So this will work to build your regression model:\n", "\n", "```python\n", "# A single column in matrix X (capital X indicates one or more input columns)\n", "X = build[[\"InvTemp3\"]].values\n", "y = build[\"VapourPressure\"].values\n", "mymodel = LinearRegression()\n", "mymodel.fit(X, y)\n", "```\n", "\n", "If you run this code and see no error messages, then then model has been built. But it is not that exciting, ... yet." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So what can you do with this model? Use \n", "```python\n", "dir(mymodel)\n", "```\n", "to ask Python what can be done with that `object`. Note that the ``dir(...)`` function works on any object and is something that you will use regularly.\n", "\n", "There are several interesting *methods* that you see there which we will get to use.\n", "\n", "* `coef_`\n", "* `intercept_`\n", "* `predict`\n", "* `score`\n", "\n", "The first two, are as you might guess, the intercept of the model and the coefficient (slope).\n", "\n", "```python\n", "print(f\"The intercept is {mymodel.intercept_} and the slope is = {mymodel.coef_}\")\n", "```\n", "\n", "Now it is not so handy having all those decimal places. Python allows you to truncate them to the desired number:\n", "\n", "```python\n", "print(f\"The intercept is {mymodel.intercept_:.5g} and the slope is = {mymodel.coef_}\")\n", "```\n", "\n", "We have to be a bit more careful with the slope. It is an array (*see the square brackets?*): so we need to extract the first entry from that vector before displaying it:\n", "```python\n", "print(f\"The intercept is {mymodel.intercept_:.5g} and the slope is = {mymodel.coef_[0]:.5g}\")\n", "```\n", "\n", "Try this as well:\n", "```python\n", "print(f\"The intercept is {mymodel.intercept_:.5f} and the slope is = {mymodel.coef_[0]:.5f}\")\n", "```\n", "There is a subtle difference between the `f` and the `g` format specifiers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Detour: Visualization of the regression model\n", "\n", "After building the regression model it is helpful to visualize it. The Seaborn library has a useful function to do so.\n", "\n", "```python\n", "import seaborn as sns\n", "ax = sns.regplot(x=\"InvTemp3\", y=\"VapourPressure\", data=distill)\n", "ax.grid(True)\n", "```\n", "\n", "Take a look at the documentation for the `regplot` function: https://seaborn.pydata.org/generated/seaborn.regplot.html\n", "\n", "for more options, but the simple function above already does most of what you would expect:\n", "* it draws a scatter plot of the raw data\n", "* adds the regression line to the plot\n", "* shows the confidence interval for the regression (the interval expected for the true but unknown slope)\n", "* adds labels to the axes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An \"upgrade\" you might be interested in, is the joint plot, which adds the histograms to the axes:\n", "\n", "```python\n", "sns.jointplot(x=\"InvTemp3\", y=\"VapourPressure\", data=distill, kind=\"reg\");\n", "\n", "# Or, show the kde=kernel density estimate\n", "sns.jointplot(x=\"InvTemp3\", y=\"VapourPressure\", data=distill, kind=\"kde\");\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How well the regression model works: training (building) data\n", "\n", "Next, we would like to extract some idea of how the model performs. For that we can look at\n", "1. The predictions of the `build`ing data, \n", "2. The predictions of the `test`ing data. *This is a more accurate estimate.*\n", "\n", "Firstly, for the model building data:\n", "```python\n", "# Get the predicted values from the data used to build the model\n", "X_build = build[[\"InvTemp3\"]]\n", "y_build = build[\"VapourPressure\"]\n", "\n", "prediction_build = mymodel.predict(X_build)\n", "errors_build = y_build - prediction_build # error = actual minus predicted\n", "\n", "# There are several ways to see \"how good\" the model is, but the average \n", "# of the absolute values of the errors gives a good feeling. Smaller is better.\n", "avg_absolute_error = pd.Series(errors_build).abs().mean()\n", "\n", "```\n", "\n", "1. Calculate this average absolute error below.\n", "2. Also calculate the standard deviation of the errors (this is another way to judge the model). Smaller is better.\n", "3. Lastly, plot the prediction errors for the building data (first 150 rows) to see what time-based trends there are." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Secondly, how well the regression model works on testing data\n", "\n", "The above gives an idea of how the model works on the data used to build the model. But of course, the idea is to use the model in the future, on data not seen before. So let's test the model on the remaining rows:\n", "\n", "```python\n", "# Create the testing data set\n", "test = distill.iloc[150:]\n", "X_test = test[[\"InvTemp3\"]]\n", "y_test = test[\"VapourPressure\"]\n", "```\n", "\n", "Then use the `predict(...)` function again, but on the testing data. Notice how simple scikit-learn makes this; just replace the input to the `predict` function with a different data frame:\n", "```python\n", "prediction_test = mymodel.predict(X_test)\n", "errors_test = y_test - prediction_test\n", "avg_absolute_error = pd.Series(errors_test).abs().mean()\n", "avg_absolute_error, errors_test.std()\n", "```\n", "\n", "1. Calculate the average absolute error below, but for the model testing data.\n", "2. Calculate the standard deviation of the prediction errors. Again, smaller is better.\n", "3. Lastly, plot the prediction errors for the testing data to see what time-based trends there are. Any concerns?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate the $R^2$ value\n", "\n", "Using the `score` method, we can get the $R^2$ value. The function needs two inputs:\n", "```python\n", "mymodel.score(X_build, y_build)\n", "```\n", "\n", "and you will get a value that shows how the two variables are correlated. NOTE: the $R^2$ value is ***not a measure of prediction precision or accuracy***. It is only an estimate of the degree of correlation. High correlation is no guarantee of good prediction." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Improving the model: adding extra predictors\n", "\n", "A least squares model $y = b_0 + b_1x_1$ with an intercept $b_0$ and a single slope of $b_1$ can be improved by adding a second, or more predictors: $$ y = b_0 + b_1x_1 + b_2x_2 + \\ldots$$\n", "\n", "This is called a multiple linear regression (MLR) model. We will try to improve our model by adding an extra predictor ``InvPressure1``:\n", "\n", "```python\n", "\n", "# Specify the predictors in a list:\n", "predictors = [\"InvTemp3\", \"InvPressure1\"]\n", "\n", "# Specify the training data:\n", "X_build_MLR = build[predictors]\n", "y_build = build[\"VapourPressure\"]\n", "\n", "# Fit the model\n", "full_model = LinearRegression()\n", "full_model.fit(X=X_build_MLR, y=y_build)\n", "\n", "# Print some stats:\n", "predict_MLR_build = full_model.predict(X_build_MLR)\n", "errors_MLR_build = y_build - predict_MLR_build\n", "avg_absolute_error_MLR_build = pd.Series(errors_MLR_build).abs().mean()\n", "print(f\"The average absolute error {avg_absolute_error_MLR_build:.3f}\")\n", "pd.Series(errors_MLR_build).plot(grid=True, title=\"Error = Actual - Predicted\")\n", "```\n", "\n", "Notice the power here: you only have to change the first line to add new predictor. The rest of the code is the same as before (just more generic variable names have been used)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Challenge: checking the MLR model on the testing data\n", "\n", "Try creating code for the testing data, which uses the 2 predictors.\n", "\n", "* Extract the values for ``X_test_MLR`` and ``y_test``.\n", "* Use the ``full_model.predict(...)`` function, but on the testing data.\n", "* As before, calculate the average absolute error and the standard deviation of the error. How does it compare to the prior model with a single predictor?\n", "* Plot the errors over time. Are they smaller than for the case where you had only 1 predictor?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" }, "toc": { "base_numbering": "1", "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "246.975px" }, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }