{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> All content here is under a Creative Commons Attribution [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) and all source code is released under a [BSD-2 clause license](https://en.wikipedia.org/wiki/BSD_licenses).\n", ">\n", ">Please reuse, remix, revise, and [reshare this content](https://github.com/kgdunn/python-basic-notebooks) in any way, keeping this notice." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Course overview\n", "\n", "This is the fourth module of several (11, 12, 13, 14, 15 and 16), which refocuses the course material in the [prior 10 modules](https://github.com/kgdunn/python-basic-notebooks) in a slightly different way. It places more emphasis on\n", "\n", "* dealing with data: importing, merging, filtering;\n", "* calculations from the data;\n", "* visualization of it.\n", "\n", "In short: ***how to extract value from your data***.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Module 14 Overview\n", "\n", "In this module we will cover\n", "\n", "* Setting date and time stamps\n", "* More plots with Pandas\n", "* Filtering and grouping data\n", "\n", "**Requirements before starting**\n", "\n", "* Have your Python installation working as you had for modules 11, 12 and 13, including the Pandas library installed." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# General import. To ensure that Plotly, and not matplotlib, is the default plotting engine\n", "import pandas as pd\n", "import plotly.graph_objects as go\n", "import plotly.io as pio\n", "pio.renderers.default = \"iframe\" # \"notebook\" # jupyterlab\n", "pd.options.plotting.backend = \"plotly\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Date and time processing\n", "\n", "In the [prior module](https://yint.org/pybasic13) you learned about box plots, histogram plot, time-series (or sequence) plots, and scatter plots. We will revise some of those, and build on that knowledge a bit further.\n", "\n", "Start with the data from an actual plant, where we have 5 columns of measurements from a [flotation cell](https://en.wikipedia.org/wiki/Froth_flotation). Read the link if you need a quick overview of what flotation is." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "flot = pd.read_csv(\"https://openmv.net/file/flotation-cell.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some things to do with a new data set called `df`:\n", "\n", "* `df.head()` and `df.tail()` to check you have the right data\n", "* `df.describe()` to get some basic statistics\n", "* `df.info()` to see the data types\n", "\n", "In the space below, apply these to the data you just read in:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next plot sequence plots of all data columns, using this command\n", "\n", "```python\n", "ax = flot.plot()\n", "```\n", "\n", "Notice that the x-axis is not time-based, even though there is a column in data frame called `\"Date and time\"`. So what went wrong?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When reading in a new data frame you might need to first:\n", "* force a column to be of the `type` of date and time, so Pandas can use it in the plots\n", "* set that column to be the index of your data frame.\n", "\n", "and then you can proceed with your plotting and data analysis.\n", "\n", "To set a column to the right type, you can use the ``pd.to_datetime(...)`` function. Many times Pandas will get it right, but if it doesn't you can give it some help.\n", "\n", "So try this first below. If it works, you are lucky, and can continue.\n", "```python\n", "flot[\"Timestamp\"] = pd.to_datetime(flot[\"Date and time\"])\n", "```\n", "\n", "Note that we created a new column. Check it with ``flot.info()`` again, to see if it is of the right type. You can of course simply overwrite your previous column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the conversion did not work, you could have given it some guidance.\n", "\n", "For example:\n", "```python\n", "pd.to_datetime(\"20/12/21\", yearfirst=True) # it is supposed to be 21 December 2020\n", "pd.to_datetime(\"20/12/21\", dayfirst=True) # it is supposed to be 20 December 2021\n", "pd.to_datetime(\"20/12/21\", format=\"%d/%m\", exact=False)\n", "```\n", "\n", "For the `format` specifier, you can see all the options available from this page: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have the column correctly as a date and time stamp, you probably want this to be your data frame index.\n", "``` python\n", "flot=flot.set_index(\"Timestamp\")\n", "\n", "# and drop the original \"Date and time\" column, since we don't need it anymore\n", "flot.drop(columns=\"Date and time\", inplace=True)\n", "\n", "flot.plot()\n", "```\n", "\n", "Now you will see a short break in the data around 09:00 on 16 December 2004 which was not visible before." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Date and time processing: Check your understanding\n", "\n", "From the provided Excel file, read in the data. Convert the date and time column to the desired format:\n", "\n", "* assuming the date in the first row is in American format: June 01, 2018\n", "* assuming the date in the first row is in the usual format: 06 January 2018" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recap: boxplot\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A box plot can be shown per column in one simple line for a data frame `df`:\n", "\n", "```python\n", "df.plot.box()\n", "```\n", "\n", "Does it make sense to plot box plots for all columns, especially when units and orders of magnitude are so different?\n", "\n", "So now rather plot only the box plot for \"Upstream pH\":\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that there are so many outliers beyond the whiskers. What is going on? Look at the time-based plot of that column:\n", "\n", "```python\n", "df[\"name of column\"].plot.line()\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recap: histogram\n", "\n", "Similar to ``df.plot.line()`` and ``df.plot.box()`` to get a line and box plot, you can also use ``df.plot.hist()`` to get a histogram. \n", "\n", "But this tries to put all histograms in one plot, which is not so useful. But we will see a better way below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recap: line plot, or sequence plot (and learning about for-loops!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you use this code, you will get all the line plots in the same plot:\n", "```python\n", "flot[\"Timestamp\"] = pd.to_datetime(flot[\"Date and time\"])\n", "flot=flot.set_index(\"Timestamp\")\n", "flot.plot()\n", "```\n", "\n", "But if you want each plot in its own axis instead, you need to use a loop to create multiple plots:\n", "```python\n", "print(flot.columns)\n", "for column in flot.columns:\n", " print(column)\n", " display(flot[column].plot())\n", "```\n", "\n", "Pandas can only plot columns of numeric data. If the column is non-numeric, it will create an error. So to ensure the loop only goes through numeric columns, you can filter on that. Change the first lines to \n", "\n", "```python\n", "flot[\"artificial column\"] = \"abc\"\n", "flot.head()\n", "for column in flot.select_dtypes(\"number\"):\n", " # add the loop content here, indented appropriately\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recap: correlation, and introducing the scatter plot matrix\n", "\n", "We saw the correlation matrix can be calculated with this handy one-liner:\n", "\n", "```python\n", "df.corr()\n", "```\n", "\n", "Do this below for the flotation data. Any interesting leads to investigate?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The scatter plot matrix is a visual tool to help create a scatter plot of each combination. The plot on the diagonal would not be an interesting scatter plot, so this is often replaced with a histogram or a kernel density estimate (kde) plot.\n", "\n", "Use the code below to try creating both types of plots on the diagonal:\n", "```python\n", "from pandas.plotting import scatter_matrix\n", "\n", "scatter_matrix(df, alpha = 0.2, figsize=(10, 8), diagonal = 'kde');\n", "scatter_matrix(df, alpha = 0.2, figsize=(10, 8), diagonal = 'hist');\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learn about filtering and grouping (Blender Efficiency dataset)\n", "\n", "\n", "Filtering and grouping data is part of the daily work of anyone working with data. The reason is because once you have filtered the data or grouped it, then you want to calculate some statistics or create a visualization on the result. So your workflow becomes:\n", "\n", ">1. Import all your data;\n", ">2. Filter or group to get a subset of the data;\n", ">3. Do calculations and create visualizations on the subset of the data.\n", "\n", "\n", "Some typical examples of filtering and grouping:\n", "\n", "* Filter out and keep only the data after 1 January 2018. Throw the rest away.\n", "* Extract only the rows in the data frame where vessel V145 was used, and ignore the rest.\n", "* Group the data by type of product (we have product name A, B, C and D). Do the same set of calculations/visualizations for each product.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [\"Blender Efficiency\" data set](https://openmv.net/info/blender-efficiency) is related to a set of designed experiments. There are 4 factors being changed to affect the blending efficiency: \n", "* `particle size`\n", "* `mixer diameter`\n", "* `mixer rotational speed`, and \n", "* `blending time`.\n", "\n", "[Last time](https://yint.org/pybasic13) we mentioned 6 steps in a data workflow:\n", "\n", "1. **Define the objective**: understanding which factors can be changed to improve blending efficiency.\n", "2. **Get your data**: in this example it is given to you at the web site address above.\n", "3. **Explore** your data: use tables and plots.\n", "4. **Clean** your data, if needed. In this case the data are pre-cleaned.\n", "5. **Calculations and models**: later we will see how to build a regression model from these data. For now we will calculate correlations and show the results for groups.\n", "6. **Communicate your result**: show the correlations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Step 2 and 3: get your data and explore it\n", "\n", "```python\n", "import pandas as pd\n", "blender = pd.read_csv('http://openmv.net/file/blender-efficiency.csv')\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tips to explore your data.\n", "\n", "* Sort the table by the outcome value (the `BlendingEfficiency` column). Values from low to high. Visually, in the table, which columns appear to be related to it?\n", "\n", "* Is a box plot useful?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now move on to calculations and other visualizations inspired by those calculations:\n", "\n", "* Create and display the numeric correlation matrix. Why columns are most correlated with the outcome variable?\n", "* Instead of plotting a scatter plot for each and every interesting correlation, rather plot a scatter plot matrix. What interesting features do you observe?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some more models/calculations: the particle size (discrete values at 2, 5 and 8) seem to have an interesting relationship to the outcome variable.\n", "\n", "Let's look at this a bit more. Start with the scatter plot of just these 2 variables:\n", "```python\n", "blender.plot.scatter(x=\"ParticleSize\", y=\"BlendingEfficiency\")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we will create a subset of the data set showing just the results when the particle size is \"2\":\n", "\n", "```python\n", "blender[\"ParticleSize\"] == 2\n", "```\n", "\n", "will create an 'indicator' variable with `True` values where the condition is met. We only want the rows where the condition is true. \n", "\n", "In [module 12](https://yint.org/pybasic12#Accessing-entries), in the sub-section on \"Accessing entries\", you saw how you can do this.\n", "```python\n", "blender[blender[\"ParticleSize\"] == 2]\n", "```\n", "\n", "returns just the 4 rows where this condition is true.\n", "\n", "* Try it below. \n", "* Also, add code to return only the rows when `ParticleSize` $\\leq 5$.\n", "* Change the filter to return rows when `ParticleSize` $> 5$. How many rows are that?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you can do interesting things on this subset. The subset is just a regular data frame, so you can plot them or do further calculations with them.\n", "\n", "```python\n", "blender[blender[\"ParticleSize\"] == 2].mean()\n", "```\n", "\n", "will calculate the average of only these rows. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, calculate the average of only the \"BlendingEfficiency\" column when particle size is 2, 5 and 8. In other words, calculate 3 averages.\n", "\n", "You probably end up with something like this:\n", "```python\n", "print(blender[blender[\"ParticleSize\"] == 2][\"BlendingEfficiency\"].mean())\n", "print(blender[blender[\"ParticleSize\"] == 5][\"BlendingEfficiency\"].mean())\n", "print(blender[blender[\"ParticleSize\"] == 8][\"BlendingEfficiency\"].mean())\n", "```\n", "\n", "Can it be done more cleanly? Perhaps you could do it in a loop?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``df.groupby()`` function in Panadas is a way to do that in a single line.\n", "\n", "```python\n", "blender.groupby(by=\"ParticleSize\").mean() # simplify it: leave out the \"by=\"\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now go wild. Try it with different types of functions:\n", "\n", "* ``blender.groupby(\"ParticleSize\").std()``\n", "* ``blender.groupby(\"ParticleSize\").max()``\n", "* ``blender.groupby(\"ParticleSize\").plot()`` # what do you think this does? Guess before testing it!\n", "* ``blender.groupby(\"ParticleSize\").plot.scatter(x=\"BlendingTime\", y=\"BlendingEfficiency\")``" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will find, that if you use the plotly backend for plotting, you will notice\n", "you don't get plots displayed if you use the code above, while with the `matplotlib` \n", "backend it will show plots.\n", "\n", "\n", "So, to use Plotly, you will need to call use `groupby` in a loop instead:\n", "\n", "```python\n", "import time\n", "for psize, subset in blender.groupby(by=\"ParticleSize\"):\n", " print(psize)\n", " display(subset)\n", " subset.plot()\n", " \n", " # Then add code here to do something with the \"subset\" plot.\n", " # For example, such as changing the axis titles, or figure size\n", " \n", " time.sleep(0.2) # pause for 200 milliseconds\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Keeping Conda up to date and installing new packages\n", "\n", "Newer versions of packages are released frequently. You can update your packages (libraries), with this command::\n", "```bash\n", "\n", " conda update -n base conda\n", " conda update --all\n", "```\n", "\n", "### Installing a new package in your virtual environment\n", "\n", "You will come across people recommending different packages in Python for all sorts of interesting applications. For example, the library `seaborn` is often recommended for visualization. But you might not have it installed yet. \n", "\n", "This is how you can install the package called `seaborn` in your virtual environment called ``myenv``:\n", "```bash\n", " conda activate myenv <--- change the last word in the command to the name of your actual environment\n", " conda install seaborn\n", "```\n", "\n", "Or in one line:\n", "```bash\n", " conda install -n myenv seaborn\n", "```\n", "\n", "\n", "### Updating an existing package\n", "\n", "Similar to the above, you can update a package to the latest version. Just change ``install`` to ``update`` instead.\n", "Or in one line:\n", "```bash\n", " conda update -n myenv seaborn\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Homework: Try the above steps again on a different data set\n", "\n", "There is another data set, about the taste of Cheddar cheese: https://openmv.net/info/cheddar-cheese\n", "\n", "Read the data set in:\n", "```python\n", "cheese = pd.read_csv(\"https://openmv.net/file/cheddar-cheese.csv\")\n", "```\n", "\n", "1. Drop the column called \"Case\"\n", "2. Calculate the correlation matrix of values and display that\n", "3. Plot a scatter plot matrix of these values:\n", " \n", " * with the \"kde\" on the diagonal\n", " * squares for the markers\n", " * alpha value of 0.8 for the points. \n", " \n", "*Hint*: look at the documentation for `scatter_matrix` to see how to do this. You can look at the documentation inside Jupyter in several ways:\n", "* ``help(scatter_matrix)``\n", "* ``scatter_matrix?`` and then hit Ctrl-Enter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Homework: Testing your knowledge on another data set\n", "\n", "The pulp digester is an industrial unit operating in the pulp and paper industry. \n", "You can find the data on this page: https://openmv.net/info/kamyr-digester\n", "\n", "Some things to try when exploring the data:\n", "\n", "* Using a set of histograms, one per variable, find 2 variables which have a bimodal distribution. Which 2 have a normal distribution?\n", "* For the 2 variables with a bimodal distribution: make a time-series (sequence) plot, to visualize what they look like when their data are plotted in sequence order. Do you now see why they have a bimodal histogram?\n", "* Plot time sequence plots of some of the other variables, including the output variable called `'Y-Kappa'`.\n", "* Which interesting correlations are there with this variable? Write Python code to find the 3 strongest positively correlated columns, and 3 strongest negatively correlated columns.\n", "* Create a new data frame with only these 6 columns of strongly correlating variables and add the `Y-Kappa` as the 7th column.\n", "* Create a scatter plot matrix for only this group of data. Use a `kde` for the diagonal plots.\n", "* If you needed to increase the Kappa number for this process, which variables would you change and in which direction?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "toc": { "base_numbering": "1", "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "348.984px" }, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }