{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Review-so-far\" data-toc-modified-id=\"Review-so-far-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Review so far</a></span></li><li><span><a href=\"#Data-set-import-and-basic-checks\" data-toc-modified-id=\"Data-set-import-and-basic-checks-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Data set import and basic checks</a></span></li><li><span><a href=\"#List-comprehensions\" data-toc-modified-id=\"List-comprehensions-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>List comprehensions</a></span></li><li><span><a href=\"#Reading-only-certain-rows\" data-toc-modified-id=\"Reading-only-certain-rows-4\"><span class=\"toc-item-num\">4&nbsp;&nbsp;</span>Reading only certain rows</a></span></li><li><span><a href=\"#Import-specific-columns-only\" data-toc-modified-id=\"Import-specific-columns-only-5\"><span class=\"toc-item-num\">5&nbsp;&nbsp;</span>Import specific columns only</a></span></li><li><span><a href=\"#Dropping-columns-or-rows\" data-toc-modified-id=\"Dropping-columns-or-rows-6\"><span class=\"toc-item-num\">6&nbsp;&nbsp;</span>Dropping columns or rows</a></span></li><li><span><a href=\"#Setting-an-index\" data-toc-modified-id=\"Setting-an-index-7\"><span class=\"toc-item-num\">7&nbsp;&nbsp;</span>Setting an index</a></span></li><li><span><a href=\"#Visualizing-and-deleting-missing-values\" data-toc-modified-id=\"Visualizing-and-deleting-missing-values-8\"><span class=\"toc-item-num\">8&nbsp;&nbsp;</span>Visualizing and deleting missing values</a></span></li><li><span><a href=\"#Using-iloc-and-loc\" data-toc-modified-id=\"Using-iloc-and-loc-9\"><span class=\"toc-item-num\">9&nbsp;&nbsp;</span>Using iloc and loc</a></span></li><li><span><a href=\"#Dropping-missing-values,-specifying-a-threshold\" data-toc-modified-id=\"Dropping-missing-values,-specifying-a-threshold-10\"><span class=\"toc-item-num\">10&nbsp;&nbsp;</span>Dropping missing values, specifying a threshold</a></span></li><li><span><a href=\"#Filtering-rows\" data-toc-modified-id=\"Filtering-rows-11\"><span class=\"toc-item-num\">11&nbsp;&nbsp;</span>Filtering rows</a></span></li><li><span><a href=\"#Filtering-with-the-.query-function\" data-toc-modified-id=\"Filtering-with-the-.query-function-12\"><span class=\"toc-item-num\">12&nbsp;&nbsp;</span>Filtering with the <code>.query</code> function</a></span></li><li><span><a href=\"#New-data-set:-raw-material-properties\" data-toc-modified-id=\"New-data-set:-raw-material-properties-13\"><span class=\"toc-item-num\">13&nbsp;&nbsp;</span>New data set: raw material properties</a></span></li><li><span><a href=\"#Single-level-groupby\" data-toc-modified-id=\"Single-level-groupby-14\"><span class=\"toc-item-num\">14&nbsp;&nbsp;</span>Single level <code>groupby</code></a></span></li><li><span><a href=\"#Two-new-concepts:-cutting-and-subplots\" data-toc-modified-id=\"Two-new-concepts:-cutting-and-subplots-15\"><span class=\"toc-item-num\">15&nbsp;&nbsp;</span>Two new concepts: cutting and subplots</a></span></li><li><span><a href=\"#Extending-groupby-to-show-multiple-outputs\" data-toc-modified-id=\"Extending-groupby-to-show-multiple-outputs-16\"><span class=\"toc-item-num\">16&nbsp;&nbsp;</span>Extending groupby to show multiple outputs</a></span></li><li><span><a href=\"#Multilevel-groupby\" data-toc-modified-id=\"Multilevel-groupby-17\"><span class=\"toc-item-num\">17&nbsp;&nbsp;</span>Multilevel groupby</a></span></li><li><span><a href=\"#Multiple-groupby-summaries\" data-toc-modified-id=\"Multiple-groupby-summaries-18\"><span class=\"toc-item-num\">18&nbsp;&nbsp;</span>Multiple groupby summaries</a></span></li><li><span><a href=\"#Add-a-new-column,-not-at-the-end-(right-hand-side)\" data-toc-modified-id=\"Add-a-new-column,-not-at-the-end-(right-hand-side)-19\"><span class=\"toc-item-num\">19&nbsp;&nbsp;</span>Add a new column, not at the end (right-hand side)</a></span></li><li><span><a href=\"#Replace-values-in-a-column\" data-toc-modified-id=\"Replace-values-in-a-column-20\"><span class=\"toc-item-num\">20&nbsp;&nbsp;</span>Replace values in a column</a></span></li><li><span><a href=\"#Styling-table-displays\" data-toc-modified-id=\"Styling-table-displays-21\"><span class=\"toc-item-num\">21&nbsp;&nbsp;</span>Styling table displays</a></span></li><li><span><a href=\"#Smoothing-a-curve\" data-toc-modified-id=\"Smoothing-a-curve-22\"><span class=\"toc-item-num\">22&nbsp;&nbsp;</span>Smoothing a curve</a></span></li><li><span><a href=\"#Challenge\" data-toc-modified-id=\"Challenge-23\"><span class=\"toc-item-num\">23&nbsp;&nbsp;</span>Challenge</a></span></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> All content here is under a Creative Commons Attribution [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) and all source code is released under a [BSD-2 clause license](https://en.wikipedia.org/wiki/BSD_licenses).\n",
    ">\n",
    ">Please reuse, remix, revise, and [reshare this content](https://github.com/kgdunn/python-basic-notebooks) in any way, keeping this notice."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course overview\n",
    "\n",
    "This is the sixth, and final, module of several (11, 12, 13, 14, 15 and 16), which refocuses the course material in the [first 10  modules](https://github.com/kgdunn/python-basic-notebooks) in a slightly different way. It places more emphasis on\n",
    "\n",
    "* dealing with data: importing, merging, filtering;\n",
    "* calculations from the data;\n",
    "* visualization of it.\n",
    "\n",
    "In short: ***how to extract value from your data***.\n",
    "\n",
    "## Review so far\n",
    "\n",
    "In [module 11](https://yint.org/pybasic11) we learned about\n",
    "* creating variables, and showing their `type`,\n",
    "* performing basic calculations, and the `math` library,\n",
    "* lists, as one of the most fundamental Python objects.\n",
    "\n",
    "In the [module 12](https://yint.org/pybasic12) we took this a step further:\n",
    "* and introduced the Pandas library, for `Series` and `DataFrame` objects,\n",
    "* learned how to import and write Excel files,\n",
    "* do basic operations on DataFrames, and \n",
    "* learned about another fundamental Python type, the `dict`ionary.\n",
    "\n",
    "[Module 13](https://yint.org/pybasic13) we introduced:\n",
    "* a general workflow for data processing\n",
    "* and how to visualize data with Pandas:\n",
    "\n",
    "    * box plot, \n",
    "    * time series (sequence) plot, and\n",
    "    * scatter plots [including showing how you can visualize 5 dimensions!]\n",
    "\n",
    "[Module 14](https://yint.org/pybasic14) we saw how to create:\n",
    "* for loops, for when we need to do things over and over,\n",
    "* but we also saw the `groupby` function, which does actions repeatedly on sub-groups of your data.\n",
    "* We also introduced the correlation matrix.\n",
    "\n",
    "Then in [module 15](https://yint.org/pybasic15) we saw:\n",
    "* that we could visualize the correlation matrix (2D histogram), to find candidates for regression,\n",
    "* using the `LinearRegression` tool from a new library, `scikit-learn`.\n",
    "* We also used another new library, `seaborn`, to visualize these regression models.\n",
    "\n",
    "\n",
    "# Module 16 Overview\n",
    "\n",
    "In this module we will cover a collection of last loose ends. Things you will use regularly in your work. \n",
    "\n",
    "* Reading in subsets of data\n",
    "* Handling missing values with Pandas\n",
    "* Filtering data, and the multi-level `groupby` capability of Panda\n",
    "* Effective table display in Pandas\n",
    "\n",
    "Most of them come from this list, with some modifications: https://towardsdatascience.com/30-examples-to-master-pandas-f8a2da751fa4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Data set import and basic checks\n",
    "\n",
    "We will use a data set that is related to food consumption. It shows, in a relative way, the food consumption habits of European (and soon to be former EU) countries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time \n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "from scipy.interpolate import UnivariateSpline\n",
    "from scipy.signal import savgol_filter\n",
    "import plotly.graph_objects as go\n",
    "import plotly.io as pio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pio.renderers.default = \"iframe\" # \"notebook\" # jupyterlab\n",
    "pd.options.plotting.backend = \"plotly\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\")\n",
    "display(df.info())\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visualizing the correlation matrix is essential to help understanding relationships. Use the code and the plot below to help answer:\n",
    "\n",
    "* Countries which consume garlic more than average, also seem to consume a higher amount of ...\n",
    "* Which variables are negatively correlated with \"Real coffee\" consumption?\n",
    "* Countries with higher consumption of \"Crisp bread\" also show high consumption of which other products?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.set(rc={'figure.figsize': (15, 15)})\n",
    "cmap = sns.diverging_palette(220, 10, as_cmap=True)\n",
    "sns.heatmap(df.corr(), cmap=cmap,  square=True, linewidths=0.5, cbar_kws={\"shrink\": 0.8});"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## List comprehensions\n",
    "\n",
    "\"*List comprehensions*\" are a quick way to make a list. You can read more, and see some examples here: https://realpython.com/list-comprehension-python/#using-list-comprehensions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(  [i     for i in range(10)]  )\n",
    "print(  [i*2+1 for i in range(10)]  )\n",
    "print(  [i*2   for i in range(10) if i > 4]  )\n",
    "print(  [i     for i in range(10) if i % 2 == 1]  )\n",
    "print(  [i     for i in range(10) if i % 2 == 0]  )\n",
    "print(  [i     for i in range(10) if i % 4 == 1]  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading only certain rows\n",
    "\n",
    "Imagine you had a large data set, and only needed certain rows for your calculations/visualization later on. You can use the `nrows` and `skiprows` arguments to read only a subset of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_subset = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\", nrows=5)\n",
    "display(df_subset)\n",
    "\n",
    "df_partial = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\", skiprows=[2, 3, 4])\n",
    "display(df_partial)\n",
    "\n",
    "# Requires an extra `engine` input\n",
    "df_bottom = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\", skipfooter=12) #, engine='python') <- intentionally left out for demo\n",
    "display(df_bottom)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Skipping every 3rd row, using a list comprehension...\n",
    "print([i for i in range(40) if i%3 ==1])\n",
    "df_partial = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\", \n",
    "                         skiprows=[i for i in range(40) if i%3 ==1])\n",
    "display(df_partial)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import specific columns only\n",
    "\n",
    "If you know the names of the columns you need, you can use the `usecols` input.\n",
    "\n",
    "Note: this also works for Excel files! You can say, for example, `usecols=\"F,G,BQ\"` if you need columns F, G and BQ only."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_subset = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\", \n",
    "                        usecols=[\"Country\", \"Sweetener\", \"Biscuits\", \"Powder soup\", \"Tin soup\"])\n",
    "display(df_subset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dropping columns or rows\n",
    "\n",
    "Conversely, you can read in the whole data set, and drop away the columns or rows you do not need."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = (\n",
    "    pd.read_csv(\"https://openmv.net/file/food-consumption.csv\")\n",
    "    .drop([\"Sweetener\", \"Biscuits\", \"Powder soup\", \"Tin soup\"], axis=1)\n",
    ")\n",
    "display(df)\n",
    "df.shape\n",
    "\n",
    "# Also drop some rows: drop away every 3rd row.\n",
    "# You can also leave away 'axis=0' (because that's the default)\n",
    "df_subset = df.drop([i for i in range( df.shape[0] ) if i%3 ==1] , axis=0)  \n",
    "display(df_subset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting an index\n",
    "\n",
    "You can always make a column from your dataframe to be your `index`, using the `set_index` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\")\n",
    "df = df.set_index('Country')\n",
    "display(df)\n",
    "\n",
    "# Or, in a single line, in a chained operation\n",
    "df = (\n",
    "    pd.read_csv(\"https://openmv.net/file/food-consumption.csv\")\n",
    "    .drop([i for i in range( df.shape[0] ) if i%3 ==1] , axis=0)\n",
    "    .set_index('Country')\n",
    ")\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualizing and deleting missing values\n",
    "\n",
    "Pandas generally handles missing values well: for example, the ``df.mean()`` function will work even if there are missing values. But some mathematical tools cannot have missing values, such as when performing a linear regression. So deleting missing data first is an option. It is therefore helpful that you can:\n",
    "\n",
    "* Find how many missing values are there per column? Or per row?\n",
    "* Delete columns with missing values.\n",
    "* Deleting rows with any missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Which columns have missing values:\n",
    "df = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\").set_index('Country')\n",
    "display(df.isna().sum())\n",
    "\n",
    "# Which rows have missing values:\n",
    "df.isna().sum(axis=1)\n",
    "\n",
    "# Display missing values in a heat map\n",
    "sns.set(rc={'figure.figsize': (10, 10)})\n",
    "sns.heatmap(df.isna(), square=True, cbar_kws={\"shrink\": 0.5});"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Confirm that the \"Sweetener\", \"Biscuits\", and \"Yoghurt\" columns are not present after running this command (these columns had missing values in them):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete columns with missing values\n",
    "df.dropna(axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Confirm that the rows for \"Sweden\", \"Finland\", and \"Spain\", which had missing entries, are not present after this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete rows with missing values\n",
    "df.dropna(axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dropping missing values in all rows, but only for a subset of the columns is possible. For example, drop only rows in the columns for \"Sweetener\" and \"Yoghurt\" (ignore the column for \"Biscuits\"):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(df.dropna(subset=[\"Sweetener\", \"Yoghurt\"], axis=0))\n",
    "\n",
    "# Note: you can also flip this around. Specify a subset of row names\n",
    "#       in `subset` and delete from all columns, using `axis=1`.\n",
    "df.dropna(subset=[\"Sweden\"], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using iloc and loc\n",
    "\n",
    "We learned about `.iloc` in the [prior module](https://yint.org/pybasic15). Let's look at this again, and emphasize the difference between `.iloc` and `.loc`. [This article](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/) gives more details about the two if you want some more explanation.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\").set_index('Country')\n",
    "\n",
    "# \"Instant coffee\" is column 1: make all these values missing\n",
    "df.iloc[:, 1] = np.nan\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# But what if don't know, or care, which column index it is? \n",
    "# When we know the column's name, then use \".loc\" \n",
    "df.loc[:, \"Tea\"] = np.nan\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Or you can use a list of column names:\n",
    "df.loc[:, [\"Potatoes\",\"Frozen fish\"]] = 98.76\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You can use a mixture of .iloc and .loc:\n",
    "df.iloc[[0, 1, 2], :].loc[:, \"Tin soup\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# but this is less code:\n",
    "df.iloc[[0, 1, 2], :][\"Tin soup\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# or even less this way:\n",
    "df.iloc[[0, 1, 2]][\"Tin soup\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Or using .loc\n",
    "df.loc[\"Germany\":\"France\", \"Tin soup\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dropping missing values, specifying a threshold\n",
    "\n",
    "If you want to delete a column only if there are more than a certain number of missing values:\n",
    "\n",
    "* Read the data\n",
    "* Make a column have a high number of missing values (for demonstration purposes; normally the column is already problematic)\n",
    "* Remove that column, because it has a high degree of missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the data, and make every 3rd row a missing value for column \"Tea\"\n",
    "df = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\").set_index('Country')\n",
    "\n",
    "df.iloc[[i for i in range(16) if i%3 == 1]][\"Tea\"] = np.nan\n",
    "\n",
    "# The above code generates a warning. Why?\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# How to make this warning go away? As suggested by the warning, use \".loc\" instead.\n",
    "#     df.loc[row_indexer, col_indexer] = np.nan\n",
    "\n",
    "# Create a variable containing all row names:\n",
    "row_indexer = df.index\n",
    "\n",
    "# Now take every third row name:\n",
    "row_indexer = df.index[  [i for i in range(16) if i%3 ==1]  ]\n",
    "row_indexer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Then, set these rows to have missing values:\n",
    "df.loc[row_indexer, \"Tea\"] = np.nan\n",
    "display(df)\n",
    "df.isna().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Finally, we can now delete columns with a threshold (degree) of missing values\n",
    "# What value should you fill in here?\n",
    "display(df.dropna(thresh=11, axis=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filtering rows\n",
    "\n",
    "* Find which countries have `\"Olive oil\"` consumption of more than 50?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\").set_index('Country')\n",
    "df[ df[\"Olive oil\"] > 50  ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Which countries have `\"Olive oil\"` more than 50, **and** `\"Garlic\"` more than 40?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[   (df[\"Olive oil\"] > 50) & (df[\"Garlic\"] > 40)   ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Which countries have `\"Tea\"` more than 80, **or** `\"Oranges\"` more than 90?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[(df[\"Tea\"] > 80) | (df[\"Oranges\"] > 90)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filtering with the `.query` function\n",
    "\n",
    "It is sometimes more natural to filter with the `.query` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(df.query(\"30 < Tea < 80\"))\n",
    "\n",
    "# or use backticks if the column name has a space:\n",
    "df.query(\"10 < `Tin soup` < 20\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can have multiple queries:\n",
    "\n",
    "Find the countries which have \"Real coffee\" and \"Tea\" consumption above 70. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.query(\"(`Real coffee` > 70) or (Tea > 70)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Really powerful is the ability to reference one column against another.\n",
    "\n",
    "Find all countries where more `\"Instant coffee\"` is drunk more than `\"Real coffee\"`. *These are countries to avoid visiting*. What else do you notice about these countries eating habits?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.query(\"`Instant coffee` > `Real coffee`\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## New data set: raw material properties\n",
    "\n",
    "For the rest of the notebook we will switch to a new data set, where we characterize the properties of a raw material. As each batch of raw material is acquired, there are 6 measurements taken. There is also an indicator variable (categorical variable) on whether the raw materials outcome was (`Adequate`), or not (`Poor`).\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"https://openmv.net/file/raw-material-characterization.csv\").set_index(\"Lot number\")\n",
    "display(df)\n",
    "\n",
    "# Note that the Outcome column is an object. We can explicitly convert it to a categorical variable:\n",
    "df[\"Outcome\"] = df[\"Outcome\"].astype('category')\n",
    "display(df.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Single level `groupby`\n",
    "\n",
    "Recall the `groupby` function from [two modules ago](https://yint.org/pybasic14), which we applied as follows in a loop to create a plot for each group:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Groupby: for plotting\n",
    "for outcome, subset in df.groupby(\"Outcome\"):\n",
    "    fig=subset.plot.scatter(x='Size5', y=\"Size15\")\n",
    "    fig.update_layout(\n",
    "        xaxis_range=[10, 16], \n",
    "        yaxis_range=[18, 45],\n",
    "        width=500,\n",
    "    )\n",
    "    fig.show()\n",
    "    time.sleep(0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Two new concepts: cutting and subplots\n",
    "\n",
    "Sometimes you want the plots shown in a matrix, or a grid. This is called a subplot. You can read the Plotly documentation about [subplots](https://plotly.com/python/subplots/) on their site.\n",
    "\n",
    "Sometimes you want to split your (continuous) data into smaller groups (or categories). This can be done with the [cut function](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) in Pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cut the \"TMA\" variable. These values lie between 46.2 and 68.0, so let's create 3 bins, with 4 cuts at [40, 50, 55, 100]\n",
    "display(pd.cut(df['TMA'], bins=[40, 50, 55, 100]))\n",
    "df['TMA categories'] = pd.cut(df['TMA'], bins=[40, 50, 55, 100])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f'Using groupby you can loop; there would be {df.groupby(\"TMA categories\").ngroups} groups. But we will use subplots instead ...')\n",
    "\n",
    "# You can use this code below as a general recipe for creating subplots.\n",
    "nrows = 2\n",
    "ncols = 2\n",
    "from plotly.subplots import make_subplots\n",
    "fig = make_subplots(\n",
    "    rows=nrows, \n",
    "    cols=ncols, \n",
    "    shared_xaxes=False,\n",
    "    shared_yaxes=False,\n",
    "    vertical_spacing=0.15,\n",
    "    horizontal_spacing=0.10,\n",
    "    subplot_titles=[str(val) for val in df.groupby(\"TMA categories\").groups.keys()],\n",
    "    start_cell=\"top-left\") # \"bottom-left\"\n",
    "\n",
    "# In a loop, create each subplot \n",
    "row = col = 1\n",
    "for category, subset in df.groupby(\"TMA categories\"):\n",
    "    fig.add_trace(\n",
    "        go.Scatter(\n",
    "            x=subset['TMA'], \n",
    "            y=subset['Size15'],\n",
    "            mode=\"markers\",\n",
    "            name=str(category),\n",
    "            showlegend=True,\n",
    "        ),\n",
    "        row=row, \n",
    "        col=col,       \n",
    "    )\n",
    "    \n",
    "    # Bump up the counters for the next plot ...\n",
    "    col += 1\n",
    "    if col > ncols:\n",
    "        col = 1\n",
    "        row += 1\n",
    "\n",
    "fig.update_layout(width=1000)\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extending groupby to show multiple outputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Or using groupby for a single summary\n",
    "display(df.groupby(\"Outcome\").mean())\n",
    "display(df.groupby(\"Outcome\").std())\n",
    "\n",
    "# Or, call all the summaries together. We will explain the .agg function below.\n",
    "df.groupby(\"Outcome\").agg([\"mean\", \"std\"]).round(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multilevel groupby\n",
    "\n",
    "We can also use `groupby` for multiple levels. Imagine we have a second categorical variable, or some other variable with few discrete values:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using what you learned above, you can see how we can quickly create a new column with .cut(...)\n",
    "df[\"Size\"] = pd.cut(df['Size5'], bins=[0, 13, np.inf])   # --> intentionally left out for now:   labels=[\"Small\", \"Large\"]\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now you can use a multi-level groupby:\n",
    "display(df.groupby([\"Outcome\", \"Size\", ]).count())  # <-- redundant, use the next one instead\n",
    "display(df.groupby([\"Outcome\", \"Size\", ]).size())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multiple groupby summaries\n",
    "\n",
    "\n",
    "In the above, we had to write 2 lines with the `groupby` function: once for `size` and once for `mean`. But you can get them both in 1 table, using the `agg` function. `agg` is short hand for aggregation (which means to form things into a cluster)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# These 2 lines do exactly the same:\n",
    "display(  df.groupby([\"Outcome\", \"Size\"]).mean()  )\n",
    "display(  df.groupby([\"Outcome\", \"Size\"]).agg('mean') )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now extend it: we have 2 groups (vertical index axis) and 2 .agg functions (horizontal column axis):\n",
    "display(  df.groupby([\"Outcome\", \"Size\"]).agg([\"mean\", \"std\"]) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You can specify an entire collection of aggregations, and on which columns you want to do that:\n",
    "agg_func_math = ['count', 'mean', 'median', 'min', 'max', 'std']\n",
    "df.groupby(['Outcome'])[[\"Size5\", \"TGA\"]].agg(agg_func_math).round(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add a new column, not at the end (right-hand side)\n",
    "\n",
    "We saw above that we can create a new column, but that it automatically gets added on the right-hand side of the data frame. If you would like the column elsewhere, use the `.insert()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.insert(0, 'EmptyColumn', np.nan)\n",
    "df.insert(3, 'Column of ones', [1] * df.shape[0])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Replace values in a column\n",
    "\n",
    "We can do a \"search and replace\" function on the values in a data frame. \n",
    "\n",
    "Imagine if we wanted to change the `Outcome` column, and instead of `Adequate` and `Poor` we would rather have `Good` and `Bad`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"Outcome-newname\"] = df['Outcome'].replace({\"Adequate\": \"Good\", \"Poor\": \"Bad\"})\n",
    "df\n",
    "\n",
    "# Try setting the `Outcome` column to numeric values: Adequate -> 1 and Poor -> 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Styling table displays\n",
    "\n",
    "To help emphasize your message in a table, you might want to colour your table appropriately.\n",
    "\n",
    "You can read about all the options on this page in the Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"https://openmv.net/file/raw-material-characterization.csv\").set_index(\"Lot number\")\n",
    "df.style.bar(color=['lightblue'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# How to style only a subset of the columns:\n",
    "(df.style\n",
    "   .hide_index() # if you don't need your index column, you can drop it away \n",
    "   .bar(color='green', subset=['Size5', 'Size10', 'Size15'])\n",
    "   .set_caption('Raw material outcomes')\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "cmap = cmap=sns.diverging_palette(0, 50, as_cmap=True)\n",
    "\n",
    "# Double sort: first on `Outcome`, then on `Size5`\n",
    "df.sort_values([\"Outcome\", \"Size5\"], inplace=True)\n",
    "\n",
    "(df[[\"Outcome\", \"Size5\", \"Size10\", \"Size15\"]].style\n",
    "    .background_gradient(cmap)\n",
    "    .format(precision=2)   # number of places after the decimal\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show missing values with a colour. First, create an artificial missing value:\n",
    "df.iloc[4, 3] = np.nan\n",
    "df.head(7).style.format(precision=2).highlight_null('red')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show the minimum and maximum values with different colours:\n",
    "(df.style\n",
    "   .format(precision=2)\n",
    "   .highlight_min(axis=0, color=\"lightblue\")\n",
    "   .highlight_max(axis=0, color='orange')\n",
    "   .highlight_null('red')\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Smoothing a curve\n",
    "\n",
    "We often want a smoother version of the raw data. One option to use the [Savitzky-Golay filter](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_filter.html); though there are a number of other options."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "seq = [1.87, 1.88, 1.89, 1.9, 1.92, 1.96, 2.0, 2.1, 2.12, 2.27, 2.29, 2.28, 2.44, 2.48, 2.52, 2.53, 2.54, 2.55, 2.56, 2.57]\n",
    "absorbances = pd.Series(seq)\n",
    "time_points = [    81 + i*9  for i in range(len(seq)) ]\n",
    "df = pd.DataFrame(dict(absorbances=absorbances, time_points=time_points))\n",
    "fig=df.plot.line(x=\"time_points\", y=\"absorbances\", title=\"Raw data\")\n",
    "fig.add_hline(y=2.25, line_color=\"purple\", line_dash=\"dash\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.interpolate import UnivariateSpline\n",
    "from scipy.signal import savgol_filter\n",
    "\n",
    "filtered_series = savgol_filter(\n",
    "    x=seq,             \n",
    "    window_length=5, \n",
    "    polyorder=3, \n",
    ")  \n",
    "\n",
    "# Create a data frame of this and plot it:\n",
    "df_smoothed = pd.DataFrame(\n",
    "    dict(\n",
    "        time_points=time_points, \n",
    "        filtered_series=filtered_series \n",
    "    )\n",
    ")\n",
    "\n",
    "# Plot the raw data and smoothed data:\n",
    "fig = go.Figure()\n",
    "fig.add_trace(\n",
    "    go.Scatter(\n",
    "        x= df['time_points'], \n",
    "        y= df[\"absorbances\"],\n",
    "        mode=\"markers\",\n",
    "        name=\"Raw data\",\n",
    "    )\n",
    ")\n",
    "fig.add_trace(\n",
    "    go.Scatter(\n",
    "        x=df_smoothed['time_points'], \n",
    "        y=df_smoothed[\"filtered_series\"],\n",
    "        mode=\"lines\",\n",
    "        name=\"Smoothed fit\",\n",
    "    )\n",
    ")\n",
    "fig.add_hline(y=2.25, line_color=\"purple\", line_dash=\"dash\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note! here we flip x and y around: we want to know what is the expected\n",
    "# values of time, given the absorbance value of 2.25.\n",
    "spline = UnivariateSpline(\n",
    "    x = df_smoothed[\"filtered_series\"],   \n",
    "    y = df_smoothed['time_points'], \n",
    "    #bc_type='not-a-knot',\n",
    "    #extrapolate=None\n",
    ")\n",
    "\n",
    "# Interpolate a new x-axis (absorbance axis) on a very fine scale, between 2.1 and 2.4\n",
    "interpolated_abs = np.arange(2.1, 2.4, 0.01)\n",
    "predicted_time = spline(interpolated_abs)\n",
    "\n",
    "# Create a data frame of this and plot it:\n",
    "df_interpolated = pd.DataFrame(\n",
    "    dict(\n",
    "        predicted_time=predicted_time, \n",
    "        interpolated_abs=interpolated_abs \n",
    "    )\n",
    ")\n",
    "# At what timepoint does the line cross 2.25? Answer is shown in the table: 166.357 seconds\n",
    "display(df_interpolated.iloc[(df_interpolated[\"interpolated_abs\"] - 2.25).abs().argmin()])\n",
    "\n",
    "# Plot the raw data and smoothed data:\n",
    "fig = go.Figure()\n",
    "fig.add_trace(\n",
    "    go.Scatter(\n",
    "        x= df['time_points'], \n",
    "        y= df[\"absorbances\"],\n",
    "        mode=\"markers\",\n",
    "        name=\"Raw data\",\n",
    "    )\n",
    ")\n",
    "fig.add_trace(\n",
    "    go.Scatter(\n",
    "        x=df_smoothed['time_points'], \n",
    "        y=df_smoothed[\"filtered_series\"],\n",
    "        mode=\"lines\",\n",
    "        name=\"Smoothed fit\",\n",
    "    )\n",
    ")\n",
    "fig.add_trace(\n",
    "    go.Scatter(\n",
    "        x= df_interpolated['predicted_time'], \n",
    "        y= df_interpolated[\"interpolated_abs\"],\n",
    "        mode=\"lines\",\n",
    "        name=\"Interpolated spline fit\",\n",
    "    )\n",
    ")\n",
    "fig.add_hline(y=2.25, line_color=\"purple\", line_dash=\"dash\")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Challenge\n",
    "\n",
    "1. Apply table styling to the Foods data set, at the start of this notebook. Can you visualize, in a colourful way, some of the food consumption trends?\n",
    "2. Apply this same table styling to a small/medium sized data set of your own."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "End of this notebook."
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  },
  "toc": {
   "base_numbering": "1",
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "376.219px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  },
  "toc-autonumbering": true,
  "toc-showcode": false,
  "toc-showtags": false,
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}