{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p><font size=\"6\"><b> CASE - Observation data - analysis</b></font></p>\n",
    "\n",
    "> *© 2021, Joris Van den Bossche and Stijn Van Hoey  (<mailto:jorisvandenbossche@gmail.com>, <mailto:stijnvanhoey@gmail.com>). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "plt.style.use('seaborn-whitegrid')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Reading in the enriched observations data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 1**\n",
    "\n",
    "- Read in the `survey_data_completed.csv` file and save the resulting `DataFrame` as variable `survey_data_processed` (if you did not complete the previous notebook, a version of the csv file is available in the `data` folder).\n",
    "- Interpret the 'eventDate' column directly as python `datetime` objects and make sure the 'occurrenceID' column is used as the index of the resulting DataFrame (both can be done at once when reading the csv file using parameters of the `read_csv` function)\n",
    "- Inspect the first five rows of the DataFrame and the data types of each of the data columns. Verify that the 'eventDate' indeed has a datetime data type.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- All read functions in Pandas start with `pd.read_...`.\n",
    "- To check the documentation of a function, use the keystroke combination of SHIFT + TAB when the cursor is on the function.\n",
    "- Remember `.head()` and `.info()`?\n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis1.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis2.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis3.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Tackle missing values (NaN) and duplicate values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See [pandas_08_missing_values.ipynb](pandas_08_missing_values.ipynb) for an overview of functionality to work with missing values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 2**\n",
    "\n",
    "How many records in the data set have no information about the `species`? Use the `isna()` method to find out.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- Do NOT use `survey_data_processed['species'] == np.nan`, but use the available method `isna()` to check if a value is NaN\n",
    "- The result of an (element-wise) condition returns a set of True/False values, corresponding to 1/0 values. The amount of True values is equal to the sum.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis4.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 3**\n",
    "\n",
    "How many duplicate records are present in the dataset? Use the method `duplicated()` to check if a row is a duplicate.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- The result of an (element-wise) condition returns a set of True/False values, corresponding to 1/0 values. The amount of True values is equal to the sum.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis5.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 4**\n",
    "\n",
    "- Select all duplicate data by filtering the `observations` data and assign the result to a new variable `duplicate_observations`. The `duplicated()` method provides a `keep` argument define which duplicates (if any) to mark.\n",
    "- Sort the `duplicate_observations` data on both the columns `eventDate` and `verbatimLocality` and show the first 9 records.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- Check the documentation of the `duplicated` method to find out which value the argument `keep` requires to select all duplicate data.\n",
    "- `sort_values()` can work with a single columns name as well as a list of names.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis6.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 5**\n",
    "\n",
    "- Exclude the duplicate values (i.e. keep the first occurrence while removing the other ones) from the `observations` data set and save the result as `survey_data_unique`. Use the `drop duplicates()` method from Pandas.\n",
    "- How many observations are still left in the data set?\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- `keep=First` is the default option for `drop_duplicates`\n",
    "- The number of rows in a DataFrame is equal to the `len`gth\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis7.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis8.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 6**\n",
    "\n",
    "Use the `dropna()` method to find out:\n",
    "\n",
    "- For how many observations (rows) we have all the information available (i.e. no NaN values in any of the columns)?\n",
    "- For how many observations (rows) we do have the `species` data available ?\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- `dropna` by default removes by default all rows for which _any_ of the columns contains a `NaN` value.\n",
    "- To specify which specific columns to check, use the `subset` argument\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis9.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 7**\n",
    "\n",
    "Filter the  `survey_data_unique` data and select only those records that do not have a `species` while having information on the `sex`. Store the result as variable `not_identified`.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- To combine logical operators element-wise in Pandas, use the `&` operator.\n",
    "- Pandas provides both a `isna()` and a `notna()` method to check the existence of `NaN` values.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis10.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "not_identified.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__NOTE!__\n",
    "\n",
    "The `DataFrame` we will use in the further analyses contains species information:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "survey_data = survey_data_unique.dropna(subset=['species']).copy()\n",
    "survey_data['name'] = survey_data['genus'] + ' ' + survey_data['species']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "**INFO**\n",
    "\n",
    "For biodiversity studies, absence values (knowing that something is not present) are useful as well to normalize the observations, but this is out of scope for these exercises.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Select subsets of the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "survey_data['taxa'].value_counts()\n",
    "#survey_data.groupby('taxa').size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 8**\n",
    "\n",
    "- Select the observations for which the `taxa` is equal to 'Rabbit', 'Bird' or 'Reptile'. Assign the result to a variable `non_rodent_species`. Use the `isin` method for the selection.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- You do not have to combine three different conditions, but use the `isin` operator with a list of names.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis11.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(non_rodent_species)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 9**\n",
    "\n",
    "Select the observations for which the `name` starts with the characters 'r' (make sure it does not matter if a capital character is used in the 'taxa' name). Call the resulting variable `r_species`.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- Remember the `.str.` construction to provide all kind of string functionalities? You can combine multiple of these after each other.\n",
    "- If the presence of capital letters should not matter, make everything lowercase first before comparing (`.lower()`)\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis12.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(r_species)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r_species[\"name\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 10**\n",
    "\n",
    "Select the observations that are not Birds. Call the resulting variable <code>non_bird_species</code>.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- Logical operators like `==`, `!=`, `>`,... can still be used.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis13.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(non_bird_species)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 11**\n",
    "\n",
    "Select the __Bird__ (taxa is Bird) observations from 1985-01 till 1989-12 using the `eventDate` column. Call the resulting variable `birds_85_89`.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- No hints, you can do this! (with the help of some `<=` and `&`, and don't forget the put brackets around each comparison that you combine)\n",
    "\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis14.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternative solution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis15.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 12**\n",
    "\n",
    "- Drop the observations for which no 'weight' (`wgt` column) information is available.\n",
    "- On the filtered data, compare the median weight for each of the species (use the `name` column)\n",
    "- Sort the output from high to low median weight (i.e. descending)\n",
    "\n",
    "__Note__ You can do this all in a single line statement, but don't have to do it as such!\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- You will need `dropna`, `groupby`, `median` and `sort_values`.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis16.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis17.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Species abundance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 13**\n",
    "\n",
    "Which 8 species (use the `name` column to identify the different species) have been observed most over the entire data set?\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- Pandas provide a function to combine sorting and showing the first n records, see [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.nlargest.html)...\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis18.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis19.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 14**\n",
    "\n",
    "- What is the number of different species (`name`) in each of the `verbatimLocality` plots? Use the `nunique` method. Assign the output to a new variable `n_species_per_plot`.\n",
    "- Define a Matplotlib `Figure` (`fig`) and `Axes` (`ax`) to prepare a plot. Make an horizontal bar chart using Pandas `plot` function linked to the just created Matplotlib `ax`. Each bar represents the `species per plot/verbatimLocality`. Change the y-label to 'Plot number'.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- _...in each of the..._ should provide a hint to use `groupby` for this exercise. The `nunique` is the aggregation function for each of the groups.\n",
    "- `fig, ax = plt.subplots()` prepares a Matplotlib Figure and Axes.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis20.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 15**\n",
    "\n",
    "- What is the number of plots (`verbatimLocality`) each of the species (`name`) have been observed in? Assign the output to a new variable `n_plots_per_species`. Sort the counts from low to high.\n",
    "- Make an horizontal bar chart using Pandas `plot` function to show the number of plots each of the species was found (using the `n_plots_per_species` variable).\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- Use the previous exercise to solve this one.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis21.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 16**\n",
    "\n",
    "- Starting from the `survey_data`, calculate the amount of males and females present in each of the plots (`verbatimLocality`). The result should return the counts for each of the combinations of `sex` and `verbatimLocality`. Assign to a new variable `n_plot_sex` and ensure the counts are in a column named \"count\".\n",
    "- Use `pivot` to convert the `n_plot_sex` DataFrame to a new DataFrame with the `verbatimLocality` as index and `male`/`female` as column names. Assign to a new variable `pivoted`.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- _...for each of the combinations..._ `groupby` can also be used with multiple columns at the same time.\n",
    "- If a `groupby` operation gives a Series as result, you can give that Series a name with the `.rename(..)` method.\n",
    "- `reset_index()` is useful function to convert multiple indices into columns again.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis22.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis23.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pivoted.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To check, we can use the variable `pivoted` to plot the result:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pivoted.plot(kind='bar', figsize=(12, 6), rot=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 17**\n",
    "\n",
    "Recreate the previous plot with the `catplot` function from the Seaborn library starting from `n_plot_sex`.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- Check the `kind` argument of the `catplot` function to figure out to specify you want a barplot with given x and y values.\n",
    "- To link a column to different colors, use the `hue` argument\n",
    "- Using `height` and `aspect`, the figure size can be optimized.\n",
    "\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis24.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 18**\n",
    "\n",
    "Recreate the previous plot with the `catplot` function from the Seaborn library directly starting from `survey_data`.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- Check the `kind`argument of the `catplot` function to find out how to use counts to define the bars instead of a `y` value.\n",
    "- To link a column to different colors, use the `hue` argument\n",
    "\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis25.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 19**\n",
    "\n",
    "- Make a summary table with the number of records of each of the species in each of the plots (also called `verbatimLocality`). Each of the species `name`s is a row index and each of the `verbatimLocality` plots is a column name.\n",
    "- Using the Seaborn <a href=\"http://seaborn.pydata.org/generated/seaborn.heatmap.html\">documentation</a> to make a heatmap.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- Make sure to pass the correct columns to respectively the `index`, `columns`, `values` and `aggfunc` parameters of the `pivot_table` function. You can use the `datasetName` to count the number of observations for each name/locality combination (when counting rows, the exact column doesn't matter).\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis26.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis27.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Observations over time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 20**\n",
    "\n",
    "Make a plot visualizing the evolution of the number of observations for each of the individual __years__ (i.e. annual counts) using the `resample` method.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- You want to `resample` the data using the `eventDate` column to create annual counts. If the index is not a datetime-index, you can use the `on=` keyword to specify which datetime column to use.\n",
    "- `resample` needs an aggregation function on how to combine the values within a single 'group' (in this case data within a year). In this example, we want to know the `size` of each group, i.e. the number of records within each year.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis28.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To evaluate the intensity or number of occurrences during different time spans, a heatmap is an interesting representation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 21**\n",
    "\n",
    "- Create a table, called `heatmap_prep`, based on the `survey_data` DataFrame with the row index the individual years, in the column the months of the year (1-> 12) and as values of the table, the counts for each of these year/month combinations.\n",
    "- Using the seaborn <a href=\"http://seaborn.pydata.org/generated/seaborn.heatmap.html\">documentation</a>, make a heatmap starting from the `heatmap_prep` variable.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- The `.dt` accessor can be used to get the `year`, `month`,... from a `datetime` column\n",
    "- Use `pivot_table` and provide the years to `index` and the months to `columns`. Do not forget to `count` the number for each combination (`aggfunc`).\n",
    "- Seaborn has an `heatmap` function which requires a short-form DataFrame, comparable to giving each element in a table a color value.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis29.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remark that we started from a `tidy` data format (also called *long* format) and converted to *short* format with in the row index the years, in the column the months and the counts for each of these year/month combinations as values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## (OPTIONAL SECTION) 6. Evolution of species during monitoring period"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*In this section, all plots can be made with the embedded Pandas plot function, unless specificly asked*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 22**\n",
    "\n",
    "Plot using Pandas `plot` function the number of records for `Dipodomys merriami` for each month of the year (January (1) -> December (12)), aggregated over all years.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- _...for each month of..._ requires `groupby`.\n",
    "- `resample` is not useful here, as we do not want to change the time-interval, but look at month of the year (over all years)\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis30.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis31.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 23**\n",
    "\n",
    "Plot, for the species 'Dipodomys merriami', 'Dipodomys ordii', 'Reithrodontomys megalotis' and 'Chaetodipus baileyi', the monthly number of records as a function of time during the monitoring period. Plot each of the individual species in a separate subplot and provide them all with the same y-axis scale\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- `isin` is useful to select from within a list of elements.\n",
    "- `groupby` AND `resample` need to be combined. We do want to change the time-interval to represent data as a function of time (`resample`) and we want to do this _for each name/species_ (`groupby`). The order matters!\n",
    "- `unstack` is a Pandas function a bit similar to `pivot`. Check the [unstack documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html) as it might be helpful for this exercise.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis32.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis33.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis34.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 24**\n",
    "\n",
    "Recreate the same plot as in the previous exercise using Seaborn `relplot` functon with the `month_evolution` variable.\n",
    "    \n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- We want to have the `counts` as a function of `eventDate`, so link these columns to y and x respectively.\n",
    "- To create subplots in Seaborn, the usage of _facetting_ (splitting data sets to multiple facets) is used by linking a column name to the `row`/`col` parameter. \n",
    "- Using `height` and `aspect`, the figure size can be optimized.\n",
    "    \n",
    "</details>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Uncomment the next cell (calculates `month_evolution`, the intermediate result of the previous excercise):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis35.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plotting with seaborn:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis36.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 25**\n",
    "\n",
    "Plot the annual amount of occurrences for each of the 'taxa' as a function of time using Seaborn. Plot each taxa in a separate subplot and do not share the y-axis among the facets.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- Combine `resample` and `groupby`!\n",
    "- Check out the previous exercise for the plot function.\n",
    "- Pass the `sharey=False` to the `facet_kws` argument as a dictionary.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis37.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis38.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE 26**\n",
    "\n",
    "The observations where taken by volunteers. You wonder on which day of the week the most observations where done. Calculate for each day of the week (`dayofweek`) the number of observations and make a bar plot.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "- Did you know the Python standard Library has a module `calendar` which contains names of week days, month names,...?\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbtutor-solution"
    ]
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case2_observations_analysis39.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nice work!"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Nbtutor - export exercises",
  "jupytext": {
   "formats": "ipynb,md:myst"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.2"
  },
  "nav_menu": {},
  "toc": {
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 6,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": true
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}