\n",
"\n",
"**EXERCISE 6**\n",
"\n",
"Use the `dropna()` method to find out:\n",
"\n",
"- For how many observations (rows) we have all the information available (i.e. no NaN values in any of the columns)?\n",
"- For how many observations (rows) we do have the `species` data available ?\n",
"\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis9.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 7**\n",
"\n",
"Filter the `survey_data_unique` data and select only those records that do not have a `species` while having information on the `sex`. Store the result as variable `not_identified`.\n",
"\n",
"
Hints
\n",
"\n",
"- To combine logical operators element-wise in Pandas, use the `&` operator.\n",
"- Pandas provides both a `isna()` and a `notna()` method to check the existence of `NaN` values.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis10.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"not_identified.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__NOTE!__\n",
"\n",
"The `DataFrame` we will use in the further analyses contains species information:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"survey_data = survey_data_unique.dropna(subset=['species']).copy()\n",
"survey_data['name'] = survey_data['genus'] + ' ' + survey_data['species']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**INFO**\n",
"\n",
"For biodiversity studies, absence values (knowing that something is not present) are useful as well to normalize the observations, but this is out of scope for these exercises.\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Select subsets of the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"survey_data['taxa'].value_counts()\n",
"#survey_data.groupby('taxa').size()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 8**\n",
"\n",
"- Select the observations for which the `taxa` is equal to 'Rabbit', 'Bird' or 'Reptile'. Assign the result to a variable `non_rodent_species`. Use the `isin` method for the selection.\n",
"\n",
"
Hints
\n",
"\n",
"- You do not have to combine three different conditions, but use the `isin` operator with a list of names.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis11.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(non_rodent_species)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 9**\n",
"\n",
"Select the observations for which the `name` starts with the characters 'r' (make sure it does not matter if a capital character is used in the 'taxa' name). Call the resulting variable `r_species`.\n",
"\n",
"
Hints
\n",
"\n",
"- Remember the `.str.` construction to provide all kind of string functionalities? You can combine multiple of these after each other.\n",
"- If the presence of capital letters should not matter, make everything lowercase first before comparing (`.lower()`)\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis12.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(r_species)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"r_species[\"name\"].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 10**\n",
"\n",
"Select the observations that are not Birds. Call the resulting variable
non_bird_species
.\n",
"\n",
"
Hints
\n",
"\n",
"- Logical operators like `==`, `!=`, `>`,... can still be used.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis13.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(non_bird_species)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 11**\n",
"\n",
"Select the __Bird__ (taxa is Bird) observations from 1985-01 till 1989-12 using the `eventDate` column. Call the resulting variable `birds_85_89`.\n",
"\n",
"
Hints
\n",
"\n",
"- No hints, you can do this! (with the help of some `<=` and `&`, and don't forget the put brackets around each comparison that you combine)\n",
"\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis14.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternative solution:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis15.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 12**\n",
"\n",
"- Drop the observations for which no 'weight' (`wgt` column) information is available.\n",
"- On the filtered data, compare the median weight for each of the species (use the `name` column)\n",
"- Sort the output from high to low median weight (i.e. descending)\n",
"\n",
"__Note__ You can do this all in a single line statement, but don't have to do it as such!\n",
"\n",
"
Hints
\n",
"\n",
"- You will need `dropna`, `groupby`, `median` and `sort_values`.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis16.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis17.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Species abundance"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 13**\n",
"\n",
"Which 8 species (use the `name` column to identify the different species) have been observed most over the entire data set?\n",
"\n",
"
Hints
\n",
"\n",
"- Pandas provide a function to combine sorting and showing the first n records, see [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.nlargest.html)...\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis18.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis19.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 14**\n",
"\n",
"- What is the number of different species (`name`) in each of the `verbatimLocality` plots? Use the `nunique` method. Assign the output to a new variable `n_species_per_plot`.\n",
"- Define a Matplotlib `Figure` (`fig`) and `Axes` (`ax`) to prepare a plot. Make an horizontal bar chart using Pandas `plot` function linked to the just created Matplotlib `ax`. Each bar represents the `species per plot/verbatimLocality`. Change the y-label to 'Plot number'.\n",
"\n",
"
Hints
\n",
"\n",
"- _...in each of the..._ should provide a hint to use `groupby` for this exercise. The `nunique` is the aggregation function for each of the groups.\n",
"- `fig, ax = plt.subplots()` prepares a Matplotlib Figure and Axes.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis20.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 15**\n",
"\n",
"- What is the number of plots (`verbatimLocality`) each of the species (`name`) have been observed in? Assign the output to a new variable `n_plots_per_species`. Sort the counts from low to high.\n",
"- Make an horizontal bar chart using Pandas `plot` function to show the number of plots each of the species was found (using the `n_plots_per_species` variable).\n",
"\n",
"
Hints
\n",
"\n",
"- Use the previous exercise to solve this one.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis21.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 16**\n",
"\n",
"- Starting from the `survey_data`, calculate the amount of males and females present in each of the plots (`verbatimLocality`). The result should return the counts for each of the combinations of `sex` and `verbatimLocality`. Assign to a new variable `n_plot_sex` and ensure the counts are in a column named \"count\".\n",
"- Use `pivot` to convert the `n_plot_sex` DataFrame to a new DataFrame with the `verbatimLocality` as index and `male`/`female` as column names. Assign to a new variable `pivoted`.\n",
"\n",
"
Hints
\n",
"\n",
"- _...for each of the combinations..._ `groupby` can also be used with multiple columns at the same time.\n",
"- If a `groupby` operation gives a Series as result, you can give that Series a name with the `.rename(..)` method.\n",
"- `reset_index()` is useful function to convert multiple indices into columns again.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis22.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis23.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pivoted.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To check, we can use the variable `pivoted` to plot the result:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pivoted.plot(kind='bar', figsize=(12, 6), rot=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 17**\n",
"\n",
"Recreate the previous plot with the `catplot` function from the Seaborn library starting from `n_plot_sex`.\n",
"\n",
"
Hints
\n",
"\n",
"- Check the `kind` argument of the `catplot` function to figure out to specify you want a barplot with given x and y values.\n",
"- To link a column to different colors, use the `hue` argument\n",
"- Using `height` and `aspect`, the figure size can be optimized.\n",
"\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis24.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 18**\n",
"\n",
"Recreate the previous plot with the `catplot` function from the Seaborn library directly starting from `survey_data`.\n",
"\n",
"
Hints
\n",
"\n",
"- Check the `kind`argument of the `catplot` function to find out how to use counts to define the bars instead of a `y` value.\n",
"- To link a column to different colors, use the `hue` argument\n",
"\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis25.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 19**\n",
"\n",
"- Make a summary table with the number of records of each of the species in each of the plots (also called `verbatimLocality`). Each of the species `name`s is a row index and each of the `verbatimLocality` plots is a column name.\n",
"- Using the Seaborn
documentation to make a heatmap.\n",
"\n",
"
Hints
\n",
"\n",
"- Make sure to pass the correct columns to respectively the `index`, `columns`, `values` and `aggfunc` parameters of the `pivot_table` function. You can use the `datasetName` to count the number of observations for each name/locality combination (when counting rows, the exact column doesn't matter).\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis26.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis27.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Observations over time"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 20**\n",
"\n",
"Make a plot visualizing the evolution of the number of observations for each of the individual __years__ (i.e. annual counts) using the `resample` method.\n",
"\n",
"
Hints
\n",
"\n",
"- You want to `resample` the data using the `eventDate` column to create annual counts. If the index is not a datetime-index, you can use the `on=` keyword to specify which datetime column to use.\n",
"- `resample` needs an aggregation function on how to combine the values within a single 'group' (in this case data within a year). In this example, we want to know the `size` of each group, i.e. the number of records within each year.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis28.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To evaluate the intensity or number of occurrences during different time spans, a heatmap is an interesting representation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 21**\n",
"\n",
"- Create a table, called `heatmap_prep`, based on the `survey_data` DataFrame with the row index the individual years, in the column the months of the year (1-> 12) and as values of the table, the counts for each of these year/month combinations.\n",
"- Using the seaborn
documentation, make a heatmap starting from the `heatmap_prep` variable.\n",
"\n",
"
Hints
\n",
"\n",
"- The `.dt` accessor can be used to get the `year`, `month`,... from a `datetime` column\n",
"- Use `pivot_table` and provide the years to `index` and the months to `columns`. Do not forget to `count` the number for each combination (`aggfunc`).\n",
"- Seaborn has an `heatmap` function which requires a short-form DataFrame, comparable to giving each element in a table a color value.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis29.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remark that we started from a `tidy` data format (also called *long* format) and converted to *short* format with in the row index the years, in the column the months and the counts for each of these year/month combinations as values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## (OPTIONAL SECTION) 6. Evolution of species during monitoring period"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*In this section, all plots can be made with the embedded Pandas plot function, unless specificly asked*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 22**\n",
"\n",
"Plot using Pandas `plot` function the number of records for `Dipodomys merriami` for each month of the year (January (1) -> December (12)), aggregated over all years.\n",
"\n",
"
Hints
\n",
"\n",
"- _...for each month of..._ requires `groupby`.\n",
"- `resample` is not useful here, as we do not want to change the time-interval, but look at month of the year (over all years)\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis30.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis31.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 23**\n",
"\n",
"Plot, for the species 'Dipodomys merriami', 'Dipodomys ordii', 'Reithrodontomys megalotis' and 'Chaetodipus baileyi', the monthly number of records as a function of time during the monitoring period. Plot each of the individual species in a separate subplot and provide them all with the same y-axis scale\n",
"\n",
"
Hints
\n",
"\n",
"- `isin` is useful to select from within a list of elements.\n",
"- `groupby` AND `resample` need to be combined. We do want to change the time-interval to represent data as a function of time (`resample`) and we want to do this _for each name/species_ (`groupby`). The order matters!\n",
"- `unstack` is a Pandas function a bit similar to `pivot`. Check the [unstack documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html) as it might be helpful for this exercise.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis32.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis33.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis34.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 24**\n",
"\n",
"Recreate the same plot as in the previous exercise using Seaborn `relplot` functon with the `month_evolution` variable.\n",
" \n",
"
Hints
\n",
"\n",
"- We want to have the `counts` as a function of `eventDate`, so link these columns to y and x respectively.\n",
"- To create subplots in Seaborn, the usage of _facetting_ (splitting data sets to multiple facets) is used by linking a column name to the `row`/`col` parameter. \n",
"- Using `height` and `aspect`, the figure size can be optimized.\n",
" \n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Uncomment the next cell (calculates `month_evolution`, the intermediate result of the previous excercise):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis35.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plotting with seaborn:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis36.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 25**\n",
"\n",
"Plot the annual amount of occurrences for each of the 'taxa' as a function of time using Seaborn. Plot each taxa in a separate subplot and do not share the y-axis among the facets.\n",
"\n",
"
Hints
\n",
"\n",
"- Combine `resample` and `groupby`!\n",
"- Check out the previous exercise for the plot function.\n",
"- Pass the `sharey=False` to the `facet_kws` argument as a dictionary.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis37.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis38.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**EXERCISE 26**\n",
"\n",
"The observations where taken by volunteers. You wonder on which day of the week the most observations where done. Calculate for each day of the week (`dayofweek`) the number of observations and make a bar plot.\n",
"\n",
"Hints
\n",
"\n",
"- Did you know the Python standard Library has a module `calendar` which contains names of week days, month names,...?\n",
"\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"nbtutor-solution"
]
},
"outputs": [],
"source": [
"# %load _solutions/case2_observations_analysis39.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nice work!"
]
}
],
"metadata": {
"celltoolbar": "Nbtutor - export exercises",
"jupytext": {
"formats": "ipynb,md:myst"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.2"
},
"nav_menu": {},
"toc": {
"navigate_menu": true,
"number_sections": true,
"sideBar": true,
"threshold": 6,
"toc_cell": false,
"toc_section_display": "block",
"toc_window_display": true
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}