{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<p><font size=\"6\"><b> CASE - CurieuzeNeuzen citizen science air quality data</b></font></p>\n",
    "\n",
    "\n",
    "> *DS Python for GIS and Geoscience*  \n",
    "> *October, 2020*\n",
    ">\n",
    "> *© 2020, Joris Van den Bossche and Stijn Van Hoey. Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "\n",
    "\n",
    "Air pollution remains a key environmental problem in an increasingly urbanized world. While concentrations of traffic-related pollutants like nitrogen dioxide (NO2) are known to vary over short distances, official monitoring networks remain inherently sparse, as reference stations are costly to construct and operate.\n",
    "\n",
    "The [**CurieuzeNeuzen**](https://curieuzeneuzen.be/curieuzeneuzen-vlaanderen-2018/) citizen science project collected a large, spatially distributed dataset that can complement official monitoring. In a first edition in 2016, in Antwerp, 2000 citizens were involved. This success was followed by a second edition in 2018 engaging 20.000 citizens across Flanders, a highly urbanized, industrialized and densely populated region in Europe. The participants measured the NO2 concentrations in front of their house using a low-cost sampler design (see picture below, where passive sampling tubes are attached using a panel to a window at the facade). \n",
    "\n",
    "Source: preprint paper at https://eartharxiv.org/repository/view/19/\n",
    "\n",
    "In this case study, we are going to make use of the data collected across Flanders in 2018: explore the data and investigate relationships with other variables.\n",
    "\n",
    "<img src=\"../img/CN_measurement_setup.png\" alt=\"Measurement panel\" style=\"width:800px\">\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import geopandas\n",
    "\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Importing and exploring the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Read the csv file from `data/CN_Flanders_open_dataset.csv` into a DataFrame `df` and inspect the data.\n",
    "* How many measurements do we have?\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality1.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality2.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The dataset contains longitude/latitude columns of the measurement locations, the measured NO2 concentration, and in addition also a \"campaign\" column indicating the type of measurement location (and an internal \"code\", which we will ignore here).\n",
    "\n",
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Check the unique values of the \"campaign\" columns and how many occurrences those have.\n",
    "\n",
    "    \n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* A pandas Series has a `value_counts()` method that counts the unique values of the column.\n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality3.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "clear_cell": false,
    "deletable": true,
    "editable": true
   },
   "source": [
    "Most of the measurements are performed at the facade of the house or building of a participant. In addition, some measurement tubes were also placed next to reference monitoring stations of the VMM and in background locations (e.g. nature reserve or park).\n",
    "\n",
    "Let's now explore the measured NO2 concentrations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Calculate the overall average NO2 concentration of all locations (i.e. the mean of the \"no2\" column). \n",
    "* Calculate a combination of descriptive statistics of the NO2 concentration using the `describe()` method.\n",
    "* Make a histogram of the NO2 concentrations to get a visual idea of the distribution of the concentrations.\n",
    "    \n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* To calculate the mean of a column, we first need to select it: using the square bracket notation `df[colname]`\n",
    "* The average can be calculate with the `mean()` method\n",
    "* A histogram of a column can be plotted with the `hist()` method\n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality4.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality5.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "A histogram:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality6.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "A more expanded histogram (not asked in the exercise, but uncomment to check the code!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality7.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Determine the percentage of locations that exceed the EU and WHO yearly limit value of 40 µg/m³.\n",
    "\n",
    "Tip: first create a boolean mask determining whether the NO2 concentration is above 40 or not. Using this boolean mask, you can determine the percentage of values that follow the condition.\n",
    "    \n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* To know the fraction of `True` values in a boolean Series, we can use the `sum()` method, which is equivalent as counting the True values (True=1, False=0, so the sum is a count of the True values) and dividing by the total number of values. \n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality8.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality9.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "So overall in Flanders, around 2.3% of the measurement locations exceeded the limit value. This might not seem much, but assuming that the dataset is representative for the population of Flanders (and effort was done to ensure this), around 150,000 inhabitants live in a place where the annual NO2 concentration at the front door exceeds the EU legal threshold value.  \n",
    "We will also later see that this exceedance has a large spatial variation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* What is the average measured concentration at the background location? Calculate this by first selecting the appropriate subset of the dataframe.\n",
    "* More generally, what is the average measured concentration grouped on the \"Campaign\" type? \n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* To calculate a grouped statistic, use the `groupby()` method. Pass as argument the name of the column on which you want to group.\n",
    "* After the `groupby()` operation, we can (similarly as for a normal DataFrame) select a column and call the aggregation column (`df.groupby(\"class\")[\"variable\"].method()`).\n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality10.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality11.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The background locations (parks, nature reserves) clearly show a lower concentration than the average location. Note that the number of observations in each class are very skewed, so those averages are not necessarily representative!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Converting to a geospatial dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The provided data was a CSV file, and we explored it above as a pandas DataFrame. To further explore the dataset using the spatial aspects (point data), we will first convert it to a geopandas GeoDataFrame.\n",
    "\n",
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Convert `df` into a GeoDataFrame, using the 'lat'/'lon' columns to create a Point geometry column. Also specify the correct Coordinate Reference System with the `crs` keyword. Call the result `gdf`.\n",
    "* Do a quick check to see the result is correct: look at the first rows, and make a simple plot of the GeoDataFrame with `.plot()` (you should recognize the shape of Flanders, if not, something went wrong)\n",
    "\n",
    "    \n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* A GeoDataFrame can be created from an existing pandas.DataFrame by using the `geopandas.GeoDataFrame(...)` constructor. This constructor needs a `geometry=` keyword specifying either the name of the column that holds the geometries or either the geometry values.\n",
    "* GeoPandas provides a helper function to create Point geometry values from a array or column of x and y coordinates: `geopandas.points_from_xy(x_values, y_values)`.\n",
    "* Remember! The order of coordinates is (x, y), and for geographical coordinates this means the (lon, lat) order.\n",
    "* Remember! The Coordinate Reference System typically used for geographical lon/lat coordinates is \"EPSG:4326\" (WGS84).\n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality12.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality13.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality14.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Let's make that last plot a bit more informative:\n",
    "\n",
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Make a plot of the point locations of `gdf` and use the \"no2\" column to color the points.\n",
    "* Make the figure a bit larger by specifying the `figsize=` keyword.\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality15.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We can already notice some spatial patterns: higher concentrations (and also more measurement locations) in urban areas. But, the visualization above is not really a good way to visualize many points. There are many alternatives (e.g. heatmaps, hexbins, etc), but in this case study we are going to make a [*choropleth* map](https://en.wikipedia.org/wiki/Choropleth_map): choropleths are maps onto which an attribute, a non-spatial variable, is displayed by coloring a certain area. \n",
    "\n",
    "As the unit of area, we will use the municipalities of Flanders."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Combining with municipalities\n",
    "\n",
    "We downloaded the publicly available municipality reference from geopunt.be ([Voorlopig referentiebestand gemeentegrenzen, toestand 16/05/2018](https://www.geopunt.be/catalogus/datasetfolder/9ff44cc4-5f16-4507-81a6-6810958b14df)), and added the Shapefile with the borders to the course repo: `data/VRBG/Refgem.shp`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Read the Shapefile with the municipalities into a variable called `muni`.\n",
    "* Inspect the data and do a quick plot.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality16.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality17.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality18.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now we have a dataset with the municipalities, we want to know for each of the measurement locations in which municipality it is located. This is a \"point-in-polygon\" spatial join.\n",
    "\n",
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "Before performing the spatial join, we need to ensure the two datasets are using the same Coordinate Reference System (CRS).\n",
    "\n",
    "* Check the CRS of both `gdf` and `muni`. What kind of CRS are they using? Are they the same?\n",
    "* Reproject the measurements to the Lambert 72 (EPSG:31370) reference system. Call the result `gdf_lambert`.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* The CRS of a GeoDataFrame can be checked by looking at the `crs` attribute.\n",
    "* To reproject a GeoDataFrame to another CRS, we can use the `to_crs()` method.\n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality19.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality20.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality21.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The EPSG:31370 or \"Belgian Lambert 72\" (https://epsg.io/31370) is the local, projected CRS most often used in Belgium. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Add the municipality information to the measurements dataframe. We are mostly interested in the \"NAAM\" column of the municipalities dataframe (the name of the municipality). Call the result `gdf_combined`.\n",
    "\n",
    "TODO hints\n",
    "    \n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* Joining the measurement locations with the municipality information can be done with the `geopandas.sjoin()` function. The first argument is the dataframe to which we want to add information, the second argument the dataframe with the additional information.\n",
    "* You can select a subset of columns to pass the the `sjoin()` function. \n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality22.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality23.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* What is the average measured concentration in each municipality?\n",
    "* Call the result `muni_mean`. Ensure we have a DataFrame with a NO2 columns and a column with the municipality name by calling the `reset_index()` method after the groupby operation.\n",
    "* Merge those average concentrations with the municipalities GeoDataFrame (note: those have a common column \"NAAM\"). Call the merged dataframe `muni_no2`, and check the first rows.\n",
    "    \n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* Something like `df.groupby(\"class\")[\"variable\"].mean()` returns a Series, with the group variable as the index. Calling `reset_index()` on the result then converts this into a DataFrame with 2 columns: the group variable and the calculated statistic.\n",
    "* Merging two dataframe that have a common column can be done with the `pd.merge()` function. \n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality24.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality25.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality26.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Make a choropleth of the municipalities using the average NO2 concentration as variable to color the municipality polygons.\n",
    "* Set the figure size to be (16, 5), and add a legend.\n",
    "    \n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* To specify which column to use to color the geometries in the plot, use the `column=` keyword of the `plot()` method.\n",
    "* The figure size can be specified with the `figsize=` keyword of the `plot()` method.\n",
    "* Pass `legend=True` to add a legend to the plot. The type of legend (continuous color bar, discrete legend) will be inferred from the plotted values.\n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality27.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "When specifying a numerical column to color the polygons, by default this results in a continuous color scale, as you can see above.\n",
    "\n",
    "However, it is very difficult for the human eye to process small differences in color in a continuous scale. \n",
    "Therefore, to create effective choropleths, we typically classify the values into a set of discrete groups.\n",
    "\n",
    "With GeoPandas' `plot()` method, you can control this with the `scheme` keyword (indicating which classification scheme to use, i.e how to divide the continuous range into a set of discrete classes) and the `k` keyword to indicate how many classes to use. This uses the [mapclassify](https://pysal.org/mapclassify/) package under the hood. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Starting from the previous figure, specify a classification scheme and a number of classes to use. Check the help of the `plot()` method to see the different options, and experiment with those.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality28.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* What is the percentage of exceedance for each municipality? Repeat the same calculating we did earlier on the original dataset `df`, but now using `gdf_combined` and grouped per municipality.\n",
    "* Show the 10 municipalities with the highest percentage of exceedances.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* For showing the 10 highest values, we can either use the `sort_values()` sorting the highest values on top and showing the first 10 rows, or either use the `nlargest()` method as a short cut for this operation.\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality29.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality30.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality31.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Combining with Land Use data\n",
    "\n",
    "The CORINE Land Cover (https://land.copernicus.eu/pan-european/corine-land-cover) is a program by the European Environment Agency (EEA) to provide an inventory of land cover in 44 classes of the European Union. The data is provided in both raster as vector format and with a resolution of 100m.\n",
    "\n",
    "The data for the whole of Europe can be downloaded from the website (latest version: https://land.copernicus.eu/pan-european/corine-land-cover/clc2018?tab=download). This is however a large dataset, so we downloaded the raster file and cropped it to cover Flanders, and this subset is included in the repo as `data/CLC2018_V2020_20u1_flanders.tif` (the code to do this cropping can be see at [INCLUDE LINK]).\n",
    "\n",
    "The air quality is indirectly linked to land use, as the presence of pollution sources will depend on the land use. Therefore, we will determine here the land use for each of the measurement locations based on the CORINE dataset and explore the relationship of the NO2 concentration and land use."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Open the land cover raster file (`data/CLC2018_V2020_20u1_flanders.tif`) with rasterio or xarray, inspect the metadata, and do a quick visualization.\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "With rasterio:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality32.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "With xarray:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality33.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality34.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The goal is now to to query from the raster file the value of the land cover class for each of the measurement locations. This can be done with the `rasterstats` package and with the point locations of our GeoDataFrame. But first, we need to ensure that both our datasets are using the same CRS. In this case, it's easiest to reproject the point locations to the CRS of the raster file.\n",
    "\n",
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* What is the EPSG code of the Coordinate Reference System (CRS) of the raster file? You can find this in the metadata inspected with rasterio or xarray above.\n",
    "* Reproject the point dataframe (`gdf`) to the CRS of the raster and assign this to a temporary variable `gdf_raster`.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* Reprojecting can be done with the `to_crs()` method of a GeoDataFrame, and the CRS can be specified in the form of \"EPSG:xxxx\".\n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality35.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Use the `rasterstats.point_query()` function to determine the value of the raster for each of the points in the dataframe. Remember to use `gdf_raster` for passing the geometries.\n",
    "* Because we have a raster file with discrete classes, ensure to pass `interpolate=\"nearest\"` (the default \"bilinear\" will result in floats with decimals, not preserving the integers representing discrete classes).\n",
    "* Assign the result to a new column \"land_use\" in `gdf`.\n",
    "* Perform a `value_counts()` on this new column do get a quick idea of the new values obtained from the raster file.\n",
    "\n",
    "Note that the query operation can take a while. Don't worry if it runs for around 20 seconds!\n",
    "    \n",
    "    \n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* Don't forget to first import the `rasterstats` package.\n",
    "* The `point_query()` function takes as first argument the point geometries (this can be passed as a GeoSeries), and as second argument the path to the raster file (this will be opened by `rasterio` under the hood).\n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality36.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality37.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality38.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "As you can see, we have obtained a large variety of land cover classes. To make this more practical to work with, we a) want to convert the numbers into a class name, and b) reduce the number of classes.\n",
    "\n",
    "The full hierarchy (with 3 levels) of the 44 classes can be seen at https://wiki.openstreetmap.org/wiki/Corine_Land_Cover. For keeping the exercise a bit practical here, we prepared a simplified set of classes and provided a csv file with this information.  \n",
    "\n",
    "This has a column \"value\" corresponding to the values used in the raster file, and a \"group\" column with the simplified classes that we will use for this exercise.\n",
    "\n",
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Read the `\"data/CLC2018_V2018_legend_grouped.csv\"` as a dataframe and call it `legend`.\n",
    "\n",
    "The additional steps, provided for you, use this information to convert the column of integer land use classes to a column with the simplified names. After that, we again use `value_counts()` on this new column, and we can see that we now have less classes.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* Reading a csv file can be done with the `pandas.read_csv()` function.\n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality39.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Convert the \"land_use\" integer values to a column with land use class names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "value_to_group = dict(zip(legend['value'], legend['group']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "gdf['land_use_class'] = gdf['land_use'].replace(value_to_group)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "gdf['land_use_class'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now we have the land use data, let's explore the air quality in relation to this land use.\n",
    "\n",
    "We can see in the `value_counts` above that we have a few classes with only very few observations, though (<50). Calculating statistics for those small groups is not very reliable, so lets leave them out for this exercises (note, this is not necessarily the best strategy in real life! Amongst others, we could also inspect those points and re-assign to a dominant land use class in the surrounding region).\n",
    "\n",
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Assign the value counts of the \"land_use_class\" column to a variable `counts`. \n",
    "* Using \"counts\", we can determine which classes occur more then 50 times. \n",
    "* Using those frequent classes, filter the `gdf` to only include observations from those classes, and call this `subset`.\n",
    "* Based on this subset, calculate the average NO2 concentration per land use class.\n",
    "\n",
    "TODO hints\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* \n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "counts = gdf['land_use_class'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "frequent_categories = counts[counts > 50].index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "subset = gdf[gdf[\"land_use_class\"].isin(frequent_categories)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "subset.groupby(\"land_use_class\")['no2'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Using the `seaborn` package and the `subset` DataFrame, make a boxplot of the NO2 concentration, splitted per land use class.\n",
    "\n",
    "Don't forget to first import `seaborn`. We can use the `seaborn.boxplot()` function (check the help of this function to see which keywords to specify).\n",
    "    \n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* With the `seaborn.boxplot()`, we can specify the `data=` keyword to pass the DataFrame from which to plot values, and the `x=` and `y=` keywords to specify which columns of the DataFrame to use.\n",
    "\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality40.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality41.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Tweaking the figure a bit more (not asked in the exercise, but uncomment and run to see):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality42.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The dense urban areas and areas close to large roads clearly have higher concentrations on average. On the countryside (indicated by \"agricultural\" areas) much lower concentrations are observed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## A focus on Gent\n",
    "\n",
    "Let's now focus on the measurements in Ghent. We first get the geometry of Gent from the municipalities dataframe, so we can use this to filter the measurements:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "gent = muni[muni[\"NAAM\"] == \"Gent\"].geometry.item()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "gent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "For the exercise here, we don't want to select just the measurements in the municipality of Gent, but those in the region. For this, we will create a bounding box of Gent:\n",
    "    \n",
    "* Create a new Shapely geometry, `gent_region`, that defines the bounding box of the Gent municipality.\n",
    "\n",
    "Check this section of the Shapely docs (https://shapely.readthedocs.io/en/latest/manual.html#constructive-methods) or experiment yourself to see which attribute is appropriate to use here (`convex_hull`, `envolope`, and `minimum_rotated_rectangle` all create a new polygon encompassing the original).\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality43.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Using the `gent_region` shape, create a subset the measurements dataframe with those measurements located within Gent. Call the result `gdf_gent`.\n",
    "* How many measurements are left in the subset?\n",
    "    \n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* Ensure to use `gdf_lambert` and not `gdf`, since the `gent` shape is extracted from the `muni` dataframe with EPSG:31370 CRS.\n",
    "* To check for a series of points whether they are located in a given polygon, use the `within()` method.\n",
    "* Use the resulting boolean Series to mask the original `gdf` GeoDataFrame using boolean indexing (`[]` notation).\n",
    "    \n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality44.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality45.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Alternatively, we can also use the `geopandas.clip()` function. For points there is not much difference with the method above, but for lines or polygons, it will actually \"clip\" the geometries, i.e. removing the parts that fall outside of the specified region (in addition, this method also uses a spatial index under the hood and will typically be faster for large datasets):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "gdf_gent = geopandas.clip(gdf_lambert, gent_region)\n",
    "len(gdf_gent)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Make a visualization of the measurements in Gent. Use contextily to add a background basemap.\n",
    "\n",
    "TODO hints\n",
    "    \n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* \n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality46.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality47.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Further zooming in on the city center (not asked in exercise, but uncomment and run to see):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality48.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Combining with OpenStreetMap information\n",
    "\n",
    "We downloaded and filtered OpenStreetMap data for the area of Gent, focusing on the street network information, and provided this as a GeoPackage file (`data/osm_network_gent.gpkg`, see this [notebook](./data/preprocess_data.ipynb) to check the code to download and preprocess the raw OSM data).\n",
    "\n",
    "The OpenStreetMap street network data includes information about the type of street in the \"highway\" tag. We can use this as a proxy for traffic intensity of the street, and relate that to the measured NO2 concentration."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Read the `\"data/osm_network_gent.gpkg\"` file into a `streets` variable, and check the first rows.\n",
    "* Convert the data to the appropriate CRS to combine with the `gdf_gent` data, if needed.\n",
    "* Make a quick plot to explore the data.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality49.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality50.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality51.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality52.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "(Note: for interactively exploring such data, there are better solutions as the GeoPandas `.plot()` method, such as opening the data in QGIS, or using an interactive visualization library)\n",
    "\n",
    "To relate the measured concentration with the road type, we want to determine for each location at what type of street it is located. Since the measurements are not exactly located on one of the lines, we are going to look at the closest street for each location."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Calculate the distance between the point (defined below) and all streets. And what is the minimum distance?\n",
    "* Use the `idxmin()` method to know get the label of which row contains the minimum distance.\n",
    "* Using the result of `idxmin()`, we can get the row or the value in the \"highway\" column that corresponds to the street that is closest to `point`.\n",
    "\n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* The distance method of a Shapely geometry does not accept a GeoSeries of geometries, only a single other geometry (so `point.distance(series)` does not work). However, the distance method is commutative, so you can always switch the order to use the distance method of the GeoSeries (`series.distance(point)` does work).\n",
    "* Given a row label, you can get the value of a Series/column with `s[label]`, or the row of a DataFrame with `df.loc[label]`. With both row label and column name, you can get the corresponding value of a DataFrame with `df.loc[label, column_name]`.\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We take the first point geometry in the Gent dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "point = gdf_gent[\"geometry\"].iloc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "point"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality53.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality54.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality55.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality56.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality57.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality58.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We now want to repeat the above analysis for each measurement location. So let's start with writing a reusable function.\n",
    "\n",
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Define a function `closest_road_type` that accepts a single point and the streets dataframe, and returns the class of the closest street.\n",
    "* Check that the function works by using it on the `point` defined above.\n",
    "\n",
    "As help, you can start from this skeleton:\n",
    "    \n",
    "```python\n",
    "def closest_road_type(point, streets):\n",
    "    # determine \"highway\" tag of the closest street\n",
    "    idx_closest = ...\n",
    "    ...\n",
    "    return ... \n",
    "```\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality59.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality60.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now we can apply this function to each of the point locations. However, with this brute force method, applying it as is using `gdf_gent` and `streets` takes quite some time. We can speed up the distance calculation by reducing the number of linestrings in the `streets` dataframe to compare with.\n",
    "\n",
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Create a `streets_unioned` dataframe with a single line per road type (a union of all lines per road type). Check the `dissolve()` method for this.\n",
    "* Repeat the `apply` call, but now on all points and using `streets_unioned` instead of `streets`.\n",
    "* Assign the result of the apply to a new columns \"road_type\".\n",
    "* Do a value counts of this new column.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "When running this, you can see it already takes a bit of time, even for the first 20 rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "%time gdf_gent.geometry.head(20).apply(lambda point: closest_road_type(point, streets))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality61.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality62.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality63.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality64.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "**Note!** We have been using a brute-force search for the closest street by calculating for each point the distance to all streets. This is a good exercise to learn the syntax, but there are however better methods for such \"nearest\" queries. See eg https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "There are some uncommon categories. For the remainder of this demo, let's group some related categories, and filter out some others.\n",
    "    \n",
    "* Using the defined mapping, replace some values with \"pedestrian\" in the `\"road_type\"` column.\n",
    "* Using the defined subset of categories, create a subset of `gdf_gent` where the road type \"is in\" this subset of categories (look at the pandas `isin()` ([doc link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html)) method). Call the result `subset`, and do again a value counts to check the result.\n",
    "    \n",
    "<details><summary>Hints</summary>\n",
    "\n",
    "* The `replace()` method can be called on a column. Pass a dictionary to this method, and the keys of the dictionary present in the column will be replaces with the corresponding values defined in the dictionary.\n",
    "* The `Series.isin()` method accepts a list of values, and will return a boolean Series indicating for each element of the original Series whether it is present in the list of values or not. \n",
    "* The boolean Series can then be used to filter the original `gdf_gent` dataframe using boolean indexing.\n",
    "</details>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Replace categories:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "mapping = {\n",
    "    \"footway\": \"pedestrian\",\n",
    "    \"living_street\": \"pedestrian\",\n",
    "    \"path\": \"pedestrian\",\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality65.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Filter categories:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "categories = [\"primary\", \"secondary\", \"tertiary\", \"residential\", \"pedestrian\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality66.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality67.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Calculate the average measured concentration depending on the type of the road next to the measurement location.\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality68.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "**EXERCISE**:\n",
    "\n",
    "* Similarly, make a plot with `seaborn` of those results. Specify the `categories` as the order to use for the plot.\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality69.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# %load _solutions/case-curieuzeneuzen-air-quality70.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "This analysis confirms that the NO2 concentration is clearly related to traffic!"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Nbtutor - export exercises",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}