{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Spatial joins" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Goals of this notebook:\n", "\n", "- Based on the `countries` and `cities` dataframes, determine for each city the country in which it is located.\n", "- To solve this problem, we will use the the concept of a 'spatial join' operation: combining information of geospatial datasets based on their spatial relationship." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import pandas as pd\n", "import geopandas\n", "\n", "pd.options.display.max_rows = 10" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries = geopandas.read_file(\"zip://./data/ne_110m_admin_0_countries.zip\")\n", "cities = geopandas.read_file(\"zip://./data/ne_110m_populated_places.zip\")\n", "rivers = geopandas.read_file(\"zip://./data/ne_50m_rivers_lake_centerlines.zip\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recap - joining dataframes\n", "\n", "Pandas provides functionality to join or merge dataframes in different ways, see https://chrisalbon.com/python/data_wrangling/pandas_join_merge_dataframe/ for an overview and https://pandas.pydata.org/pandas-docs/stable/merging.html for the full documentation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate the concept of joining the information of two dataframes with pandas, let's take a small subset of our `cities` and `countries` datasets: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cities2 = cities[cities['name'].isin(['Bern', 'Brussels', 'London', 'Paris'])].copy()\n", "cities2['iso_a3'] = ['CHE', 'BEL', 'GBR', 'FRA']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cities2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries2 = countries[['iso_a3', 'name', 'continent']]\n", "countries2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We added a 'iso_a3' column to the `cities` dataset, indicating a code of the country of the city. This country code is also present in the `countries` dataset, which allows us to merge those two dataframes based on the common column.\n", "\n", "Joining the `cities` dataframe with `countries` will transfer extra information about the countries (the full name, the continent) to the `cities` dataframe, based on a common key:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cities2.merge(countries2, on='iso_a3')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**But**, for this illustrative example, we added the common column manually, it is not present in the original dataset. However, we can still know how to join those two datasets based on their spatial coordinates." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recap - spatial relationships between objects\n", "\n", "In the previous notebook [02-spatial-relationships.ipynb](./02-spatial-relationships-operations.ipynb), we have seen the notion of spatial relationships between geometry objects: within, contains, intersects, ...\n", "\n", "In this case, we know that each of the cities is located *within* one of the countries, or the other way around that each country can *contain* multiple cities.\n", "\n", "We can test such relationships using the methods we have seen in the previous notebook:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "france = countries.loc[countries['name'] == 'France', 'geometry'].squeeze()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cities.within(france)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above gives us a boolean series, indicating for each point in our `cities` dataframe whether it is located within the area of France or not. \n", "Because this is a boolean series as result, we can use it to filter the original dataframe to only show those cities that are actually within France:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cities[cities.within(france)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could now repeat the above analysis for each of the countries, and add a column to the `cities` dataframe indicating this country. However, that would be tedious to do manually, and is also exactly what the spatial join operation provides us.\n", "\n", "*(note: the above result is incorrect, but this is just because of the coarse-ness of the countries dataset)*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Spatial join operation\n", "\n", "
\n", " \n", "**SPATIAL JOIN** = *transferring attributes from one layer to another based on their spatial relationship*

\n", "\n", "\n", "Different parts of this operations:\n", "\n", "* The GeoDataFrame to which we want add information\n", "* The GeoDataFrame that contains the information we want to add\n", "* The spatial relationship we want to use to match both datasets ('intersects', 'contains', 'within')\n", "* The type of join: left or inner join\n", "\n", "\n", "![](img/illustration-spatial-join.svg)\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "In this case, we want to join the `cities` dataframe with the information of the `countries` dataframe, based on the spatial relationship between both datasets.\n", "\n", "We use the [`geopandas.sjoin`](http://geopandas.readthedocs.io/en/latest/reference/geopandas.sjoin.html) function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "joined = geopandas.sjoin(cities, countries, op='within', how='left')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "joined" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "joined['continent'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lets's practice!\n", "\n", "We will again use the Paris datasets to do some exercises. Let's start importing them again:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "districts = geopandas.read_file(\"data/paris_districts_utm.geojson\")\n", "stations = geopandas.read_file(\"data/paris_sharing_bike_stations_utm.geojson\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Make a plot of the density of bike stations by district\n", "

\n", "

\n", "

\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/03-spatial-joins1.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/03-spatial-joins2.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/03-spatial-joins3.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/03-spatial-joins4.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/03-spatial-joins5.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/03-spatial-joins6.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/03-spatial-joins7.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The overlay operation\n", "\n", "In the spatial join operation above, we are not changing the geometries itself. We are not joining geometries, but joining attributes based on a spatial relationship between the geometries. This also means that the geometries need to at least overlap partially.\n", "\n", "If you want to create new geometries based on joining (combining) geometries of different dataframes into one new dataframe (eg by taking the intersection of the geometries), you want an **overlay** operation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "africa = countries[countries['continent'] == 'Africa']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "africa.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cities['geometry'] = cities.buffer(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "geopandas.overlay(africa, cities, how='difference').plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "REMEMBER
\n", "\n", "* **Spatial join**: transfer attributes from one dataframe to another based on the spatial relationship\n", "* **Spatial overlay**: construct new geometries based on spatial operation between both dataframes (and combining attributes of both dataframes)\n", "\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }