{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to geospatial vector data in Python" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import pandas as pd\n", "import geopandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing geospatial data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Geospatial data is often available from specific GIS file formats or data stores, like ESRI shapefiles, GeoJSON files, geopackage files, PostGIS (PostgreSQL) database, ...\n", "\n", "We can use the GeoPandas library to read many of those GIS file formats (relying on the `fiona` library under the hood, which is an interface to GDAL/OGR), using the `geopandas.read_file` function.\n", "\n", "For example, let's start by reading a shapefile with all the countries of the world (adapted from http://www.naturalearthdata.com/downloads/110m-cultural-vectors/110m-admin-0-countries/, zip file is available in the `/data` directory), and inspect the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries = geopandas.read_file(\"zip://./data/ne_110m_admin_0_countries.zip\")\n", "# or if the archive is unpacked:\n", "# countries = geopandas.read_file(\"data/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What can we observe:\n", "\n", "- Using `.head()` we can see the first rows of the dataset, just like we can do with Pandas.\n", "- There is a 'geometry' column and the different countries are represented as polygons\n", "- We can use the `.plot()` method to quickly get a *basic* visualization of the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's a GeoDataFrame?\n", "\n", "We used the GeoPandas library to read in the geospatial data, and this returned us a `GeoDataFrame`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(countries)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A GeoDataFrame contains a tabular, geospatial dataset:\n", "\n", "* It has a **'geometry' column** that holds the geometry information (or features in GeoJSON).\n", "* The other columns are the **attributes** (or properties in GeoJSON) that describe each of the geometries\n", "\n", "Such a `GeoDataFrame` is just like a pandas `DataFrame`, but with some additional functionality for working with geospatial data:\n", "\n", "* A `.geometry` attribute that always returns the column with the geometry information (returning a GeoSeries). The column name itself does not necessarily need to be 'geometry', but it will always be accessible as the `.geometry` attribute.\n", "* It has some extra methods for working with spatial data (area, distance, buffer, intersection, ...), which we will see in later notebooks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries.geometry" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(countries.geometry)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries.geometry.area" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**It's still a DataFrame**, so we have all the pandas functionality available to use on the geospatial dataset, and to do data manipulations with the attributes and geometry information together.\n", "\n", "For example, we can calculate average population number over all countries (by accessing the 'pop_est' column, and calling the `mean` method on it):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries['pop_est'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, we can use boolean filtering to select a subset of the dataframe based on a condition:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "africa = countries[countries['continent'] == 'Africa']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "africa.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "The rest of the tutorial is going to assume you already know some pandas basics, but we will try to give hints for that part for those that are not familiar. \n", "A few resources in case you want to learn more about pandas:\n", "\n", "- Pandas docs: https://pandas.pydata.org/pandas-docs/stable/10min.html\n", "- Other tutorials: chapter from pandas in https://jakevdp.github.io/PythonDataScienceHandbook/, https://github.com/jorisvandenbossche/pandas-tutorial, https://github.com/TomAugspurger/pandas-head-to-tail, ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**REMEMBER:**
\n", "\n", "* A `GeoDataFrame` allows to perform typical tabular data analysis together with spatial operations\n", "* A `GeoDataFrame` (or *Feature Collection*) consists of:\n", " * **Geometries** or **features**: the spatial objects\n", " * **Attributes** or **properties**: columns with information about each spatial object\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Geometries: Points, Linestrings and Polygons\n", "\n", "Spatial **vector** data can consist of different types, and the 3 fundamental types are:\n", "\n", "![](img/simple_features_3_text.svg)\n", "\n", "* **Point** data: represents a single point in space.\n", "* **Line** data (\"LineString\"): represents a sequence of points that form a line.\n", "* **Polygon** data: represents a filled area.\n", "\n", "And each of them can also be combined in multi-part geometries (See https://shapely.readthedocs.io/en/stable/manual.html#geometric-objects for extensive overview)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the example we have seen up to now, the individual geometry objects are Polygons:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(countries.geometry[2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's import some other datasets with different types of geometry objects.\n", "\n", "A dateset about cities in the world (adapted from http://www.naturalearthdata.com/downloads/110m-cultural-vectors/110m-populated-places/, zip file is available in the `/data` directory), consisting of Point data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cities = geopandas.read_file(\"zip://./data/ne_110m_populated_places.zip\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(cities.geometry[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And a dataset of rivers in the world (from http://www.naturalearthdata.com/downloads/50m-physical-vectors/50m-rivers-lake-centerlines/, zip file is available in the `/data` directory) where each river is a (multi-)line:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rivers = geopandas.read_file(\"zip://./data/ne_50m_rivers_lake_centerlines.zip\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(rivers.geometry[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The `shapely` library\n", "\n", "The individual geometry objects are provided by the [`shapely`](https://shapely.readthedocs.io/en/stable/) library" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(countries.geometry[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To construct one ourselves:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from shapely.geometry import Point, Polygon, LineString" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p = Point(0, 0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(p)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "polygon = Polygon([(1, 1), (2,2), (2, 1)])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "polygon.area" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "polygon.distance(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**REMEMBER**:
\n", "\n", "Single geometries are represented by `shapely` objects:\n", "\n", "* If you access a single geometry of a GeoDataFrame, you get a shapely geometry object\n", "* Those objects have similar functionality as geopandas objects (GeoDataFrame/GeoSeries). For example:\n", " * `single_shapely_object.distance(other_point)` -> distance between two points\n", " * `geodataframe.distance(other_point)` -> distance for each point in the geodataframe to the other point\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting our different layers together" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = countries.plot(edgecolor='k', facecolor='none', figsize=(15, 10))\n", "rivers.plot(ax=ax)\n", "cities.plot(ax=ax, color='red')\n", "ax.set(xlim=(-20, 60), ylim=(-40, 40))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See the [04-more-on-visualization.ipynb](04-more-on-visualization.ipynb) notebook for more details on visualizing geospatial datasets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's practice!\n", "\n", "For the exercises, we are going to use some data of the city of Paris:\n", "\n", "- The administrative districts of Paris (https://opendata.paris.fr/explore/dataset/quartier_paris/): `paris_districts_utm.geojson`\n", "- Real-time (at the moment I downloaded them ..) information about the public bicycle sharing system in Paris (vélib, https://opendata.paris.fr/explore/dataset/stations-velib-disponibilites-en-temps-reel/information/): `paris_sharing_bike_stations_utm.geojson`\n", "\n", "Both datasets are provided as GeoJSON files.\n", "\n", "Let's explore those datasets:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**EXERCISE**:\n", "\n", "* Read both datasets into a GeoDataFrame called `districts` and `stations`.\n", "* Check the type of the returned objects (with `type(..)`)\n", "* Check the first rows of both dataframes. What kind of geometries do those datasets contain? \n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/01-introduction-geospatial-data1.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/01-introduction-geospatial-data2.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/01-introduction-geospatial-data3.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/01-introduction-geospatial-data4.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**EXERCISE**:\n", "\n", "* Make a plot of the `districts` dataset.\n", "* Set the figure size to (12, 6) (hint: the `plot` method accepts a figsize keyword).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/01-introduction-geospatial-data5.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**EXERCISE**:\n", "\n", "* Make a plot of the `stations` dataset (also with a (12, 6) figsize).\n", "* Use the `'available_bikes'` colums to determine the color of the points. For this, use the `column=` keyword.\n", "* Use the `legend=True` keyword to show a color bar.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/01-introduction-geospatial-data6.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**EXERCISE**:\n", "\n", "* Visualize the `stations` and `districts` datasets together on a single plot (of 20, 10)).\n", "* Use a grey color for the `districts` dataset with an alpha of 0.5, but use black lines (tip: `edgecolor`).\n", "* You can use `ax.set_axis_off()` to remove the axis (tick)labels.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/01-introduction-geospatial-data7.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE:\n", "

\n", "

\n", "

\n", "
HintYou can find the location of the largest value with `.idxmax()`
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/01-introduction-geospatial-data8.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/01-introduction-geospatial-data9.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**EXERCISE**:\n", "\n", "* Make a histogram showing the distribution of the number of bike stands in the stations.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/01-introduction-geospatial-data10.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**EXERCISE**:\n", "\n", "* Add a column `'population_density'` representing the number of inhabitants per squared kilometer (Note: The area is given in squared meter, so you will need to multiply the result with `10**6`).\n", "* Plot the districts using the `'population_density'` to color the polygons.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/01-introduction-geospatial-data11.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true }, "outputs": [], "source": [ "# %load _solved/solutions/01-introduction-geospatial-data12.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Coordinate reference systems\n", "\n", "A **coordinate reference system (CRS)** determines how the two-dimensional (planar) coordinates of the geometry objects should be related to actual places on the (non-planar) earth.\n", "\n", "For a nice in-depth explanation, see https://docs.qgis.org/2.8/en/docs/gentle_gis_introduction/coordinate_reference_systems.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A GeoDataFrame or GeoSeries has a `.crs` attribute which holds (optionally) a description of the coordinate reference system of the geometries:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries.crs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the `countries` dataframe, it indicates that it used the EPSG 4326 / WGS84 lon/lat reference system, which is one of the most used. \n", "It uses coordinates as latitude and longitude in degrees, as can you be seen from the x/y labels on the plot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.crs` attribute is given as a dictionary. In this case, it only indicates the EPSG code, but it can also contain the full \"proj4\" string (in dictionary form). \n", "\n", "Under the hood, GeoPandas uses the `pyproj` / `proj4` libraries to deal with the re-projections.\n", "\n", "For more information, see also http://geopandas.readthedocs.io/en/latest/projections.html." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "There are sometimes good reasons you want to change the coordinate references system of your dataset, for example:\n", "\n", "- different sources with different crs -> need to convert to the same crs\n", "- distance-based operations -> if you a crs that has meter units (not degrees)\n", "- plotting in a certain crs (eg to preserve area)\n", "\n", "We can convert a GeoDataFrame to another reference system using the `to_crs` function. \n", "\n", "For example, let's convert the countries to the World Mercator projection (http://epsg.io/3395):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# remove Antartica, as the Mercator projection cannot deal with the poles\n", "countries = countries[(countries['name'] != \"Antarctica\")]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries_mercator = countries.to_crs(epsg=3395) # or .to_crs({'init': 'epsg:3395'})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries_mercator.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the different scale of x and y." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A bit more on importing and creating GeoDataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Note on `fiona`\n", "\n", "Under the hood, GeoPandas uses the [Fiona library](http://toblerity.org/fiona/) (pythonic interface to GDAL/OGR) to read and write data. GeoPandas provides a more user-friendly wrapper, which is sufficient for most use cases. But sometimes you want more control, and in that case, to read a file with fiona you can do the following:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import fiona\n", "from shapely.geometry import shape\n", "\n", "with fiona.Env():\n", " with fiona.open(\"zip://./data/ne_110m_admin_0_countries.zip\") as collection:\n", " for feature in collection:\n", " # ... do something with geometry\n", " geom = shape(feature['geometry'])\n", " # ... do something with properties\n", " print(feature['properties']['name'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Constructing a GeoDataFrame manually" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "geopandas.GeoDataFrame({\n", " 'geometry': [Point(1, 1), Point(2, 2)],\n", " 'attribute1': [1, 2],\n", " 'attribute2': [0.1, 0.2]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a GeoDataFrame from an existing dataframe\n", "\n", "For example, if you have lat/lon coordinates in two columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(\n", " {'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],\n", " 'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],\n", " 'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],\n", " 'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Coordinates'] = list(zip(df.Longitude, df.Latitude))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Coordinates'] = df['Coordinates'].apply(Point)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf = geopandas.GeoDataFrame(df, geometry='Coordinates')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See http://geopandas.readthedocs.io/en/latest/gallery/create_geopandas_from_pandas.html#sphx-glr-gallery-create-geopandas-from-pandas-py for full example" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }