{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to geospatial vector data in Python" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import pandas as pd\n", "import geopandas\n", "\n", "pd.options.display.max_rows = 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing geospatial data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Geospatial data is often available from specific GIS file formats or data stores, like ESRI shapefiles, GeoJSON files, geopackage files, PostGIS (PostgreSQL) database, ...\n", "\n", "We can use the GeoPandas library to read many of those GIS file formats (relying on the `fiona` library under the hood, which is an interface to GDAL/OGR), using the `geopandas.read_file` function.\n", "\n", "For example, let's start by reading a shapefile with all the countries of the world (adapted from http://www.naturalearthdata.com/downloads/110m-cultural-vectors/110m-admin-0-countries/, zip file is available in the `/data` directory), and inspect the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries = geopandas.read_file(\"zip://./data/ne_110m_admin_0_countries.zip\")\n", "# or if the archive is unpacked:\n", "# countries = geopandas.read_file(\"data/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What can we observe:\n", "\n", "- Using `.head()` we can see the first rows of the dataset, just like we can do with Pandas.\n", "- There is a 'geometry' column and the different countries are represented as polygons\n", "- We can use the `.plot()` method to quickly get a *basic* visualization of the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's a GeoDataFrame?\n", "\n", "We used the GeoPandas library to read in the geospatial data, and this returned us a `GeoDataFrame`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(countries)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A GeoDataFrame contains a tabular, geospatial dataset:\n", "\n", "* It has a **'geometry' column** that holds the geometry information (or features in GeoJSON).\n", "* The other columns are the **attributes** (or properties in GeoJSON) that describe each of the geometries\n", "\n", "Such a `GeoDataFrame` is just like a pandas `DataFrame`, but with some additional functionality for working with geospatial data:\n", "\n", "* A `.geometry` attribute that always returns the column with the geometry information (returning a GeoSeries). The column name itself does not necessarily need to be 'geometry', but it will always be accessible as the `.geometry` attribute.\n", "* It has some extra methods for working with spatial data (area, distance, buffer, intersection, ...), which we will see in later notebooks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries.geometry" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(countries.geometry)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries.geometry.area" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**It's still a DataFrame**, so we have all the pandas functionality available to use on the geospatial dataset, and to do data manipulations with the attributes and geometry information together.\n", "\n", "For example, we can calculate average population number over all countries (by accessing the 'pop_est' column, and calling the `mean` method on it):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries['pop_est'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, we can use boolean filtering to select a subset of the dataframe based on a condition:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "africa = countries[countries['continent'] == 'Africa']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "africa.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "The rest of the tutorial is going to assume you already know some pandas basics, but we will try to give hints for that part for those that are not familiar. \n", "A few resources in case you want to learn more about pandas:\n", "\n", "- Pandas docs: https://pandas.pydata.org/pandas-docs/stable/10min.html\n", "- Other tutorials: chapter from pandas in https://jakevdp.github.io/PythonDataScienceHandbook/, https://github.com/jorisvandenbossche/pandas-tutorial, https://github.com/TomAugspurger/pandas-head-to-tail, ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "