{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Practical 6: Spatial Data

\n", "

Getting to grips with Geo-Data using Geopandas

\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Complete | Part 1: Foundations | Part 2: Data | Part 3: Analysis | |\n", "| :------- | :------------------ | :----------- | :--------------- | --: |\n", "| 50% | ▓▓▓▓▓▓▓▓ | ▓▓▓░░░ | ░░░░░░ | 6/10\n", "\n", "Last week we did some initial processing on the Inside Airbnb listings data, focussing on its *numeric* properties. This week we are going to focus on the *spatial* properties of the data set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preamble\n", "\n", "
💡 Tip: It makes life a lot easier if you gather all of the library import commands and configuration information (here having to do with `matplotlib`) in the first exectuable code block in a notebook or script. That way it's easy for you for others to see what what it is necessary to have installed before getting started!.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import numpy as np\n", "import pandas as pd\n", "import geopandas as gpd\n", "import matplotlib.cm as cm\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 1. Reading Geo-Data\n", "\n", "
\n", " 🔗 Connections: We're building on the work done in Practical 5 and Practical 4 (with a particular nod to the lecture on Data) to create some useful functions that we can call on at-need to improve the ease of doing data analysis.\n", "
\n", "\n", "I find GeoPackages and GeoParquet to be by far the easiest way to distribute geo-data now: they are a single file (in a database-like format that supports multiple types of data), include the projection information by default, and in some cases QGIS can even embed information about rendering style! \n", "\n", "We're going to do something _similar_ to the `get_url` function in `dtools` in order to download the file to our hard drive and save it there. The improvement is that we'll check to see if the file already exists and, if it does, return that so that you can don't have to keep downloading it week after week. \n", "\n", "You'll need to add the documentation yourself and I've left a few `??` to challenge you." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 1.1: Caching Data Files\n", "\n", "I've used the Numpy-style comments here, but the Google-style also look good in this context and all styles of answer are acceptable so long as they *work*. See overview of commenting styles [on DataCamp](https://www.datacamp.com/community/tutorials/docstrings-python).\n", "\n", "
💡 Tip: Use this as an opportunity to improve your ability to read code and to learn through documentation.
\n", "\n", "
Difficulty level: Moderate
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from requests import get\n", "from urllib.parse import urlparse\n", "\n", "def cache_data(src:str, dest:str) -> str:\n", " \"\"\"\n", " \n", " ??\n", " \n", " \n", " \"\"\" \n", " url = urlparse(src) # We assume that this is some kind of valid URL \n", " fn = os.path.split(url.path)[??] # Extract the filename\n", " dfn = os.path.join(dest,fn) # Destination filename\n", " \n", " if not os.path.isfile(dfn):\n", " \n", " print(f\"{dfn} not found, downloading!\")\n", "\n", " path = os.path.split(dest)\n", " \n", " if len(path) >= 1 and path[0] != '':\n", " os.makedirs(os.path.join(*path), exist_ok=True)\n", " \n", " with open(dfn, \"wb\") as file:\n", " response = get(src)\n", " file.write(??.content)\n", " \n", " print(\"\\tDone downloading...\")\n", "\n", " else:\n", " print(f\"Found {dfn} locally!\")\n", "\n", " return dfn\n", "\n", "help(cache_data) # <- This should show the docstring you've written" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 1.2: Read Remote Geo-Data\n", "\n", "
Difficult level: Low, if your function works!
\n", "\n", "Use the function above to download and cache the GeoPackage files found [on GitHub](https://github.com/jreades/fsds/tree/master/data/src) for Boroughs, Water, and Greenspace, then pass the output of these to GeoPandas. If you have been having trouble downloading files from GitHub, then use the understanding of the function developed above to download the file manually and place it where this function expects to find it!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ddir = os.path.join('data','geo') # destination directory\n", "spath = 'https://github.com/jreades/i2p/blob/master/data/src/' # source path\n", "\n", "boros = gpd.read_file( ??(spath+'Boroughs.gpkg?raw=true', ddir) )\n", "water = gpd.read_file( ??(spath+'Water.gpkg?raw=true', ddir) )\n", "green = gpd.read_file( ??(spath+'Greenspace.gpkg?raw=true', ddir) )\n", "\n", "print('Done.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 1.3: Check the Projection\n", "\n", "
Difficult level: Low
\n", "\n", "Check the projection of each GeoDataFrame using a for loop and the `crs` attribute." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for g in [??]:\n", " print(g.??)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see that all three are in the [EPSG:27700 CRS](https://epsg.io/27700) which is a common one for analysis using GB data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 1.4: Check the Data\n", "\n", "
Difficult level: Low
\n", "\n", "We'll see how you control figure-making more effectively later, but for now let's just see what they look like using plot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for ??:\n", " ??.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 2. An Introduction to Mapping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These files all contain polygons, and the adjustments for points are different, but it's worth seeing how you can tweak these before we start combining them. Behind the scenes, GeoPandas is using `matplotlib` to render the map, so let's play with the colours to get the _start_ of something map-like. \n", "\n", "You will want to look both at [how to make maps in GeoPandas](https://geopandas.org/mapping.html) and at the different ways to [specify colours in Matplotlib](https://matplotlib.org/3.1.1/tutorials/colors/colors.html#specifying-colors). For the greenspace map you are looking for information about tuples... which can have three or four elements." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 2.1: Work Out the Colour Scheme\n", "\n", "
Difficulty level: Moderate
\n", "\n", "
⚠ Warning: R and Python take very different approaches to plotting. Do not think of Python's output as being 'maps' in the GIS sense, they are composed of 'patches' of color on abstract 'axes' that can use any arbitrary coordinate space. So colours are 'really' triplet (or quadruplet if you have alpha-blending transparency) values in the range 0.0-1.0. Annotations are then added in similarly abstract fashion.
\n", "\n", "I'd **suggest** the following colour scheme as a way to test out different ways of specifying colour (though anything you like is fine so long as you manipulate the colours):\n", "\n", "- The boroughs can have red edges and white fill with a thick edge.\n", "- The water should have no edges and XKCD Lightblue fill.\n", "- The greenspace should have edges and faces specified using different 'alpha blending' (i.e. transparency) levels." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.1 Boroughs\n", "\n", "By way of a hint, matplotlib uses `edgecolor` and `facecolor` for controlling 'patches' (which is what polygons are considered), but the thicker-than-default line-width is specified differently (you'll need to look this up). So the intention is:\n", "\n", "1. Thick red borough borders, and \n", "2. White fill colour.\n", "\n", "Just to drive home how different this is from R, you can find the answer to question 1 [on the page for bar plots](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boros.plot(??)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.2: Water\n", "\n", "The process is the same as above, but I'd like you to work out how to specify:\n", "1. _No_ color for an edge, and \n", "2. An XKCD color for the face." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "water.plot(??)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.3 Greenspace\n", "\n", "The process is _also_ the same as above, but I'd like you to work out how to specify colours and transparency using RGBA (red-green-blue-alpha transparency) tuples. So we're looking for:\n", "1. No edge color.\n", "2. A partially transparent green specified as a 'tuple' (4 numbers in parentheses in the range 0.0-1.0)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "green.plot(??)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 2.2 Combining Layers\n", "\n", "
⚠ Note: R and Python take very different approaches to plotting. Do not think of Python's output as being 'maps' in the GIS sense, they are composed of 'patches' of color on abstract 'axes' that can use any arbitrary coordinate space. So colours are 'really' numerical triplets (or quadruplets if you have transparency as well) in the range 0.0-1.0. Annotations are then added in similarly abstract fashion.
\n", "\n", "Now that we've got our layers looking roughly how we want them, it's time to combine them. This is also reliant on `matplotlib` and basically involves plotting items to _shared axes_ which is done by passing in `ax=` to each `plot(...)`. By convention, if you only have a single figure (e.g. a single map) then you create an axis object and name it `ax` so you will see a lot of `ax=ax` code in graphing libraries, but `=ax` is just saying 'assign to the axis object that I created'.\n", "\n", "Since the axes are how you control what is shown, see if you can find out by Googling how to set the x- and y-limits on the map so that it shows only London and trims out the much larger area of water that is outside of the Greater London Authority. **As a rough guideline, this has the Easting range 501,000 to 563,000, and the Northing range 155,000 to 202,000.**\n", "\n", "You can set these limits before or after you start adding layers to the 'map', but it's probably easier conceptually to add them after with the idea of 'zooming in' on the features of interest. It's also easier to debug since you can start by seeing if you can plot the elements at all, and _then_ add the limits to zoom.\n", "\n", "**So the steps are:**\n", "1. Write the code to plot every image on the same set of axes (I've given you something to get started).\n", "2. Google how to set the limits of the map and then use the ranges I've offered above.\n", "3. Work out how to change the width of the edges for the boroughs layer. \n", "4. Save it somewhere local so that you could, say, load it into a Markdown file!\n", "\n", "
💡 Hint: This is a first pass at a map, over the next few weeks we'll see how to add things like axis labels and titles to make it more 'map-like'. We don't have quite the built-in functionality of `ggplot` alas, but Python is advancing very quickly in this area. There is even an implementation of ggplot in Python, but it's functionality is more limited. In fact, there's more than one...
\n", "\n", "
Difficulty level: Hard
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Creates a new figure with specified number of\n", "# subplots (we'll see more of this later) and \n", "# and the specified size (in inches by default).\n", "fig, ax = plt.subplots(1,1, figsize=(12,9))\n", "\n", "# Plot all three GeoPackages to the same axes\n", "water.plot(??, ax=ax)\n", "green.??\n", "boros.??\n", "\n", "# Set the x and y limits\n", "\n", "\n", "# Save the image (dpi is 'dots per inch')\n", "plt.savefig('My_First_Map.png', dpi=150)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may wish for a different look, but here's one version of the output:\n", "\n", "![](https://github.com/jreades/i2p/raw/master/practicals/img/Map-First_Pass.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 3. 'Creating' Geo-Data\n", "\n", "Of course, you will also often encounter geo-data that is not yet 'geographically enabled'; the two most frequent contexts for this are:\n", "\n", "1. The data represents points and is provided with latitude and longitude (or similar) as separate columns in a non-geographic data set.\n", "2. The data represents polygons but is provided _separately_ from the polygons themselves and so cannot be shown on a map without being 'joined' to the geography first.\n", "\n", "We'll tackle each of these eventually, but for now we're going to focus on the first option." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 3.1: Parquet Pandas\n", "\n", "Let's re-use our `cache_data` function to download and save the full Inside Airbnb data set. Again, if you have trouble with downloading via code, use your understanding of the function to work out where to save your own copy of this file so that the function works as expected.\n", "\n", "
Difficulty level: Low
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set download URL\n", "host = 'https://orca.casa.ucl.ac.uk'\n", "path = '~jreades/data/2023-09-06-listings.parquet'\n", "url = f'{host}/{path}'\n", "\n", "# your code here\n", "df = pd.read_parquet( ??(??, os.path.join('data','raw')) )\n", "print(f\"Data frame is {df.shape[0]:,} x {df.shape[1]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see that the file was 'not found' so 'downloading' happened and then the size of the data frame was printed out." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 3.2: Checking Lat/Long\n", "\n", "Before we mindlessly convert the data it might make sense to sanity-check the data. In GeoPandas we have a `total_bounds` method that gives us the bounding box for a GeoSeries, but how would we do that in Pandas?\n", "\n", "
💡 Hint: Think about what the 'total bounds' (or 'envelope') of a point data set is. You have already seen the pandas functions you'll need to find these...
\n", "\n", "
Difficulty Level: Moderate
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"The bottom-left corner is {??}, {??}\")\n", "print(f\"The top-right corner is {??}, {??}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your answer should be something *like* (it varies from year to year):\n", "```\n", "The bottom-left corner is -0.5, 51.3\n", "The top-right corner is 0.3, 51.7\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 3.4: Embedding Web Maps\n", "\n", "
Difficulty Level: Hard
\n", "\n", "This is more for the sake of demonstrating Python's features than because it's part of my workflow, but what the heck, let's do it! We will create and embed a zoomable web map in the notebook; to do _that_ we need to:\n", "\n", "1. Calculate the bounds of the map using the min/max x and y coordinates above.\n", "2. Calculate the centroid of the map from the bounds.\n", "3. Set an appropriate zoom level.\n", "\n", "If your work is going well, perhaps you may also want to experiment with [different basemaps](https://ipyleaflet.readthedocs.io/en/latest/api_reference/basemaps.html).\n", "\n", "
💡 Hint: You can't use round here because it it could round up or down depending on what's closest and, consequently, cut off data on your map. So you'll have to look for two other functions that do this predictably (e.g. always rounding down, even if the value is 4.999999). However, those functions don't handle decimals like round does, so you need to think about how you could turn a number like 4.99 into a number that those functions can work with and then turn it back into the decimal...
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from math import floor, ceil\n", "\n", "# Calculate min and max to *two* decimal places\n", "xmin = ??\n", "xmax = ??\n", "ymin = ??\n", "ymax = ??\n", "\n", "# Print them to *3* decimal places to check they end in 0\n", "print(f\"{xmin:.3f}, {xmax:.3f}, {ymin:.3f}, {ymax:.3f}\")\n", "\n", "# Calculate the centre of the map\n", "yctr = ??\n", "xctr = ??\n", "\n", "# Print this two ways to see an intriguing issue\n", "print(f\"{xctr:.5f}, {yctr:.5f}\")\n", "print(xctr, yctr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should end up with something like: \n", "```\n", "-0.530, 0.310, 51.270, 51.710\n", "-0.11000, 51.49000\n", "-0.10999999999999999 51.49\n", "```\n", "You'll see *why* this happens in the answer notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you've managed the calculations above, then this code should simply run!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from ipyleaflet import Map, basemaps, basemap_to_tiles, Rectangle, projections\n", "\n", "# Note the basemap can be easily changed\n", "watercolor = basemap_to_tiles(basemaps.Stamen.Watercolor)\n", "\n", "m = Map(layers=(watercolor, ), center=(yctr, xctr), zoom=8)\n", "\n", "rectangle = Rectangle(bounds=( (ymin, xmin), (ymax, xmax) ),\n", " crs=projections.EPSG4326\n", ")\n", "\n", "m.add_layer(rectangle)\n", "\n", "m" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your map should look like this:\n", "\n", "![](https://github.com/jreades/i2p/raw/master/practicals/img/Stamen.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 3.5: Lat/Long to GeoSeries\n", "\n", "
Difficulty Level: Low
\n", "\n", "Right, we're finally there! We need to convert our coordinates into some kind of geo-data. GeoPandas offers two ways to do this: the original way using `zip` and a new utility method called `points_from_xy`. Here's the old way:\n", "```python\n", "from shapely.geometry import Point\n", "gdf = gpd.GeoDataFrame(df, \n", " geometry=[Point(x,y) for x, y in zip(df.Longitude,df.Latitude)])\n", "```\n", "Note, however, that this did not automatically set a projection, unlike the new approach with the 'helper function':" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf = gpd.GeoDataFrame(df, \n", " geometry=gpd.??(df.??, df.??, crs='epsg:4326'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(type(gdf))\n", "print(type(gdf.geometry))" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Task 3.6: Plotting\n", "\n", "
Difficulty Level: Low
\n", "\n", "Now that we've converted the InsideAirbnb data to a GeoDataFrame, we can plot it, reproject it, etc.\n", "\n", "See if you can work out how to plot the points coloured by their price using the appropriate BNG projection." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf.to_crs(??).plot(column=??, cmap=??, alpha=0.25, markersize=1, figsize=(12,8));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the Viridis colourmap I get the following:\n", "\n", "![Viridis Chloropleth Map of InsideAirbnb Listings](https://github.com/jreades/fsds/raw/master/practicals/img/Viridis.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 3.7: Saving Geo-Data Locally\n", "\n", "
Difficulty Level: Low
\n", "\n", "Since there are _many_ formats in which to save geo-data, rather than have multiple `to_format_x` methods, GeoPandas has _one_ for local files (`to_file`). If you are reading/writing a filename than ends in a valid extension (e.g. `.shp`, `.gpkg`, or `.geojson`) then GeoPandas will 'do the right thing'. Where you *may* run into trouble is if you are reading/writing a **URL** (e.g. [https://github.com/jreades/fsds/blob/master/data/src/Boroughs.gpkg?raw=true](https://github.com/jreades/fsds/blob/master/data/src/Boroughs.gpkg?raw=true)). With this URL ending in `?raw=true` there's no extension that GeoPandas can read so you *will* need to specify a driver. If in doubt, specify the driver as below:\n", "\n", "
Clarification: in this example we are reading from GitHub and I'm saying that we need to specify the driver. So why didn't we need to do with the cache_data function earlier as well? Well, this was a side-benefit of using the standard URL library: it automatically stripped off the query string (?raw=true) when I asked it for the file name, so we saved the file locally as a GeoPackage with .gpkg extension, which means that GeoPandas could read it without any problems.
\n", "\n", "So the following two bits of code are equivalent:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boros.to_file('test.gpkg') # This *may* not always do what we want, but should be fine for local files" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boros.to_file('test.gpkg', driver='GPKG') # This is safer if working across computers/the Internet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now save the InsideAirbnb GeoDataFrame to the 'geo' directory, but first let's see [what file formats are supported](https://geopandas.org/en/stable/docs/user_guide/io.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we have two options:\n", "\n", "- File formats with wider compatibility (the ones supported by Fiona) may not be fully-compatible with the Pandas/GeoPandas data formats. For example, **Categorical Data** is not supported by any standard format except parquet, and Shapefiles don't reliably support column labels (`df.columns`) longer than 10 characters!\n", "- File formats with Python compatibility and performance benefits are much less likely to be compatible with a GIS. So for these we have to write them `to_feather` and `read_feather`, and `to_parquet` and `read_parquet`.\n", "\n", "Notice the difference:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Using '{fn}' as basis for saving data...\")\n", "gdf.to_file(os.path.join('data','geo',fn.replace('.parquet','.gpkg')), driver='GPKG')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you try to save as a GeoPackage file then the code above typically throws a `TypeError` because of the presence of Categorical data.\n", "\n", "But the below, in which we specify as a 'geoparquet' because of the coordinate data, does not:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Using '{fn}' as basis for saving data...\")\n", "gdf.to_parquet(os.path.join('data','geo',fn.replace('.parquet','.geoparquet')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 3.8: Compare\n", "\n", "It's also worth comparing the output of a point with the output of a polygon or multi-polygon because you may well come across data in formats (e.g. WKT) resembling both of these in real data sets and they *can* be read as well. Notice too that we can use `loc` and `iloc` accessor methods to pull individual points and polygons out of a GeoDataFrame!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(gdf.geometry.iloc[1]) # Print out the object's contents\n", "gdf.geometry.iloc[1] # The object knows how to print itself as a point" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(str(boros.geometry.iloc[1])[:399] + \"...\") # Object to string then print out first 399 characters\n", "boros.geometry.iloc[1] # So this is a multi-polygon boundary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So each element of this Series has text indicating the type of shape the geometry applies to (e.g. _POLYGON_) followed by a bunch of numbers. These numbers are truncated here just to make things a little more legible." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 4. Dealing with (Geo)Data\n", "\n", "Let's dig a little deeper into GeoPandas' features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.1: Reprojection\n", "\n", "
Difficulty Level: Low
\n", "\n", "Let's start by taking our InsideAirbnb data in its original projection..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(gdf.geometry.crs)\n", "print(gdf.total_bounds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... and reprojecting this into the OSGB1936/BNG CRS:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf = gdf.??(??) # There is no 'in_place=True' option here.\n", "print(gdf.geometry.crs)\n", "print(gdf.total_bounds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the change in total bounds from lat/long to Northing/Easting." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.2: More on Mapping\n", "\n", "
Difficulty Level: Low
\n", "\n", "As we saw above with the point-plot, in its original form the pricing data will not reveal much of interest because of the range of the data. However, as you will have seen in QM already (and as we explore in greater detail in Weeks 7/8), using *transformations* we can manipulate the data to increase its tractability for analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.2.1 Work out the Data Range\n", "\n", "Let's start by getting a feel for the full data set in terms of the range of prices that it contains:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"The range of price is ${??:,.2f} to ${??:,.2f}\")\n", "print(f\"The mean and median of the price are ${??:,.2f} and ${??:,.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the neat little comma-separated thousands in there? That's fairly easy to do in English, but to use a thousands separator common to another language you would need to do something [a little more tricky](https://stackoverflow.com/questions/13082620/how-can-i-print-a-float-with-thousands-separators). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.2.2: Inheritance!\n", "\n", "We already know that GeoPandas _inherits_ functionality from Pandas, but let's formalise this... \n", "\n", "First, let's check what class of object `gdf` is using the [`isinstance`](https://www.w3schools.com/python/ref_func_isinstance.asp) function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if isinstance(gdf, gpd.GeoDataFrame): # Is gdf a GeoDataFrame object?\n", " print(\"\\tI'm a geopandas data frame!\")\n", "\n", "if isinstance(gdf, pd.DataFrame): # Is gdf *also* a DataFrame object?\n", " print(\"\\tI'm a pandas data frame!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.2.3 Benefiting from Inheritance\n", "\n", "*That* result means that we can also investigate the data using, for instance, a pandas histogram:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf.price.plot.??(bins=??, figsize=(13,3)); # Oooooh, let's use a *pandas* method here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how we've used our GeoDataFrame *as if* it's a plain old DataFrame here? That's the miracle of Object-Oriented Design: we can do *anything* we would with a regular Pandas `df` as we do with a GeoPandas `gdf` because GeoPandas *inherits* all the methods of its parent super-class.\n", "\n", "We can see that there's very little data above (at a guess) about $2,000, but at this scale it's hard to tell. We've already seen that you can use axes limits to adjust the display of a map, but the same technique applies to plain old plots because they're fundamentally the *same thing*.\n", "\n", "Try adjusting the axis so that the x-range is 0..2500:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = gdf.price.plot.??(bins=??, figsize=(13,3)); # Oooooh, let's use a *pandas* method here\n", "ax.??" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can do the same thing with a boxplot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = gdf.price.plot.??(vert=False, figsize=(13,3))\n", "ax.??" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[More complex](https://stackoverflow.com/a/54023612/4041902) formatting is also possible if you really know your pandas and your matplotlib:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf.price.plot(kind='box', vert=False, \n", " color=dict(boxes='r', whiskers='r', medians='r', caps='r'),\n", " boxprops=dict(linestyle='-', linewidth=1.5),\n", " flierprops=dict(linestyle='-', linewidth=1.5),\n", " medianprops=dict(linestyle='-', linewidth=1.5),\n", " whiskerprops=dict(linestyle='-', linewidth=1.5),\n", " capprops=dict(linestyle='-', linewidth=1.5),\n", " showfliers=False, grid=False, rot=0);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.3: Truncating and Transforming\n", "\n", "
Difficulty Level: Hard
" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "#### 4.3.1 Working it Out\n", "\n", "Anyway, drawing on everything we've seen over the past couple of weeks (and in *this* practical) I'd like you to:\n", "\n", "1. Try to take the natural-log of the price (*hint*: use `numpy`) and assign to a new Series called `lnprice`.\n", "2. Work out what the error means.\n", "3. Work out how to fix the error and *then* repeate step 1.\n", "4. Work out how many rows were affected.\n", "5. Report on the new min/max values.\n", "6. Work out if other outliers need to be removed (use code from above).\n", "7. Remove outliers and then continue with your work..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use this as a 'scratch space' to work out what's needed below..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"gdf has {gdf.shape[0]:,.0f} rows.\")\n", "\n", "# ---------- Do the processing -------------\n", "# You may need more than one of these 'drops'\n", "# to get the data the way you want...\n", "gdf.drop(gdf[??].index, axis=0, inplace=True)\n", "gdf['lnprice'] = np.log(gdf.price)\n", "\n", "# ---------- Check effects -----------\n", "print(f\"gdf now has {gdf.shape[0]:,.0f} rows.\")\n", "print(f\"The range of price is {gdf.price.min():,.2f} to {gdf.price.max():,.2f}\")\n", "print(f\"The range of ln(price) is {gdf.lnprice.min():,.4f} to {gdf.lnprice.max():,.4f}\")\n", "\n", "gdf.price.plot(kind='box', vert=False, \n", " color=dict(boxes='r', whiskers='r', medians='r', caps='r'),\n", " boxprops=dict(linestyle='-', linewidth=1.5),\n", " flierprops=dict(linestyle='-', linewidth=1.5),\n", " medianprops=dict(linestyle='-', linewidth=1.5),\n", " whiskerprops=dict(linestyle='-', linewidth=1.5),\n", " capprops=dict(linestyle='-', linewidth=1.5),\n", " showfliers=False, grid=False, rot=0);\n", "plt.title(\"Price (Outliers not shown)\")\n", "plt.show()\n", "\n", "gdf.lnprice.plot(kind='box', vert=False, \n", " color=dict(boxes='r', whiskers='r', medians='r', caps='r'),\n", " boxprops=dict(linestyle='-', linewidth=1.5),\n", " flierprops=dict(linestyle='-', linewidth=1.5),\n", " medianprops=dict(linestyle='-', linewidth=1.5),\n", " whiskerprops=dict(linestyle='-', linewidth=1.5),\n", " capprops=dict(linestyle='-', linewidth=1.5),\n", " showfliers=False, grid=False, rot=0);\n", "plt.title(\"Ln(Price) (Outliers not shown)\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.3.2 Plotting\n", "\n", "Now plot the ln(price) as a chloropleth using:\n", "1. A figure size of 12 x 8\n", "2. A marker size of 0.25\n", "3. The Viridis colourmap\n", "4. A legend\n", "5. A legend label of 'Natural Log of Price per Night ($)'\n", "\n", "I'd suggest [referring to the documentation](https://geopandas.org/mapping.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = gdf.plot(figsize=??, marker='*', markersize=0.25, \n", " column=??, cmap=??, \n", " legend=??, legend_kwds=??);\n", "ax.set_title(\"Plot of Natural Log of Nightly Price for Airbnb Listings (Outliers Removed)\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your map should look something like this:\n", " \n", "![Plot of Natural Log of Nightly Price for Airbnb Listings (Outliers Removed)](https://github.com/jreades/fsds/raw/master/practicals/img/Airbnb-lnprice.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.4: Zooming In/Out\n", "\n", "
Difficulty Level: Low
\n", "\n", "That's a little hard to see, let's try zooming in on Central London! Very roughly, let's call that an Easting range of 525,000 to 535,000 and a Northing range of 178,000 to 185,000.\n", "\n", "
💡 Hint: You will need to remember that GeoDataFrame.plot() returns an axis object.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = gdf.plot(figsize=(12,8), marker='*', markersize=0.25, \n", " column='lnprice', cmap='viridis', \n", " legend=True, legend_kwds={'label':'Natural Log of Price per Night ($)'});\n", "ax.set_title(\"Ln(Price/Night) for Airbnb Listings (Central London Detail)\")\n", "ax.??\n", "ax.??;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You result should look something like this:\n", "\n", "![Ln(Price/Night) for Airbnb Listings (Central London Detail)](https://github.com/jreades/fsds/raw/master/practicals/img/Airbnb-lnprice-zoom.png)\n", "\n", "That's a little better, but ideally we'd do more thinking about outliers... " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.5: Changing the Classification Scheme\n", "\n", "
Difficulty Level: Moderate (mainly computation time)
\n", "\n", "Let's give this one last try using a different classification scheme... such as Fisher-Jenks... for Central London!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = gdf.plot(figsize=(12,8), marker='*', markersize=0.25, \n", " column='lnprice', cmap='viridis', ??, k=5, \n", " legend=True); # Note that the legend *label* had to go -- there are other ways to add it\n", "\n", "ax.set_xlim([525000,535000])\n", "ax.set_ylim([178000,185000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In which case we can..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.6: Bringing it All Together\n", "\n", "
Difficulty Level: 🤯
\n", "\n", "And just to give a bit of a show of how we can put it all together:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pysal as p\n", "import mapclassify as mc\n", "import palettable.matplotlib as palmpl\n", "from legendgram import legendgram\n", "\n", "# We change fonts the hard way here...\n", "# but you can also do this to change\n", "# the font everywhere in one go:\n", "# matplotlib.rcParams['font.family'] = \"Liberation Sans Narrow\"\n", "fontname = \"Liberation Sans Narrow\"\n", "\n", "# We create a temporary data frame here because we want\n", "# the 'bins' to be created using only the data on the \n", "# map. Otherwise, we'd have a distribution on the map \n", "# that differed from the one in the legendgram and the\n", "# one used to calculate the breaks in the first place!\n", "tgdf = gdf[(gdf.geometry.x > 525000) & (gdf.geometry.x < 540000) & (gdf.geometry.y > 176000) & (gdf.geometry.y < 186000)].copy()\n", "\n", "# Here we use Mapclassify to calculate quantiles\n", "# (k=5) using the original price. You could use\n", "# any Mapclassify scheme at this point, though\n", "# note that for Fisher Jenks you might want to use\n", "# the 'Sampled' version to speed things up a bit.\n", "q = mc.??(tgdf.price.values, ??)\n", "\n", "# We then write these binned values *back* on to the data\n", "# frame so that we can use them with the GDF plot function.\n", "tgdf['bins'] = q.??\n", "\n", "# Set up the figure with its 'basemap'\n", "f,ax = plt.subplots(figsize=(15,9))\n", "green.plot(edgecolor=(0.7, 0.7, 0.14, 0.25), facecolor=(0.7, 0.7, 0.14, 0.25), zorder=1, ax=ax)\n", "water.plot(edgecolor=\"none\", facecolor='xkcd:lightblue', zorder=2, ax=ax)\n", "boros.plot(edgecolor=(0.8, 0, 0, 0.5), facecolor='none', linewidth=2.5, zorder=3, ax=ax)\n", "\n", "# Restrict the x and y axis to the data\n", "# Notice it's hard-coded here *and* above\n", "# this could be done *better* using a \n", "# custom bounding box so that we could \n", "# update both the tgdf and the display\n", "# area at the same time.\n", "ax.set_xlim([525000,540000])\n", "ax.set_ylim([176000,186000])\n", "\n", "ax.axis(??) # Don't plot the axes\n", "\n", "# Plot the bins using a categorical legend instead\n", "# of the price using a continuous legend.\n", "tgdf.plot(column='bins', categorical=True,\n", " cmap='viridis', legend=True, marker='.', markersize=1.5, zorder=4, ax=ax)\n", "\n", "# Set the title using a specified font, weight, and size\n", "ax.set_title('London Airbnb Listings Price Per Night', \n", " fontdict={'fontsize':'20', 'fontweight':'3', 'family':fontname}) #provide a title\n", "\n", "# This is where geopandas gets in the way -- the \n", "# categorical legend doesn't work for us so we need\n", "# to actually create the legend 'by hand' using this \n", "# code... which first has to *find* the layer containing\n", "# the data! Each layer is a 'patch collection', so we \n", "# loop through the collections looking for the one whose\n", "# z-order is 4 (which we set above to the data layer).\n", "#\n", "# I relied on this: https://stackoverflow.com/a/71419387/4041902\n", "# to work out how to do this!\n", "for c in ax.collections:\n", " # Find the layer with the data\n", " if c.get_zorder()==4:\n", " # *Now* we can create a legend... but we need to \n", " # first retrieve the colours from the layer. These\n", " # are returned as 'handles' and then we need to \n", " # associate these with the labels taken from the\n", " # Mapclassify object... Once we set that up, along\n", " # with fonts and such, we can add it as an 'artist'\n", " # to the figure.\n", " handles, _ = c.legend_elements(prop=\"colors\")\n", " legend1 = ax.legend(handles, q.get_legend_classes(fmt='{:.2f}'), \n", " loc=\"upper right\", title=\"Price per Night\", \n", " prop={'size':'10', 'weight':'1', 'family':fontname})\n", " ax.add_artist(legend1)\n", "\n", "# And don't forget to add a source!\n", "a = ax.text(tgdf.geometry.x.max(), tgdf.geometry.y.min(), 'Source: InsideAirbnb (2022)', \n", " horizontalalignment='right', verticalalignment='bottom', \n", " fontsize=14, fontweight=4, color='#333333', family=fontname)\n", "\n", "# And this is a nice feature: show the distribution!\n", "ax2 = legendgram(f, ax, \n", " tgdf.??, q.??, bins=round(gdf.price.max()/25),\n", " pal=palmpl.Viridis_5,\n", " legend_size=(0.3, 0.1), \n", " loc='lower left',\n", " clip=(0,1000),\n", " frameon=True\n", " )\n", "# But we have to fix the font manually here\n", "# for the legendgram too\n", "for tk in ax2.get_xticklabels():\n", " tk.set_fontname(fontname)\n", " \n", "#plt.savefig('Airbnb-price-all.png', dpi=150)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should have something like this:\n", " \n", "![London Airbnb Listings Price Per Night (Quantiles)](https://github.com/jreades/fsds/raw/master/practicals/img/Airbnb-price-all-together.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'll be honest, I do find ggplot easier for making good-quality; this _is_ more customisable overall, but it's also much more 'magical' in the sense of 'search for `matplotlib` examples that do what you want then copy+paste them and tweak' being the main way that most people get things working how they want." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scalebars are awkward, and there's now a library that can help with this [on GitHub](https://github.com/ppinard/matplotlib-scalebar) that I've installed. But I'll leave that one to you.\n", "\n", "
💡 Tip: You can find a lot of possible solutions in this Stackoverflow thread that should work without needing to install new libraries but I've not had a chance to test them each individually. You would undoubtedly want to put this in an external package and import it when needed rather than paste this code into every file. But you might find it easier to test the solutions by pasting. If you're looking for glory (and my gratitude) then working out which of these is most generalisable (i.e. would work with both lat/long and OSGB coordinates) would be quite the challenge!
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting More Help/Applications\n", "\n", "A great resource for more help and more examples is Dani Arribas-Bel's _Geographic Data Science_ module: he has put all of his [module practicals online](http://darribas.org/gds17/) (as we have too), and you might find that something that he does makes more sense to you than what we've done... check it out!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Credits!\n", "\n", "#### Contributors:\n", "The following individuals have contributed to these teaching materials: Jon Reades (jonathan.reades@kcl.ac.uk), James Millington (james.millington@kcl.ac.uk)\n", "\n", "#### License\n", "These teaching materials are licensed under a mix of [The MIT License](https://opensource.org/licenses/mit-license.php) and the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/).\n", "\n", "#### Acknowledgements:\n", "Supported by the [Royal Geographical Society](https://www.rgs.org/HomePage.htm) (with the Institute of British Geographers) with a Ray Y Gildea Jr Award.\n", "\n", "#### Potential Dependencies:\n", "This notebook may depend on the following libraries: pandas, matplotlib, seaborn" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 4 }