{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to points, lines, areas, rasters, and trimeshes, Datashader can quickly render large collections of polygons (filled polylines). Datashader's polygon support depends on data structures provided by the [spatialpandas](https://nbviewer.org/github/holoviz/spatialpandas/blob/main/examples/Overview.ipynb) library. SpatialPandas extends Pandas and Parquet to support efficient storage and manipulation of \"ragged\" (variable length) data like polygons. Instructions for installing spatialpandas are given at the bottom of this page.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SpatialPandas Data Structures\n", "Pandas supports custom column types using an \"ExtensionArray\" interface. SpatialPandas provides two Pandas ExtensionArrays that support polygons:\n", "\n", "- `spatialpandas.geometry.PolygonArray`: Each row in the column is a single `Polygon` instance. As with shapely and geopandas, each Polygon may contain zero or more holes. \n", " \n", "- `spatialpandas.geometry.MultiPolygonArray`: Each row in the column is a `MultiPolygon` instance, each of which can store one or more polygons, with each polygon containing zero or more holes.\n", "\n", "Datashader assumes that the vertices of the outer filled polygon will be listed as `x1`, `y1`, `x2`, `y2`, etc. in counter-clockwise (CCW) order around the polygon edge, while the holes will be in clockwise (CW) order. All polygons (both filled and holes) must be \"closed\", with the first vertex of each polygon repeated as the last vertex." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```mermaid\n", "graph TD;\n", " subgraph SP[ ]\n", " A[Polygon]:::spatialpandas -->|used in| E[PolygonArray]:::spatialpandas\n", " B[MultiPolygon]:::spatialpandas -->|used in| F[MultiPolygonArray]:::spatialpandas\n", " E -->|used in| G[GeoDataFrame or
GeoSeries]:::spatialpandas\n", " spatialpandas(SpatialPandas):::spatialpandas_title\n", " F -->|used in| G\n", " end\n", "\n", " subgraph SH[ ]\n", " shapely(Shapely):::shapely_title\n", " H[Polygon]:::shapely -->|converts to| A\n", " I[MultiPolygon]:::shapely -->|converts to| B\n", " end\n", "\n", " subgraph GP[ ]\n", " geopandas(GeoPandas):::geopandas_title\n", " J[GeoDataFrame or
GeoSeries]:::geopandas\n", " end\n", "\n", " subgraph PD[ ]\n", " pandas(Pandas):::pandas_title\n", " K[DataFrame or
Series]:::pandas\n", " end\n", "\n", " subgraph SPD[ ]\n", " spatialpandas.dask(SpatialPandas.Dask):::dask_title\n", " L[DaskGeoDataFrame or
DaskGeoSeries]:::dask\n", " end\n", "\n", " M(Datashader):::datashader_title\n", "\n", " F -->|usable in| K\n", " E -->|usable in| K\n", " \n", " G -->|converts to| L\n", "\n", " G <-->|converts to| J\n", "\n", " G -->|usable by| M\n", " L -->|usable by| M\n", " J -->|usable by| M\n", " K -->|usable by| M\n", "\n", " classDef spatialpandas fill:#4e79a7,stroke:#000,stroke-width:0px,color:black;\n", " classDef shapely fill:#f28e2b,stroke:#000,stroke-width:0px,color:black;\n", " classDef geopandas fill:#59a14f,stroke:#000,stroke-width:0px,color:black;\n", " classDef dask fill:#76b7b2,stroke:#000,stroke-width:0px,color:black;\n", " classDef datashader fill:#b07aa1,stroke:#000,stroke-width:0px,color:black;\n", " classDef pandas fill:#edc948,stroke:#000,stroke-width:0px,color:black;\n", " classDef spatialpandas_title fill:#fff,stroke:#4e79a7,stroke-width:7px,color:black,fill-opacity:.9,font-weight: bold;\n", " classDef shapely_title fill:#fff,stroke:#f28e2b,stroke-width:7px,color:black,fill-opacity:.9,font-weight: bold;\n", " classDef geopandas_title fill:#fff,stroke:#59a14f,stroke-width:7px,color:black,fill-opacity:.9,font-weight: bold;\n", " classDef dask_title fill:#fff,stroke:#76b7b2,stroke-width:7px,color:black,fill-opacity:.9,font-weight: bold;\n", " classDef pandas_title fill:#fff,stroke:#edc948,stroke-width:7px,color:black,fill-opacity:.9,font-weight: bold;\n", " classDef datashader_title fill:#fff,stroke:#b07aa1,stroke-width:7px,color:black,fill-opacity:.9,font-weight: bold;\n", " classDef subgraph_style fill:#fff;\n", "\n", " style SP fill:grey,stroke:#fff,stroke-width:0px\n", " style SH fill:grey,stroke:#fff,stroke-width:0px\n", " style GP fill:grey,stroke:#fff,stroke-width:0px\n", " style PD fill:grey,stroke:#fff,stroke-width:0px\n", " style SPD fill:grey,stroke:#fff,stroke-width:0px\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing Required Libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import dask.dataframe as dd\n", "import colorcet as cc\n", "import datashader as ds\n", "import datashader.transfer_functions as tf\n", "import spatialpandas as sp\n", "import spatialpandas.geometry\n", "import spatialpandas.dask" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simple Example\n", "Here is a simple example of a two-element `MultiPolygonArray`. The first element specifies two filled polygons, the first with two holes. The second element contains one filled polygon with one hole." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "polygons = sp.geometry.MultiPolygonArray([\n", " # First Element\n", " [[[0, 0, 1, 0, 2, 2, -1, 4, 0, 0], # Filled quadrilateral (CCW order)\n", " [0.5, 1, 1, 2, 1.5, 1.5, 0.5, 1], # Triangular hole (CW order)\n", " [0, 2, 0, 2.5, 0.5, 2.5, 0.5, 2, 0, 2]], # Rectangular hole (CW order)\n", "\n", " [[-0.5, 3, 1.5, 3, 1.5, 4, -0.5, 3]],], # Filled triangle\n", "\n", " # Second Element\n", " [[[1.25, 0, 1.25, 2, 4, 2, 4, 0, 1.25, 0], # Filled rectangle (CCW order)\n", " [1.5, 0.25, 3.75, 0.25, 3.75, 1.75, 1.5, 1.75, 1.5, 0.25]],]]) # Rectangular hole (CW order)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The filled quadrilateral starts at (x,y) (0,0) and goes to (1,0), then (2,2), then (-1,4), and back to (0,0); others similarly go left to right (but drawing in either CCW or CW order around the edge of the polygon depending on whether they are filled or holes).\n", "\n", "Since a `MultiPolygonArray` is a pandas/dask extension array, it can be added as a column to a `DataFrame`. For convenience, you can define your DataFrame as `sp.GeoDataFrame` instead of `pd.DataFrame`, which will automatically include support for polygon columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = sp.GeoDataFrame({'polygons': polygons, 'v': range(1, len(polygons)+1)})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rasterizing Polygons\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Polygons are rasterized to a `Canvas` by Datashader using the `Canvas.polygons` method. This method works like the existing glyph methods (`.points`, `.line`, etc), except it does not have `x` and `y` arguments. Instead, it has a single `geometry` argument that should be passed the name of a `PolygonArray` or `MultiPolygonArray` column in the supplied `DataFrame`. For comparison we'll also show the polygon outlines rendered with `Canvas.line` (which also supports geometry columns):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "cvs = ds.Canvas()\n", "agg = cvs.polygons(df, geometry='polygons', agg=ds.sum('v'))\n", "filled = tf.shade(agg)\n", "float(agg.min()), float(agg.max())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "cvs = ds.Canvas()\n", "agg = cvs.line(df, geometry='polygons', agg=ds.sum('v'), line_width=4)\n", "unfilled = tf.shade(agg)\n", "float(agg.min()), float(agg.max())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tf.Images(filled, unfilled)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here as you can see each polygon is filled or outlined with the indicated value from the `v` column for that polygon specification. The `sum` aggregator for the filled polygons specifies that the rendered colors should indicate the sum of the `v` values of polygons that overlap that pixel, and so the first element (`v=1`) has an overlapping area with value 2, and the second element has a value of 2 except in overlap areas where it gets a value of 3. Each plot is normalized separately, so the filled plot uses the colormap for the range 1 (light blue) to 3 (dark blue), while the outlined plot ranges from 1 to 2.\n", "\n", "You can use polygons from within interactive plotting programs with axes to see the underlying values by hovering, which helps for comparing Datashader's aggregation-based approach (right) to standard polygon rendering (left):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import holoviews as hv\n", "from holoviews.operation.datashader import rasterize\n", "from holoviews.streams import PlotSize\n", "PlotSize.scale=2 # Sharper plots on Retina displays\n", "hv.extension(\"bokeh\")\n", "\n", "hvpolys = hv.Polygons(df, vdims=['v']).opts(color='v', tools=['hover'])\n", "hvpolys + rasterize(hvpolys, aggregator=ds.sum('v')).opts(tools=['hover'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Realistic Example\n", "Here is a more realistic example, plotting the unemployment rate of the counties in Texas. To run you need to run `bokeh sampledata` or install the package `bokeh_sampledata`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bokeh.sampledata.us_counties import data as counties\n", "from bokeh.sampledata.unemployment import data as unemployment\n", "\n", "counties = { code: county for code, county in counties.items()\n", " if county[\"state\"] in [\"tx\"] }\n", "\n", "county_boundaries = [[[*zip(county[\"lons\"] + county[\"lons\"][:1],\n", " county[\"lats\"] + county[\"lats\"][:1])]\n", " for county in counties.values()]]\n", "\n", "county_rates = [unemployment[county_id] for county_id in counties]\n", "\n", "boundary_coords = [[np.concatenate(list(\n", " zip(county[\"lons\"][::-1] + county[\"lons\"][-1:],\n", " county[\"lats\"][::-1] + county[\"lats\"][-1:])\n", "))] for county in counties.values()]\n", "\n", "boundaries = sp.geometry.PolygonArray(boundary_coords)\n", "\n", "county_info = sp.GeoDataFrame({'boundary': boundaries,\n", " 'unemployment': county_rates})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Discard the output from one aggregation of each type, to avoid measuring Numba compilation times\n", "tf.shade(cvs.polygons(county_info, geometry='boundary', agg=ds.mean('unemployment')));\n", "tf.shade(cvs.line(county_info, geometry='boundary', agg=ds.any()));\n", "tf.shade(cvs.line(county_info, geometry='boundary', agg=ds.any(), line_width=1.5));" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "cvs = ds.Canvas(plot_width=600, plot_height=600)\n", "agg = cvs.polygons(county_info, geometry='boundary', agg=ds.mean('unemployment'))\n", "filled = tf.shade(agg)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "agg = cvs.line(county_info, geometry='boundary', agg=ds.any())\n", "unfilled = tf.shade(agg)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "agg = cvs.line(county_info, geometry='boundary', agg=ds.any(), line_width=3.5)\n", "antialiased = tf.shade(agg)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tf.Images(filled, unfilled, antialiased)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using GeoPandas and Converting to SpatialPandas\n", "\n", "You can utilize a GeoPandas `GeoSeries` of `Polygon`/`MultiPolygon` objects with Datashader:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import geodatasets as gds\n", "import geopandas\n", "\n", "geodf = geopandas.read_file(gds.get_path('geoda health'))\n", "geodf = geodf.to_crs(epsg=4087) # simple cylindrical projection\n", "geodf['boundary'] = geodf.geometry.boundary\n", "geodf['centroid'] = geodf.geometry.centroid\n", "geodf.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Optionally, you can convert the GeoPandas GeoDataFrame to a SpatialPandas GeoDataFrame. Since version 0.16, Datashader supports direct use of `geopandas` `GeoDataFrame`s without having to convert them to `spatialpandas`. See [GeoPandas](13_Geopandas.ipynb).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "sgeodf = sp.GeoDataFrame(geodf)\n", "sgeodf.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting from SpatialPandas to GeoPandas/Shapely\n", "A `MultiPolygonArray` can be converted to a geopandas `GeometryArray` using the `to_geopandas` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "pd.Series(sgeodf.boundary.array.to_geopandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Individual elements of a `MultiPolygonArray` can be converted into `shapely` shapes using the `to_shapely` method" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sgeodf.geometry.array[3].to_shapely()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting Polygons with Datashader" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting as filled polygons" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Datashader can also plot polygons as filled shapes. Here’s an example using a `spatialpandas` `GeoDataFrame`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Discard the output to avoid measuring Numba compilation times\n", "tf.shade(cvs.polygons(sgeodf, geometry='geometry', agg=ds.mean('cty_pop200')));" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "cvs = ds.Canvas(plot_width=650, plot_height=400)\n", "agg = cvs.polygons(sgeodf, geometry='geometry', agg=ds.mean('cty_pop200'))\n", "tf.shade(agg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting as centroid points" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Discard the output to avoid measuring Numba compilation times\n", "cvs.points(sgeodf, geometry='centroid', agg=ds.mean('cty_pop200'));" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "agg = cvs.points(sgeodf, geometry='centroid', agg=ds.mean('cty_pop200'))\n", "tf.spread(tf.shade(agg), 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Polygon Perimeter/Area calculation\n", "\n", "The spatialpandas library provides highly optimized versions of some of the geometric operations supported by a geopandas `GeoSeries`, including parallelized [Numba](https://numba.pydata.org) implementations of the `length` and `area` properties:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "{\"MultiPolygon2dArray length\": sgeodf.geometry.array.length[:4],\n", " \"GeoPandas length\": geodf.geometry.array.length[:4],\n", " \"MultiPolygonArray area\": sgeodf.geometry.array.area[:4],\n", " \"GeoPandas area\": geodf.geometry.array.area[:4],}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Speed differences:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Duplicate 500 times\n", "sgeodf_large = pd.concat([sgeodf.geometry] * 500)\n", "geodf_large = pd.concat([geodf.geometry] * 500)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "length_ds = %timeit -o -n 3 -r 1 geodf_large.array.length\n", "length_gp = %timeit -o -n 3 -r 1 sgeodf_large.array.length\n", "print(\"\\nMultiPolygonArray.length speedup: %.2f\" % (length_ds.average / length_gp.average))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "area_ds = %timeit -o -n 3 -r 1 geodf_large.array.area\n", "area_gp = %timeit -o -n 3 -r 1 sgeodf_large.array.area\n", "print(\"\\nMultiPolygonArray.area speedup: %.2f\" % (area_ds.average / area_gp.average))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Other GeoPandas operations could be sped up similarly by [adding them to spatialpandas](https://github.com/holoviz/spatialpandas/issues/1)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parquet support\n", "\n", "SpatialPandas geometry arrays can be stored in Parquet files, which support efficient chunked columnar access that is particularly important when working with Dask for large files. To create such a file, use `.to_parquet`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sgeodf.to_parquet('sgeodf.parq')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A parquet file containing geometry arrays should be read using the `spatialpandas.io.read_parquet` function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from spatialpandas.io import read_parquet\n", "read_parquet('sgeodf.parq').head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dask support\n", "\n", "For large collections of polygons, you can use [Dask](https://dask.org) to parallelize the rendering. If you are starting with a Pandas dataframe with a geometry column or a SpatialPandas `GeoDataFrame`, just use the standard `dask.dataframe.from_pandas` method:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " If you are seeing an error when running this example, ensure that you have configured Dask for SpatialPandas compatibility, as described at the bottom of this page.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ddf = dd.from_pandas(sgeodf, npartitions=2).pack_partitions(npartitions=100).persist()\n", "\n", "tf.shade(cvs.polygons(ddf, geometry='geometry', agg=ds.mean('cty_pop200')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we've used `pack_partitions` to re-sort and re-partition the dataframe such that each partition contains geometry objects that are relatively close together in space. This partitioning makes it faster for Datashader to identify which partitions are needed in order to render a particular view, which is useful when zooming into local regions of large datasets. (This particular dataset is quite small, and unlikely to benefit, of course!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interactive example using HoloViews\n", "\n", "As you can see above, HoloViews can easily invoke Datashader on polygons using `rasterize`, with full interactive redrawing at each new zoom level **as long as you have a live Python process running**. The code for the world population example would be:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "out = rasterize(hv.Polygons(ddf, vdims=['cty_pop200']), aggregator=ds.sum('cty_pop200'))\n", "out.opts(width=700, height=500, tools=[\"hover\"]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, we've used a semicolon to suppress the output, because we'll actually use a more complex example with a custom callback function `cb` to update the plot title whenever you zoom in. That way, you'll be able to see the number of partitions that the spatial index has determined are needed to cover the current viewport:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def compute_partitions(el):\n", " n = ddf.cx_partitions[slice(*el.range('x')), slice(*el.range('y'))].npartitions\n", " return el.opts(title=f'Population by county (npartitions: {n})')\n", "\n", "out.apply(compute_partitions).opts(frame_width=700, frame_height=500, tools=[\"hover\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, if you zoom in with this page backed by a live Python process, you'll not only see the plot redraw at full resolution whenever you zoom in, you'll see how many partitions of the dataset are needed to render it, ranging from 100 for the full map down to 1 when zoomed into a small area.\n", "\n", "A larger example with polygons for the footprints of a million buildings in the New York City area can be run online at [PyViz examples](https://examples.pyviz.org/nyc_buildings).\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing SpatialPandas\n", "\n", "\n", "Before running these examples, you will need spatialpandas installed with pip:\n", "\n", "```\n", "$ pip install spatialpandas\n", "```\n", "\n", "or conda:\n", "```\n", "$ conda install -c pyviz/label/dev spatialpandas\n", "```\n", "\n", "## Configuring Dask for SpatialPandas Compatibility\n", "\n", "By default, Dask version [2024.3.0](https://docs.dask.org/en/stable/changelog.html#:~:text=2024.3.0-,%C2%B6,-Released%20on%20March) enabled a new query-planning (`dask-expr`) version of their DataFrame API. SpatialPandas v0.4.11 does not yet support query-planning. Until SpatialPandas supports query planning, you can use one of the following steps to use the classic Dask DataFrame API:\n", "\n", "#### Method 1: Dask Configuration\n", "\n", "Run the following command in your terminal:\n", "\n", "```bash\n", "$ dask config set dataframe.query-planning False\n", "```\n", "\n", "#### Method 2: Environment Variable\n", "Set the environment variable by running:\n", "\n", "```bash\n", "$ export DASK_DATAFRAME__QUERY_PLANNING=False\n", "```" ] } ], "metadata": { "language_info": { "name": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 4 }