{ "cells": [ { "cell_type": "markdown", "id": "61a040ce-418a-4a97-9850-0813cfb02422", "metadata": {}, "source": [ "## Accessing GBIF data with the Planetary Computer STAC API\n", "\n", "This notebook provides an example of accessing [Global Biodiversity Information Facility](https://planetarycomputer.microsoft.com/dataset/gbif) (GBIF) occurrence data from the Planetary Computer STAC API. Periodic snapshots of the data are stored in Parquet format." ] }, { "cell_type": "code", "execution_count": 1, "id": "377c1939-0449-4a4a-b2e1-5927a438e282", "metadata": {}, "outputs": [], "source": [ "import pystac_client\n", "import planetary_computer" ] }, { "cell_type": "markdown", "id": "8843db71-467d-4537-837c-844b30f0cf68", "metadata": {}, "source": [ "To access the data stored in Azure Blob Storage, we'll use the Planetary Computer's [STAC API](https://planetarycomputer.microsoft.com/api/stac/v1/docs). " ] }, { "cell_type": "code", "execution_count": 2, "id": "f456e340-2e97-4534-9548-28e94b03806b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['gbif-2022-10-01',\n", " 'gbif-2022-09-01',\n", " 'gbif-2022-08-01',\n", " 'gbif-2022-07-01',\n", " 'gbif-2022-06-01',\n", " 'gbif-2022-05-01',\n", " 'gbif-2022-04-01',\n", " 'gbif-2022-03-01',\n", " 'gbif-2022-02-01',\n", " 'gbif-2022-01-01',\n", " 'gbif-2021-12-01',\n", " 'gbif-2021-11-01',\n", " 'gbif-2021-10-01',\n", " 'gbif-2021-09-01',\n", " 'gbif-2021-08-01',\n", " 'gbif-2021-07-01',\n", " 'gbif-2021-06-01',\n", " 'gbif-2021-04-13']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "catalog = pystac_client.Client.open(\n", " \"https://planetarycomputer.microsoft.com/api/stac/v1\",\n", " modifier=planetary_computer.sign_inplace,\n", ")\n", "search = catalog.search(collections=[\"gbif\"])\n", "items = search.get_all_items()\n", "items = {x.id: x for x in items}\n", "list(items)" ] }, { "cell_type": "markdown", "id": "15a7e8ac-5128-49f7-8781-96c578044b99", "metadata": {}, "source": [ "https://sasweb.microsoft.com/Member/Silo/16477We'll take the most recent item." ] }, { "cell_type": "code", "execution_count": 4, "id": "1a786e42-a134-48a9-b8d4-1046b8c78556", "metadata": {}, "outputs": [], "source": [ "item = list(items.values())[0]" ] }, { "cell_type": "markdown", "id": "435ec017-8d3c-49a1-8c03-748ec2ba2613", "metadata": {}, "source": [ "We'll use [Dask](https://docs.dask.org/en/latest/) to read the partitioned Parquet Dataset." ] }, { "cell_type": "code", "execution_count": 10, "id": "aeee8190-2183-4600-8820-c3d323270488", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Dask DataFrame Structure:
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gbifiddatasetkeyoccurrenceidkingdomphylumclassorderfamilygenusspeciesinfraspecificepithettaxonrankscientificnameverbatimscientificnameverbatimscientificnameauthorshipcountrycodelocalitystateprovinceoccurrencestatusindividualcountpublishingorgkeydecimallatitudedecimallongitudecoordinateuncertaintyinmeterscoordinateprecisionelevationelevationaccuracydepthdepthaccuracyeventdatedaymonthyeartaxonkeyspecieskeybasisofrecordinstitutioncodecollectioncodecatalognumberrecordnumberidentifiedbydateidentifiedlicenserightsholderrecordedbytypestatusestablishmentmeanslastinterpretedmediatypeissue
npartitions=1960
int64objectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectobjectint32objectfloat64float64float64float64float64float64float64float64datetime64[ns]int32int32int32int32int32objectobjectobjectobjectobjectobjectdatetime64[ns]objectobjectobjectobjectobjectdatetime64[ns]objectobject
......................................................................................................................................................
.........................................................................................................................................................
......................................................................................................................................................
......................................................................................................................................................
\n", "
\n", "
Dask Name: read-parquet, 1 graph layer
" ], "text/plain": [ "Dask DataFrame Structure:\n", " gbifid datasetkey occurrenceid kingdom phylum class order family genus species infraspecificepithet taxonrank scientificname verbatimscientificname verbatimscientificnameauthorship countrycode locality stateprovince occurrencestatus individualcount publishingorgkey decimallatitude decimallongitude coordinateuncertaintyinmeters coordinateprecision elevation elevationaccuracy depth depthaccuracy eventdate day month year taxonkey specieskey basisofrecord institutioncode collectioncode catalognumber recordnumber identifiedby dateidentified license rightsholder recordedby typestatus establishmentmeans lastinterpreted mediatype issue\n", "npartitions=1960 \n", " int64 object object object object object object object object object object object object object object object object object object int32 object float64 float64 float64 float64 float64 float64 float64 float64 datetime64[ns] int32 int32 int32 int32 int32 object object object object object object datetime64[ns] object object object object object datetime64[ns] object object\n", " ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n", "... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n", " ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n", " ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n", "Dask Name: read-parquet, 1 graph layer" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import dask.dataframe as dd\n", "\n", "asset = item.assets[\"data\"]\n", "df = dd.read_parquet(\n", " asset.href,\n", " storage_options=asset.extra_fields[\"table:storage_options\"],\n", " parquet_file_extension=None,\n", " arrow_to_pandas=dict(timestamp_as_object=True),\n", ")\n", "df" ] }, { "cell_type": "markdown", "id": "2141fc59-b96c-4e87-a594-3ce15e86fa9d", "metadata": {}, "source": [ "As indicated by `npartitions`, this Parquet dataset is made up of many individual parquet files. We can read in a specific partition with `.get_partition`" ] }, { "cell_type": "code", "execution_count": 11, "id": "4cc8cfbc-23db-4b55-9a94-2840f2f0f0e0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gbifiddatasetkeyoccurrenceidkingdomphylumclassorderfamilygenusspecies...identifiedbydateidentifiedlicenserightsholderrecordedbytypestatusestablishmentmeanslastinterpretedmediatypeissue
021417880294fa7b334-ce0d-4e88-aaae-2e0c138d049eURN:catalog:CLO:EBIRD:OBS590769735AnimaliaChordataAvesPasseriformesPolioptilidaePolioptilaPolioptila caerulea...[]NoneCC_BY_4_0None[{'array_element': 'obsr233099'}][]None2022-09-08 14:29:55.344000[][]
121425116294fa7b334-ce0d-4e88-aaae-2e0c138d049eURN:catalog:CLO:EBIRD:OBS591457037AnimaliaChordataAvesColumbiformesColumbidaePatagioenasPatagioenas cayennensis...[]NoneCC_BY_4_0None[{'array_element': 'obsr767103'}][]None2022-09-08 14:29:56.715000[][{'array_element': 'COORDINATE_ROUNDED'}]
221261551164fa7b334-ce0d-4e88-aaae-2e0c138d049eURN:catalog:CLO:EBIRD:OBS592106103AnimaliaChordataAvesAccipitriformesPandionidaePandionPandion haliaetus...[]NoneCC_BY_4_0None[{'array_element': 'obsr370369'}][]None2022-09-08 14:29:58.118000[][]
321435876634fa7b334-ce0d-4e88-aaae-2e0c138d049eURN:catalog:CLO:EBIRD:OBS592831487AnimaliaChordataAvesPasseriformesIcteridaeQuiscalusQuiscalus quiscula...[]NoneCC_BY_4_0None[{'array_element': 'obsr29404'}][]None2022-09-08 14:29:59.965000[][]
421520552674fa7b334-ce0d-4e88-aaae-2e0c138d049eURN:catalog:CLO:EBIRD:OBS593603038AnimaliaChordataAvesPasseriformesRegulidaeRegulusRegulus calendula...[]NoneCC_BY_4_0None[{'array_element': 'obsr383648'}][]None2022-09-08 14:30:01.862000[][]
..................................................................
194784316749068579e932f70-0c61-11dd-84ce-b8a03c50a862urn:lsid:slu.aqua.rom.sers:ObservedProperty:14...AnimaliaChordataActinopterygiiPerciformesPercidaePercaPerca fluviatilis...[]NoneCC0_1_0None[{'array_element': 'Fiskeriverkets utredningsk...[]None2022-09-25 05:23:40.367000[][{'array_element': 'COORDINATE_ROUNDED'}]
194784420128268379e932f70-0c61-11dd-84ce-b8a03c50a862urn:lsid:slu.aqua.rom.sers:ObservedProperty:14...AnimaliaArthropodaMalacostracaDecapodaAstacidaeAstacusAstacus astacus...[]NoneCC0_1_0None[{'array_element': 'Fiskeriverkets utredningsk...[]None2022-09-25 05:23:40.367000[][{'array_element': 'COORDINATE_ROUNDED'}]
194784516749068909e932f70-0c61-11dd-84ce-b8a03c50a862urn:lsid:slu.aqua.rom.sers:ObservedProperty:14...AnimaliaChordataActinopterygiiSalmoniformesSalmonidaeSalmoSalmo trutta...[]NoneCC0_1_0None[{'array_element': 'Fiskeriverkets utredningsk...[]None2022-09-25 05:23:40.367000[][{'array_element': 'COORDINATE_ROUNDED'}]
194784616751041769e932f70-0c61-11dd-84ce-b8a03c50a862urn:lsid:slu.aqua.rom.sers:ObservedProperty:14...AnimaliaChordataActinopterygiiSalmoniformesSalmonidaeThymallusThymallus thymallus...[]NoneCC0_1_0None[{'array_element': 'Fiskeriverkets utredningsk...[]None2022-09-25 05:23:40.367000[][{'array_element': 'COORDINATE_ROUNDED'}]
194784716749068369e932f70-0c61-11dd-84ce-b8a03c50a862urn:lsid:slu.aqua.rom.sers:ObservedProperty:14...AnimaliaChordataActinopterygiiPerciformesPercidaePercaPerca fluviatilis...[]NoneCC0_1_0None[{'array_element': 'Konsult'}][]None2022-09-25 05:23:40.367000[][{'array_element': 'COORDINATE_ROUNDED'}]
\n", "

1947848 rows × 50 columns

\n", "
" ], "text/plain": [ " gbifid datasetkey \\\n", "0 2141788029 4fa7b334-ce0d-4e88-aaae-2e0c138d049e \n", "1 2142511629 4fa7b334-ce0d-4e88-aaae-2e0c138d049e \n", "2 2126155116 4fa7b334-ce0d-4e88-aaae-2e0c138d049e \n", "3 2143587663 4fa7b334-ce0d-4e88-aaae-2e0c138d049e \n", "4 2152055267 4fa7b334-ce0d-4e88-aaae-2e0c138d049e \n", "... ... ... \n", "1947843 1674906857 9e932f70-0c61-11dd-84ce-b8a03c50a862 \n", "1947844 2012826837 9e932f70-0c61-11dd-84ce-b8a03c50a862 \n", "1947845 1674906890 9e932f70-0c61-11dd-84ce-b8a03c50a862 \n", "1947846 1675104176 9e932f70-0c61-11dd-84ce-b8a03c50a862 \n", "1947847 1674906836 9e932f70-0c61-11dd-84ce-b8a03c50a862 \n", "\n", " occurrenceid kingdom \\\n", "0 URN:catalog:CLO:EBIRD:OBS590769735 Animalia \n", "1 URN:catalog:CLO:EBIRD:OBS591457037 Animalia \n", "2 URN:catalog:CLO:EBIRD:OBS592106103 Animalia \n", "3 URN:catalog:CLO:EBIRD:OBS592831487 Animalia \n", "4 URN:catalog:CLO:EBIRD:OBS593603038 Animalia \n", "... ... ... \n", "1947843 urn:lsid:slu.aqua.rom.sers:ObservedProperty:14... Animalia \n", "1947844 urn:lsid:slu.aqua.rom.sers:ObservedProperty:14... Animalia \n", "1947845 urn:lsid:slu.aqua.rom.sers:ObservedProperty:14... Animalia \n", "1947846 urn:lsid:slu.aqua.rom.sers:ObservedProperty:14... Animalia \n", "1947847 urn:lsid:slu.aqua.rom.sers:ObservedProperty:14... Animalia \n", "\n", " phylum class order family \\\n", "0 Chordata Aves Passeriformes Polioptilidae \n", "1 Chordata Aves Columbiformes Columbidae \n", "2 Chordata Aves Accipitriformes Pandionidae \n", "3 Chordata Aves Passeriformes Icteridae \n", "4 Chordata Aves Passeriformes Regulidae \n", "... ... ... ... ... \n", "1947843 Chordata Actinopterygii Perciformes Percidae \n", "1947844 Arthropoda Malacostraca Decapoda Astacidae \n", "1947845 Chordata Actinopterygii Salmoniformes Salmonidae \n", "1947846 Chordata Actinopterygii Salmoniformes Salmonidae \n", "1947847 Chordata Actinopterygii Perciformes Percidae \n", "\n", " genus species ... identifiedby \\\n", "0 Polioptila Polioptila caerulea ... [] \n", "1 Patagioenas Patagioenas cayennensis ... [] \n", "2 Pandion Pandion haliaetus ... [] \n", "3 Quiscalus Quiscalus quiscula ... [] \n", "4 Regulus Regulus calendula ... [] \n", "... ... ... ... ... \n", "1947843 Perca Perca fluviatilis ... [] \n", "1947844 Astacus Astacus astacus ... [] \n", "1947845 Salmo Salmo trutta ... [] \n", "1947846 Thymallus Thymallus thymallus ... [] \n", "1947847 Perca Perca fluviatilis ... [] \n", "\n", " dateidentified license rightsholder \\\n", "0 None CC_BY_4_0 None \n", "1 None CC_BY_4_0 None \n", "2 None CC_BY_4_0 None \n", "3 None CC_BY_4_0 None \n", "4 None CC_BY_4_0 None \n", "... ... ... ... \n", "1947843 None CC0_1_0 None \n", "1947844 None CC0_1_0 None \n", "1947845 None CC0_1_0 None \n", "1947846 None CC0_1_0 None \n", "1947847 None CC0_1_0 None \n", "\n", " recordedby typestatus \\\n", "0 [{'array_element': 'obsr233099'}] [] \n", "1 [{'array_element': 'obsr767103'}] [] \n", "2 [{'array_element': 'obsr370369'}] [] \n", "3 [{'array_element': 'obsr29404'}] [] \n", "4 [{'array_element': 'obsr383648'}] [] \n", "... ... ... \n", "1947843 [{'array_element': 'Fiskeriverkets utredningsk... [] \n", "1947844 [{'array_element': 'Fiskeriverkets utredningsk... [] \n", "1947845 [{'array_element': 'Fiskeriverkets utredningsk... [] \n", "1947846 [{'array_element': 'Fiskeriverkets utredningsk... [] \n", "1947847 [{'array_element': 'Konsult'}] [] \n", "\n", " establishmentmeans lastinterpreted mediatype \\\n", "0 None 2022-09-08 14:29:55.344000 [] \n", "1 None 2022-09-08 14:29:56.715000 [] \n", "2 None 2022-09-08 14:29:58.118000 [] \n", "3 None 2022-09-08 14:29:59.965000 [] \n", "4 None 2022-09-08 14:30:01.862000 [] \n", "... ... ... ... \n", "1947843 None 2022-09-25 05:23:40.367000 [] \n", "1947844 None 2022-09-25 05:23:40.367000 [] \n", "1947845 None 2022-09-25 05:23:40.367000 [] \n", "1947846 None 2022-09-25 05:23:40.367000 [] \n", "1947847 None 2022-09-25 05:23:40.367000 [] \n", "\n", " issue \n", "0 [] \n", "1 [{'array_element': 'COORDINATE_ROUNDED'}] \n", "2 [] \n", "3 [] \n", "4 [] \n", "... ... \n", "1947843 [{'array_element': 'COORDINATE_ROUNDED'}] \n", "1947844 [{'array_element': 'COORDINATE_ROUNDED'}] \n", "1947845 [{'array_element': 'COORDINATE_ROUNDED'}] \n", "1947846 [{'array_element': 'COORDINATE_ROUNDED'}] \n", "1947847 [{'array_element': 'COORDINATE_ROUNDED'}] \n", "\n", "[1947848 rows x 50 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chunk = df.get_partition(0).compute()\n", "chunk" ] }, { "cell_type": "markdown", "id": "8a3a79a9-8ec1-4e17-9844-50b19a12be61", "metadata": {}, "source": [ "To get a sense for the most commonly observed species, we'll group the dataset and get the count of each species." ] }, { "cell_type": "code", "execution_count": 12, "id": "31d9b0f4-2d61-40dc-8465-2ae039401fac", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "kingdom phylum class family genus species \n", "Animalia Chordata Actinopterygii Salmonidae Salmo Salmo trutta 36867\n", " Cyprinidae Phoxinus Phoxinus phoxinus 16818\n", " Aves Sturnidae Sturnus Sturnus vulgaris 12594\n", " Actinopterygii Salmonidae Salmo Salmo salar 12313\n", " Aves Passeridae Passer Passer domesticus 12047\n", " Turdidae Turdus Turdus migratorius 12024\n", " Corvidae Corvus Corvus brachyrhynchos 11941\n", " Actinopterygii Cottidae Cottus Cottus gobio 11923\n", " Aves Columbidae Zenaida Zenaida macroura 11370\n", " Actinopterygii Lotidae Lota Lota lota 11167\n", " Aves Cardinalidae Cardinalis Cardinalis cardinalis 10779\n", " Anatidae Anas Anas platyrhynchos 10139\n", " Corvidae Cyanocitta Cyanocitta cristata 10009\n", " Emberizidae Melospiza Melospiza melodia 9357\n", " Anatidae Branta Branta canadensis 8980\n", "Name: species, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chunk.groupby([\"kingdom\", \"phylum\", \"class\", \"family\", \"genus\"])[\n", " \"species\"\n", "].value_counts().sort_values(ascending=False).head(15)" ] }, { "cell_type": "markdown", "id": "100d4a53-a1cf-4812-899c-247e754f3bf3", "metadata": {}, "source": [ "Let's create a map with the number of unique species per country. First, we'll group by country code and compute the number of unique species (per country)." ] }, { "cell_type": "code", "execution_count": 13, "id": "e06d8774-c648-44d4-9994-32acc129d2b4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "countrycode\n", "AD 4\n", "AE 208\n", "AF 10\n", "AG 18\n", "AI 11\n", " ... \n", "YT 1\n", "ZA 482\n", "ZM 111\n", "ZW 117\n", "ZZ 15\n", "Name: species, Length: 239, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "species_per_country = chunk.groupby(\"countrycode\").species.nunique()\n", "species_per_country" ] }, { "cell_type": "markdown", "id": "41e1c79c-1fb6-4c98-b250-d0af3eea69ba", "metadata": {}, "source": [ "Finally, we can plot the counts on a map using geopandas, by joining `species_per_country` to a dataset with country boundaries." ] }, { "cell_type": "code", "execution_count": 14, "id": "6af8d4a4-e491-422f-b33d-c4b64b2f2c57", "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import geopandas\n", "import cartopy\n", "\n", "countries = geopandas.read_file(\n", " \"https://raw.githubusercontent.com/datasets/geo-countries/master/data/countries.geojson\"\n", ")\n", "gdf = geopandas.GeoDataFrame(\n", " countries.merge(species_per_country, left_on=\"ISO_A2\", right_index=True)\n", ")\n", "crs = cartopy.crs.Robinson()\n", "ax = gdf.to_crs(crs.proj4_init).plot(\n", " column=\"species\", legend=True, scheme=\"natural_breaks\", k=5, figsize=(15, 15)\n", ")\n", "ax.set_axis_off()" ] }, { "cell_type": "markdown", "id": "91946b8e-db90-4bfa-8960-09b91c6e3b73", "metadata": { "tags": [] }, "source": [ "### Working with the full dataset\n", "\n", "Thus far, we've just used a single partition from the full GBIF dataset. All of the examples shown in this notebook work on the entire dataset using `dask.dataframe` to read in the Parquet dataset.\n", "\n", "You might want create a cluster to process the data in parallel on many machines.\n", "\n", "```python\n", "from dask_gateway import GatewayCluster\n", "\n", "cluster = GatewayCluster()\n", "cluster.scale(16)\n", "client = cluster.get_client()\n", "```\n", "\n", "Then use `dask.dataframe.read_parquet` to read in the files. To speed things up even more, we'll specify a subset of files to read in.\n", "\n", "```python\n", "df = dd.read_parquet(\n", " signed_asset.href,\n", " columns=[\"countrycode\", \"species\"],\n", " storage_options=signed_asset.extra_fields[\"table:storage_options\"],\n", ")\n", "```\n", "\n", "Now you can repeat the computations above, replacing `chunk` with `df`.\n", "\n", "### Next Steps\n", "\n", "Now that you've an introduction to the Forest Inventory and Analysis dataset, learn more with\n", "\n", "* The [Reading tabular data quickstart](https://planetarycomputer.microsoft.com/docs/quickstarts/reading-tabular-data/) for an introduction to tabular data on the Planeatry Computer\n", "* [Scale with Dask](https://planetarycomputer.microsoft.com/docs/quickstarts/scale-with-dask/) for more on using Dask to work with large datasets" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }