{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 2.2 Data Formats\n", "\n", "In this tutorial, we will manipulate the data structure from and to several data formats.\n", "\n", "\n", "## JSON (JavaScript Object Notation)\n", "JSON is a lightweight, human-readable data format used in web applications and APIs for data exchange. SON is used to store metadata, configuration files, and small datasets, particularly when working with web-based applications or interacting with APIs (e.g., querying weather data or geospatial information from an API).\n", "Here is a simple example:\n", "```json\n", "{\n", " \"location\": \"Yellowstone\",\n", " \"coordinates\": {\n", " \"latitude\": 44.423691,\n", " \"longitude\": -110.588516\n", " },\n", " \"elevation_m\": 2399,\n", " \"temperature_c\": 22.5\n", "}\n", "\n", "```\n", "The character encoding is UTF-8. The data types in JSON files may be numbers, string, boolean, array, object (collection of name-value pairs), or null. More information on JSON from the [EarthDataScience course](!https://www.earthdatascience.org/courses/use-data-open-source-python/intro-to-apis/apis-in-python/).\n", "\n", "In Python, you create simply a JSON file with the JSON library:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "data = {\n", " \"location\": \"Yellowstone\",\n", " \"coordinates\": {\"latitude\": 44.423691, \"longitude\": -110.588516},\n", " \"elevation_m\": 2399,\n", " \"temperature_c\": 22.5\n", "}\n", "\n", "with open('data.json', 'w') as outfile:\n", " json.dump(data, outfile)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tabular Data Formats \n", "\n", "Tabular data are common in geosciences. The *CSV* format is convienent, human-readable files for *small data sets* that store data in rows. The *Parquet* format is machine-readable for *large data sets* that supports compression.\n", "\n", "### **CSV (Comma-Separated Values)**:\n", "\n", "CSV is a simple, widely-used format for tabular data, often used in geoscience for sharing and storing smaller datasets (e.g., soil samples, environmental readings).\n", "It stores data as plain text, making it easy to read but can be inefficient for large datasets. It is most often read using ``pandas``." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "ename": "FileNotFoundError", "evalue": "[Errno 2] No such file or directory: 'data.csv'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[2], line 4\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mpandas\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mpd\u001b[39;00m\n\u001b[1;32m 3\u001b[0m \u001b[38;5;66;03m# Reading a CSV file\u001b[39;00m\n\u001b[0;32m----> 4\u001b[0m df \u001b[38;5;241m=\u001b[39m \u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread_csv\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mdata.csv\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 6\u001b[0m \u001b[38;5;66;03m# Writing a CSV file\u001b[39;00m\n\u001b[1;32m 7\u001b[0m df\u001b[38;5;241m.\u001b[39mto_csv(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124moutput.csv\u001b[39m\u001b[38;5;124m'\u001b[39m, index\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m)\n", "File \u001b[0;32m~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/pandas/io/parsers/readers.py:948\u001b[0m, in \u001b[0;36mread_csv\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[0m\n\u001b[1;32m 935\u001b[0m kwds_defaults \u001b[38;5;241m=\u001b[39m _refine_defaults_read(\n\u001b[1;32m 936\u001b[0m dialect,\n\u001b[1;32m 937\u001b[0m delimiter,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 944\u001b[0m dtype_backend\u001b[38;5;241m=\u001b[39mdtype_backend,\n\u001b[1;32m 945\u001b[0m )\n\u001b[1;32m 946\u001b[0m kwds\u001b[38;5;241m.\u001b[39mupdate(kwds_defaults)\n\u001b[0;32m--> 948\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_read\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfilepath_or_buffer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mkwds\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/pandas/io/parsers/readers.py:611\u001b[0m, in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m 608\u001b[0m _validate_names(kwds\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnames\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m))\n\u001b[1;32m 610\u001b[0m \u001b[38;5;66;03m# Create the parser.\u001b[39;00m\n\u001b[0;32m--> 611\u001b[0m parser \u001b[38;5;241m=\u001b[39m \u001b[43mTextFileReader\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfilepath_or_buffer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwds\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 613\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m chunksize \u001b[38;5;129;01mor\u001b[39;00m iterator:\n\u001b[1;32m 614\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m parser\n", "File \u001b[0;32m~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1448\u001b[0m, in \u001b[0;36mTextFileReader.__init__\u001b[0;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[1;32m 1445\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhas_index_names\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m kwds[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhas_index_names\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[1;32m 1447\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles: IOHandles \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m-> 1448\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_make_engine\u001b[49m\u001b[43m(\u001b[49m\u001b[43mf\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mengine\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1705\u001b[0m, in \u001b[0;36mTextFileReader._make_engine\u001b[0;34m(self, f, engine)\u001b[0m\n\u001b[1;32m 1703\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m mode:\n\u001b[1;32m 1704\u001b[0m mode \u001b[38;5;241m+\u001b[39m\u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m-> 1705\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles \u001b[38;5;241m=\u001b[39m \u001b[43mget_handle\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 1706\u001b[0m \u001b[43m \u001b[49m\u001b[43mf\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1707\u001b[0m \u001b[43m \u001b[49m\u001b[43mmode\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1708\u001b[0m \u001b[43m \u001b[49m\u001b[43mencoding\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moptions\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mencoding\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1709\u001b[0m \u001b[43m \u001b[49m\u001b[43mcompression\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moptions\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mcompression\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1710\u001b[0m \u001b[43m \u001b[49m\u001b[43mmemory_map\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moptions\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mmemory_map\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1711\u001b[0m \u001b[43m \u001b[49m\u001b[43mis_text\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mis_text\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1712\u001b[0m \u001b[43m \u001b[49m\u001b[43merrors\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moptions\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mencoding_errors\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mstrict\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1713\u001b[0m \u001b[43m \u001b[49m\u001b[43mstorage_options\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moptions\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mstorage_options\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1714\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1715\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m 1716\u001b[0m f \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles\u001b[38;5;241m.\u001b[39mhandle\n", "File \u001b[0;32m~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/pandas/io/common.py:863\u001b[0m, in \u001b[0;36mget_handle\u001b[0;34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[0m\n\u001b[1;32m 858\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(handle, \u001b[38;5;28mstr\u001b[39m):\n\u001b[1;32m 859\u001b[0m \u001b[38;5;66;03m# Check whether the filename is to be opened in binary mode.\u001b[39;00m\n\u001b[1;32m 860\u001b[0m \u001b[38;5;66;03m# Binary mode does not support 'encoding' and 'newline'.\u001b[39;00m\n\u001b[1;32m 861\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m ioargs\u001b[38;5;241m.\u001b[39mencoding \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m ioargs\u001b[38;5;241m.\u001b[39mmode:\n\u001b[1;32m 862\u001b[0m \u001b[38;5;66;03m# Encoding\u001b[39;00m\n\u001b[0;32m--> 863\u001b[0m handle \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mopen\u001b[39;49m\u001b[43m(\u001b[49m\n\u001b[1;32m 864\u001b[0m \u001b[43m \u001b[49m\u001b[43mhandle\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 865\u001b[0m \u001b[43m \u001b[49m\u001b[43mioargs\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmode\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 866\u001b[0m \u001b[43m \u001b[49m\u001b[43mencoding\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mioargs\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mencoding\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 867\u001b[0m \u001b[43m \u001b[49m\u001b[43merrors\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43merrors\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 868\u001b[0m \u001b[43m \u001b[49m\u001b[43mnewline\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 869\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 870\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 871\u001b[0m \u001b[38;5;66;03m# Binary mode\u001b[39;00m\n\u001b[1;32m 872\u001b[0m handle \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mopen\u001b[39m(handle, ioargs\u001b[38;5;241m.\u001b[39mmode)\n", "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'data.csv'" ] } ], "source": [ "import pandas as pd\n", "\n", "# Reading a CSV file\n", "df = pd.read_csv('data.csv')\n", "\n", "# Writing a CSV file\n", "df.to_csv('output.csv', index=False)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Parquet**:\n", "Parquet is a binary, columnar storage format optimized for efficiency, particularly for large datasets. It is widely used in big data environments (e.g., storing satellite imagery or climate model outputs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Writing the dictionary data to a Parquet file\n", "df = pd.DataFrame([data])\n", "df.to_parquet('data.parquet', index=False)\n", "\n", "# Reading the Parquet file\n", "df_read = pd.read_parquet('data.parquet')\n", "print(df_read)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Geospatial Data\n", "The main formats that support pixelized raster data are:\n", "- **GeoTIFF**: metadata standard that allows for georeferencing information embedded in a TIFF (Tagged Image File Format) file. **GeoTIFF** is enhanced to be cloud optimized.\n", "- **GeoJSON**: GeoJSON is a format for encoding a variety of geographic data structures in the JSON format.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests, zipfile , os, io\n", "import folium\n", "import geopandas as gpd\n", "import matplotlib.pyplot as plt\n", "import netCDF4 as nc\n", "import numpy as np\n", "import pandas as pd\n", "# import pycrs\n", "import rasterio\n", "import h5py\n", "import rasterio\n", "import netCDF4 as nc\n", "import wget\n", "\n", "\n", "from folium.plugins import MarkerCluster\n", "from rasterio.mask import mask\n", "from rasterio.plot import show" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Raster data\n", "\n", "### 1.1 rasterio to read GeoTIFF\n", "\n", "Raster data is any pixelated (or gridded) data where each pixel is associated with a specific geographical location. The value of a pixel can be continuous (e.g. elevation) or categorical (e.g. land use).\n", "\n", "The python package ``rasterio``, with documentation [here](!https://rasterio.readthedocs.io/en/latest/), and that can read formats such as ``GeoTIFF`` and ``GeoJSON``.\n", "\n", "See additional introductory materials from [EarthDataScience](!https://www.earthdatascience.org/courses/use-data-open-source-python/intro-raster-data-python/), and tutorials from the [GeoHackweek](!https://geohackweek.github.io/raster/).\n", "\n", "\n", "\n", "We will download topography files found on this [page](!https://www.naturalearthdata.com/downloads/50m-raster-data/50m-cross-blend-hypso/), but stored on a Dropbox folder.\n", "\n", "The file name is `HYP_50M_SR` and is a zipped file.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Dowload the data using wget.\n", "fname = 'HYP_50M_SR'\n", "wget.download(\"https://www.dropbox.com/s/j5lxhd8uxrtsxko/\"+str(fname)+\"?dl=1\") # note the last character as a string to request the file itself\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The data file will be saved on the home directory, we want to move it into a ``data`` folder:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "os.replace(fname+\".zip\", './data/'+fname)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Unzip the file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "os.makedirs(\"./data/\"+fname+\"/\",exist_ok=True)\n", "# wget.download(url,out=\"HYP_50M_SR\") # this does not work on the hub\n", "z = zipfile.ZipFile('./data/'+fname+\".zip\")\n", "z.extractall(\"./data/\"+fname+\"/\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now let's get the Digital Elevation Map. We open the unzipped file using the package ``rasterio``." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "elevation = rasterio.open(\"./data/\"+fname+\"/\"+fname+\".tif\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(elevation.variables.keys())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Let us have a look at the dimensions of the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "elevation.height" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "elevation.width" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "elevation.indexes" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Can you guess on how to call the data types of the file entry?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# type below" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "and at the boundaries of the dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "elevation.bounds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(elevation.transform * (0, 0)) # North West corner\n", "print(elevation.transform * (elevation.width, elevation.height)) # South East corner" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Here is the projection used for the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "elevation.crs" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "How to interpret the data: There are three layers for the three colors red, green, and blue:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(elevation.colorinterp[0])\n", "print(elevation.colorinterp[1])\n", "print(elevation.colorinterp[2])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(np.min(elevation.read(1)), np.max(elevation.read(1)))\n", "print(np.min(elevation.read(2)), np.max(elevation.read(2)))\n", "print(np.min(elevation.read(3)), np.max(elevation.read(3)))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Let us now plot the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "image = elevation.read()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "show(image)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Geopandas to read GeoJSON\n", "\n", "GeoTIFF are not the only kind of files that we can read with geopandas. Let us look at an example of reading data from a geojson file (which is a special case of json file with geographical coordinates)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = 'https://www.nps.gov/lib/npmap.js/4.0.0/examples/data/national-parks.geojson'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parks = gpd.read_file(url)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parks.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Let us plot the data.\n", "\n", "Folium is a nice Python package for visualization. The [Geohackweek tutorial on Folium](!https://github.com/geohackweek/tutorial_contents/blob/master/visualization/notebooks/foliumTutorial.ipynb) is also informative.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m = folium.Map(location=[40, -100], zoom_start=4)\n", "folium.GeoJson(parks).add_to(m)\n", "marker_cluster = MarkerCluster().add_to(m)\n", "m" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We are going to focus on the parks in Washington State:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parks_WA = parks.iloc[[94, 127, 187, 228, 286, 294, 295, 297, 299, 300, 302]].reset_index()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We create a list of locations to add popups to the map." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "locations = []\n", "for index in range(0, len(parks_WA)):\n", " location = [parks_WA['geometry'][index].y, parks_WA['geometry'][index].x]\n", " locations.append(location)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m = folium.Map(location=[47, -121], zoom_start=7)\n", "marker_cluster = MarkerCluster().add_to(m)\n", "for point in range(0, len(locations)):\n", " folium.Marker(location = locations[point], popup=parks_WA['Name'].iloc[point]).add_to(marker_cluster)\n", "m" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Hierarchical formats: NETCDF4 & HDF5\n", "\n", "Hierarchical data formats are designed to store large amount of data into a single file. They mimic a file system (e.g., tree-like data structure with nested directories) into a single file. There are two dominent hierarchical data formats (HDF5 and NETCDF4), and one emerging for cloud (Zarr). Hierarchical formats in general can store many data types (numeric vs string).\n", "\n", "## HDF5\n", "The Hierarchical Data Format version 5 (HDF5), is an open-source file format that supports large, complex, heterogeneous data. HDF5 uses a \"file directory\" like structure that allows you to organize data within the file in many different structured ways, as you might do with files on your computer. The HDF5 format also allows for embedding of metadata making it _self-describing_.\n", "HDF5 files are self describing - this means that all elements (the file itself, groups and datasets) can have associated metadata that describes the information contained within the element.\n", "\n", "\n", "HDF Structure Example:\n", "- Datasets, which are typed multidimensional arrays\n", "- Groups, which are container structures that can hold datasets and other groups\n", "\n", "\n", "\"An\n", "Figure: HDF5 data example. Found in [neonscience](https://www.neonscience.org/resources/learning-hub/tutorials/about-hdf5)\n", "\n", "## Netcdf\n", "\n", "The network Common Data Form, or **netCDF**, was created in the early 1990s, and set out to solve some of the challenges in working with N-dimensional arrays. Netcdf is a collection of self-describing, machine-independent binary data formats and software tools that facilitate the creation, access, and sharing of scientific data stored in N-dimensional arrays, along with metadata describing the contents of each array. Netcdf was built by the climate science community at a time when regional climate models were beginning to produce larger and larger output files. NetCDF version 4 is now a subset of HDF5.\n", "\n", "\n", "### Handling large arrays\n", "The NetCDF & H5 format has no limit on file sizes. However, any analysis tools that read data from a NetCDF array into memory for some computational operation will be limited by that particular machine's available memory. \n", "\n", "### But slow at I/O\n", "When reading a hierarchical file, the whole tree of the data structure is scanned from the root node down. Since it has to be done each time a user makes an inquiry, reading H5 and Netcdf is **slow**. There are faster\n", "\n", "\n", "We will download a geological map stored in a netCDF format on Dropbox. The original data can be found on the [USGS database](!https://www.sciencebase.gov/catalog/item/5cfeb4cce4b0156ea5645056). (https://doi.org/10.3133/ofr20191081)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Download the geological framework\n", "file1 = wget.download(\"https://www.dropbox.com/s/wdb25puxh3u07dj/NCM_GeologicFrameworkGrids.nc?dl=1\") #\"./data/NCM_GeologicFrameworkGrids.nc\"\n", "# Download the coordinate grids\n", "file2 = wget.download(\"https://www.dropbox.com/s/wdb25puxh3u07dj/NCM_SpatialGrid.nc?dl=1\") #\"./data/NCM_GeologicFrameworkGrids.nc\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# move data\n", "os.replace(file1,'./data/'+file1)\n", "os.replace(file2,'./data/'+file2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# read data\n", "geology = nc.Dataset('./data/'+file1)\n", "grid = nc.Dataset('./data/'+file2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "geology" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "geology['Surface Elevation']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.shape(geology['Surface Elevation'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "geology['Surface Elevation'][3246, 1234]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = grid['x'][0:4901, 0:3201]\n", "y = grid['y'][0:4901, 0:3201]\n", "elevation = geology['Surface Elevation'][0:4901, 0:3201]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.contourf(x, y, elevation)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Zarr\n", "\n", "\n", "Zarr is a cloud-optimized data format that handles heterogeneous data sets.\n", "\n", "In the following exercise, we will use the Xarray open data sets air_temperature and save it into a Zarr file.\n", "\n", "Let's work in groups to :\n", "- Download the xarray data\n", "- Save it in to file, report on time and size of the data set.\n", "- Read it again and check again write time and read times" ] } ], "metadata": { "kernelspec": { "display_name": "mlgeo", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }