{ "cells": [ { "cell_type": "markdown", "id": "4f72dbcb-b44b-455d-a3f9-59325dab619c", "metadata": {}, "source": [ "# Tutorial One - Data Preparation\n", "\n", "This tutorial gives an overview of two methods of preparing sample data for use in the tutorial notebooks demonstrating the use of each scoring function. The result of this tutorial will be two files - \"forecast_grid.nc\" and \"analysis_grid.nc\". The two methods covered will be \n", "\n", "1. data download from the Austration National Computational Infrastructure, \n", "2. generation of synthetic data. \n", "\n", "Weather prediction data is typically large in size, and comes in many formats and structures. Common structures will include:\n", "\n", " - a single value for a prediction such as predicted maximum temperature (e.g. \"The forecast for your town tomorrow is a maximum of 25 degrees\"); \n", " - a data array representing that value across a geographical area (such as a region, a country or the whole planet); \n", " - a data array representing multiple possible predicted values (arising from ensemble predictions), or an expected value and confidence intervals. \n", "\n", "There is often also a time dimension to the data.\n", "\n", "The data in the `scores` tutorials starts with typical numerical model data outputs and ensembles. This will be made clearer with an example.\n", "\n", "Downloaded data is usually covered by a license agreement. You should check the license agreements for the downloaded data in these examples." ] }, { "cell_type": "code", "execution_count": 1, "id": "269aa167-afdb-402c-9fe1-0f5b82e69eb4", "metadata": { "tags": [] }, "outputs": [], "source": [ "import hashlib # Used to check the downloaded file is what is expected\n", "import matplotlib # Used to improve plot outputs\n", "import numpy # Used if generating synthetic data\n", "import os # Used to check if files are already downloaded\n", "import pandas # Used if generating synthetic data\n", "import requests # Used to retrieve files\n", "import xarray # Used for opening and inspecting the data" ] }, { "cell_type": "markdown", "id": "806f1c25-6af1-4509-8ca9-919c6b1be5db", "metadata": {}, "source": [ "## Method One - Downloaded Data from the Australian National Computational Infrastructure\n", "\n", "- This information was collected on 13 April 2023\n", "- Information about the data can be found at geonetwork.nci.org.au/geonetwork/srv/eng/catalog.search#metadata/f5394_0782_5339_8313\n", "- The DOI for this dataset is https://dx.doi.org/10.25914/608a9940ce85d\n", "- The license listed for this dataset is Creative Commons Attribution 4.0 International (https://creativecommons.org/licenses/by/4.0)\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "d15e0ef8-f445-465a-8f60-2d37c921fffd", "metadata": { "tags": [] }, "outputs": [], "source": [ "# This is a basic file fetch only. Do not rely on it for untrusted data. A basic check has been included to make sure the\n", "# downloaded file matches what was expected.\n", "\n", "# tweak chunk_size if fetch is taking time\n", "def basic_fetch(url, filename, expected_hash, *, chunk_size=8192):\n", " if os.path.exists(filename):\n", " print(\"File already exists, skipping download\")\n", " else: \n", " digest = hashlib.sha256()\n", " with requests.get(url, stream=True) as r:\n", " r.raise_for_status()\n", " total_size = int(r.headers.get(\"content-length\", 0))\n", " downloaded = 0\n", " print(f\"downloading {url} -> {filename}\\n(size={total_size} bytes), please wait a moment...\")\n", " with open(filename, 'wb') as f:\n", " for chunk in r.iter_content(chunk_size):\n", " f.write(chunk)\n", " # update digest on the fly\n", " digest.update(chunk)\n", " downloaded += len(chunk)\n", " print(\"Download complete, checking digest...\")\n", " \n", " # check sha256 hash\n", " found_hash = digest.hexdigest()\n", " print(f\"Hash validation: \\n-> expected | {expected_hash}\\n-> found | {found_hash}\") \n", " if found_hash != expected_hash:\n", " os.remove(filename)\n", " print(\"File has unexpected contents. The file has been removed - please download manually and check the data carefully\")\n", " print(\"Done.\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "ffc12f9a-e179-4907-a502-08b833b36bf4", "metadata": { "tags": [] }, "outputs": [], "source": [ "forecast_url = 'https://thredds.nci.org.au/thredds/fileServer/wr45/ops_aps3/access-g/1/20221120/0000/fc/sfc/temp_scrn.nc'\n", "forecast_hash = '7956d95ea3a7edee2a01c989b1f9e089199da5b1924b4c2d4611088713fbcb44' # Recorded on 13/5/2023\n", "analysis_url = 'https://thredds.nci.org.au/thredds/fileServer/wr45/ops_aps3/access-g/1/20221124/0000/an/sfc/temp_scrn.nc'\n", "analysis_hash = '163c5de55e721ad2a76518242120044bedfec805e3397cfb0008435521630042' # Recorded on 13/5/2023" ] }, { "cell_type": "code", "execution_count": 4, "id": "842010cb-a826-4526-939c-40ebc3d67bb2", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.73 s, sys: 1.94 s, total: 5.67 s\n", "Wall time: 13.5 s\n" ] } ], "source": [ "%%time\n", "basic_fetch(forecast_url, 'forecast_grid.nc', forecast_hash)" ] }, { "cell_type": "code", "execution_count": 5, "id": "3e9a31ab-da4d-4867-a209-bf87d5aa7786", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 56.4 ms, sys: 11.4 ms, total: 67.8 ms\n", "Wall time: 439 ms\n" ] } ], "source": [ "%%time\n", "basic_fetch(analysis_url, 'analysis_grid.nc', analysis_hash)" ] }, { "cell_type": "code", "execution_count": 7, "id": "cc91de18-db74-48e1-b77d-862a0bb63b07", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
<xarray.DataArray 'temp_scrn' (time: 240, lat: 1536, lon: 2048)>\n",
"[754974720 values with dtype=float32]\n",
"Coordinates:\n",
" * time (time) datetime64[ns] 2022-11-20T01:00:00 ... 2022-11-30\n",
" * lat (lat) float64 89.94 89.82 89.71 89.59 ... -89.71 -89.82 -89.94\n",
" * lon (lon) float64 0.08789 0.2637 0.4395 0.6152 ... 359.6 359.7 359.9\n",
"Attributes:\n",
" grid_type: spatial\n",
" level_type: single\n",
" units: K\n",
" long_name: screen level temperature\n",
" stash_code: 3236\n",
" accum_type: instantaneous<xarray.DataArray 'temp_scrn' (time: 1, lat: 1536, lon: 2048)>\n",
"[3145728 values with dtype=float32]\n",
"Coordinates:\n",
" * time (time) datetime64[ns] 2022-11-24\n",
" * lat (lat) float64 89.94 89.82 89.71 89.59 ... -89.71 -89.82 -89.94\n",
" * lon (lon) float64 0.08789 0.2637 0.4395 0.6152 ... 359.6 359.7 359.9\n",
"Attributes:\n",
" grid_type: spatial\n",
" level_type: single\n",
" units: K\n",
" long_name: screen level temperature\n",
" stash_code: 3236\n",
" accum_type: instantaneous<xarray.Dataset>\n",
"Dimensions: (time: 1, lat: 1536, lon: 2048)\n",
"Coordinates:\n",
" * time (time) datetime64[ns] 2022-11-24\n",
" * lat (lat) float64 -90.0 -89.88 -89.77 -89.65 ... 89.77 89.88 90.0\n",
" * lon (lon) float64 0.0 0.1759 0.3517 0.5276 ... 359.6 359.8 360.0\n",
"Data variables:\n",
" temp_scrn (time, lat, lon) float64 24.77 24.44 24.86 ... 24.07 24.39 24.2<xarray.Dataset>\n",
"Dimensions: (time: 240, lat: 1536, lon: 2048)\n",
"Coordinates:\n",
" * time (time) datetime64[ns] 2022-11-20T01:00:00 ... 2022-11-30\n",
" * lat (lat) float64 -90.0 -89.88 -89.77 -89.65 ... 89.77 89.88 90.0\n",
" * lon (lon) float64 0.0 0.1759 0.3517 0.5276 ... 359.6 359.8 360.0\n",
"Data variables:\n",
" temp_scrn (time, lat, lon) float64 26.77 26.44 26.86 ... 18.07 18.39 18.2<xarray.DataArray 'temp_scrn' ()>\n",
"array(26.49989251)\n",
"Coordinates:\n",
" time datetime64[ns] 2022-11-20T01:00:00<xarray.DataArray 'temp_scrn' ()>\n",
"array(23.31997619)\n",
"Coordinates:\n",
" time datetime64[ns] 2022-11-24