Siphon (remote_open)

Unidata AMS 2021 Student Conference

\n", "\n", "---\n", "\n", "This notebook demonstrates the Siphon `remote_open` function, which opens a TDS Catalog remote dataset for random access. The `remote_open` method returns a file-like object that can be used similarly to a local file to read raw data.\n", "
\n", "\n", "\n", "### Focuses\n", "* Open remote datasets on the TDS\n", "* Use the returned object to read the dataset as raw bytes\n", "* Interface with the dataset as if stored in a local file\n", "\n", "### Objectives\n", "1. [Find a dataset in a TDS Catalog](#1.-Find-a-dataset-in-a-TDS-Catalog)\n", "1. [Open the dataset using remote_open](#2.-Open-the-dataset-using-remote_open)\n", "1. [Read the returned object like a local file](#3.-Read-the-returned-object-like-a-local-file)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports\n", "Before beginning, let's import the packages to be used throughout this training:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from siphon.catalog import TDSCatalog" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Find a dataset in a TDS Catalog\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we use `remote_open`, we need to find a dataset that we'd like to access. \n", "As an example, we'll use this [dataset](https://thredds-test.unidata.ucar.edu/thredds/catalog/casestudies/harvey/model/gfs/GFS_Global_0p5deg_20170825_1800.grib2/catalog.html?dataset=casestudies/harvey/model/gfs/GFS_Global_0p5deg_20170825_1800.grib2) from the NOAA NCEI THREDDS catalog." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To access a dataset, we need to know two things:\n", "* the url of the catalog where the dataset lives\n", "* the dataset name \n", "\n", "The dataset name can be found on the [dataset HTML page](https://www.ncei.noaa.gov/thredds/catalog/model-namanl/202101/20210104/catalog.html?dataset=model-namanl/202101/20210104/nam_218_20210104_0600_006.grb2), e.g. \"nam_218_20210104_0600_006.grb2\". \n", "The catalog URL is the URL of the dataset page up to \".html\", replacing \".html\" with \".xml\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "catUrl=\"https://www.ncei.noaa.gov/thredds/catalog/model-namanl/202101/20210104/catalog.xml\"\n", "datasetName=\"nam_218_20210104_0600_006.grb2\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we access the catalog using the catalog URL:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "catalog = TDSCatalog(catUrl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then select our dataset using the dataset name:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds = catalog.datasets[datasetName]\n", "ds.name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now view the access protocols available for our dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "list(ds.access_urls)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The list of services available for this dataset includes `HTTPServer`, which we'll need to open the dataset using `remote_open`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Open the dataset using `remote_open`\n", "\n", "We'll now use Siphon's `remote_open` to obtain a file-like object representing the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_file = ds.remote_open()\n", "data_file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have an object that we can read similar to a local file. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = data_file.readline()\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note:* When we use `remote_open` to read a dataset, we are reading raw data from a file-like object, rather than formatted data. The `b` at the start of the data indicates that the string should be interpreted as bytes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Read the returned object like a local file\n", "We can now read our dataset using random access." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can read a line, as we did in the previous section, or we can read a specified number of bytes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = data_file.read(100)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can change the our position in the file using `seek`, similar to moving a cursor in a file. The position is given as bytes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_file.seek(0) # move \"cursor\" to start of file\n", "print(data_file.read(4)) # print first 4 bytes\n", "data_file.seek(50) # move \"cursor\" to byte 50\n", "print(data_file.read(10)) # print 10 more bytes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can read the data directly into a byte array." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "b = bytearray(100) # create a byte array of length 100\n", "data_file.readinto(b) # read 100 bytes into the byte array\n", "b[:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling `getbuffer` returns the location in memory where the dataset is being stored locally." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "b = data_file.getbuffer()\n", "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the memory buffer to make local writes. Write to the buffer will change the contents of `data_file` in memory, but will not write to the remote file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_file.seek(100) # move \"cursor\" position to byte 100\n", "b[100:110] = b\"helloworld\"; # we include the `b` before \"helloword\" to tell Python to interpret it as bytes\n", "data_file.seek(100) # return \"cursor\" to byte 100\n", "n = data_file.read(10) # read back the written bytes\n", "n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have opened a remote dataset and read parts of it using random access! Use `remote_open` when you want access to the raw data in a dataset, e.g., if you have Python code to read bytes in a particular format.\n", "\n", "*Note:* Without some prior knowledge about the format of the dataset, `remote_open` is not an effective method of parsing data. Since we are reading a raw file object, we need to know layout of the data and the data types (e.g. ints, floats, etc.). 