{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Select Data in Columns for Data Science\n", "*Pivot the row-based data in a STOQS database to fit into a column-based dataframe*\n", "\n", "This Notebook explores options raised by this [GitHub Issue](https://github.com/stoqs/stoqs/issues/837#issuecomment-763176111). We want to be able to efficiently consume mass quantities of data from a STOQS database and have it organized for efficient data analysis and visualization using modern data frame orientied tools.\n", "\n", "Executing this Notebook requires a personal STOQS server. It can be run from either a Docker installation or from a development Vagrant Virtual Machine. \n", "\n", "### Docker Instructions\n", "Install and start the software as \n", "[detailed in the README](https://github.com/stoqs/stoqs#production-deployment-with-docker). (Note that on MacOS you will need to modify settings in your `docker-compose.yml` and `.env` files — look for comments referencing 'HOST_UID'.)\n", "\n", "Then, from your `$STOQS_HOME/docker` directory start the Jupyter Notebook server - you can query from the remote database or from a copy that you've made to your local system: \n", "\n", "#### Option A: Query from MBARI's master database\n", "Start the Jupyter Notebook server pointing to MBARI's master STOQS database server. (Note: firewall rules limit unprivileged access to such resources):\n", "\n", " docker-compose exec \\\n", " -e DATABASE_URL=postgis://everyone:guest@kraken.shore.mbari.org:5432/stoqs \\\n", " stoqs stoqs/manage.py shell_plus --notebook\n", "\n", "#### Option B: Query from your local Docker Desktop\n", "Restore the `stoqs_canon_october2020` database from MBARI's server onto your local database and start the Jupyter Notebook server using the default DATABASE_URL, which should be your local system, also **make sure that your Docker Desktop has at least 16 GB of RAM allocated to it**:\n", "\n", " cd $STOQS_HOME/docker\n", " docker-compose exec stoqs createdb -U postgres stoqs_canon_october2020\n", " curl -k https://stoqs.shore.mbari.org/media/pg_dumps/stoqs_canon_october2020.pg_dump | \\\n", " docker exec -i stoqs pg_restore -Fc -U postgres -d stoqs_canon_october2020\n", " docker-compose exec stoqs stoqs/manage.py shell_plus --notebook\n", "\n", "### Opening this Notebook\n", "Following execution of the `stoqs/manage.py shell_plus --notebook` command a message is displayed giving a URL for you to use in a browser on your host, e.g.:\n", "\n", " http://127.0.0.1:8888/?token=\n", "\n", "In the browser window opened to this URL navigate to this file (`select_data_in_columns_for_data_science.ipynb`) and open it. You will then be able to execute the cells and modify the code to suit your needs.\n", "\n", "The information in the output cells result from execution on a 2019 MacBook Pro with a 2.4 GHz 8-Core Intel Core i9 processor, 32 GB 2667 MHz DDR4 RAM, running Docker Desktop 3.1.0 with 16 GB with 4 CPUs and 16 GB allocated." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import time\n", "\n", "# Prevent SynchronousOnlyOperation exceptions\n", "os.environ[\"DJANGO_ALLOW_ASYNC_UNSAFE\"] = \"true\"\n", "\n", "# Set do_plots to True for visualizations (do not commit with this setting)\n", "do_plots = False\n", "t_start = time.time()\n", "\n", "# Use a recent database available at DATABASE_URL\n", "db = 'stoqs_canon_october2020'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 0. Perform a straight forward query using the STOQS data model.\n", "Collect all the sea_water_temperature and sea_water_salinity data into dictionaries keyed by platform name. This is to examine the landscape of data we are querying." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting data for: dorado #temps: 169159 #salts: 169159\n", "Collecting data for: M1_Mooring #temps: 6600 #salts: 6600\n", "Collecting data for: makai #temps: 154793 #salts: 154797 - not equal\n", "Collecting data for: makai_ESPmv1_filtering #temps: 29857 #salts: 29857\n", "Collecting data for: makai_Sipper #temps: 595 #salts: 595\n", "Collecting data for: NPS_Glider_29 #temps: 33143 #salts: 33143\n", "Collecting data for: NPS_Glider_34 #temps: 32759 #salts: 32759\n", "Collecting data for: pontus #temps: 132945 #salts: 132947 - not equal\n", "Collecting data for: wg_Tiny_Glider #temps: 386 #salts: 386\n", "Done\n" ] } ], "source": [ "# To make sure we collect temperatures and salinities that are properly associated\n", "# we will first find all the Platforms that have T & S and then from each Measurement\n", "# from the Platform collect the temperatures and salinities into lists for plotting.\n", "# Assume that Platforms that have sea_water_salinity also have sea_water_temperature.\n", "platforms = (ActivityParameter.objects.using(db)\n", " .filter(parameter__standard_name='sea_water_salinity')\n", " .values_list('activity__platform__name', flat=True)\n", " .distinct().order_by('activity__platform__name'))\n", "temps = {}\n", "salts = {}\n", "for platform in platforms:\n", " print(f\"Collecting data for: {platform:23}\", end=' ')\n", " mps = (MeasuredParameter.objects.using(db)\n", " .filter(measurement__instantpoint__activity__platform__name=platform))\n", " \n", " temps[platform] = (mps.filter(parameter__standard_name='sea_water_temperature')\n", " .values_list('datavalue', flat=True))\n", " salts[platform] = (mps.filter(parameter__standard_name='sea_water_salinity')\n", " .values_list('datavalue', flat=True))\n", " print(f\"#temps: {len(temps[platform]):6} #salts: {len(salts[platform]):6}\", end='')\n", " if len(temps[platform]) != len(salts[platform]):\n", " print(' - not equal')\n", " else:\n", " print()\n", "print('Done')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "if do_plots:\n", " # Make a T/S plot of data from all the platforms\n", " import pylab as plt\n", " plt.rcParams['figure.figsize'] = (18, 6)\n", " for platform in temps.keys():\n", " if len(temps[platform]) == len(salts[platform]):\n", " plt.scatter(salts[platform], temps[platform], s=1, label=platform)\n", " plt.xlabel('Salinty')\n", " plt.ylabel('Temperature (DegC)')\n", " plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Approach 1. Use the same kind of self-join query used for selecting data for Parameter-Parameter plots. \n", "A sample SQL statement was copied from the STOQS UI and then modified to select sea_water_temperature and sea_water_salinity from all platforms." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "sql_multp = '''SELECT DISTINCT \n", " stoqs_platform.name,\n", " stoqs_instantpoint.timevalue,\n", " stoqs_measurement.depth,\n", " mp_salt.datavalue AS salt,\n", " mp_temp.datavalue AS temp\n", "FROM stoqs_measuredparameter\n", "INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id)\n", "INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id)\n", "INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id)\n", "INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id)\n", "INNER JOIN stoqs_measurement m_salt ON m_salt.instantpoint_id = stoqs_instantpoint.id\n", "INNER JOIN stoqs_measuredparameter mp_salt ON mp_salt.measurement_id = m_salt.id\n", "INNER JOIN stoqs_parameter p_salt ON mp_salt.parameter_id = p_salt.id\n", "INNER JOIN stoqs_measurement m_temp ON m_temp.instantpoint_id = stoqs_instantpoint.id\n", "INNER JOIN stoqs_measuredparameter mp_temp ON mp_temp.measurement_id = m_temp.id\n", "INNER JOIN stoqs_parameter p_temp ON mp_temp.parameter_id = p_temp.id\n", "WHERE (p_salt.standard_name = 'sea_water_salinity')\n", " AND (p_temp.standard_name = 'sea_water_temperature')\n", " AND stoqs_platform.name IN ({})\n", "ORDER BY stoqs_instantpoint.timevalue, stoqs_measurement.depth'''" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SELECT DISTINCT \n", " stoqs_platform.name,\n", " stoqs_instantpoint.timevalue,\n", " stoqs_measurement.depth,\n", " mp_salt.datavalue AS salt,\n", " mp_temp.datavalue AS temp\n", "FROM stoqs_measuredparameter\n", "INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id)\n", "INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id)\n", "INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id)\n", "INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id)\n", "INNER JOIN stoqs_measurement m_salt ON m_salt.instantpoint_id = stoqs_instantpoint.id\n", "INNER JOIN stoqs_measuredparameter mp_salt ON mp_salt.measurement_id = m_salt.id\n", "INNER JOIN stoqs_parameter p_salt ON mp_salt.parameter_id = p_salt.id\n", "INNER JOIN stoqs_measurement m_temp ON m_temp.instantpoint_id = stoqs_instantpoint.id\n", "INNER JOIN stoqs_measuredparameter mp_temp ON mp_temp.measurement_id = m_temp.id\n", "INNER JOIN stoqs_parameter p_temp ON mp_temp.parameter_id = p_temp.id\n", "WHERE (p_salt.standard_name = 'sea_water_salinity')\n", " AND (p_temp.standard_name = 'sea_water_temperature')\n", " AND stoqs_platform.name IN ('dorado','makai_ESPmv1_filtering','makai_Sipper','NPS_Glider_29','NPS_Glider_34','wg_Tiny_Glider')\n", "ORDER BY stoqs_instantpoint.timevalue, stoqs_measurement.depth\n" ] } ], "source": [ "# Build the SQL with optional selection of platforms to use\n", "db = 'stoqs_canon_october2020'\n", "platforms = (ActivityParameter.objects.using(db)\n", " .filter(parameter__standard_name='sea_water_salinity')\n", " .values_list('activity__platform__name', flat=True)\n", " .order_by('activity__platform__name').distinct())\n", "plats = ''\n", "plat_list = []\n", "for platform in platforms:\n", " if platform == 'M1_Mooring' or platform == 'makai' or platform == 'pontus':\n", " # Continue to omit some platforms for shorter execution times\n", " continue\n", " plats += f\"'{platform}',\"\n", " plat_list.append(platform)\n", "plats = plats[:-2] + \"'\"\n", "sql = sql_multp.format(plats)\n", "print(sql)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 614 ms, sys: 91.6 ms, total: 706 ms\n", "Wall time: 8.11 s\n", "(266285, 2)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salttemp
nametimevaluedepth
wg_Tiny_Glider2020-10-05 23:20:00-1.033.49912614.309250
0.533.49912614.309250
2020-10-05 23:30:00-1.033.49525114.408250
0.533.49525114.408250
2020-10-05 23:40:00-1.033.49124914.349625
\n", "
" ], "text/plain": [ " salt temp\n", "name timevalue depth \n", "wg_Tiny_Glider 2020-10-05 23:20:00 -1.0 33.499126 14.309250\n", " 0.5 33.499126 14.309250\n", " 2020-10-05 23:30:00 -1.0 33.495251 14.408250\n", " 0.5 33.495251 14.408250\n", " 2020-10-05 23:40:00 -1.0 33.491249 14.349625" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from django.db import connections\n", "\n", "# It takes about 15 seconds to read about 0.5 million rows from the local STOQS database.\n", "%time df1 = pd.read_sql_query(sql, connections[db], index_col=['name', 'timevalue', 'depth'])\n", "##%time df1 = pd.read_sql_query(sql, connections[db])\n", "print(df1.shape)\n", "df1.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 132 ms, sys: 37.2 ms, total: 170 ms\n", "Wall time: 233 ms\n" ] } ], "source": [ "# Writing the Parquet file takes about 0.6 seconds\n", "%time df1.to_parquet('all_plats.parquet')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 164 ms, sys: 20.9 ms, total: 185 ms\n", "Wall time: 226 ms\n" ] }, { "data": { "text/plain": [ "(266285, 2)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reading the Parquest file takes about 0.4 seconds\n", "%time df1b = pd.read_parquet('all_plats.parquet')\n", "df1b.shape" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salttemp
nametimevaluedepth
wg_Tiny_Glider2020-10-05 23:20:00-1.00000033.49912614.309250
0.50000033.49912614.309250
2020-10-05 23:30:00-1.00000033.49525114.408250
0.50000033.49525114.408250
2020-10-05 23:40:00-1.00000033.49124914.349625
...............
dorado2020-10-28 14:42:431.63863133.48735114.753714
2020-10-28 14:42:441.35610233.48845714.751194
2020-10-28 14:42:460.79172933.48915514.750566
2020-10-28 14:42:470.51377233.48915514.744095
2020-10-28 14:42:480.16938333.48915514.743799
\n", "

266285 rows × 2 columns

\n", "
" ], "text/plain": [ " salt temp\n", "name timevalue depth \n", "wg_Tiny_Glider 2020-10-05 23:20:00 -1.000000 33.499126 14.309250\n", " 0.500000 33.499126 14.309250\n", " 2020-10-05 23:30:00 -1.000000 33.495251 14.408250\n", " 0.500000 33.495251 14.408250\n", " 2020-10-05 23:40:00 -1.000000 33.491249 14.349625\n", "... ... ...\n", "dorado 2020-10-28 14:42:43 1.638631 33.487351 14.753714\n", " 2020-10-28 14:42:44 1.356102 33.488457 14.751194\n", " 2020-10-28 14:42:46 0.791729 33.489155 14.750566\n", " 2020-10-28 14:42:47 0.513772 33.489155 14.744095\n", " 2020-10-28 14:42:48 0.169383 33.489155 14.743799\n", "\n", "[266285 rows x 2 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Datashader plots must be left justified on last line, use this variable to do that\n", "ts_points = None\n", "if do_plots:\n", " # See: http://holoviews.org/user_guide/Large_Data.html\n", " # https://stackoverflow.com/a/18835121/1281657\n", " import colorcet\n", " import holoviews as hv\n", " from holoviews.operation.datashader import rasterize\n", " hv.extension(\"bokeh\")\n", " ropts = dict(height=380, width=300, colorbar=True, colorbar_position=\"bottom\", cmap=colorcet.fire)\n", " plots = [(rasterize(hv.Points(df1.iloc[df1.index.get_level_values('name') == p], kdims=['salt', 'temp']))\n", " .opts(**ropts).relabel(p)) for p in plat_list]\n", " ts_points = hv.Layout(plots).cols(3)\n", "ts_points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This approach could be used in a general way to extract all Parameters for each Platform by dynamically generating the SQL (with dozens more self joins) and executing it. We do need more scalable methods than `.read_sql_query()` and `.to_parquet()`, which need to read and write all the data in to and out of allocated random access memory. This is why at least a resource of 16 GB of RAM is needed in Docker Desktop for this query - larger data requests would require more memory - this is not scalable. \n", "\n", "This SQL is not forgiving in terms of *ad hoc* modification; for example, adding a column to the SELECT can increase the volume of results by unexpectedly returning a type of cross join with repeated salt and temp values. \n", "\n", "This sort of self-join query returns a lot of duplicate records (kind of a cross join) for the 'M1_Mooring' platform which has a 'stationprofile' CF featureType, resulting in a different relational cardinality that would require special treatment.\n", "\n", "---\n", "\n", "#### Approach 2. Use Brent's trimSTOQS program to convert the MeasuredParameter Data Access output:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.12user 0.80system 0:11.25elapsed 8%CPU (0avgtext+0avgdata 6744maxresident)k\n", "0inputs+0outputs (0major+363minor)pagefaults 0swaps\n", "(33176, 11)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timevaluedepthgeom.xgeom.yaltitudeTEMP (Celsius)PSAL (0.001)OXYG (ml/L)sigmatspiceNPS_Glider_29
02020-10-06 19:07:40.8000020.000000-121.85399636.803799280.56773213.95633.4620026.00225.0078210.682620NaN
12020-10-06 19:07:57.9310360.794097-121.85405036.803795280.17034513.84533.4669995.91825.0346450.663049NaN
22020-10-06 19:08:15.0620711.885975-121.85410336.803787279.50258813.73133.4679995.81525.0588880.639810NaN
32020-10-06 19:08:32.1931052.878586-121.85414936.803780278.85872813.62633.4720005.71625.0834880.620954NaN
42020-10-06 19:08:49.3241403.970453-121.85420236.803772278.15241213.59033.4690025.62425.0885410.611061NaN
\n", "
" ], "text/plain": [ " timevalue depth geom.x geom.y altitude \\\n", "0 2020-10-06 19:07:40.800002 0.000000 -121.853996 36.803799 280.567732 \n", "1 2020-10-06 19:07:57.931036 0.794097 -121.854050 36.803795 280.170345 \n", "2 2020-10-06 19:08:15.062071 1.885975 -121.854103 36.803787 279.502588 \n", "3 2020-10-06 19:08:32.193105 2.878586 -121.854149 36.803780 278.858728 \n", "4 2020-10-06 19:08:49.324140 3.970453 -121.854202 36.803772 278.152412 \n", "\n", " TEMP (Celsius) PSAL (0.001) OXYG (ml/L) sigmat spice \\\n", "0 13.956 33.462002 6.002 25.007821 0.682620 \n", "1 13.845 33.466999 5.918 25.034645 0.663049 \n", "2 13.731 33.467999 5.815 25.058888 0.639810 \n", "3 13.626 33.472000 5.716 25.083488 0.620954 \n", "4 13.590 33.469002 5.624 25.088541 0.611061 \n", "\n", " NPS_Glider_29 \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# It takes about 5 minutes to read in 0.17 million dorado CSV rows and convert using trimSTOQS\n", "##! time wget https://stoqs.mbari.org/stoqs_canon_october2020/api/measuredparameter.csv?measurement__instantpoint__activity__platform__name=dorado \\\n", "## -q -O - | /srv/stoqs/contrib/trimSTOQS/trimSTOQS parameter__name --separator=, > october2020_dorado_parms.cvs\n", "##df2 = pd.read_csv('/srv/stoqs/contrib/trimSTOQS/october2020_dorado_parms.cvs')\n", "\n", "# It takes about 40 seconds (on a fast network) to read in just 0.033 million NPS_Glider_29 CSV rows and convert using trimSTOQS\n", "! time wget http://stoqs.mbari.org/stoqs_canon_october2020/api/measuredparameter.csv?measurement__instantpoint__activity__platform__name=NPS_Glider_29 \\\n", " -q -O - | /srv/stoqs/contrib/trimSTOQS/trimSTOQS parameter__name --separator=, \\\n", " > /srv/stoqs/contrib/trimSTOQS/october2020_NPS_Glider_29_parms.cvs\n", " \n", "df2 = pd.read_csv('/srv/stoqs/contrib/trimSTOQS/october2020_NPS_Glider_29_parms.cvs')\n", "print(df2.shape)\n", "df2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The advantage of this approach is that all parameters get transformed into the columns we want. The disadvantage is that it takes a long time to extract the data in CSV format. Approach 1 reads at a rate of about 30,000 rows/sec, approach 2 reads at a rate of 1000 rows/sec - **orders of magnitude slower**.\n", "\n", "---\n", "\n", "#### 3. Do a direct Postgresql query to transform the data, perhaps using the [crosstab() function](https://stackoverflow.com/questions/3002499/postgresql-crosstab-query/11751905#11751905). \n", "\n", "Need to do this on the database first:\n", "\n", " % docker-compose exec postgis psql -U postgres \n", " postgres=# \\c stoqs_canon_october2020\n", " stoqs_canon_october2020=# CREATE EXTENSION IF NOT EXISTS tablefunc;\n", " CREATE EXTENSION\n", " " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SELECT stoqs_platform.name as platform, stoqs_activity.name as activity__name,\n", " stoqs_instantpoint.timevalue, stoqs_measurement.depth, \n", " ST_X(stoqs_measurement.geom) as longitude, ST_Y(stoqs_measurement.geom) as latitude,\n", " stoqs_parameter.name, standard_name, datavalue \n", "FROM public.stoqs_measuredparameter\n", "INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id)\n", "INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id)\n", "INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id)\n", "INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id)\n", "INNER JOIN stoqs_parameter ON (stoqs_measuredparameter.parameter_id = stoqs_parameter.id)\n", "WHERE stoqs_platform.name IN ('dorado','makai_ESPmv1_filtering','makai_Sipper','NPS_Glider_29','NPS_Glider_34','wg_Tiny_Glider')\n", "ORDER BY stoqs_platform.name, stoqs_instantpoint.timevalue, stoqs_measurement.depth, stoqs_parameter.name\n" ] } ], "source": [ "# Base query that's similar to the one behind the api/measuredparameter.csv request\n", "sql_base = '''SELECT stoqs_platform.name as platform, stoqs_activity.name as activity__name,\n", " stoqs_instantpoint.timevalue, stoqs_measurement.depth, \n", " ST_X(stoqs_measurement.geom) as longitude, ST_Y(stoqs_measurement.geom) as latitude,\n", " stoqs_parameter.name, standard_name, datavalue \n", "FROM public.stoqs_measuredparameter\n", "INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id)\n", "INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id)\n", "INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id)\n", "INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id)\n", "INNER JOIN stoqs_parameter ON (stoqs_measuredparameter.parameter_id = stoqs_parameter.id)\n", "WHERE stoqs_platform.name IN ({})\n", "ORDER BY stoqs_platform.name, stoqs_instantpoint.timevalue, stoqs_measurement.depth, stoqs_parameter.name'''\n", "sql = sql_base.format(plats)\n", "print(sql)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 12.4 s, sys: 2.39 s, total: 14.8 s\n", "Wall time: 41.8 s\n", "(3683189, 4)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
activity__namenamestandard_namedatavalue
platformtimevaluedepthlatitudelongitude
dorado2020-10-08 20:08:11-0.02826436.796672-121.849978Dorado389_2020_282_01_282_01_decim.ncaltitudeheight_above_sea_floor2.707653e+02
-121.849978Dorado389_2020_282_01_282_01_decim.ncbbp420 (m-1)None1.269240e-03
-121.849978Dorado389_2020_282_01_282_01_decim.ncbbp700 (m-1)None8.974295e-03
-121.849978Dorado389_2020_282_01_282_01_decim.ncbiolumeNone3.691703e+09
-121.849978Dorado389_2020_282_01_282_01_decim.ncfl700_uncorrNone1.431676e-03
\n", "
" ], "text/plain": [ " activity__name \\\n", "platform timevalue depth latitude longitude \n", "dorado 2020-10-08 20:08:11 -0.028264 36.796672 -121.849978 Dorado389_2020_282_01_282_01_decim.nc \n", " -121.849978 Dorado389_2020_282_01_282_01_decim.nc \n", " -121.849978 Dorado389_2020_282_01_282_01_decim.nc \n", " -121.849978 Dorado389_2020_282_01_282_01_decim.nc \n", " -121.849978 Dorado389_2020_282_01_282_01_decim.nc \n", "\n", " name \\\n", "platform timevalue depth latitude longitude \n", "dorado 2020-10-08 20:08:11 -0.028264 36.796672 -121.849978 altitude \n", " -121.849978 bbp420 (m-1) \n", " -121.849978 bbp700 (m-1) \n", " -121.849978 biolume \n", " -121.849978 fl700_uncorr \n", "\n", " standard_name \\\n", "platform timevalue depth latitude longitude \n", "dorado 2020-10-08 20:08:11 -0.028264 36.796672 -121.849978 height_above_sea_floor \n", " -121.849978 None \n", " -121.849978 None \n", " -121.849978 None \n", " -121.849978 None \n", "\n", " datavalue \n", "platform timevalue depth latitude longitude \n", "dorado 2020-10-08 20:08:11 -0.028264 36.796672 -121.849978 2.707653e+02 \n", " -121.849978 1.269240e-03 \n", " -121.849978 8.974295e-03 \n", " -121.849978 3.691703e+09 \n", " -121.849978 1.431676e-03 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Identify the columns used as the context (index) for the measurements\n", "context = ['platform', 'timevalue', 'depth', 'latitude', 'longitude']\n", "\n", "# It takes about 1 minute to read all the Parameters for the selected platforms - about 13.5 million rows\n", "%time df3a = pd.read_sql_query(sql, connections[db], index_col=context)\n", "print(df3a.shape)\n", "df3a.head()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 14.6 s, sys: 1.59 s, total: 16.2 s\n", "Wall time: 43.1 s\n", "(3683189, 3)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namestandard_namedatavalue
platformactivity__nametimevaluedepthlatitudelongitude
doradoDorado389_2020_282_01_282_01_decim.nc2020-10-08 20:08:11-0.02826436.796672-121.849978altitudeheight_above_sea_floor2.707653e+02
-121.849978bbp420 (m-1)None1.269240e-03
-121.849978bbp700 (m-1)None8.974295e-03
-121.849978biolumeNone3.691703e+09
-121.849978fl700_uncorrNone1.431676e-03
\n", "
" ], "text/plain": [ " name \\\n", "platform activity__name timevalue depth latitude longitude \n", "dorado Dorado389_2020_282_01_282_01_decim.nc 2020-10-08 20:08:11 -0.028264 36.796672 -121.849978 altitude \n", " -121.849978 bbp420 (m-1) \n", " -121.849978 bbp700 (m-1) \n", " -121.849978 biolume \n", " -121.849978 fl700_uncorr \n", "\n", " standard_name \\\n", "platform activity__name timevalue depth latitude longitude \n", "dorado Dorado389_2020_282_01_282_01_decim.nc 2020-10-08 20:08:11 -0.028264 36.796672 -121.849978 height_above_sea_floor \n", " -121.849978 None \n", " -121.849978 None \n", " -121.849978 None \n", " -121.849978 None \n", "\n", " datavalue \n", "platform activity__name timevalue depth latitude longitude \n", "dorado Dorado389_2020_282_01_282_01_decim.nc 2020-10-08 20:08:11 -0.028264 36.796672 -121.849978 2.707653e+02 \n", " -121.849978 1.269240e-03 \n", " -121.849978 8.974295e-03 \n", " -121.849978 3.691703e+09 \n", " -121.849978 1.431676e-03 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "context = ['platform', 'activity__name', 'timevalue', 'depth', 'latitude', 'longitude']\n", "\n", "# It takes about 1 minute to read all the Parameters for the selected platforms - about 13.5 million rows\n", "%time df3a = pd.read_sql_query(sql, connections[db], index_col=context)\n", "print(df3a.shape)\n", "df3a.head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 27.3 s, sys: 1.51 s, total: 28.8 s\n", "Wall time: 56.6 s\n", "(3683189, 3)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namestandard_namedatavalue
platformactivity__nametimevaluedepthlatitudelongitude
doradoDorado389_2020_282_01_282_01_decim.nc2020-10-08 20:08:11+00:00-0.02826436.796672-121.849978altitudeheight_above_sea_floor2.707653e+02
-121.849978bbp420 (m-1)None1.269240e-03
-121.849978bbp700 (m-1)None8.974295e-03
-121.849978biolumeNone3.691703e+09
-121.849978fl700_uncorrNone1.431676e-03
\n", "
" ], "text/plain": [ " name \\\n", "platform activity__name timevalue depth latitude longitude \n", "dorado Dorado389_2020_282_01_282_01_decim.nc 2020-10-08 20:08:11+00:00 -0.028264 36.796672 -121.849978 altitude \n", " -121.849978 bbp420 (m-1) \n", " -121.849978 bbp700 (m-1) \n", " -121.849978 biolume \n", " -121.849978 fl700_uncorr \n", "\n", " standard_name \\\n", "platform activity__name timevalue depth latitude longitude \n", "dorado Dorado389_2020_282_01_282_01_decim.nc 2020-10-08 20:08:11+00:00 -0.028264 36.796672 -121.849978 height_above_sea_floor \n", " -121.849978 None \n", " -121.849978 None \n", " -121.849978 None \n", " -121.849978 None \n", "\n", " datavalue \n", "platform activity__name timevalue depth latitude longitude \n", "dorado Dorado389_2020_282_01_282_01_decim.nc 2020-10-08 20:08:11+00:00 -0.028264 36.796672 -121.849978 2.707653e+02 \n", " -121.849978 1.269240e-03 \n", " -121.849978 8.974295e-03 \n", " -121.849978 3.691703e+09 \n", " -121.849978 1.431676e-03 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas.io.sql as sqlio\n", "import psycopg2\n", "\n", "# Use psycopg2 for direct from Postgres query - still explodes stoqs container RAM as this Notebook runs there\n", "conn = psycopg2.connect(\"host='{}' port={} dbname='{}' user={} password={}\".format(\n", " 'postgis', 5432, 'stoqs_canon_october2020', 'stoqsadm', 'CHANGEME'))\n", "# Takes about 5 minutes to read 13.5 million rows\n", "%time df3b = sqlio.read_sql_query(sql, conn, index_col=context)\n", "print(df3b.shape)\n", "df3b.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "It would be nice if chunksize helped with memory usage in docker, but it doesn't.\n" ] } ], "source": [ "# Setting chunksize doesn't help reduce memory requirements in the stoqs or stoqs-postgis containers\n", "# See: https://stackoverflow.com/a/31843091/1281657\n", "# https://github.com/pandas-dev/pandas/issues/12265#issuecomment-181809005\n", "# https://github.com/pandas-dev/pandas/issues/35689\n", "df3c = pd.DataFrame()\n", "# Still takes about 2.5 minutes to read 13.5 rows (chunking happens only on client in Pandas)\n", "##%time chunker = pd.read_sql_query(sql, connections[db], index_col=context, chunksize=1000000)\n", "##for chunk in chunker:\n", " ##print(chunk.shape)\n", " ##df3c.add(chunk.pivot_table(index=context, columns='name', values='datavalue'))\n", "print(\"It would be nice if chunksize helped with memory usage in docker, but it doesn't.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The syntax of crosstab() is arcane and it will take some work to figure out a way to preserve datetime objects as they are read into a DataFrame. It's likely that performing a pivot on the data closer to the database will be more performant than say doing it after reading records into a DataFrame.\n", "\n", "---\n", "#### Approach 4. Use Pandas do a pivot on data read into a DataFrame\n", "Similar to Approach 2, but this may be more efficient as conversion to and from CSV text format is avoided." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.9 s, sys: 53.1 ms, total: 4.96 s\n", "Wall time: 5.02 s\n", "(312470, 51)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameO2_conc (umol/kg)OXYGOXYG (ml/L)PAR (umol/s/m2)PSAL (0.001)TEMP (Celsius)air_temp (C)altitudeatm_press (mbar)avg_wind_spd (m/s)...salinitysalinity (psu)sepCountList (count)sigmatspicetemperature (Celsius)temperature (degC)water_temp (C)wind_dir (deg (True))yaw (degree)
platformtimevaluedepthlatitudelongitude
NPS_Glider_292020-10-06 19:07:40.8000020.00000036.803799-121.853996NaNNaN6.002NaN33.46200213.956NaN280.567732NaNNaN...NaNNaNNaN25.0078210.682620NaNNaNNaNNaNNaN
2020-10-06 19:07:57.9310360.79409736.803795-121.854050NaNNaN5.918NaN33.46699913.845NaN280.170345NaNNaN...NaNNaNNaN25.0346450.663049NaNNaNNaNNaNNaN
2020-10-06 19:08:15.0620711.88597536.803787-121.854103NaNNaN5.815NaN33.46799913.731NaN279.502588NaNNaN...NaNNaNNaN25.0588880.639810NaNNaNNaNNaNNaN
2020-10-06 19:08:32.1931052.87858636.803780-121.854149NaNNaN5.716NaN33.47200013.626NaN278.858728NaNNaN...NaNNaNNaN25.0834880.620954NaNNaNNaNNaNNaN
2020-10-06 19:08:49.3241403.97045336.803772-121.854202NaNNaN5.624NaN33.46900213.590NaN278.152412NaNNaN...NaNNaNNaN25.0885410.611061NaNNaNNaNNaNNaN
\n", "

5 rows × 51 columns

\n", "
" ], "text/plain": [ "name O2_conc (umol/kg) \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 NaN \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 NaN \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 NaN \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 NaN \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 NaN \n", "\n", "name OXYG \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 NaN \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 NaN \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 NaN \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 NaN \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 NaN \n", "\n", "name OXYG (ml/L) \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 6.002 \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 5.918 \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 5.815 \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 5.716 \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 5.624 \n", "\n", "name PAR (umol/s/m2) \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 NaN \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 NaN \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 NaN \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 NaN \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 NaN \n", "\n", "name PSAL (0.001) \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 33.462002 \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 33.466999 \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 33.467999 \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 33.472000 \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 33.469002 \n", "\n", "name TEMP (Celsius) \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 13.956 \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 13.845 \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 13.731 \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 13.626 \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 13.590 \n", "\n", "name air_temp (C) \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 NaN \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 NaN \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 NaN \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 NaN \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 NaN \n", "\n", "name altitude \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 280.567732 \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 280.170345 \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 279.502588 \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 278.858728 \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 278.152412 \n", "\n", "name atm_press (mbar) \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 NaN \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 NaN \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 NaN \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 NaN \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 NaN \n", "\n", "name avg_wind_spd (m/s) \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 NaN \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 NaN \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 NaN \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 NaN \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 NaN \n", "\n", "name ... \\\n", "platform timevalue depth latitude longitude ... \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 ... \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 ... \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 ... \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 ... \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 ... \n", "\n", "name salinity \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 NaN \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 NaN \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 NaN \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 NaN \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 NaN \n", "\n", "name salinity (psu) \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 NaN \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 NaN \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 NaN \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 NaN \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 NaN \n", "\n", "name sepCountList (count) \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 NaN \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 NaN \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 NaN \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 NaN \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 NaN \n", "\n", "name sigmat \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 25.007821 \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 25.034645 \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 25.058888 \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 25.083488 \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 25.088541 \n", "\n", "name spice \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 0.682620 \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 0.663049 \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 0.639810 \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 0.620954 \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 0.611061 \n", "\n", "name temperature (Celsius) \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 NaN \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 NaN \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 NaN \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 NaN \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 NaN \n", "\n", "name temperature (degC) \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 NaN \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 NaN \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 NaN \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 NaN \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 NaN \n", "\n", "name water_temp (C) \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 NaN \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 NaN \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 NaN \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 NaN \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 NaN \n", "\n", "name wind_dir (deg (True)) \\\n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 NaN \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 NaN \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 NaN \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 NaN \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 NaN \n", "\n", "name yaw (degree) \n", "platform timevalue depth latitude longitude \n", "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996 NaN \n", " 2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050 NaN \n", " 2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103 NaN \n", " 2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149 NaN \n", " 2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202 NaN \n", "\n", "[5 rows x 51 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Identify the columns used as the index for the pivot\n", "context = ['platform', 'timevalue', 'depth', 'latitude', 'longitude']\n", "\n", "%time df4 = df3a.pivot_table(index=context, columns='name', values='datavalue')\n", "print(df4.shape)\n", "df4.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This approach looks promising. Some advantages:\n", "\n", "1. The SQL query is a simple inner join of the tables - similar to that used for MeasuredParameter Data Access\n", "2. No complicated self joins are needed\n", "3. It will work for any Parameter names from any platforms\n", "4. Missing values are preserved as None or NaN\n", "5. Pandas pivot_table() method is efficient, taking only about 5 seconds\n", " \n", "Some disadvantages:\n", "\n", "1. This uses the stoqs Docker image and the Django api - it could be closer to the database\n", "2. More than 16 GB needs to be resourced to Docker Desktop to read the entire stoqs_canon_october2020 db\n", "3. When memory is exhausted there is no error message provided; it annoyingly quits silently\n", "---\n", "These experiments now lead to the script stoqs/contrib/parquet/extract_columns.py which will implement this capability at the command line. This notebook can still serve as a \"playground\" for testing out various ways to get STOQS data into modern data science tools. " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "if do_plots:\n", " # See: https://datashader.org/getting_started/Pipeline.html\n", " import holoviews as hv\n", " from holoviews.operation.datashader import datashade\n", " hv.extension(\"bokeh\")\n", " pts1 = hv.Points(df1, kdims=['salt', 'temp'])\n", " pts2 = hv.Points(df2, kdims=['PSAL (0.001)', 'TEMP (Celsius)'])\n", " pts4a = hv.Points(df4, kdims=['salinity', 'temperature (Celsius)'])\n", " pts4b = hv.Points(df4, kdims=['PSAL (0.001)', 'TEMP (Celsius)'])\n", " ts_points = ( datashade(pts1, cmap=colorcet.fire).opts(title='df1')\n", " + datashade(pts2, cmap=colorcet.fire).opts(title='df2')\n", " + datashade(pts4a, cmap=colorcet.fire).opts(title='df4a')\n", " + datashade(pts4b, cmap=colorcet.fire).opts(title='df4b'))\n", "ts_points" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "if do_plots:\n", " # See: http://holoviews.org/user_guide/Large_Data.html\n", " from holoviews.operation.datashader import rasterize\n", " ##ropts = dict(tools=[\"pan,wheel_zoom,box_zoom\"], height=380, width=330, colorbar=True, colorbar_position=\"bottom\")\n", " ropts = dict(height=380, width=330, colorbar=True, colorbar_position=\"bottom\")\n", "\n", " ts_points = hv.Layout([rasterize(hv.Points(df1.iloc[df1.index.get_level_values('name') == p],kdims=['temp', 'salt'])).opts(**ropts).relabel(p)for p in plat_list])\n", "\n", "ts_points" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time to execute this notebook: 176.5 seconds\n" ] } ], "source": [ "print(f\"Time to execute this notebook: {(time.time() - t_start):.1f} seconds\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Monitoring with `docker stats` shows that executing this notebook required **7.5 GB** of memory by the stoqs container." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Django Shell-Plus", "language": "python", "name": "django_extensions" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 2 }