{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Select Data in Columns for Data Science\n",
    "*Pivot the row-based data in a STOQS database to fit into a column-based dataframe*\n",
    "\n",
    "This Notebook explores options raised by this [GitHub Issue](https://github.com/stoqs/stoqs/issues/837#issuecomment-763176111). We want to be able to efficiently consume mass quantities of data from a STOQS database and have it organized for efficient data analysis and visualization using modern data frame orientied tools.\n",
    "\n",
    "Executing this Notebook requires a personal STOQS server.  It can be run from either a Docker installation or from a development Vagrant Virtual Machine. \n",
    "\n",
    "### Docker Instructions\n",
    "Install and start the software as \n",
    "[detailed in the README](https://github.com/stoqs/stoqs#production-deployment-with-docker). (Note that on MacOS you will need to modify settings in your `docker-compose.yml` and `.env` files &mdash; look for comments referencing 'HOST_UID'.)\n",
    "\n",
    "Then, from your `$STOQS_HOME/docker` directory start the Jupyter Notebook server - you can query from the remote database or from a copy that you've made to your local system: \n",
    "\n",
    "#### Option A: Query from MBARI's master database\n",
    "Start the Jupyter Notebook server pointing to MBARI's master STOQS database server. (Note: firewall rules limit unprivileged access to such resources):\n",
    "\n",
    "    docker-compose exec \\\n",
    "        -e DATABASE_URL=postgis://everyone:guest@kraken.shore.mbari.org:5432/stoqs \\\n",
    "        stoqs stoqs/manage.py shell_plus --notebook\n",
    "\n",
    "#### Option B: Query from your local Docker Desktop\n",
    "Restore the `stoqs_canon_october2020` database from MBARI's server onto your local database and start the Jupyter Notebook server using the default DATABASE_URL, which should be your local system, also **make sure that your Docker Desktop has at least 16 GB of RAM allocated to it**:\n",
    "\n",
    "    cd $STOQS_HOME/docker\n",
    "    docker-compose exec stoqs createdb -U postgres stoqs_canon_october2020\n",
    "    curl -k https://stoqs.shore.mbari.org/media/pg_dumps/stoqs_canon_october2020.pg_dump | \\\n",
    "        docker exec -i stoqs pg_restore -Fc -U postgres -d stoqs_canon_october2020\n",
    "    docker-compose exec stoqs stoqs/manage.py shell_plus --notebook\n",
    "\n",
    "###  Opening this Notebook\n",
    "Following execution of the `stoqs/manage.py shell_plus --notebook` command a message is displayed giving a URL for you to use in a browser on your host, e.g.:\n",
    "\n",
    "    http://127.0.0.1:8888/?token=<a_token_generated_upon_server_start>\n",
    "\n",
    "In the browser window opened to this URL navigate to this file (`select_data_in_columns_for_data_science.ipynb`) and open it. You will then be able to execute the cells and modify the code to suit your needs.\n",
    "\n",
    "The information in the output cells result from execution on a 2019 MacBook Pro with a 2.4 GHz 8-Core Intel Core i9 processor, 32 GB 2667 MHz DDR4 RAM, running Docker Desktop 3.1.0 with 16 GB with 4 CPUs and 16 GB allocated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import time\n",
    "\n",
    "# Prevent SynchronousOnlyOperation exceptions\n",
    "os.environ[\"DJANGO_ALLOW_ASYNC_UNSAFE\"] = \"true\"\n",
    "\n",
    "# Set do_plots to True for visualizations (do not commit with this setting)\n",
    "do_plots = False\n",
    "t_start = time.time()\n",
    "\n",
    "# Use a recent database available at DATABASE_URL\n",
    "db = 'stoqs_canon_october2020'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 0. Perform a straight forward query using the STOQS data model.\n",
    "Collect all the sea_water_temperature and sea_water_salinity data into dictionaries keyed by platform name. This is to examine the landscape of data we are querying."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting data for: dorado                  #temps: 169159  #salts: 169159\n",
      "Collecting data for: M1_Mooring              #temps:   6600  #salts:   6600\n",
      "Collecting data for: makai                   #temps: 154793  #salts: 154797 - not equal\n",
      "Collecting data for: makai_ESPmv1_filtering  #temps:  29857  #salts:  29857\n",
      "Collecting data for: makai_Sipper            #temps:    595  #salts:    595\n",
      "Collecting data for: NPS_Glider_29           #temps:  33143  #salts:  33143\n",
      "Collecting data for: NPS_Glider_34           #temps:  32759  #salts:  32759\n",
      "Collecting data for: pontus                  #temps: 132945  #salts: 132947 - not equal\n",
      "Collecting data for: wg_Tiny_Glider          #temps:    386  #salts:    386\n",
      "Done\n"
     ]
    }
   ],
   "source": [
    "# To make sure we collect temperatures and salinities that are properly associated\n",
    "# we will first find all the Platforms that have T & S and then from each Measurement\n",
    "# from the Platform collect the temperatures and salinities into lists for plotting.\n",
    "# Assume that Platforms that have sea_water_salinity also have sea_water_temperature.\n",
    "platforms = (ActivityParameter.objects.using(db)\n",
    "                              .filter(parameter__standard_name='sea_water_salinity')\n",
    "                              .values_list('activity__platform__name', flat=True)\n",
    "                              .distinct().order_by('activity__platform__name'))\n",
    "temps = {}\n",
    "salts = {}\n",
    "for platform in platforms:\n",
    "    print(f\"Collecting data for: {platform:23}\", end=' ')\n",
    "    mps = (MeasuredParameter.objects.using(db)\n",
    "           .filter(measurement__instantpoint__activity__platform__name=platform))\n",
    "    \n",
    "    temps[platform] = (mps.filter(parameter__standard_name='sea_water_temperature')\n",
    "                          .values_list('datavalue', flat=True))\n",
    "    salts[platform] = (mps.filter(parameter__standard_name='sea_water_salinity')\n",
    "                          .values_list('datavalue', flat=True))\n",
    "    print(f\"#temps: {len(temps[platform]):6}  #salts: {len(salts[platform]):6}\", end='')\n",
    "    if len(temps[platform]) != len(salts[platform]):\n",
    "        print(' - not equal')\n",
    "    else:\n",
    "        print()\n",
    "print('Done')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "if do_plots:\n",
    "    # Make a T/S plot of data from all the platforms\n",
    "    import pylab as plt\n",
    "    plt.rcParams['figure.figsize'] = (18, 6)\n",
    "    for platform in temps.keys():\n",
    "        if len(temps[platform]) == len(salts[platform]):\n",
    "            plt.scatter(salts[platform], temps[platform], s=1, label=platform)\n",
    "    plt.xlabel('Salinty')\n",
    "    plt.ylabel('Temperature (DegC)')\n",
    "    plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Approach 1. Use the same kind of self-join query used for selecting data for Parameter-Parameter plots. \n",
    "A sample SQL statement was copied from the STOQS UI and then modified to select sea_water_temperature and sea_water_salinity from all platforms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "sql_multp = '''SELECT DISTINCT \n",
    "                stoqs_platform.name,\n",
    "                stoqs_instantpoint.timevalue,\n",
    "                stoqs_measurement.depth,\n",
    "                mp_salt.datavalue AS salt,\n",
    "                mp_temp.datavalue AS temp\n",
    "FROM stoqs_measuredparameter\n",
    "INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id)\n",
    "INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id)\n",
    "INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id)\n",
    "INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id)\n",
    "INNER JOIN stoqs_measurement m_salt ON m_salt.instantpoint_id = stoqs_instantpoint.id\n",
    "INNER JOIN stoqs_measuredparameter mp_salt ON mp_salt.measurement_id = m_salt.id\n",
    "INNER JOIN stoqs_parameter p_salt ON mp_salt.parameter_id = p_salt.id\n",
    "INNER JOIN stoqs_measurement m_temp ON m_temp.instantpoint_id = stoqs_instantpoint.id\n",
    "INNER JOIN stoqs_measuredparameter mp_temp ON mp_temp.measurement_id = m_temp.id\n",
    "INNER JOIN stoqs_parameter p_temp ON mp_temp.parameter_id = p_temp.id\n",
    "WHERE (p_salt.standard_name = 'sea_water_salinity')\n",
    "  AND (p_temp.standard_name = 'sea_water_temperature')\n",
    "  AND stoqs_platform.name IN ({})\n",
    "ORDER BY stoqs_instantpoint.timevalue, stoqs_measurement.depth'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SELECT DISTINCT \n",
      "                stoqs_platform.name,\n",
      "                stoqs_instantpoint.timevalue,\n",
      "                stoqs_measurement.depth,\n",
      "                mp_salt.datavalue AS salt,\n",
      "                mp_temp.datavalue AS temp\n",
      "FROM stoqs_measuredparameter\n",
      "INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id)\n",
      "INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id)\n",
      "INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id)\n",
      "INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id)\n",
      "INNER JOIN stoqs_measurement m_salt ON m_salt.instantpoint_id = stoqs_instantpoint.id\n",
      "INNER JOIN stoqs_measuredparameter mp_salt ON mp_salt.measurement_id = m_salt.id\n",
      "INNER JOIN stoqs_parameter p_salt ON mp_salt.parameter_id = p_salt.id\n",
      "INNER JOIN stoqs_measurement m_temp ON m_temp.instantpoint_id = stoqs_instantpoint.id\n",
      "INNER JOIN stoqs_measuredparameter mp_temp ON mp_temp.measurement_id = m_temp.id\n",
      "INNER JOIN stoqs_parameter p_temp ON mp_temp.parameter_id = p_temp.id\n",
      "WHERE (p_salt.standard_name = 'sea_water_salinity')\n",
      "  AND (p_temp.standard_name = 'sea_water_temperature')\n",
      "  AND stoqs_platform.name IN ('dorado','makai_ESPmv1_filtering','makai_Sipper','NPS_Glider_29','NPS_Glider_34','wg_Tiny_Glider')\n",
      "ORDER BY stoqs_instantpoint.timevalue, stoqs_measurement.depth\n"
     ]
    }
   ],
   "source": [
    "# Build the SQL with optional selection of platforms to use\n",
    "db = 'stoqs_canon_october2020'\n",
    "platforms = (ActivityParameter.objects.using(db)\n",
    "                              .filter(parameter__standard_name='sea_water_salinity')\n",
    "                              .values_list('activity__platform__name', flat=True)\n",
    "                              .order_by('activity__platform__name').distinct())\n",
    "plats = ''\n",
    "plat_list = []\n",
    "for platform in platforms:\n",
    "    if platform == 'M1_Mooring' or platform == 'makai' or platform == 'pontus':\n",
    "        # Continue to omit some platforms for shorter execution times\n",
    "        continue\n",
    "    plats += f\"'{platform}',\"\n",
    "    plat_list.append(platform)\n",
    "plats = plats[:-2] + \"'\"\n",
    "sql = sql_multp.format(plats)\n",
    "print(sql)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 614 ms, sys: 91.6 ms, total: 706 ms\n",
      "Wall time: 8.11 s\n",
      "(266285, 2)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>salt</th>\n",
       "      <th>temp</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th>timevalue</th>\n",
       "      <th>depth</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">wg_Tiny_Glider</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">2020-10-05 23:20:00</th>\n",
       "      <th>-1.0</th>\n",
       "      <td>33.499126</td>\n",
       "      <td>14.309250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.5</th>\n",
       "      <td>33.499126</td>\n",
       "      <td>14.309250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">2020-10-05 23:30:00</th>\n",
       "      <th>-1.0</th>\n",
       "      <td>33.495251</td>\n",
       "      <td>14.408250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.5</th>\n",
       "      <td>33.495251</td>\n",
       "      <td>14.408250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-10-05 23:40:00</th>\n",
       "      <th>-1.0</th>\n",
       "      <td>33.491249</td>\n",
       "      <td>14.349625</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               salt       temp\n",
       "name           timevalue           depth                      \n",
       "wg_Tiny_Glider 2020-10-05 23:20:00 -1.0   33.499126  14.309250\n",
       "                                    0.5   33.499126  14.309250\n",
       "               2020-10-05 23:30:00 -1.0   33.495251  14.408250\n",
       "                                    0.5   33.495251  14.408250\n",
       "               2020-10-05 23:40:00 -1.0   33.491249  14.349625"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from django.db import connections\n",
    "\n",
    "# It takes about 15 seconds to read about 0.5 million rows from the local STOQS database.\n",
    "%time df1 = pd.read_sql_query(sql, connections[db], index_col=['name', 'timevalue', 'depth'])\n",
    "##%time df1 = pd.read_sql_query(sql, connections[db])\n",
    "print(df1.shape)\n",
    "df1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 132 ms, sys: 37.2 ms, total: 170 ms\n",
      "Wall time: 233 ms\n"
     ]
    }
   ],
   "source": [
    "# Writing the Parquet file takes about 0.6 seconds\n",
    "%time df1.to_parquet('all_plats.parquet')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 164 ms, sys: 20.9 ms, total: 185 ms\n",
      "Wall time: 226 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(266285, 2)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reading the Parquest file takes about 0.4 seconds\n",
    "%time df1b = pd.read_parquet('all_plats.parquet')\n",
    "df1b.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>salt</th>\n",
       "      <th>temp</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th>timevalue</th>\n",
       "      <th>depth</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">wg_Tiny_Glider</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">2020-10-05 23:20:00</th>\n",
       "      <th>-1.000000</th>\n",
       "      <td>33.499126</td>\n",
       "      <td>14.309250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.500000</th>\n",
       "      <td>33.499126</td>\n",
       "      <td>14.309250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">2020-10-05 23:30:00</th>\n",
       "      <th>-1.000000</th>\n",
       "      <td>33.495251</td>\n",
       "      <td>14.408250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.500000</th>\n",
       "      <td>33.495251</td>\n",
       "      <td>14.408250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-10-05 23:40:00</th>\n",
       "      <th>-1.000000</th>\n",
       "      <td>33.491249</td>\n",
       "      <td>14.349625</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">dorado</th>\n",
       "      <th>2020-10-28 14:42:43</th>\n",
       "      <th>1.638631</th>\n",
       "      <td>33.487351</td>\n",
       "      <td>14.753714</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-10-28 14:42:44</th>\n",
       "      <th>1.356102</th>\n",
       "      <td>33.488457</td>\n",
       "      <td>14.751194</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-10-28 14:42:46</th>\n",
       "      <th>0.791729</th>\n",
       "      <td>33.489155</td>\n",
       "      <td>14.750566</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-10-28 14:42:47</th>\n",
       "      <th>0.513772</th>\n",
       "      <td>33.489155</td>\n",
       "      <td>14.744095</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-10-28 14:42:48</th>\n",
       "      <th>0.169383</th>\n",
       "      <td>33.489155</td>\n",
       "      <td>14.743799</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>266285 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                   salt       temp\n",
       "name           timevalue           depth                          \n",
       "wg_Tiny_Glider 2020-10-05 23:20:00 -1.000000  33.499126  14.309250\n",
       "                                    0.500000  33.499126  14.309250\n",
       "               2020-10-05 23:30:00 -1.000000  33.495251  14.408250\n",
       "                                    0.500000  33.495251  14.408250\n",
       "               2020-10-05 23:40:00 -1.000000  33.491249  14.349625\n",
       "...                                                 ...        ...\n",
       "dorado         2020-10-28 14:42:43  1.638631  33.487351  14.753714\n",
       "               2020-10-28 14:42:44  1.356102  33.488457  14.751194\n",
       "               2020-10-28 14:42:46  0.791729  33.489155  14.750566\n",
       "               2020-10-28 14:42:47  0.513772  33.489155  14.744095\n",
       "               2020-10-28 14:42:48  0.169383  33.489155  14.743799\n",
       "\n",
       "[266285 rows x 2 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Datashader plots must be left justified on last line, use this variable to do that\n",
    "ts_points = None\n",
    "if do_plots:\n",
    "    # See: http://holoviews.org/user_guide/Large_Data.html\n",
    "    #      https://stackoverflow.com/a/18835121/1281657\n",
    "    import colorcet\n",
    "    import holoviews as hv\n",
    "    from holoviews.operation.datashader import rasterize\n",
    "    hv.extension(\"bokeh\")\n",
    "    ropts = dict(height=380, width=300, colorbar=True, colorbar_position=\"bottom\", cmap=colorcet.fire)\n",
    "    plots = [(rasterize(hv.Points(df1.iloc[df1.index.get_level_values('name') == p], kdims=['salt', 'temp']))\n",
    "              .opts(**ropts).relabel(p)) for p in plat_list]\n",
    "    ts_points = hv.Layout(plots).cols(3)\n",
    "ts_points"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This approach could be used in a general way to extract all Parameters for each Platform by dynamically generating the SQL (with dozens more self joins) and executing it. We do need more scalable methods than `.read_sql_query()` and `.to_parquet()`, which need to read and write all the data in to and out of allocated random access memory. This is why at least a resource of 16 GB of RAM is needed in Docker Desktop for this query - larger data requests would require more memory - this is not scalable. \n",
    "\n",
    "This SQL is not forgiving in terms of *ad hoc* modification; for example, adding a column to the SELECT can increase the volume of results by unexpectedly returning a type of cross join with repeated salt and temp values. \n",
    "\n",
    "This sort of self-join query returns a lot of duplicate records (kind of a cross join) for the 'M1_Mooring' platform which has a 'stationprofile' CF featureType, resulting in a different relational cardinality that would require special treatment.\n",
    "\n",
    "---\n",
    "\n",
    "#### Approach 2. Use Brent's trimSTOQS program to convert the MeasuredParameter Data Access output:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.12user 0.80system 0:11.25elapsed 8%CPU (0avgtext+0avgdata 6744maxresident)k\n",
      "0inputs+0outputs (0major+363minor)pagefaults 0swaps\n",
      "(33176, 11)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>timevalue</th>\n",
       "      <th>depth</th>\n",
       "      <th>geom.x</th>\n",
       "      <th>geom.y</th>\n",
       "      <th>altitude</th>\n",
       "      <th>TEMP (Celsius)</th>\n",
       "      <th>PSAL (0.001)</th>\n",
       "      <th>OXYG (ml/L)</th>\n",
       "      <th>sigmat</th>\n",
       "      <th>spice</th>\n",
       "      <th>NPS_Glider_29</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2020-10-06 19:07:40.800002</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-121.853996</td>\n",
       "      <td>36.803799</td>\n",
       "      <td>280.567732</td>\n",
       "      <td>13.956</td>\n",
       "      <td>33.462002</td>\n",
       "      <td>6.002</td>\n",
       "      <td>25.007821</td>\n",
       "      <td>0.682620</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2020-10-06 19:07:57.931036</td>\n",
       "      <td>0.794097</td>\n",
       "      <td>-121.854050</td>\n",
       "      <td>36.803795</td>\n",
       "      <td>280.170345</td>\n",
       "      <td>13.845</td>\n",
       "      <td>33.466999</td>\n",
       "      <td>5.918</td>\n",
       "      <td>25.034645</td>\n",
       "      <td>0.663049</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2020-10-06 19:08:15.062071</td>\n",
       "      <td>1.885975</td>\n",
       "      <td>-121.854103</td>\n",
       "      <td>36.803787</td>\n",
       "      <td>279.502588</td>\n",
       "      <td>13.731</td>\n",
       "      <td>33.467999</td>\n",
       "      <td>5.815</td>\n",
       "      <td>25.058888</td>\n",
       "      <td>0.639810</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2020-10-06 19:08:32.193105</td>\n",
       "      <td>2.878586</td>\n",
       "      <td>-121.854149</td>\n",
       "      <td>36.803780</td>\n",
       "      <td>278.858728</td>\n",
       "      <td>13.626</td>\n",
       "      <td>33.472000</td>\n",
       "      <td>5.716</td>\n",
       "      <td>25.083488</td>\n",
       "      <td>0.620954</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2020-10-06 19:08:49.324140</td>\n",
       "      <td>3.970453</td>\n",
       "      <td>-121.854202</td>\n",
       "      <td>36.803772</td>\n",
       "      <td>278.152412</td>\n",
       "      <td>13.590</td>\n",
       "      <td>33.469002</td>\n",
       "      <td>5.624</td>\n",
       "      <td>25.088541</td>\n",
       "      <td>0.611061</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    timevalue     depth      geom.x     geom.y    altitude  \\\n",
       "0  2020-10-06 19:07:40.800002  0.000000 -121.853996  36.803799  280.567732   \n",
       "1  2020-10-06 19:07:57.931036  0.794097 -121.854050  36.803795  280.170345   \n",
       "2  2020-10-06 19:08:15.062071  1.885975 -121.854103  36.803787  279.502588   \n",
       "3  2020-10-06 19:08:32.193105  2.878586 -121.854149  36.803780  278.858728   \n",
       "4  2020-10-06 19:08:49.324140  3.970453 -121.854202  36.803772  278.152412   \n",
       "\n",
       "   TEMP (Celsius)  PSAL (0.001)  OXYG (ml/L)     sigmat     spice  \\\n",
       "0          13.956     33.462002        6.002  25.007821  0.682620   \n",
       "1          13.845     33.466999        5.918  25.034645  0.663049   \n",
       "2          13.731     33.467999        5.815  25.058888  0.639810   \n",
       "3          13.626     33.472000        5.716  25.083488  0.620954   \n",
       "4          13.590     33.469002        5.624  25.088541  0.611061   \n",
       "\n",
       "   NPS_Glider_29  \n",
       "0            NaN  \n",
       "1            NaN  \n",
       "2            NaN  \n",
       "3            NaN  \n",
       "4            NaN  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# It takes about 5 minutes to read in 0.17 million dorado CSV rows and convert using trimSTOQS\n",
    "##! time wget https://stoqs.mbari.org/stoqs_canon_october2020/api/measuredparameter.csv?measurement__instantpoint__activity__platform__name=dorado \\\n",
    "##    -q -O - | /srv/stoqs/contrib/trimSTOQS/trimSTOQS parameter__name --separator=, > october2020_dorado_parms.cvs\n",
    "##df2 = pd.read_csv('/srv/stoqs/contrib/trimSTOQS/october2020_dorado_parms.cvs')\n",
    "\n",
    "# It takes about 40 seconds (on a fast network) to read in just 0.033 million NPS_Glider_29 CSV rows and convert using trimSTOQS\n",
    "! time wget http://stoqs.mbari.org/stoqs_canon_october2020/api/measuredparameter.csv?measurement__instantpoint__activity__platform__name=NPS_Glider_29 \\\n",
    "    -q -O - | /srv/stoqs/contrib/trimSTOQS/trimSTOQS parameter__name --separator=, \\\n",
    "    > /srv/stoqs/contrib/trimSTOQS/october2020_NPS_Glider_29_parms.cvs\n",
    "    \n",
    "df2 = pd.read_csv('/srv/stoqs/contrib/trimSTOQS/october2020_NPS_Glider_29_parms.cvs')\n",
    "print(df2.shape)\n",
    "df2.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The advantage of this approach is that all parameters get transformed into the columns we want. The disadvantage is that it takes a long time to extract the data in CSV format. Approach 1 reads at a rate of about 30,000 rows/sec, approach 2 reads at a rate of 1000 rows/sec - **orders of magnitude slower**.\n",
    "\n",
    "---\n",
    "\n",
    "#### 3. Do a direct Postgresql query to transform the data, perhaps using the [crosstab() function](https://stackoverflow.com/questions/3002499/postgresql-crosstab-query/11751905#11751905). \n",
    "\n",
    "Need to do this on the database first:\n",
    "\n",
    "    % docker-compose exec postgis psql -U postgres  \n",
    "    postgres=# \\c stoqs_canon_october2020\n",
    "    stoqs_canon_october2020=# CREATE EXTENSION IF NOT EXISTS tablefunc;\n",
    "    CREATE EXTENSION\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SELECT stoqs_platform.name as platform, stoqs_activity.name as activity__name,\n",
      "       stoqs_instantpoint.timevalue, stoqs_measurement.depth, \n",
      "       ST_X(stoqs_measurement.geom) as longitude, ST_Y(stoqs_measurement.geom) as latitude,\n",
      "       stoqs_parameter.name, standard_name, datavalue \n",
      "FROM public.stoqs_measuredparameter\n",
      "INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id)\n",
      "INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id)\n",
      "INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id)\n",
      "INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id)\n",
      "INNER JOIN stoqs_parameter ON (stoqs_measuredparameter.parameter_id = stoqs_parameter.id)\n",
      "WHERE stoqs_platform.name IN ('dorado','makai_ESPmv1_filtering','makai_Sipper','NPS_Glider_29','NPS_Glider_34','wg_Tiny_Glider')\n",
      "ORDER BY stoqs_platform.name, stoqs_instantpoint.timevalue, stoqs_measurement.depth, stoqs_parameter.name\n"
     ]
    }
   ],
   "source": [
    "# Base query that's similar to the one behind the api/measuredparameter.csv request\n",
    "sql_base = '''SELECT stoqs_platform.name as platform, stoqs_activity.name as activity__name,\n",
    "       stoqs_instantpoint.timevalue, stoqs_measurement.depth, \n",
    "       ST_X(stoqs_measurement.geom) as longitude, ST_Y(stoqs_measurement.geom) as latitude,\n",
    "       stoqs_parameter.name, standard_name, datavalue \n",
    "FROM public.stoqs_measuredparameter\n",
    "INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id)\n",
    "INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id)\n",
    "INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id)\n",
    "INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id)\n",
    "INNER JOIN stoqs_parameter ON (stoqs_measuredparameter.parameter_id = stoqs_parameter.id)\n",
    "WHERE stoqs_platform.name IN ({})\n",
    "ORDER BY stoqs_platform.name, stoqs_instantpoint.timevalue, stoqs_measurement.depth, stoqs_parameter.name'''\n",
    "sql = sql_base.format(plats)\n",
    "print(sql)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 12.4 s, sys: 2.39 s, total: 14.8 s\n",
      "Wall time: 41.8 s\n",
      "(3683189, 4)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>activity__name</th>\n",
       "      <th>name</th>\n",
       "      <th>standard_name</th>\n",
       "      <th>datavalue</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>platform</th>\n",
       "      <th>timevalue</th>\n",
       "      <th>depth</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">dorado</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">2020-10-08 20:08:11</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">-0.028264</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">36.796672</th>\n",
       "      <th>-121.849978</th>\n",
       "      <td>Dorado389_2020_282_01_282_01_decim.nc</td>\n",
       "      <td>altitude</td>\n",
       "      <td>height_above_sea_floor</td>\n",
       "      <td>2.707653e+02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-121.849978</th>\n",
       "      <td>Dorado389_2020_282_01_282_01_decim.nc</td>\n",
       "      <td>bbp420 (m-1)</td>\n",
       "      <td>None</td>\n",
       "      <td>1.269240e-03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-121.849978</th>\n",
       "      <td>Dorado389_2020_282_01_282_01_decim.nc</td>\n",
       "      <td>bbp700 (m-1)</td>\n",
       "      <td>None</td>\n",
       "      <td>8.974295e-03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-121.849978</th>\n",
       "      <td>Dorado389_2020_282_01_282_01_decim.nc</td>\n",
       "      <td>biolume</td>\n",
       "      <td>None</td>\n",
       "      <td>3.691703e+09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-121.849978</th>\n",
       "      <td>Dorado389_2020_282_01_282_01_decim.nc</td>\n",
       "      <td>fl700_uncorr</td>\n",
       "      <td>None</td>\n",
       "      <td>1.431676e-03</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                     activity__name  \\\n",
       "platform timevalue           depth     latitude  longitude                                            \n",
       "dorado   2020-10-08 20:08:11 -0.028264 36.796672 -121.849978  Dorado389_2020_282_01_282_01_decim.nc   \n",
       "                                                 -121.849978  Dorado389_2020_282_01_282_01_decim.nc   \n",
       "                                                 -121.849978  Dorado389_2020_282_01_282_01_decim.nc   \n",
       "                                                 -121.849978  Dorado389_2020_282_01_282_01_decim.nc   \n",
       "                                                 -121.849978  Dorado389_2020_282_01_282_01_decim.nc   \n",
       "\n",
       "                                                                      name  \\\n",
       "platform timevalue           depth     latitude  longitude                   \n",
       "dorado   2020-10-08 20:08:11 -0.028264 36.796672 -121.849978      altitude   \n",
       "                                                 -121.849978  bbp420 (m-1)   \n",
       "                                                 -121.849978  bbp700 (m-1)   \n",
       "                                                 -121.849978       biolume   \n",
       "                                                 -121.849978  fl700_uncorr   \n",
       "\n",
       "                                                                       standard_name  \\\n",
       "platform timevalue           depth     latitude  longitude                             \n",
       "dorado   2020-10-08 20:08:11 -0.028264 36.796672 -121.849978  height_above_sea_floor   \n",
       "                                                 -121.849978                    None   \n",
       "                                                 -121.849978                    None   \n",
       "                                                 -121.849978                    None   \n",
       "                                                 -121.849978                    None   \n",
       "\n",
       "                                                                 datavalue  \n",
       "platform timevalue           depth     latitude  longitude                  \n",
       "dorado   2020-10-08 20:08:11 -0.028264 36.796672 -121.849978  2.707653e+02  \n",
       "                                                 -121.849978  1.269240e-03  \n",
       "                                                 -121.849978  8.974295e-03  \n",
       "                                                 -121.849978  3.691703e+09  \n",
       "                                                 -121.849978  1.431676e-03  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Identify the columns used as the context (index) for the measurements\n",
    "context = ['platform', 'timevalue', 'depth', 'latitude', 'longitude']\n",
    "\n",
    "# It takes about 1 minute to read all the Parameters for the selected platforms - about 13.5 million rows\n",
    "%time df3a = pd.read_sql_query(sql, connections[db], index_col=context)\n",
    "print(df3a.shape)\n",
    "df3a.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 14.6 s, sys: 1.59 s, total: 16.2 s\n",
      "Wall time: 43.1 s\n",
      "(3683189, 3)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>standard_name</th>\n",
       "      <th>datavalue</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>platform</th>\n",
       "      <th>activity__name</th>\n",
       "      <th>timevalue</th>\n",
       "      <th>depth</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">dorado</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">Dorado389_2020_282_01_282_01_decim.nc</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">2020-10-08 20:08:11</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">-0.028264</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">36.796672</th>\n",
       "      <th>-121.849978</th>\n",
       "      <td>altitude</td>\n",
       "      <td>height_above_sea_floor</td>\n",
       "      <td>2.707653e+02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-121.849978</th>\n",
       "      <td>bbp420 (m-1)</td>\n",
       "      <td>None</td>\n",
       "      <td>1.269240e-03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-121.849978</th>\n",
       "      <td>bbp700 (m-1)</td>\n",
       "      <td>None</td>\n",
       "      <td>8.974295e-03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-121.849978</th>\n",
       "      <td>biolume</td>\n",
       "      <td>None</td>\n",
       "      <td>3.691703e+09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-121.849978</th>\n",
       "      <td>fl700_uncorr</td>\n",
       "      <td>None</td>\n",
       "      <td>1.431676e-03</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                                            name  \\\n",
       "platform activity__name                        timevalue           depth     latitude  longitude                   \n",
       "dorado   Dorado389_2020_282_01_282_01_decim.nc 2020-10-08 20:08:11 -0.028264 36.796672 -121.849978      altitude   \n",
       "                                                                                       -121.849978  bbp420 (m-1)   \n",
       "                                                                                       -121.849978  bbp700 (m-1)   \n",
       "                                                                                       -121.849978       biolume   \n",
       "                                                                                       -121.849978  fl700_uncorr   \n",
       "\n",
       "                                                                                                             standard_name  \\\n",
       "platform activity__name                        timevalue           depth     latitude  longitude                             \n",
       "dorado   Dorado389_2020_282_01_282_01_decim.nc 2020-10-08 20:08:11 -0.028264 36.796672 -121.849978  height_above_sea_floor   \n",
       "                                                                                       -121.849978                    None   \n",
       "                                                                                       -121.849978                    None   \n",
       "                                                                                       -121.849978                    None   \n",
       "                                                                                       -121.849978                    None   \n",
       "\n",
       "                                                                                                       datavalue  \n",
       "platform activity__name                        timevalue           depth     latitude  longitude                  \n",
       "dorado   Dorado389_2020_282_01_282_01_decim.nc 2020-10-08 20:08:11 -0.028264 36.796672 -121.849978  2.707653e+02  \n",
       "                                                                                       -121.849978  1.269240e-03  \n",
       "                                                                                       -121.849978  8.974295e-03  \n",
       "                                                                                       -121.849978  3.691703e+09  \n",
       "                                                                                       -121.849978  1.431676e-03  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "context = ['platform', 'activity__name', 'timevalue', 'depth', 'latitude', 'longitude']\n",
    "\n",
    "# It takes about 1 minute to read all the Parameters for the selected platforms - about 13.5 million rows\n",
    "%time df3a = pd.read_sql_query(sql, connections[db], index_col=context)\n",
    "print(df3a.shape)\n",
    "df3a.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 27.3 s, sys: 1.51 s, total: 28.8 s\n",
      "Wall time: 56.6 s\n",
      "(3683189, 3)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>standard_name</th>\n",
       "      <th>datavalue</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>platform</th>\n",
       "      <th>activity__name</th>\n",
       "      <th>timevalue</th>\n",
       "      <th>depth</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">dorado</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">Dorado389_2020_282_01_282_01_decim.nc</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">2020-10-08 20:08:11+00:00</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">-0.028264</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">36.796672</th>\n",
       "      <th>-121.849978</th>\n",
       "      <td>altitude</td>\n",
       "      <td>height_above_sea_floor</td>\n",
       "      <td>2.707653e+02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-121.849978</th>\n",
       "      <td>bbp420 (m-1)</td>\n",
       "      <td>None</td>\n",
       "      <td>1.269240e-03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-121.849978</th>\n",
       "      <td>bbp700 (m-1)</td>\n",
       "      <td>None</td>\n",
       "      <td>8.974295e-03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-121.849978</th>\n",
       "      <td>biolume</td>\n",
       "      <td>None</td>\n",
       "      <td>3.691703e+09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-121.849978</th>\n",
       "      <td>fl700_uncorr</td>\n",
       "      <td>None</td>\n",
       "      <td>1.431676e-03</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                                                  name  \\\n",
       "platform activity__name                        timevalue                 depth     latitude  longitude                   \n",
       "dorado   Dorado389_2020_282_01_282_01_decim.nc 2020-10-08 20:08:11+00:00 -0.028264 36.796672 -121.849978      altitude   \n",
       "                                                                                             -121.849978  bbp420 (m-1)   \n",
       "                                                                                             -121.849978  bbp700 (m-1)   \n",
       "                                                                                             -121.849978       biolume   \n",
       "                                                                                             -121.849978  fl700_uncorr   \n",
       "\n",
       "                                                                                                                   standard_name  \\\n",
       "platform activity__name                        timevalue                 depth     latitude  longitude                             \n",
       "dorado   Dorado389_2020_282_01_282_01_decim.nc 2020-10-08 20:08:11+00:00 -0.028264 36.796672 -121.849978  height_above_sea_floor   \n",
       "                                                                                             -121.849978                    None   \n",
       "                                                                                             -121.849978                    None   \n",
       "                                                                                             -121.849978                    None   \n",
       "                                                                                             -121.849978                    None   \n",
       "\n",
       "                                                                                                             datavalue  \n",
       "platform activity__name                        timevalue                 depth     latitude  longitude                  \n",
       "dorado   Dorado389_2020_282_01_282_01_decim.nc 2020-10-08 20:08:11+00:00 -0.028264 36.796672 -121.849978  2.707653e+02  \n",
       "                                                                                             -121.849978  1.269240e-03  \n",
       "                                                                                             -121.849978  8.974295e-03  \n",
       "                                                                                             -121.849978  3.691703e+09  \n",
       "                                                                                             -121.849978  1.431676e-03  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas.io.sql as sqlio\n",
    "import psycopg2\n",
    "\n",
    "# Use psycopg2 for direct from Postgres query - still explodes stoqs container RAM as this Notebook runs there\n",
    "conn = psycopg2.connect(\"host='{}' port={} dbname='{}' user={} password={}\".format(\n",
    "                        'postgis', 5432, 'stoqs_canon_october2020', 'stoqsadm', 'CHANGEME'))\n",
    "# Takes about 5 minutes to read 13.5 million rows\n",
    "%time df3b = sqlio.read_sql_query(sql, conn, index_col=context)\n",
    "print(df3b.shape)\n",
    "df3b.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "It would be nice if chunksize helped with memory usage in docker, but it doesn't.\n"
     ]
    }
   ],
   "source": [
    "# Setting chunksize doesn't help reduce memory requirements in the stoqs or stoqs-postgis containers\n",
    "# See: https://stackoverflow.com/a/31843091/1281657\n",
    "#      https://github.com/pandas-dev/pandas/issues/12265#issuecomment-181809005\n",
    "#      https://github.com/pandas-dev/pandas/issues/35689\n",
    "df3c = pd.DataFrame()\n",
    "# Still takes about 2.5 minutes to read 13.5 rows (chunking happens only on client in Pandas)\n",
    "##%time chunker = pd.read_sql_query(sql, connections[db], index_col=context, chunksize=1000000)\n",
    "##for chunk in chunker:\n",
    "    ##print(chunk.shape)\n",
    "    ##df3c.add(chunk.pivot_table(index=context, columns='name', values='datavalue'))\n",
    "print(\"It would be nice if chunksize helped with memory usage in docker, but it doesn't.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The syntax of crosstab() is arcane and it will take some work to figure out a way to preserve datetime objects as they are read into a DataFrame. It's likely that performing a pivot on the data closer to the database will be more performant than say doing it after reading records into a DataFrame.\n",
    "\n",
    "---\n",
    "#### Approach 4. Use Pandas do a pivot on data read into a DataFrame\n",
    "Similar to Approach 2, but this may be more efficient as conversion to and from CSV text format is avoided."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 4.9 s, sys: 53.1 ms, total: 4.96 s\n",
      "Wall time: 5.02 s\n",
      "(312470, 51)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>O2_conc (umol/kg)</th>\n",
       "      <th>OXYG</th>\n",
       "      <th>OXYG (ml/L)</th>\n",
       "      <th>PAR (umol/s/m2)</th>\n",
       "      <th>PSAL (0.001)</th>\n",
       "      <th>TEMP (Celsius)</th>\n",
       "      <th>air_temp (C)</th>\n",
       "      <th>altitude</th>\n",
       "      <th>atm_press (mbar)</th>\n",
       "      <th>avg_wind_spd (m/s)</th>\n",
       "      <th>...</th>\n",
       "      <th>salinity</th>\n",
       "      <th>salinity (psu)</th>\n",
       "      <th>sepCountList (count)</th>\n",
       "      <th>sigmat</th>\n",
       "      <th>spice</th>\n",
       "      <th>temperature (Celsius)</th>\n",
       "      <th>temperature (degC)</th>\n",
       "      <th>water_temp (C)</th>\n",
       "      <th>wind_dir (deg (True))</th>\n",
       "      <th>yaw (degree)</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>platform</th>\n",
       "      <th>timevalue</th>\n",
       "      <th>depth</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">NPS_Glider_29</th>\n",
       "      <th>2020-10-06 19:07:40.800002</th>\n",
       "      <th>0.000000</th>\n",
       "      <th>36.803799</th>\n",
       "      <th>-121.853996</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6.002</td>\n",
       "      <td>NaN</td>\n",
       "      <td>33.462002</td>\n",
       "      <td>13.956</td>\n",
       "      <td>NaN</td>\n",
       "      <td>280.567732</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>25.007821</td>\n",
       "      <td>0.682620</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-10-06 19:07:57.931036</th>\n",
       "      <th>0.794097</th>\n",
       "      <th>36.803795</th>\n",
       "      <th>-121.854050</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.918</td>\n",
       "      <td>NaN</td>\n",
       "      <td>33.466999</td>\n",
       "      <td>13.845</td>\n",
       "      <td>NaN</td>\n",
       "      <td>280.170345</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>25.034645</td>\n",
       "      <td>0.663049</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-10-06 19:08:15.062071</th>\n",
       "      <th>1.885975</th>\n",
       "      <th>36.803787</th>\n",
       "      <th>-121.854103</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.815</td>\n",
       "      <td>NaN</td>\n",
       "      <td>33.467999</td>\n",
       "      <td>13.731</td>\n",
       "      <td>NaN</td>\n",
       "      <td>279.502588</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>25.058888</td>\n",
       "      <td>0.639810</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-10-06 19:08:32.193105</th>\n",
       "      <th>2.878586</th>\n",
       "      <th>36.803780</th>\n",
       "      <th>-121.854149</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.716</td>\n",
       "      <td>NaN</td>\n",
       "      <td>33.472000</td>\n",
       "      <td>13.626</td>\n",
       "      <td>NaN</td>\n",
       "      <td>278.858728</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>25.083488</td>\n",
       "      <td>0.620954</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-10-06 19:08:49.324140</th>\n",
       "      <th>3.970453</th>\n",
       "      <th>36.803772</th>\n",
       "      <th>-121.854202</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.624</td>\n",
       "      <td>NaN</td>\n",
       "      <td>33.469002</td>\n",
       "      <td>13.590</td>\n",
       "      <td>NaN</td>\n",
       "      <td>278.152412</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>25.088541</td>\n",
       "      <td>0.611061</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 51 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "name                                                                     O2_conc (umol/kg)  \\\n",
       "platform      timevalue                  depth    latitude  longitude                        \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996                NaN   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050                NaN   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103                NaN   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149                NaN   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202                NaN   \n",
       "\n",
       "name                                                                     OXYG  \\\n",
       "platform      timevalue                  depth    latitude  longitude           \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996   NaN   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050   NaN   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103   NaN   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149   NaN   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202   NaN   \n",
       "\n",
       "name                                                                     OXYG (ml/L)  \\\n",
       "platform      timevalue                  depth    latitude  longitude                  \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996        6.002   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050        5.918   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103        5.815   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149        5.716   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202        5.624   \n",
       "\n",
       "name                                                                     PAR (umol/s/m2)  \\\n",
       "platform      timevalue                  depth    latitude  longitude                      \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996              NaN   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050              NaN   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103              NaN   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149              NaN   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202              NaN   \n",
       "\n",
       "name                                                                     PSAL (0.001)  \\\n",
       "platform      timevalue                  depth    latitude  longitude                   \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996     33.462002   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050     33.466999   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103     33.467999   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149     33.472000   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202     33.469002   \n",
       "\n",
       "name                                                                     TEMP (Celsius)  \\\n",
       "platform      timevalue                  depth    latitude  longitude                     \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996          13.956   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050          13.845   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103          13.731   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149          13.626   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202          13.590   \n",
       "\n",
       "name                                                                     air_temp (C)  \\\n",
       "platform      timevalue                  depth    latitude  longitude                   \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996           NaN   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050           NaN   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103           NaN   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149           NaN   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202           NaN   \n",
       "\n",
       "name                                                                       altitude  \\\n",
       "platform      timevalue                  depth    latitude  longitude                 \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996  280.567732   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050  280.170345   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103  279.502588   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149  278.858728   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202  278.152412   \n",
       "\n",
       "name                                                                     atm_press (mbar)  \\\n",
       "platform      timevalue                  depth    latitude  longitude                       \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996               NaN   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050               NaN   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103               NaN   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149               NaN   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202               NaN   \n",
       "\n",
       "name                                                                     avg_wind_spd (m/s)  \\\n",
       "platform      timevalue                  depth    latitude  longitude                         \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996                 NaN   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050                 NaN   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103                 NaN   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149                 NaN   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202                 NaN   \n",
       "\n",
       "name                                                                     ...  \\\n",
       "platform      timevalue                  depth    latitude  longitude    ...   \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996  ...   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050  ...   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103  ...   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149  ...   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202  ...   \n",
       "\n",
       "name                                                                     salinity  \\\n",
       "platform      timevalue                  depth    latitude  longitude               \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996       NaN   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050       NaN   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103       NaN   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149       NaN   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202       NaN   \n",
       "\n",
       "name                                                                     salinity (psu)  \\\n",
       "platform      timevalue                  depth    latitude  longitude                     \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996             NaN   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050             NaN   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103             NaN   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149             NaN   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202             NaN   \n",
       "\n",
       "name                                                                     sepCountList (count)  \\\n",
       "platform      timevalue                  depth    latitude  longitude                           \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996                   NaN   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050                   NaN   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103                   NaN   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149                   NaN   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202                   NaN   \n",
       "\n",
       "name                                                                        sigmat  \\\n",
       "platform      timevalue                  depth    latitude  longitude                \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996  25.007821   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050  25.034645   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103  25.058888   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149  25.083488   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202  25.088541   \n",
       "\n",
       "name                                                                        spice  \\\n",
       "platform      timevalue                  depth    latitude  longitude               \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996  0.682620   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050  0.663049   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103  0.639810   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149  0.620954   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202  0.611061   \n",
       "\n",
       "name                                                                     temperature (Celsius)  \\\n",
       "platform      timevalue                  depth    latitude  longitude                            \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996                    NaN   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050                    NaN   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103                    NaN   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149                    NaN   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202                    NaN   \n",
       "\n",
       "name                                                                     temperature (degC)  \\\n",
       "platform      timevalue                  depth    latitude  longitude                         \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996                 NaN   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050                 NaN   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103                 NaN   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149                 NaN   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202                 NaN   \n",
       "\n",
       "name                                                                     water_temp (C)  \\\n",
       "platform      timevalue                  depth    latitude  longitude                     \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996             NaN   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050             NaN   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103             NaN   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149             NaN   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202             NaN   \n",
       "\n",
       "name                                                                     wind_dir (deg (True))  \\\n",
       "platform      timevalue                  depth    latitude  longitude                            \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996                    NaN   \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050                    NaN   \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103                    NaN   \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149                    NaN   \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202                    NaN   \n",
       "\n",
       "name                                                                     yaw (degree)  \n",
       "platform      timevalue                  depth    latitude  longitude                  \n",
       "NPS_Glider_29 2020-10-06 19:07:40.800002 0.000000 36.803799 -121.853996           NaN  \n",
       "              2020-10-06 19:07:57.931036 0.794097 36.803795 -121.854050           NaN  \n",
       "              2020-10-06 19:08:15.062071 1.885975 36.803787 -121.854103           NaN  \n",
       "              2020-10-06 19:08:32.193105 2.878586 36.803780 -121.854149           NaN  \n",
       "              2020-10-06 19:08:49.324140 3.970453 36.803772 -121.854202           NaN  \n",
       "\n",
       "[5 rows x 51 columns]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Identify the columns used as the index for the pivot\n",
    "context = ['platform', 'timevalue', 'depth', 'latitude', 'longitude']\n",
    "\n",
    "%time df4 = df3a.pivot_table(index=context, columns='name', values='datavalue')\n",
    "print(df4.shape)\n",
    "df4.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This approach looks promising. Some advantages:\n",
    "\n",
    "1. The SQL query is a simple inner join of the tables - similar to that used for MeasuredParameter Data Access\n",
    "2. No complicated self joins are needed\n",
    "3. It will work for any Parameter names from any platforms\n",
    "4. Missing values are preserved as None or NaN\n",
    "5. Pandas pivot_table() method is efficient, taking only about 5 seconds\n",
    "    \n",
    "Some disadvantages:\n",
    "\n",
    "1. This uses the stoqs Docker image and the Django api - it could be closer to the database\n",
    "2. More than 16 GB needs to be resourced to Docker Desktop to read the entire stoqs_canon_october2020 db\n",
    "3. When memory is exhausted there is no error message provided; it annoyingly quits silently\n",
    "---\n",
    "These experiments now lead to the script stoqs/contrib/parquet/extract_columns.py which will implement this capability at the command line. This notebook can still serve as a \"playground\" for testing out various ways to get STOQS data into modern data science tools. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "if do_plots:\n",
    "    # See: https://datashader.org/getting_started/Pipeline.html\n",
    "    import holoviews as hv\n",
    "    from holoviews.operation.datashader import datashade\n",
    "    hv.extension(\"bokeh\")\n",
    "    pts1 = hv.Points(df1, kdims=['salt', 'temp'])\n",
    "    pts2 = hv.Points(df2, kdims=['PSAL (0.001)', 'TEMP (Celsius)'])\n",
    "    pts4a = hv.Points(df4, kdims=['salinity', 'temperature (Celsius)'])\n",
    "    pts4b = hv.Points(df4, kdims=['PSAL (0.001)', 'TEMP (Celsius)'])\n",
    "    ts_points = ( datashade(pts1, cmap=colorcet.fire).opts(title='df1')\n",
    "                + datashade(pts2, cmap=colorcet.fire).opts(title='df2')\n",
    "                + datashade(pts4a, cmap=colorcet.fire).opts(title='df4a')\n",
    "                + datashade(pts4b, cmap=colorcet.fire).opts(title='df4b'))\n",
    "ts_points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "if do_plots:\n",
    "    # See: http://holoviews.org/user_guide/Large_Data.html\n",
    "    from holoviews.operation.datashader import rasterize\n",
    "    ##ropts = dict(tools=[\"pan,wheel_zoom,box_zoom\"], height=380, width=330, colorbar=True, colorbar_position=\"bottom\")\n",
    "    ropts = dict(height=380, width=330, colorbar=True, colorbar_position=\"bottom\")\n",
    "\n",
    "    ts_points = hv.Layout([rasterize(hv.Points(df1.iloc[df1.index.get_level_values('name') == p],kdims=['temp', 'salt'])).opts(**ropts).relabel(p)for p in plat_list])\n",
    "\n",
    "ts_points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time to execute this notebook: 176.5 seconds\n"
     ]
    }
   ],
   "source": [
    "print(f\"Time to execute this notebook: {(time.time() - t_start):.1f} seconds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Monitoring with `docker stats` shows that executing this notebook required **7.5 GB** of memory by the stoqs container."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Django Shell-Plus",
   "language": "python",
   "name": "django_extensions"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}