{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Accessing DC2 data in PostgreSQL at NERSC part 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Owner: **Joanne Bogart [@jrbogart](https://github.com/LSSTDESC/DC2-analysis/issues/new?body=@jrbogart)**  \n",
    "Last Verified to Run: \n",
    "\n",
    "This notebook introduces some additional features of the PostgreSQL database at NERSC.   \n",
    "\n",
    "__Learning objectives__:\n",
    "\n",
    "After going through this notebook, you should be able to:\n",
    "  1. Discover which object catalogs are available\n",
    "  2. Query on native quantities in those catalogs\n",
    "  3. Make use of custom functions, in particular for area searches\n",
    "\n",
    "__Logistics__: This notebook is intended to be run through the JupyterHub NERSC interface available here: https://jupyter-dev.nersc.gov. To setup your NERSC environment, please follow the instructions available here: https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter-dev+at+NERSC\n",
    "\n",
    "### Prerequisites\n",
    "Please see the first notebook in this series for instructions on how to gain access to the database.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import psycopg2\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "%matplotlib inline \n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make the db connection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dbname = 'desc_dc2_drp'\n",
    "dbuser = 'desc_dc2_drp_user'\n",
    "dbhost = 'nerscdb03.nersc.gov'\n",
    "dbconfig = {'dbname' : dbname, 'user' : dbuser, 'host' : dbhost}\n",
    "dbconn = psycopg2.connect(**dbconfig)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding Data\n",
    "Tables for the Run1.2i data as well as a view to make dpdd quantities more easily accessible are in the `schema` (acts like a namespace) `run12i`.  To reference, say, a table called `position` for Run1.2i use `run12i.position`. \n",
    "\n",
    "### Finding Datasets\n",
    "To find out which datasets are available and by what schema names, query the table `run_provenance`. It's in a special schema known as `public` which does not normally need to be specified."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cols = ['schema_name', 'run_designation','simulation_program', 'db_ingest', 'remarks']\n",
    "# Additional columns in run_provenance store software and input data versions\n",
    "prov_query = 'SELECT '  + ','.join(cols) + ' from run_provenance'\n",
    "with dbconn.cursor() as cursor:\n",
    "    cursor.execute(prov_query)\n",
    "    fmt = '{0!s:13} {1!s:16} {2!s:18} {3!s:12} {4!s}'\n",
    "    print(fmt.format(cols[0], cols[1], cols[2], cols[3], cols[4]))\n",
    "    for record in cursor:\n",
    "        print(fmt.format(record[0], record[1], record[2], record[3], record[4]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Normally only data sets with `db_ingest` == 'complete' are of interest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pick one of the supported datasets\n",
    "schema = 'run12p_v4'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Querying on Native Quantities\n",
    "Unlike DPDD quantities (all in a single view), native quantities are split across several tables. The first notebook in the PostgreSQL collection shows how to find out which tables belong to a schema and what columns a table has. Alternatively, if you know the column names you want, you can query for the table name.  The following looks for the table containing `ext_shapeHSM_HsmShapeRegauss_resolution`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "column = 'ext_shapehsm_hsmshaperegauss_flag'\n",
    "find_table_query = \"select table_name from information_schema.columns where table_schema='{}' and column_name='{}'\"\n",
    "find_table_query = find_table_query.format(schema, column)\n",
    "print(find_table_query)\n",
    "with dbconn.cursor() as cursor:\n",
    "    cursor.execute(find_table_query)\n",
    "    for record in cursor:\n",
    "        print(record[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is some necessary fussiness here:\n",
    "* Note `ext_shapeHSM_HsmShapeRegauss_flag` has been transformed to all lower-case in the query. This is required when querying information_schema, where this string is a __value__ in the database (not a column name). \n",
    "* In the query single quotes are used around literals like `run12p_v4`. Double quotes won't work."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now suppose we wanted to combine a cut on this quantity, in table dpdd_ref, with cuts on DPDD quantities like `clean`.  Then the query has to be made on a join of these two tables, where we specify that the value of dpdd_ref.object_id = dpdd.objectid. This causes the corresponding rows from each table (or view) to be treated as if they were assembled into one long row. Here is a simple query showing how this is done. A more realistic one would have more conditions in the `where` clause and might join more than two tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = 'run12i'\n",
    "join_query = 'select count(object_id) from {schema}.dpdd_ref join {schema}.dpdd '\n",
    "join_query += 'on {schema}.dpdd_ref.object_id = {schema}.dpdd.objectid'\n",
    "join_query = join_query.format(**locals())\n",
    "where = \" where (ext_shapeHSM_HSmShapeRegauss_flag = 'f') and clean\"\n",
    "join_query += where\n",
    "print(join_query)                # confirm the query looks reasonable\n",
    "with dbconn.cursor() as cursor:\n",
    "    cursor.execute(join_query)\n",
    "    for record in cursor:\n",
    "        print(record[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## User-defined Functions (UDFs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many math functions from the c library have been wrapped and incorporated in an extension module installed in the database. They have their normal c library names with the prefix c_. Functions with a floating point argument or return value usually have two versions, such as c_log (double precision natural logarithm) and c_logf (single precision). They can be incorporated in queries as in this example using the command-line interface program psql:\n",
    "```\n",
    "desc_dc2_drp=> select c_asin(1.0);\n",
    "\n",
    "     c_asin\n",
    "-----------------\n",
    " 1.5707963267949\n",
    "```\n",
    "There are also functions specially crafted for HSC or LSST catalogs with suggestive names like `patch_contains`, `tract_from_object_id` (used in q3 in the first notebook of this series), `sky_to_pixel`,..\n",
    "```\n",
    "desc_dc2_drp=> select count(*) from run12i.dpdd where tractsearch(objectId, 5063);\n",
    " count\n",
    "--------\n",
    " 233982\n",
    "(1 row)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Area Searches\n",
    "The  **dpdd** view has one extra column, `coord`, which is not formally a DPDD quantity. `coord` is an alternate way (other than `ra` and `dec`) to express location.  A `coord` value is a triple of doubles representing a position on a sphere in units of arcseconds. This column is indexed, which can make certain calculations faster. In particular, using the functions `conesearch` and `boxsearch` (which take a `coord` as input) rather than starting with `ra` and `dec` makes queries much faster.  There are also functions to translate between `coord` and `(ra, dec)`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Cone search\n",
    "Find all stars satisfying quality cuts within a fixed radius of a particular coordinate.  The function `coneSearch` returns true if `coord` is within the cone centered at (ra, dec) of the specified radius, measured in arcseconds. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = 'run12p_v4'\n",
    "band = 'i'\n",
    "mag_col = 'mag_' + band\n",
    "min_SNR = 25   \n",
    "max_err = 1/min_SNR\n",
    "pop = 'Stars'\n",
    "ra = 54.5\n",
    "decl = -31.4\n",
    "radius = 240.0\n",
    "where = ' where (magerr_{band} < {max_err}) and clean and (extendedness < 1.0) and coneSearch(coord, {ra}, {decl}, {radius})'\n",
    "qcone = ('SELECT ra, dec, mag_{band} from {schema}.dpdd ' + where).format(**locals())\n",
    "print(qcone)\n",
    "with dbconn.cursor() as cursor:\n",
    "    %time cursor.execute(qcone)\n",
    "    records = cursor.fetchall()\n",
    "    nObj = len(records)\n",
    "    print('{} objects found '.format(nObj))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cmags = pd.DataFrame(records, columns=['ra', 'dec', mag_col])\n",
    "\n",
    "plt.figure(figsize=(8, 8))\n",
    "plt.xlabel('ra')\n",
    "plt.ylabel('dec')\n",
    "plt.suptitle(pop + ' Cone search', size='xx-large', y=0.92)\n",
    "p = plt.scatter(cmags['ra'], cmags['dec'], color='y')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how fast the query is.   Compare time, # objects found and scatter plot after increasing radius, e.g. by a factor of 10 to 2400.0. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Box search\n",
    "Find all stars, subject to quality cuts, with the specified ra and dec bounds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ra1 = 54.4\n",
    "ra2 = 54.8\n",
    "decl1 = -31.6\n",
    "decl2 = -31.3\n",
    "\n",
    "where = ' where (magerr_{band} < {max_err}) and clean and (extendedness < 1.0) and boxSearch(coord, {ra1}, {ra2},{decl1}, {decl2})'\n",
    "qbox = ('SELECT ra, dec, mag_{band} from {schema}.dpdd ' + where).format(**locals())\n",
    "print(qbox)\n",
    "with dbconn.cursor() as cursor:\n",
    "    %time cursor.execute(qbox)\n",
    "    records = cursor.fetchall()\n",
    "    nObj = len(records)\n",
    "    print('{} objects found '.format(nObj))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bmags = pd.DataFrame(records, columns=['ra', 'dec', mag_col])\n",
    "\n",
    "plt.figure(figsize=(8, 8))\n",
    "plt.xlabel('ra')\n",
    "plt.ylabel('dec')\n",
    "plt.suptitle(pop + ' Box search', size='xx-large', y=0.92)\n",
    "p = plt.scatter(bmags['ra'], bmags['dec'], color='y')\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "desc-python",
   "language": "python",
   "name": "desc-python"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}