{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Accessing DC2 data in PostgreSQL at NERSC part 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Owner: **Joanne Bogart [@jrbogart](https://github.com/LSSTDESC/DC2-analysis/issues/new?body=@jrbogart)** \n", "Last Verified to Run: \n", "\n", "This notebook introduces some additional features of the PostgreSQL database at NERSC. \n", "\n", "__Learning objectives__:\n", "\n", "After going through this notebook, you should be able to:\n", " 1. Discover which object catalogs are available\n", " 2. Query on native quantities in those catalogs\n", " 3. Make use of custom functions, in particular for area searches\n", "\n", "__Logistics__: This notebook is intended to be run through the JupyterHub NERSC interface available here: https://jupyter-dev.nersc.gov. To setup your NERSC environment, please follow the instructions available here: https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter-dev+at+NERSC\n", "\n", "### Prerequisites\n", "Please see the first notebook in this series for instructions on how to gain access to the database.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import psycopg2\n", "\n", "import numpy as np\n", "\n", "%matplotlib inline \n", "import matplotlib.pyplot as plt\n", "\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make the db connection" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dbname = 'desc_dc2_drp'\n", "dbuser = 'desc_dc2_drp_user'\n", "dbhost = 'nerscdb03.nersc.gov'\n", "dbconfig = {'dbname' : dbname, 'user' : dbuser, 'host' : dbhost}\n", "dbconn = psycopg2.connect(**dbconfig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finding Data\n", "Tables for the Run1.2i data as well as a view to make dpdd quantities more easily accessible are in the `schema` (acts like a namespace) `run12i`. To reference, say, a table called `position` for Run1.2i use `run12i.position`. \n", "\n", "### Finding Datasets\n", "To find out which datasets are available and by what schema names, query the table `run_provenance`. It's in a special schema known as `public` which does not normally need to be specified." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cols = ['schema_name', 'run_designation','simulation_program', 'db_ingest', 'remarks']\n", "# Additional columns in run_provenance store software and input data versions\n", "prov_query = 'SELECT ' + ','.join(cols) + ' from run_provenance'\n", "with dbconn.cursor() as cursor:\n", " cursor.execute(prov_query)\n", " fmt = '{0!s:13} {1!s:16} {2!s:18} {3!s:12} {4!s}'\n", " print(fmt.format(cols[0], cols[1], cols[2], cols[3], cols[4]))\n", " for record in cursor:\n", " print(fmt.format(record[0], record[1], record[2], record[3], record[4]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normally only data sets with `db_ingest` == 'complete' are of interest" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Pick one of the supported datasets\n", "schema = 'run12p_v4'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Querying on Native Quantities\n", "Unlike DPDD quantities (all in a single view), native quantities are split across several tables. The first notebook in the PostgreSQL collection shows how to find out which tables belong to a schema and what columns a table has. Alternatively, if you know the column names you want, you can query for the table name. The following looks for the table containing `ext_shapeHSM_HsmShapeRegauss_resolution`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "column = 'ext_shapehsm_hsmshaperegauss_flag'\n", "find_table_query = \"select table_name from information_schema.columns where table_schema='{}' and column_name='{}'\"\n", "find_table_query = find_table_query.format(schema, column)\n", "print(find_table_query)\n", "with dbconn.cursor() as cursor:\n", " cursor.execute(find_table_query)\n", " for record in cursor:\n", " print(record[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is some necessary fussiness here:\n", "* Note `ext_shapeHSM_HsmShapeRegauss_flag` has been transformed to all lower-case in the query. This is required when querying information_schema, where this string is a __value__ in the database (not a column name). \n", "* In the query single quotes are used around literals like `run12p_v4`. Double quotes won't work." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now suppose we wanted to combine a cut on this quantity, in table dpdd_ref, with cuts on DPDD quantities like `clean`. Then the query has to be made on a join of these two tables, where we specify that the value of dpdd_ref.object_id = dpdd.objectid. This causes the corresponding rows from each table (or view) to be treated as if they were assembled into one long row. Here is a simple query showing how this is done. A more realistic one would have more conditions in the `where` clause and might join more than two tables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = 'run12i'\n", "join_query = 'select count(object_id) from {schema}.dpdd_ref join {schema}.dpdd '\n", "join_query += 'on {schema}.dpdd_ref.object_id = {schema}.dpdd.objectid'\n", "join_query = join_query.format(**locals())\n", "where = \" where (ext_shapeHSM_HSmShapeRegauss_flag = 'f') and clean\"\n", "join_query += where\n", "print(join_query) # confirm the query looks reasonable\n", "with dbconn.cursor() as cursor:\n", " cursor.execute(join_query)\n", " for record in cursor:\n", " print(record[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## User-defined Functions (UDFs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many math functions from the c library have been wrapped and incorporated in an extension module installed in the database. They have their normal c library names with the prefix c_. Functions with a floating point argument or return value usually have two versions, such as c_log (double precision natural logarithm) and c_logf (single precision). They can be incorporated in queries as in this example using the command-line interface program psql:\n", "```\n", "desc_dc2_drp=> select c_asin(1.0);\n", "\n", " c_asin\n", "-----------------\n", " 1.5707963267949\n", "```\n", "There are also functions specially crafted for HSC or LSST catalogs with suggestive names like `patch_contains`, `tract_from_object_id` (used in q3 in the first notebook of this series), `sky_to_pixel`,..\n", "```\n", "desc_dc2_drp=> select count(*) from run12i.dpdd where tractsearch(objectId, 5063);\n", " count\n", "--------\n", " 233982\n", "(1 row)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Area Searches\n", "The **dpdd** view has one extra column, `coord`, which is not formally a DPDD quantity. `coord` is an alternate way (other than `ra` and `dec`) to express location. A `coord` value is a triple of doubles representing a position on a sphere in units of arcseconds. This column is indexed, which can make certain calculations faster. In particular, using the functions `conesearch` and `boxsearch` (which take a `coord` as input) rather than starting with `ra` and `dec` makes queries much faster. There are also functions to translate between `coord` and `(ra, dec)`.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Cone search\n", "Find all stars satisfying quality cuts within a fixed radius of a particular coordinate. The function `coneSearch` returns true if `coord` is within the cone centered at (ra, dec) of the specified radius, measured in arcseconds. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = 'run12p_v4'\n", "band = 'i'\n", "mag_col = 'mag_' + band\n", "min_SNR = 25 \n", "max_err = 1/min_SNR\n", "pop = 'Stars'\n", "ra = 54.5\n", "decl = -31.4\n", "radius = 240.0\n", "where = ' where (magerr_{band} < {max_err}) and clean and (extendedness < 1.0) and coneSearch(coord, {ra}, {decl}, {radius})'\n", "qcone = ('SELECT ra, dec, mag_{band} from {schema}.dpdd ' + where).format(**locals())\n", "print(qcone)\n", "with dbconn.cursor() as cursor:\n", " %time cursor.execute(qcone)\n", " records = cursor.fetchall()\n", " nObj = len(records)\n", " print('{} objects found '.format(nObj))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cmags = pd.DataFrame(records, columns=['ra', 'dec', mag_col])\n", "\n", "plt.figure(figsize=(8, 8))\n", "plt.xlabel('ra')\n", "plt.ylabel('dec')\n", "plt.suptitle(pop + ' Cone search', size='xx-large', y=0.92)\n", "p = plt.scatter(cmags['ra'], cmags['dec'], color='y')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how fast the query is. Compare time, # objects found and scatter plot after increasing radius, e.g. by a factor of 10 to 2400.0. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Box search\n", "Find all stars, subject to quality cuts, with the specified ra and dec bounds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ra1 = 54.4\n", "ra2 = 54.8\n", "decl1 = -31.6\n", "decl2 = -31.3\n", "\n", "where = ' where (magerr_{band} < {max_err}) and clean and (extendedness < 1.0) and boxSearch(coord, {ra1}, {ra2},{decl1}, {decl2})'\n", "qbox = ('SELECT ra, dec, mag_{band} from {schema}.dpdd ' + where).format(**locals())\n", "print(qbox)\n", "with dbconn.cursor() as cursor:\n", " %time cursor.execute(qbox)\n", " records = cursor.fetchall()\n", " nObj = len(records)\n", " print('{} objects found '.format(nObj))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bmags = pd.DataFrame(records, columns=['ra', 'dec', mag_col])\n", "\n", "plt.figure(figsize=(8, 8))\n", "plt.xlabel('ra')\n", "plt.ylabel('dec')\n", "plt.suptitle(pop + ' Box search', size='xx-large', y=0.92)\n", "p = plt.scatter(bmags['ra'], bmags['dec'], color='y')\n" ] } ], "metadata": { "kernelspec": { "display_name": "desc-python", "language": "python", "name": "desc-python" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }