{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Accessing DC2 forced source data in PostgreSQL at NERSC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Owner: **Joanne Bogart [@jrbogart](https://github.com/LSSTDESC/DC2-analysis/issues/new?body=@jrbogart)** \n", "Last Verified to Run: **2020-08-03**\n", "\n", "This notebook demonstrates access to forced source data via the PostgreSQL database at NERSC. Currently the only forced source dataset available is the one for Run1.2p v4. Because of the size of forced source (not many columns but a lot of rows; over 680 million just for 1.2p v4) database access is likely to perform better than file access. It is also possible to efficiently correlate the information in the forced source dataset and the object catalog. A view has been provided so that the most useful quantities from the object catalog may be fetched easily along with forced source fields.\n", "\n", "__Learning objectives__:\n", "\n", "After going through this notebook, you should be able to:\n", " 1. Find out what Forced source information is available and query it.\n", " 2. Find out what information is kept per visit and query it.\n", " 3. Use the forcedsource view to get forced source information and most commonly needed fields from the object catalog and visit table associated with the forced source entries.\n", " 4. Make use of standard tools to, e.g., plot light curves\n", "\n", "__Logistics__: This notebook is intended to be run through the JupyterHub NERSC interface available here: https://jupyter.nersc.gov. To setup your NERSC environment, please follow the instructions available here: \n", "https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter+at+NERSC\n", "### Prerequisites\n", "* You should work through the first PostgreSQL notebook, \"Accessing DC2 Data in PostgreSQL at NERSC\", before tackling this one.\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import psycopg2\n", "import numpy as np\n", "%matplotlib inline \n", "import matplotlib.pyplot as plt\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make the db connection" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dbname = 'desc_dc2_drp'\n", "dbuser = 'desc_dc2_drp_user'\n", "dbhost = 'nerscdb03.nersc.gov'\n", "dbconfig = {'dbname' : dbname, 'user' : dbuser, 'host' : dbhost}\n", "dbconn = psycopg2.connect(**dbconfig)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = 'run12p_v4' # Currently (May 2019) only dataset with forced source " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display all tables and views belonging to the schema. Most of them are for the object catalog" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "q1 = \"SELECT DISTINCT table_name FROM information_schema.columns WHERE table_schema='{schema}' ORDER BY table_name\".format(**locals())\n", "with dbconn.cursor() as cursor:\n", " # Could have several queries interspersed with other code in this block\n", " cursor.execute(q1)\n", " for record in cursor:\n", " print(record[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Where Forced Source data can be found\n", "\n", "**\\_temp:forced\\_patch** and **\\_temp:forced_bit** are artifacts of the ingest process and are of no interest here.\n", "\n", "forcedsourcenative has columns from the forced source data most likely to be of interest. These include objectid (identical in meaning to its use in the object catalog) and ccdvisitid. ccdvisitid uniquely identifies a row in the ccdvisit table and is computed from visit, raft and sensor ids. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tbl = 'forcedsourcenative'\n", "q2 = \"SELECT column_name, data_type FROM information_schema.columns WHERE table_schema='{schema}' AND table_name='{tbl}' order by column_name \".format(**locals())\n", "print(q2)\n", "with dbconn.cursor() as cursor:\n", " cursor.execute(q2)\n", " records = cursor.fetchall()\n", " print(\"There are {} columns in table {}. They are:\\n\".format(len(records), tbl))\n", " print(\"Name Data Type\")\n", " for record in records:\n", " print(\"{0!s:55} {1!s:20}\".format(record[0], record[1]) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a similar query for the **ccdvisit** table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tbl = 'ccdvisit'\n", "q2_pos = \"SELECT column_name, data_type FROM information_schema.columns WHERE table_schema='{schema}' AND table_name='{tbl}'\".format(**locals())\n", "with dbconn.cursor() as cursor:\n", " cursor.execute(q2_pos)\n", " records = cursor.fetchall()\n", " print(\"There are {} columns in table {}. They are:\\n\".format(len(records), tbl))\n", " print(\"Name Data Type\")\n", " for record in records:\n", " print(\"{0!s:55} {1!s:20}\".format(record[0], record[1]) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of these columns two (`expmidpt`, `exptime`) are, in this data set, always null. For many purposes `ccdname` and `raftname` are of no interest and `visitid` is encompassed by `ccdvisitid`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a query which finds all visits. The example only prints out the total number." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "q3 = \"SELECT DISTINCT visitid FROM {schema}.ccdvisit\".format(**locals())\n", "with dbconn.cursor() as cursor:\n", " %time cursor.execute(q3)\n", " records = cursor.fetchall()\n", " print(\"{} visits found\".format(len(records)))\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The view `forcedsource` selects fields of interest from `forcedsourcenative`, `ccdvisit` and the object catalog. This query fetches all its fields." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tbl = 'forcedsource'\n", "q4 = \"SELECT column_name, data_type FROM information_schema.columns WHERE table_schema='{schema}' AND table_name='{tbl}'\".format(**locals())\n", "with dbconn.cursor() as cursor:\n", " cursor.execute(q4)\n", " records = cursor.fetchall()\n", " print(\"There are {} columns in view {}. They are:\\n\".format(len(records), tbl))\n", " print(\"Name Data Type\")\n", " for record in records:\n", " print(\"{0!s:55} {1!s:20}\".format(record[0], record[1]) ) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some explanation is in order. Most of these fields (e.g. `psflux_g`, `psflux_flag_g`, `psfluxerr_g`, `mag_g`, `magerr_g` and similar for the other bands) come from the corresponding object in the object catalog and have the same name they have in the object dpdd. `ra`, `dec` (as well as `coord`, which is a more convenient way to express location in some circumstances), `extendedness`, `blendedness`, `good`, and `clean` also come from the object catalog. (For information about dpdd quantities like clean and other fields mentioned above, see the [SCHEMA.md](https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/SCHEMA.md#schema-for-dc2-coadd-catalogs) file of the LSSTDESC/gcr-catalogs repository.)\n", "\n", "`filtername` and `obsstart` come from `ccdvisit`. The rest all come from `forcedsourcenative` one way or another. Some fields, such as `psflux`, come directly but have been renamed to match the names suggested for the forced source dpdd as defined in LSE-163. Others (e.g. `mag`, `magerr`) have been computed from one or more fields in `forcedsourcenative`. `forcedsourcevisit_good` is similar to `good` (meaning no flagged pixels) but uses flags from `forcedsourcenative` rather than the object catalog." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Light curves\n", "For this purpose first find objects with lots of visits. This query takes 5 or 6 minutes to execute, the only query in this notebook which is slow; the others all return essentially immediately. Then cut on various measures of goodness. For the remaining objects save object id, ra, dec and extendedness. Since the query is relatively long the function writes out the data. You can either just read in such a file or recreate it yourself by uncommenting the call to the function,\n", "substituting something reasonable for `'some_path'`, and then skip the step which reads from a file created previously by calling the function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def getGoodVisited(outpath, schema):\n", " '''\n", " Get and save information for good objects which show up in at least 400 visits\n", " \n", " Parameters\n", " ----------\n", " outpath : str Path to output csv file\n", " schema : str Database schema in which data are stored\n", " \n", " Returns\n", " -------\n", " Pandas data frame with certain properties for all objects making the cuts \n", " '''\n", " # First find visit count per object\n", " avisited_q = 'select objectid,count(objectid) as avisit_count from {}.forcedsourcenative ' \n", " avisited_q += 'group by (objectid) '\n", " avisited_q += 'order by avisit_count desc'\n", " avisited_qf = avisited_q.format(schema)\n", " print(avisited_qf) # this is the time-consuming query\n", " avisited_records = []\n", " with dbconn.cursor() as cursor:\n", " %time cursor.execute(avisited_qf)\n", " avisited_records = cursor.fetchall()\n", " avisited_df = pd.DataFrame(avisited_records, columns=['objectid', 'visit_count'])\n", " print(avisited_df.shape)\n", " \n", " # Keep the ones with at least 400 visits\n", " over400_df = avisited_df.query('visit_count > 400')\n", " print(over400_df.shape)\n", " \n", " # Now query on those object in the set which also satisfy some cuts\n", " i_list = list(over400_df['objectid'])\n", " s_list = []\n", " for i in i_list : s_list.append(str(i))\n", " objectcut = ' AND objectid in (' + ','.join(s_list) + ')'\n", "\n", " global_cuts = 'clean '\n", "\n", " min_SNR = 25 \n", " max_err = 1/min_SNR\n", " band_cuts = ' (magerr_g < {max_err}) AND (magerr_i < {max_err}) AND (magerr_r < {max_err}) '.format(**locals())\n", " where = ' WHERE ' + global_cuts + ' AND ' + band_cuts \n", "\n", " goodobjects_q = \"SELECT objectid, extendedness, ra, dec, mag_i, mag_g, mag_r from {schema}.dpdd \".format(**locals()) + where + objectcut\n", " # Don't normally print out this query because object_cut can be a very long string. \n", " records = []\n", " with dbconn.cursor() as cursor:\n", " %time cursor.execute(goodobjects_q)\n", " records = cursor.fetchall()\n", " nObj = len(records)\n", " \n", " df = pd.DataFrame(records, columns=['objectid', 'extendedness', 'ra', 'dec', 'mag_i', 'mag_g', 'mag_r'])\n", " print(\"Total: \", nObj)\n", " over400out = open(outpath,'w') \n", " df.to_csv(over400out)\n", " return df\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** For information about dpdd quantities like `clean` and other fields mentioned above, see the [SCHEMA.md](https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/SCHEMA.md#schema-for-dc2-coadd-catalogs) file of the LSSTDESC/gcr-catalogs repository " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# recreate good objects table\n", "# df = getGoodVisited('some_path', schema)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read in good objects table\n", "retrieve_path = '../tutorials/assets/{}-over400visits.csv'.format(schema)\n", "df = pd.read_csv(retrieve_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Get the data\n", "Query `forcedsource` for an entry and use the returned data to plot light curves. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This object happens to be a star\n", "star_ix = 1045\n", "star_id = df['objectid'][star_ix]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "def getlc(schema, objectid):\n", " q_template = 'select filtername as band,obsstart,mag,magerr from {schema}.forcedsource ' \n", " q_template += 'where objectid={objectid} and forcedsourcevisit_good and not psflux_flag order by filtername,obsstart'\n", " lc_q = q_template.format(**locals())\n", " print(lc_q)\n", " with dbconn.cursor() as cursor:\n", " %time cursor.execute(lc_q)\n", " records = cursor.fetchall()\n", " \n", " df = pd.DataFrame(records, \n", " columns=['filtername', 'obsstart', 'mag', 'magerr'])\n", " #print('Printing i-band data from getlc for object ', objectid)\n", " #iband = df[(df.filtername == 'i')]\n", " #for ival in list(iband['mag']): print(ival)\n", " \n", " return df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "star_data = getlc(schema, star_id)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "star_data.head(star_data.shape[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Plotting" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from astropy.time import Time\n", "def plot_band_lc(axes, times, mags, params):\n", " j_times = []\n", " for t in times:\n", " tobj = Time(str(t).replace(' ', 'T'))\n", " tobj.format = 'jd'\n", " j_times.append(tobj.value)\n", " out = axes.scatter(np.asarray(j_times), np.asarray(mags), **params)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_level(axes, yvalue, params):\n", " xmin, xmax = axes.get_xlim()\n", " out = axes.plot(np.asarray([xmin, xmax]), np.asarray([yvalue, yvalue]), **params)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def format_title(df, ix):\n", " ra = df['ra'][ix]\n", " dec = df['dec'][ix]\n", " oid = df['objectid'][ix]\n", " if df['extendedness'][ix] > 0.5:\n", " prefix = 'Extended object'\n", " else:\n", " prefix = 'Star '\n", " return '{prefix} light curve for object {oid} at ra={ra}, dec={dec}'.format(**locals())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_object(title, the_data, coadd_mag):\n", " '''\n", " Plot r, g and i light 'curves' (as scatter plot) for an object. Also plot coadd magnitudes for each band. \n", " Parameters\n", " -----------\n", " title : string\n", " the_data : data frame which must include columns filtername, obsstart, mag\n", " coadd_mag : dict associating magnitude from coadd with filtername\n", " '''\n", " good_d = the_data[(np.isnan(the_data.mag)==False)]\n", " red_d = good_d[(good_d.filtername==\"r\")]\n", " green_d = good_d[(good_d.filtername==\"g\")]\n", " i_d = good_d[(good_d.filtername==\"i\")]\n", " #print(\"red data shape: \", red_e.shape, \" green data shape: \", green_e.shape, \" i data shape: \", i_e.shape)\n", " fix, axes = plt.subplots(figsize=(12,8))\n", "\n", " plt.title(title)\n", " plt.xlabel('Julian date')\n", " plt.ylabel('Magnitude')\n", "\n", " params_r = {'marker' : 'o', 'label' : 'r band', 'color' : 'red'}\n", " #print('In plot_object printing i-band values')\n", " #for ival in list(i_d['mag']): print(ival)\n", " plot_band_lc(axes, list(red_d['obsstart']), list(red_d['mag']), params_r)\n", " params_g = {'marker' : 'o', 'label' : 'g band', 'color' : 'green'}\n", " plot_band_lc(axes, list(green_d['obsstart']), list(green_d['mag']), params_g)\n", " params_i = {'marker' : 'o', 'label' : 'i band', 'color' : 'orange'}\n", " plot_band_lc(axes, list(i_d['obsstart']), list(i_d['mag']), params_i)\n", " plot_level(axes, coadd_mag['r'], {'label' : 'r coadd mag', 'color' : 'red'})\n", " plot_level(axes, coadd_mag['g'], {'label' : 'g coadd mag', 'color' : 'green'})\n", " plot_level(axes, coadd_mag['i'], {'label' : 'i coadd mag', 'color' : 'orange'})\n", " plt.legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "oid = df['objectid'][star_ix]\n", "title = format_title(df,star_ix)\n", "coadd_mag = {'r' : df['mag_r'][star_ix], 'g' : df['mag_g'][star_ix], 'i' : df['mag_i'][star_ix]}\n", "the_data = getlc('run12p_v4', oid)\n", "plot_object(title, the_data, coadd_mag)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot light curves for other objects chosen at random. \n", "for ix in [2, 13, 14]:\n", " oid = df['objectid'][ix]\n", " title = format_title(df,ix)\n", " the_data = getlc('run12p_v4', oid)\n", " coadd_mag = {'r' : df['mag_r'][ix], 'g' : df['mag_g'][ix], 'i' : df['mag_i'][ix]}\n", " plot_object(title, the_data, coadd_mag)" ] } ], "metadata": { "kernelspec": { "display_name": "desc-python", "language": "python", "name": "desc-python" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }