{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Accessing DC2 truth and simulated observations data in PostgreSQL at NERSC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Owner: **Joanne Bogart [@jrbogart](https://github.com/LSSTDESC/DC2-analysis/issues/new?body=@jrbogart)**  \n",
    "Last Verified to Run: **2020-11-24**\n",
    "\n",
    "This notebook demonstrates access to truth catalog data via the PostgreSQL database at NERSC.  Currently (May 1, 2020) two kinds of truth catalogs are available: star truth and supernova truth.  This notebook will concentrate on star truth. The minion observation database is also available, as well as the Run2.2i dr6 object catalog (dpdd columns approximating specification in [LSE-163](https://lse-163.lsst.io/) only).\n",
    "\n",
    "__Learning objectives__:\n",
    "\n",
    "After going through this notebook, you should be able to:\n",
    "  1. Find out what star truth and simulated observation information is available and query it.\n",
    "  4. Make use of standard tools to, e.g., plot light curves\n",
    "\n",
    "__Logistics__: This notebook is intended to be run through the JupyterHub NERSC interface available here: https://jupyter.nersc.gov. To setup your NERSC environment, please follow the instructions available here: \n",
    "https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter+at+NERSC\n",
    "### Prerequisites\n",
    "* See [Getting Started with PostgreSQL at NERSC](https://confluence.slac.stanford.edu/x/s4joE), especially the \"Preliminaries\" section\n",
    "* Some minimal acquaintance with SQL is helpful. See the \"SQL Primer\" section of the above document\n",
    "\n",
    "### Conventions\n",
    "* SQL keywords have been written in ALL CAPS only to make them stand out in queries. (The database server ignores case in queries for keywords, column names and table names.)\n",
    "  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import psycopg2\n",
    "import numpy as np\n",
    "%matplotlib inline \n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make the db connection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dbname = 'desc_dc2_drp'\n",
    "dbuser = 'desc_dc2_drp_user'\n",
    "dbhost = 'nerscdb03.nersc.gov'\n",
    "dbconfig = {'dbname' : dbname, 'user' : dbuser, 'host' : dbhost}\n",
    "dbconn = psycopg2.connect(**dbconfig)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "obs_schema = 'minion_test'\n",
    "truth_schema = 'star_truth'\n",
    "object_schema = 'run22i_dr6_wfd_v1'\n",
    "#truth_schema = 'sne_truth'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initial steps in this notebook can be run for supernova truth instead of star truth\n",
    "just by changing the value for `truth_schema` above, but the section labeled **Sample Query** \n",
    "and the light curve queries need more adjustments."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convenience utilities, defined here so as not to clutter\n",
    "up the main line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_schema_tables(conn, schema):\n",
    "    '''\n",
    "    Returns 1-d numpy array of table names\n",
    "    '''\n",
    "    q = f\"\"\"SELECT DISTINCT table_name FROM information_schema.columns \n",
    "         WHERE table_schema='{schema}' \n",
    "         ORDER BY table_name\"\"\"\n",
    "    with conn.cursor() as cursor:\n",
    "        cursor.execute(q)\n",
    "        records = cursor.fetchall()\n",
    "    tables = np.array([r[0] for r in records])\n",
    "    return tables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_table_columns(conn, schema, table):\n",
    "    '''\n",
    "    Returns a pandas dataframe with columns column_name, datatype\n",
    "    '''\n",
    "    q = f\"\"\"SELECT column_name, data_type FROM information_schema.columns \n",
    "        WHERE table_schema='{schema}' AND table_name='{table}' \n",
    "        ORDER BY column_name\"\"\"\n",
    "    with conn.cursor() as cursor:\n",
    "        cursor.execute(q)\n",
    "        records = cursor.fetchall()\n",
    "    print(len(records))\n",
    "    df = pd.DataFrame(records, columns=['column_name', 'data_type'])\n",
    "    return df\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another convenience routine.\n",
    "It makes use of the public function `conesearch` and provides a place to document its arguments.\n",
    "See also a [more comprehensive list](https://github.com/LSSTDESC/DC2-PostgreSQL/blob/master/postgres-objcatalog/README_functions.md) of such functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_cone_search(coord_column, ra, dec, radius):\n",
    "    '''\n",
    "    Parameters\n",
    "    coord_column:  name of column of type earth in the table\n",
    "    ra:  ra value at center of cone (degrees)\n",
    "    dec:  dec value at center of cone (degrees)\n",
    "    radius: radius of cone (arcseconds)\n",
    "    \n",
    "    Returns\n",
    "    Condition to be inserted into WHERE clause for the query\n",
    "    '''\n",
    "    cond = f\"\"\"conesearch({coord_column},'{ra}','{dec}','{radius}')\"\"\"\n",
    "    return cond"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display all tables belonging to the schema. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "schemas = [truth_schema, obs_schema]\n",
    "for s in schemas:\n",
    "    tables = get_schema_tables(dbconn, s)\n",
    "    print(f\"\\nTables for schema {s}:\")\n",
    "    for t in tables: \n",
    "        print(t)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display all columns belonging to a couple tables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "truth_summary_info = get_table_columns(dbconn, truth_schema, 'truth_summary' )\n",
    "truth_summary_info.style.set_properties(**{'text-align': 'left'})\n",
    "truth_summary_info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stellar_variability_info = get_table_columns(dbconn, truth_schema, 'stellar_variability_truth' )\n",
    "stellar_variability_info.style.set_properties(**{'text-align': 'left'})\n",
    "stellar_variability_info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most generally useful table in the obs database is `summary`.  Note ra and dec are stored in radians in this table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "obs_summary_info = get_table_columns(dbconn, obs_schema, 'summary' )\n",
    "obs_summary_info.style.set_properties(subset=[\"column_name\", \"data_type\"], **{'text-align': 'right'})\n",
    "obs_summary_info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sample Query\n",
    "Find delta flux readings for stars which are expected to be in the field of view for a particular visit. This sort of query returns practically instantly. In the `stellar_variability_truth` both of the columns mentioned in the `WHERE` clause - `id` and `obshistid` - are indexed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "obshistid = 731791\n",
    "table = 'stellar_variability_truth'\n",
    "ids = [\"'835183'\", \"'31303590103'\",\"'31102013522'\",\"'31303588649'\", \"'30317268917'\",\n",
    "       \"'30825472052'\",\"'835279'\",\"'31102039372'\",\"'30825477672'\",\"'31102046245'\", \n",
    "       \"'30321363109'\",\"'31102051190'\",\"'31102061342'\",\"'30321363877'\",\"'31102061079'\",\n",
    "       \"'31411663457'\", \"'31107813412'\"]\n",
    "id_list = \",\".join(ids)\n",
    "\n",
    "flux_q = f\"\"\"SELECT id, delta_flux FROM {truth_schema}.{table} \n",
    "         WHERE (obshistid={obshistid})\n",
    "         AND id IN ({id_list});\"\"\"\n",
    "print(flux_q)\n",
    "with dbconn.cursor() as cursor:\n",
    "    cursor.execute(flux_q)\n",
    "    f_records = cursor.fetchall()\n",
    "df_flux = pd.DataFrame(f_records, columns=['id', 'delta_flux']) \n",
    "df_flux"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Area searches\n",
    "\n",
    "The column `coord` in table `truth_summary` and columns with \"coord\" in their names in the observation summary table are of a special type `earth`.  The value is a triple of double precision numbers describing the position on a unit sphere corresponding to `ra` and `dec`. Indexes have been defined on these columns to speed up area searches."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The observation with obshistid=14 has desc dithered ra, dec = (1.69831745204, -0.59856) *in radians*. Find all observations within a radius (expressed in arcseconds). \n",
    "\n",
    "**Warning:** All ra, dec in the observation summary table are in radians. ra and dec in `truth_summary` are in degrees. `format_cone_search` needs ra, dec in degrees."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "radius = 500\n",
    "ra_deg = np.degrees((1.69831745204,))[0]\n",
    "dec_deg = np.degrees((-0.59856,))[0]\n",
    "dec = -0.59856\n",
    "cond = format_cone_search('descditheredcoord', ra_deg, dec_deg, radius)\n",
    "obs_query = f\"\"\"SELECT obshistid,descditheredra,descdithereddec FROM {obs_schema}.summary\n",
    "            WHERE {cond}\"\"\"\n",
    "# Uncomment the following to confirm that the query looks reasonable\n",
    "#print(obs_query)\n",
    "with dbconn.cursor() as cursor:\n",
    "    %time cursor.execute(obs_query)\n",
    "    records = cursor.fetchall()\n",
    "    df_obs = pd.DataFrame(records, columns=['obshistid', 'ra_radians', 'dec_radians'])\n",
    "df_obs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Light curves\n",
    "Find length of light curves for variable stars near a particular location"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pick a location that probably gets lots of visits\n",
    "# for (70.0, -30.0, 80)  get 3; for (60.0, -30.0, 80) get 1\n",
    "ra = 60.0       # 70.0     \n",
    "dec = -30.0\n",
    "radius = 150    \n",
    "tbl_spec = f\"\"\"SELECT S.id, S.ra, S.dec, max(abs(V.delta_flux)),count(V.bandpass) AS visit_count \n",
    "           FROM {truth_schema}.truth_summary AS S JOIN \n",
    "           {truth_schema}.stellar_variability_truth AS V ON S.id=V.id \"\"\"\n",
    "where = \"WHERE \" + format_cone_search('S.coord', ra, dec, radius) + \" AND S.is_variable=1 \"\n",
    "group_by = \" GROUP BY S.id,S.ra,S.dec\"\n",
    "q = tbl_spec + where + group_by\n",
    "\n",
    "# This takes a couple minutes to complete\n",
    "with dbconn.cursor() as cursor:\n",
    "    %time cursor.execute(q)\n",
    "    records = cursor.fetchall()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_lengths = pd.DataFrame(records, columns=['id', 'ra','dec', 'max_delta_flux','count'])\n",
    "df_lengths"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to above, but this time don't count visits. Get the delta_flux values instead"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ra = 70.0     \n",
    "dec = -30.0\n",
    "radius = 80 \n",
    "columns = ['S.id', 'ra', 'dec', 'bandpass', 'delta_flux']\n",
    "col_list = (',').join(columns)\n",
    "tbl_spec = f\"\"\"SELECT {col_list} \n",
    "           FROM {truth_schema}.truth_summary AS S JOIN \n",
    "           {truth_schema}.stellar_variability_truth AS V ON S.id=V.id \"\"\"\n",
    "where = \"WHERE \" + format_cone_search('S.coord', ra, dec, radius) + \" and S.is_variable=1 \"\n",
    "q = tbl_spec + where\n",
    "\n",
    "with dbconn.cursor() as cursor:\n",
    "    %time cursor.execute(q)\n",
    "    records_lc = cursor.fetchall()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_cone_lcs = pd.DataFrame(records_lc, columns=columns)\n",
    "df_cone_lcs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plot light curves for one star\n",
    "Pick the second object from the results of the first query in this section (id=1568931714) since max_delta_flux is large\n",
    "#### Get the data\n",
    "Get delta_flux and time values for the plot and some summary information about the star. Use `ORDER BY` clause so that data are presented conveniently for plotting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "id = 1568931714\n",
    "var_tbl = 'stellar_variability_truth'\n",
    "lc_q = f\"\"\"SELECT bandpass,mjd,delta_flux FROM {truth_schema}.{var_tbl}\n",
    "       WHERE id='{id}' ORDER BY bandpass, mjd;\"\"\"\n",
    "print(lc_q)\n",
    "with dbconn.cursor() as cursor:\n",
    "    %time cursor.execute(lc_q)\n",
    "    lc_records = cursor.fetchall()\n",
    "print(len(lc_records))\n",
    "df_single_lc = pd.DataFrame(lc_records, columns=['bandpass','mjd','delta_flux'])\n",
    "df_single_lc\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sum_tbl = 'truth_summary'\n",
    "sum_fluxes = ','.join([f\"flux_{b}\" for b in ['g', 'i', 'r', 'u', 'y', 'z']])\n",
    "sum_q = f\"\"\"SELECT ra,dec,{sum_fluxes} FROM {truth_schema}.{sum_tbl} \n",
    "         WHERE id='{id}';\"\"\"\n",
    "print(sum_q)\n",
    "with dbconn.cursor() as cursor:\n",
    "    %time cursor.execute(sum_q)\n",
    "    sum_record = cursor.fetchone()\n",
    "lc_ra = sum_record[0]\n",
    "lc_dec = sum_record[1]\n",
    "print(f'ra={lc_ra}, dec={lc_dec}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Plotting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from astropy.time import Time\n",
    "def plot_band_lc(axes, times, fluxes, params):\n",
    "    out = axes.scatter(np.asarray(times), np.asarray(fluxes), **params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_level(axes, yvalue, params):\n",
    "    xmin, xmax = axes.get_xlim()\n",
    "    out = axes.plot(np.asarray([xmin, xmax]), np.asarray([yvalue, yvalue]), **params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_title(id, ra, dec, band=None):  \n",
    "    if band is None:\n",
    "        return f'Per-band light curves for star {id} at (ra,dec)=({ra}, {dec})'\n",
    "    else:\n",
    "        return f'Light curve for star {id}, band={band} at (ra,dec)=({ra}, {dec})'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_object(title, the_data, band=None):\n",
    "    '''\n",
    "    Plot r, g and i light 'curves' (delta_flux as scatter plot) for an object\n",
    "    or plot only requested band\n",
    "    Parameters\n",
    "    -----------\n",
    "    title : string\n",
    "    the_data : data frame which must include columns filtername, obsstart, mag\n",
    "    '''\n",
    "    good_d = the_data[(np.isnan(the_data.delta_flux)==False)]\n",
    "    red_d = good_d[(good_d.bandpass==\"r\")]\n",
    "    green_d = good_d[(good_d.bandpass==\"g\")]\n",
    "    i_d = good_d[(good_d.bandpass==\"i\")]\n",
    "    #print(\"red data shape: \", red_e.shape, \"   green data shape: \", green_e.shape, \"  i data shape: \", i_e.shape)\n",
    "    fix, axes = plt.subplots(figsize=(12,8))\n",
    "\n",
    "    plt.title(title)\n",
    "    plt.xlabel('Julian date')\n",
    "    plt.ylabel('Delta flux')\n",
    "\n",
    "    params_r = {'marker' : 'o', 'label' : 'r band', 'color' : 'red'}\n",
    "    params_g = {'marker' : 'o', 'label' : 'g band', 'color' : 'green'}\n",
    "    params_i = {'marker' : 'o', 'label' : 'i band', 'color' : 'orange'}\n",
    "    #print('In plot_object printing i-band values')\n",
    "    #for ival in list(i_d['mag']): print(ival)\n",
    "    if band is None or band=='r':\n",
    "        plot_band_lc(axes, list(red_d['mjd']), list(red_d['delta_flux']), params_r)\n",
    "    if band is None or band=='g':\n",
    "        plot_band_lc(axes, list(green_d['mjd']), list(green_d['delta_flux']), params_g)\n",
    "    if band is None or band=='i':\n",
    "        plot_band_lc(axes, list(i_d['mjd']), list(i_d['delta_flux']), params_i)\n",
    "    #plot_level(axes, coadd_mag['r'], {'label' : 'r coadd mag', 'color' : 'red'})\n",
    "    #plot_level(axes, coadd_mag['g'], {'label' : 'g coadd mag', 'color' : 'green'})\n",
    "    #plot_level(axes, coadd_mag['i'], {'label' : 'i coadd mag', 'color' : 'orange'})\n",
    "    if band is None:\n",
    "        plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for band in ('r','g','i'):\n",
    "    title = format_title(id, lc_ra, lc_dec, band)\n",
    "    plot_object(title, df_single_lc, band)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Is it in the object table?\n",
    "First get truth information for our chosen star, then try to find a match, restricting to point sources within a few arcseconds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "truth_q = f\"SELECT ra,dec,flux_g,flux_r,flux_i from {truth_schema}.truth_summary where id='{id}'\"\n",
    "print(truth_q)\n",
    "with dbconn.cursor() as cursor:\n",
    "    %time cursor.execute(truth_q)\n",
    "    truth_records = cursor.fetchall()\n",
    "#truth_records\n",
    "print(len(truth_records))\n",
    "truth_df = None\n",
    "truth_df = pd.DataFrame(truth_records, columns=['ra', 'dec', 'flux_g', 'flux_r', 'flux_i'])\n",
    "truth_df.shape\n",
    "truth_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cols = ['objectid','ra', 'dec', 'psflux_g','psflux_r', 'psflux_i']\n",
    "col_spec = ','.join(cols)\n",
    "obj_q = f\"\"\"SELECT {col_spec} from {object_schema}.dpdd_object  WHERE extendedness < 0.5 AND \"\"\"\n",
    "radius = 20   # in arcseconds\n",
    "obj_q += format_cone_search('coord', lc_ra, lc_dec, radius) \n",
    "print(obj_q)\n",
    "with dbconn.cursor() as cursor:\n",
    "    %time cursor.execute(obj_q)\n",
    "    obj_records = cursor.fetchall()\n",
    "obj_df = pd.DataFrame(obj_records, columns=cols)\n",
    "obj_df"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "desc-python",
   "language": "python",
   "name": "desc-python"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}