{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Plot N versus z Distributions in the extragalactic catalogs\n",
    "\n",
    "<br>Owner: **Eve Kovacs** ([@evevkovacs](https://github.com/LSSTDESC/DC2-analysis/issues/new?body=@evevkovacs))\n",
    "<br>Last Verified to Run: Nov 30, 2018 by @yymao\n",
    "\n",
    "This notebook demonstrates how to make number-density versus redshift plots from the extragalactic catalog (protoDC2, cosmoDC2).\n",
    "\n",
    "### Learning Objectives:\n",
    "After working through and studying this Notebook you should be able to\n",
    "1. Select data quantities to plot\n",
    "2. Define and apply filters to the data selection\n",
    "3. Use the GCRCatalogs iterator for large catalogs\n",
    "4. Aggregate histograms on the fly\n",
    "5. Loop over sub-plots [Advanced]\n",
    "6. Use a private version of GCRCatalogs [Advanced]\n",
    "\n",
    "The functions in this notebook have been selected from the [NumberDensityVersusRedshift](https://github.com/LSSTDESC/descqa/blob/master/descqa/NumberDensityVersusRedshift.py) DESCQA test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from itertools import zip_longest\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "from GCR import GCRQuery\n",
    "import GCRCatalogs                                                                                               \n",
    "\n",
    "%pylab inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the catalog using GCRCatalogs.  We want 'protoDC2'.  The default config asks the catalog reader to check the catalog against a stored MD5 checksum.  This is in general a good feature, but it takes 30-40 seconds and we'll want to skip that for this tutorial.  So we here use 'protoDC2_test' configuration which skips the MD5 checksum."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gc = GCRCatalogs.load_catalog('cosmoDC2_v1.1.4_small')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a function to fetch the catalog data for the desired redshift quantity and magnitude in  a specified band. Data will be filtered to lie between selected redshift bounds. Filter magnitudes may come from several specified filters, in the given preferred order.\n",
    "\n",
    "This function uses a GCR iterator so that data from a very large catalog can be processed. This is not a problem for protoDC2, but will be an issue for cosmoDC2, since that catalog will be too large to fit into memory. The iterator returns the catalog data in chunks. Histogram arrays to store the chunks have to be pre-allocated and accumulated on the fly for each chunk.\n",
    "\n",
    "Note that because we are accumulating the histograms as we process the data, this function needs arguments that define the binning and the desired shape of the sub-plot array.\n",
    "\n",
    "We accumulate sumz_array so that it will be possible to calculate the mean of z for each bin."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_mags_and_redshift(gc, mag_lo, mag_hi, band='r', z='redshift_true', zlo=0., zhi=1.1, \n",
    "                          nrows=3, ncolumns=2, Nzbins=20):\n",
    "    # Set up quantities to fetch\n",
    "    possible_mag_fields = ('mag_true_{}_lsst', 'mag_true_{}_sdss', 'mag_true_{}_des')\n",
    "    mag_fields = [f.format(band) for f in possible_mag_fields]\n",
    "    mag_field = gc.first_available(*mag_fields)\n",
    "    required_quantities = [mag_field, z]\n",
    "    # Set up binning \n",
    "    zbins = np.linspace(zlo, zhi, Nzbins+1)\n",
    "    filters = [(lambda z: (z > zlo) & (z < zhi), z)]  # Filter on selected redshift range\n",
    "    \n",
    "    # Initialize arrays for storing histogram sums\n",
    "    N_array = np.zeros((nrows, ncolumns, len(zbins)-1), dtype=np.int)\n",
    "    sumz_array = np.zeros((nrows, ncolumns, len(zbins)-1))\n",
    "    # Get catalog data by looping over data iterator (needed for large catalogs) and aggregate histograms\n",
    "    for catalog_data in gc.get_quantities(required_quantities, filters=filters, return_iterator=True):\n",
    "        catalog_data = GCRQuery(*((np.isfinite, col) for col in catalog_data)).filter(catalog_data)\n",
    "        for n, (cut_lo, cut_hi, N, sumz) in enumerate(zip_longest(\n",
    "                mag_lo,\n",
    "                mag_hi,\n",
    "                N_array.reshape(-1, N_array.shape[-1]), # Flatten all but last dimension of array\n",
    "                sumz_array.reshape(-1, sumz_array.shape[-1]),\n",
    "        )):\n",
    "            if cut_lo is None or cut_hi is None:\n",
    "                continue\n",
    "            cuts = [\n",
    "                '{} < {}'.format(mag_field, cut_lo),\n",
    "                '{} >= {}'.format(mag_field, cut_hi),\n",
    "            ]\n",
    "            z_this = catalog_data[z][GCRQuery(*cuts).mask(catalog_data)]\n",
    "\n",
    "            # Bin catalog_data and accumulate subplot histograms\n",
    "            N += np.histogram(z_this, bins=zbins)[0]\n",
    "            sumz += np.histogram(z_this, bins=zbins, weights=z_this)[0]\n",
    "        \n",
    "    return zbins, N_array, sumz_array"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now select some magnitude cuts, choose a 2 column array for the subplots, and call the function to accumulate the histogram arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select some magnitude cuts and fill the histogram arrays for 2-column plots\n",
    "mlo = 25\n",
    "mhi = 19\n",
    "\n",
    "mag_lo = 1 + np.arange(mhi, mlo, dtype=np.float)\n",
    "mag_hi = mhi + np.zeros_like(mag_lo)\n",
    "\n",
    "ncolumns = 2\n",
    "nrows = (len(mag_lo) + ncolumns - 1)//ncolumns\n",
    "\n",
    "Nzbins = 10 \n",
    "zbins, N_array, sumz_array = get_mags_and_redshift(gc, mag_lo, mag_hi, Nzbins=Nzbins, nrows=nrows, ncolumns=ncolumns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a function to make the plots using the accumulated histogram arrays. Statistical errors only are being used. A more realistic estimate would be provided by jack-knife errors which take into account the sample variance. The code for calculating the jack-knife errors is available in the [DESCQA test](https://github.com/LSSTDESC/descqa/blob/master/descqa/NumberDensityVersusRedshift.py)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_N_vs_z(mag_lo, mag_hi, N_array, sumz_array, zbins, band='r', normed=True,\n",
    "                nrows=3, ncolumns=2, figx_p=9, figy_p=11):\n",
    "    fig, ax = plt.subplots(nrows, ncolumns, figsize=(figx_p, figy_p), sharex='col')\n",
    "    for n, (ax_this, cut_lo, cut_hi, N, sumz) in \\\n",
    "            enumerate(zip_longest(ax.flat,\n",
    "                                  mag_lo, \n",
    "                                  mag_hi,\n",
    "                                  N_array.reshape(-1, N_array.shape[-1]),\n",
    "                                  sumz_array.reshape(-1, sumz_array.shape[-1]),\n",
    "                                  )):\n",
    "        if cut_lo is None or cut_hi is None:  # cut_lo is None if mag_lo is exhausted\n",
    "            ax_this.set_visible(False)\n",
    "            continue\n",
    "\n",
    "        cut_label = '{} $\\leq$ {} $<$ {}'.format(cut_hi, band, cut_lo)\n",
    "        meanz = sumz / N\n",
    "        sumN = N.sum()\n",
    "        covariance = np.diag(N)\n",
    "\n",
    "        if normed:\n",
    "            scale = sumN * (zbins[1:] - zbins[:-1])\n",
    "            N = N/scale\n",
    "            covariance = covariance/np.outer(scale, scale)\n",
    "\n",
    "        Nerrors = np.sqrt(np.diag(covariance))\n",
    "\n",
    "        ax_this.errorbar(meanz, N, yerr=Nerrors, label=cut_label, color='blue', fmt='o:', ms=4)\n",
    "        decorate_subplot(ax_this, n, nrows, ncolumns, 'p(z|m)', 'z')\n",
    "\n",
    "    plt.subplots_adjust(hspace=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def decorate_subplot(ax, nplot, nrows, ncolumns, ylabel, xlabel):\n",
    "    # Add axes and legend              \n",
    "    if nplot % ncolumns == 0:  # 1st column\n",
    "        ax.set_ylabel('$'+ylabel+'$', size=16)\n",
    "\n",
    "    if nplot+1 <= nplot - ncolumns:  # x scales for last ncol plots only\n",
    "        # Print \"noticks\",nplot                                                  \n",
    "        for axlabel in ax.get_xticklabels():                                    \n",
    "            axlabel.set_visible(False)                                          \n",
    "            # Prevent overlapping yaxis labels                                   \n",
    "            ax.yaxis.get_major_ticks()[0].label1.set_visible(False)             \n",
    "    else:                                                                       \n",
    "        ax.set_xlabel('$'+xlabel+'$', size=16)                                   \n",
    "        for axlabel in ax.get_xticklabels():                                    \n",
    "            axlabel.set_visible(True)                                           \n",
    "        ax.legend(loc='best', fancybox=True, framealpha=0.5, fontsize=10, numpoints=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_N_vs_z(mag_lo, mag_hi, N_array, sumz_array, zbins, nrows=nrows, ncolumns=ncolumns, normed=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Appendix: Using a development version of GCRCatalogs\n",
    "In development, the extragalactic team often found it useful to refer to development versions of GCRCatalogs (gcr-catalogs).  Here's how one would do this.  If one had a private checkout of the GCRCatalogs repository, e.g., one had run\n",
    "```\n",
    "git clone https://github/com/LSSTDESC/gcr-catalogs /global/u1/k/kovacs/gcr-catalogs_v4x\n",
    "```\n",
    "\n",
    "Then you can explicitly manipulate your Python path to include this particular version as:\n",
    "\n",
    "```\n",
    "import sys\n",
    "sys.path.insert(0, '/global/u1/k/kovacs/gcr-catalogs_v4x')\n",
    "```\n",
    "\n",
    "If you had a specific custom configuration file that you had written, called, 'proto-dc2_v4.15_test', you could load that configuration instead with\n",
    "\n",
    "```\n",
    "gc = GCRCatalogs.load_catalog('proto-dc2_v4.15_test')\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "desc-python",
   "language": "python",
   "name": "desc-python"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}