{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# An Inventory of the Shared Datasets in the LSST Science Platform\n",
    "<br>Owner(s): **Phil Marshall** ([@drphilmarshall](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@drphilmarshall)), **Rob Morgan** ([@rmorgan10](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@rmorgan10))\n",
    "<br>Last Verified to Run: **2019-08-13**\n",
    "<br>Verified Stack Release: **18.1**\n",
    "\n",
    "In this notebook we'll take a look at some of the datasets available on the LSST Science Platform. \n",
    "\n",
    "### Learning Objectives:\n",
    "\n",
    "After working through this tutorial you should be able to: \n",
    "1. Start figuring out which of the available datasets is going to be of most use to you in any given project; \n",
    "\n",
    "When it is finished, you should be able to use the `stackclub.Taster` to:\n",
    "2. Report on the available data in a given dataset;\n",
    "3. Plot the patches and tracts in a given dataset on the sky.\n",
    "\n",
    "**Outstanding Issue:** The `Taster` augments the functionality of the Gen-2 butler, which provides limited capabilities for discovery what data *actually* exist. Specifically, the `Taster` is relying heavily on the `queryMetadata` functionality of the Gen-2 butler, which is limited to a small number of datasets and does not actually guarentee that those datasets exist. The user should beware of over-interpreting the true *existence* of datasets queried by the `Taster`. This should be improved greatly with the Gen-3 butler.\n",
    "\n",
    "### Logistics\n",
    "This notebook is intended to be runnable on `lsst-lsp-stable.ncsa.illinois.edu` from a local git clone of https://github.com/LSSTScienceCollaborations/StackClub.\n",
    "\n",
    "## Set-up"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll need the `stackclub` package to be installed. If you are not developing this package, you can install it using `pip`, like this:\n",
    "```\n",
    "pip install git+git://github.com/LSSTScienceCollaborations/StackClub.git#egg=stackclub\n",
    "```\n",
    "If you are developing the `stackclub` package (eg by adding modules to it to support the Stack Club tutorial that you are writing, you'll need to make a local, editable installation. In the top level folder of the `StackClub` repo, do:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! cd .. && python setup.py -q develop --user && cd -"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You may need to restart the kernel after doing this. When editing the `stackclub` package files, we want the latest version to be imported when we re-run the import command. To enable this, we need the `%autoreload` magic command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "To just get a taste of the data that the Butler will deliver for a chosen dataset, we have added a `taster` class to the `stackclub` library. All needed imports are contained in that file, so we only need to import the `stackclub` library to work through this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "%matplotlib inline\n",
    "\n",
    "import stackclub"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can find the Stack version that this notebook is running by using eups list -s on the terminal command line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# What version of the Stack am I using?\n",
    "! echo $HOSTNAME\n",
    "! eups list -s lsst_distrib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Listing the Available Datasets\n",
    "First, let's look at what is currently available. There are several shared data folders in the LSP, the read-only `/datasets` folder, the project-group-writeable folder `/project/shared/data`, and the Stack Club shared directory `/project/stack-club`. Let's take a look at what's in `/project/shared/data`. Specifically, we want to see butler-friendly data _repositories_, distinguished by their containing a file called `_mapper`, or `repositoryCfg.yaml` in their top level."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**`/project/shared/data`:** These datasets are designed to be small test sets, ideal for tutorials."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "shared_repos_with_mappers = ! ls -d /project/shared/data/*/_mapper | grep -v README | cut -d'/' -f1-5 | sort | uniq\n",
    "shared_repos_with_yaml_files = ! ls -d /project/shared/data/*/repositoryCfg.yaml | grep -v README | cut -d'/' -f1-5 | sort | uniq\n",
    "shared_repos = np.unique(shared_repos_with_mappers + shared_repos_with_yaml_files)\n",
    "\n",
    "shared_repos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for repo in shared_repos:\n",
    "    ! du -sh $repo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**`/datasets`:**\n",
    "These are typically much bigger: to measure the size, uncomment the second cell below and edit it to target the dataset you are interested in. Running `du` on all folders takes several minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "repos_with_mappers = ! ls -d /datasets/*/repo/_mapper  |& grep -v \"No such\" | cut -d'/' -f1-4 | sort | uniq\n",
    "repos_with_yaml_files = ! ls -d /datasets/*/repo/repositoryCfg.yaml |& grep -v \"No such\" | cut -d'/' -f1-4 | sort | uniq\n",
    "repos = np.unique(repos_with_mappers + repos_with_yaml_files)\n",
    "\n",
    "repos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "for repo in repos:\n",
    "    ! du -sh $repo\n",
    "\"\"\";"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring the Data Repo with the Stack Club `Taster`\n",
    "\n",
    "The `stackclub` library provides a `Taster` class, to explore the datasets in a given repo. As an example, let's take a look at some HSC data using the `Taster`. When instantiating the `Taster`, if you plan to use it for visualizing sky coverage, you can provide it with a path to the tracts from the main repo.\n",
    "\n",
    "### Initializing the `Taster`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Parent repo\n",
    "repo = '/datasets/hsc/repo/'\n",
    "\n",
    "#Location of tracts for a particular rerun and depth relative to main repo\n",
    "rerun = 'DM-13666' # DM-13666, DM-10404 \n",
    "depth = 'WIDE' # WIDE, DEEP, UDEEP\n",
    "tract_location = 'rerun/' + rerun + '/' + depth"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Execute one of the following two cells. The latter will make `tarquin` aware of the tracts for the dataset while the former will just look at the repo as a whole and not visualize any sky area."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tarquin = stackclub.Taster(repo, vb=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tarquin = stackclub.Taster(repo, vb=True, path_to_tracts=tract_location)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Properties of the `Taster`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The taster, `tarquin`, carries a butler around with it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "type(tarquin.butler)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we ask the taster to investigate a folder that is not a repo, its butler will be `None`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "failed = stackclub.Taster('not-a-repo', vb=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(failed.butler)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The taster uses its butler to query the metadata of the repo for datasets, skymaps etc. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tarquin.look_for_datasets_of_type(['raw', 'calexp', 'deepCoadd_calexp', 'deepCoadd_mergeDet'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **PROBLEM: these last two datatypes are not listed in the repo metadata. This is one of the issues with the Gen-2 butler and the`Taster` is not smart enough to search the tract folders for catalog files. This should be updated with Gen-3.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tarquin.look_for_skymap()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `what_exists` method searches for everything \"interesting\". In the `taster.py` class, interesting currently consists of \n",
    "* `'raw'`\n",
    "* `'calexp'` \n",
    "* `'src'`\n",
    "* `'deepCoadd_calexp'`\n",
    "* `'deepCoadd_meas'` \n",
    "\n",
    "but this method can easily be updated to include more dataset types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tarquin.what_exists()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If one wishes to check the existance of all dataset types, you can use the `all` parameter of the `what_exists()` method to do exactly that. Checking all dataset types may take a minute or so (while the `Taster` does a lot of database queries)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tarquin.what_exists(all=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A dictionary with existence information is stored in the `exists` attribute:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tarquin.exists"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Taster` can report on the data available, counting the number of visits, sources, etc, according to what's in the repo. It uses methods like this one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tarquin.estimate_sky_area()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and this one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tarquin.count_things()\n",
    "\n",
    "print(tarquin.counts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When the `estimate_sky_area` method runs, `tarquin` collects all the tracts associated with the repo. A list of the tracts is stored in the attribute `tarquin.tracts`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tarquin.tracts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the tracts, we can get a rough estimate for what parts of the sky have been targeted in the dataset. The method for doing this is `tarquin.plot_sky_coverage`, and follows the example code given in [Exploring_A_Data_Repo.ipynb](Exploring_A_Data_Repo.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tarquin.plot_sky_coverage()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To have your `Taster` do all the above, and just report on what it finds, do:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tarquin.report()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are interested in learning which fields, filters, visits, etc. have been counted by `tarquin`, remember that `tarquin` carries an instance of the `Butler` with it, so you can run typical `Butler` methods. For example, if you found the number of filters being 13 odd, you can look at the filters like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tarquin.butler.queryMetadata('calexp', ['filter'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For more on the `Taster`'s methods, do, for example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# help(tarquin)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example Tastings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare the WIDE, DEEP and UDEEP parts of the HSC dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "repo = '/datasets/hsc/repo/'\n",
    "rerun = 'DM-13666'\n",
    "\n",
    "for depth in ['WIDE', 'DEEP', 'UDEEP']:\n",
    "    tract_location = 'rerun/' + rerun + '/' + depth\n",
    "    \n",
    "    taster = stackclub.Taster(repo, path_to_tracts=tract_location)\n",
    "    taster.report()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You may notice that all **Metadata Characteristics** beginning with \"Number of\" are the same for the three depths. This is a result of `tarquin`'s `Butler` getting this information from the repo as a whole, rather than the specific depth we specified for the tracts. There is more information on why the `Butler` works in this way in the [Exploring_A_Data_Repo.ipynb](https://github.com/LSSTScienceCollaborations/StackClub/blob/project/data_inventory/drphilmarshall/Basics/Exploring_A_Data_Repo.ipynb) notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this notebook we took a first look at the datasets available to us in two shared directories in the LSST science platform filesystem, and used the `stackclub.Taster` class to report on their basic properties, and their sky coverage. Details on the methods used by the `Taster` can be found in the [Exploring_A_Data_Repo.ipynb](https://github.com/LSSTScienceCollaborations/StackClub/blob/project/data_inventory/drphilmarshall/Basics/Exploring_A_Data_Repo.ipynb) notebook, or by executing the follwoing cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(tarquin)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# STILL TODO\n",
    "* Build defensiveness into the `Taster` so that it can handle a wider variety of datasets.\n",
    "* Update `Taster` to use Gen-3 butler\n",
    "\n",
    "### Looking at other shared datasets and repos\n",
    "\n",
    "The following loops over all shared datasets fails in interesting ways: some folders don't seem to be `Butler`-friendly. We need to do a bit more work to identify the actual repos available to us, and then use the `Taster` to provide a guide to all of them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for repo in shared_repos:\n",
    "    try:\n",
    "        taster = stackclub.Taster(repo)\n",
    "        taster.report()\n",
    "    except:\n",
    "        print(\"Taster failed to explore repo \",repo)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for repo in repos:\n",
    "    try:\n",
    "        taster = stackclub.Taster(repo)\n",
    "        taster.report()\n",
    "    except:\n",
    "        print(\"Taster failed to explore repo \",repo)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "LSST",
   "language": "python",
   "name": "lsst"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  },
  "livereveal": {
   "scroll": true,
   "start_slideshow_at": "selected"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}