{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# An Inventory of the Shared Datasets in the LSST Science Platform\n", "
Owner(s): **Phil Marshall** ([@drphilmarshall](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@drphilmarshall)), **Rob Morgan** ([@rmorgan10](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@rmorgan10))\n", "
Last Verified to Run: **2019-08-13**\n", "
Verified Stack Release: **18.1**\n", "\n", "In this notebook we'll take a look at some of the datasets available on the LSST Science Platform. \n", "\n", "### Learning Objectives:\n", "\n", "After working through this tutorial you should be able to: \n", "1. Start figuring out which of the available datasets is going to be of most use to you in any given project; \n", "\n", "When it is finished, you should be able to use the `stackclub.Taster` to:\n", "2. Report on the available data in a given dataset;\n", "3. Plot the patches and tracts in a given dataset on the sky.\n", "\n", "**Outstanding Issue:** The `Taster` augments the functionality of the Gen-2 butler, which provides limited capabilities for discovery what data *actually* exist. Specifically, the `Taster` is relying heavily on the `queryMetadata` functionality of the Gen-2 butler, which is limited to a small number of datasets and does not actually guarentee that those datasets exist. The user should beware of over-interpreting the true *existence* of datasets queried by the `Taster`. This should be improved greatly with the Gen-3 butler.\n", "\n", "### Logistics\n", "This notebook is intended to be runnable on `lsst-lsp-stable.ncsa.illinois.edu` from a local git clone of https://github.com/LSSTScienceCollaborations/StackClub.\n", "\n", "## Set-up" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll need the `stackclub` package to be installed. If you are not developing this package, you can install it using `pip`, like this:\n", "```\n", "pip install git+git://github.com/LSSTScienceCollaborations/StackClub.git#egg=stackclub\n", "```\n", "If you are developing the `stackclub` package (eg by adding modules to it to support the Stack Club tutorial that you are writing, you'll need to make a local, editable installation. In the top level folder of the `StackClub` repo, do:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! cd .. && python setup.py -q develop --user && cd -" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may need to restart the kernel after doing this. When editing the `stackclub` package files, we want the latest version to be imported when we re-run the import command. To enable this, we need the `%autoreload` magic command." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "To just get a taste of the data that the Butler will deliver for a chosen dataset, we have added a `taster` class to the `stackclub` library. All needed imports are contained in that file, so we only need to import the `stackclub` library to work through this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "%matplotlib inline\n", "\n", "import stackclub" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can find the Stack version that this notebook is running by using eups list -s on the terminal command line:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# What version of the Stack am I using?\n", "! echo $HOSTNAME\n", "! eups list -s lsst_distrib" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Listing the Available Datasets\n", "First, let's look at what is currently available. There are several shared data folders in the LSP, the read-only `/datasets` folder, the project-group-writeable folder `/project/shared/data`, and the Stack Club shared directory `/project/stack-club`. Let's take a look at what's in `/project/shared/data`. Specifically, we want to see butler-friendly data _repositories_, distinguished by their containing a file called `_mapper`, or `repositoryCfg.yaml` in their top level." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**`/project/shared/data`:** These datasets are designed to be small test sets, ideal for tutorials." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "shared_repos_with_mappers = ! ls -d /project/shared/data/*/_mapper | grep -v README | cut -d'/' -f1-5 | sort | uniq\n", "shared_repos_with_yaml_files = ! ls -d /project/shared/data/*/repositoryCfg.yaml | grep -v README | cut -d'/' -f1-5 | sort | uniq\n", "shared_repos = np.unique(shared_repos_with_mappers + shared_repos_with_yaml_files)\n", "\n", "shared_repos" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for repo in shared_repos:\n", " ! du -sh $repo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**`/datasets`:**\n", "These are typically much bigger: to measure the size, uncomment the second cell below and edit it to target the dataset you are interested in. Running `du` on all folders takes several minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "repos_with_mappers = ! ls -d /datasets/*/repo/_mapper |& grep -v \"No such\" | cut -d'/' -f1-4 | sort | uniq\n", "repos_with_yaml_files = ! ls -d /datasets/*/repo/repositoryCfg.yaml |& grep -v \"No such\" | cut -d'/' -f1-4 | sort | uniq\n", "repos = np.unique(repos_with_mappers + repos_with_yaml_files)\n", "\n", "repos" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "for repo in repos:\n", " ! du -sh $repo\n", "\"\"\";" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring the Data Repo with the Stack Club `Taster`\n", "\n", "The `stackclub` library provides a `Taster` class, to explore the datasets in a given repo. As an example, let's take a look at some HSC data using the `Taster`. When instantiating the `Taster`, if you plan to use it for visualizing sky coverage, you can provide it with a path to the tracts from the main repo.\n", "\n", "### Initializing the `Taster`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Parent repo\n", "repo = '/datasets/hsc/repo/'\n", "\n", "#Location of tracts for a particular rerun and depth relative to main repo\n", "rerun = 'DM-13666' # DM-13666, DM-10404 \n", "depth = 'WIDE' # WIDE, DEEP, UDEEP\n", "tract_location = 'rerun/' + rerun + '/' + depth" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Execute one of the following two cells. The latter will make `tarquin` aware of the tracts for the dataset while the former will just look at the repo as a whole and not visualize any sky area." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tarquin = stackclub.Taster(repo, vb=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tarquin = stackclub.Taster(repo, vb=True, path_to_tracts=tract_location)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Properties of the `Taster`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The taster, `tarquin`, carries a butler around with it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(tarquin.butler)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we ask the taster to investigate a folder that is not a repo, its butler will be `None`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "failed = stackclub.Taster('not-a-repo', vb=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(failed.butler)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The taster uses its butler to query the metadata of the repo for datasets, skymaps etc. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tarquin.look_for_datasets_of_type(['raw', 'calexp', 'deepCoadd_calexp', 'deepCoadd_mergeDet'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **PROBLEM: these last two datatypes are not listed in the repo metadata. This is one of the issues with the Gen-2 butler and the`Taster` is not smart enough to search the tract folders for catalog files. This should be updated with Gen-3.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tarquin.look_for_skymap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `what_exists` method searches for everything \"interesting\". In the `taster.py` class, interesting currently consists of \n", "* `'raw'`\n", "* `'calexp'` \n", "* `'src'`\n", "* `'deepCoadd_calexp'`\n", "* `'deepCoadd_meas'` \n", "\n", "but this method can easily be updated to include more dataset types." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tarquin.what_exists()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If one wishes to check the existance of all dataset types, you can use the `all` parameter of the `what_exists()` method to do exactly that. Checking all dataset types may take a minute or so (while the `Taster` does a lot of database queries)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tarquin.what_exists(all=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A dictionary with existence information is stored in the `exists` attribute:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tarquin.exists" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Taster` can report on the data available, counting the number of visits, sources, etc, according to what's in the repo. It uses methods like this one:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tarquin.estimate_sky_area()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and this one:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tarquin.count_things()\n", "\n", "print(tarquin.counts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When the `estimate_sky_area` method runs, `tarquin` collects all the tracts associated with the repo. A list of the tracts is stored in the attribute `tarquin.tracts`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tarquin.tracts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the tracts, we can get a rough estimate for what parts of the sky have been targeted in the dataset. The method for doing this is `tarquin.plot_sky_coverage`, and follows the example code given in [Exploring_A_Data_Repo.ipynb](Exploring_A_Data_Repo.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tarquin.plot_sky_coverage()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To have your `Taster` do all the above, and just report on what it finds, do:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tarquin.report()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are interested in learning which fields, filters, visits, etc. have been counted by `tarquin`, remember that `tarquin` carries an instance of the `Butler` with it, so you can run typical `Butler` methods. For example, if you found the number of filters being 13 odd, you can look at the filters like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tarquin.butler.queryMetadata('calexp', ['filter'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more on the `Taster`'s methods, do, for example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# help(tarquin)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example Tastings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's compare the WIDE, DEEP and UDEEP parts of the HSC dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "repo = '/datasets/hsc/repo/'\n", "rerun = 'DM-13666'\n", "\n", "for depth in ['WIDE', 'DEEP', 'UDEEP']:\n", " tract_location = 'rerun/' + rerun + '/' + depth\n", " \n", " taster = stackclub.Taster(repo, path_to_tracts=tract_location)\n", " taster.report()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may notice that all **Metadata Characteristics** beginning with \"Number of\" are the same for the three depths. This is a result of `tarquin`'s `Butler` getting this information from the repo as a whole, rather than the specific depth we specified for the tracts. There is more information on why the `Butler` works in this way in the [Exploring_A_Data_Repo.ipynb](https://github.com/LSSTScienceCollaborations/StackClub/blob/project/data_inventory/drphilmarshall/Basics/Exploring_A_Data_Repo.ipynb) notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "In this notebook we took a first look at the datasets available to us in two shared directories in the LSST science platform filesystem, and used the `stackclub.Taster` class to report on their basic properties, and their sky coverage. Details on the methods used by the `Taster` can be found in the [Exploring_A_Data_Repo.ipynb](https://github.com/LSSTScienceCollaborations/StackClub/blob/project/data_inventory/drphilmarshall/Basics/Exploring_A_Data_Repo.ipynb) notebook, or by executing the follwoing cell:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(tarquin)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# STILL TODO\n", "* Build defensiveness into the `Taster` so that it can handle a wider variety of datasets.\n", "* Update `Taster` to use Gen-3 butler\n", "\n", "### Looking at other shared datasets and repos\n", "\n", "The following loops over all shared datasets fails in interesting ways: some folders don't seem to be `Butler`-friendly. We need to do a bit more work to identify the actual repos available to us, and then use the `Taster` to provide a guide to all of them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for repo in shared_repos:\n", " try:\n", " taster = stackclub.Taster(repo)\n", " taster.report()\n", " except:\n", " print(\"Taster failed to explore repo \",repo)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for repo in repos:\n", " try:\n", " taster = stackclub.Taster(repo)\n", " taster.report()\n", " except:\n", " print(\"Taster failed to explore repo \",repo)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "LSST", "language": "python", "name": "lsst" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" }, "livereveal": { "scroll": true, "start_slideshow_at": "selected" } }, "nbformat": 4, "nbformat_minor": 2 }