{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# An Inventory of the Shared Datasets in the LSST Science Platform\n",
"
Owner(s): **Phil Marshall** ([@drphilmarshall](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@drphilmarshall)), **Rob Morgan** ([@rmorgan10](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@rmorgan10))\n",
"
Last Verified to Run: **2019-08-13**\n",
"
Verified Stack Release: **18.1**\n",
"\n",
"In this notebook we'll take a look at some of the datasets available on the LSST Science Platform. \n",
"\n",
"### Learning Objectives:\n",
"\n",
"After working through this tutorial you should be able to: \n",
"1. Start figuring out which of the available datasets is going to be of most use to you in any given project; \n",
"\n",
"When it is finished, you should be able to use the `stackclub.Taster` to:\n",
"2. Report on the available data in a given dataset;\n",
"3. Plot the patches and tracts in a given dataset on the sky.\n",
"\n",
"**Outstanding Issue:** The `Taster` augments the functionality of the Gen-2 butler, which provides limited capabilities for discovery what data *actually* exist. Specifically, the `Taster` is relying heavily on the `queryMetadata` functionality of the Gen-2 butler, which is limited to a small number of datasets and does not actually guarentee that those datasets exist. The user should beware of over-interpreting the true *existence* of datasets queried by the `Taster`. This should be improved greatly with the Gen-3 butler.\n",
"\n",
"### Logistics\n",
"This notebook is intended to be runnable on `lsst-lsp-stable.ncsa.illinois.edu` from a local git clone of https://github.com/LSSTScienceCollaborations/StackClub.\n",
"\n",
"## Set-up"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll need the `stackclub` package to be installed. If you are not developing this package, you can install it using `pip`, like this:\n",
"```\n",
"pip install git+git://github.com/LSSTScienceCollaborations/StackClub.git#egg=stackclub\n",
"```\n",
"If you are developing the `stackclub` package (eg by adding modules to it to support the Stack Club tutorial that you are writing, you'll need to make a local, editable installation. In the top level folder of the `StackClub` repo, do:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! cd .. && python setup.py -q develop --user && cd -"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may need to restart the kernel after doing this. When editing the `stackclub` package files, we want the latest version to be imported when we re-run the import command. To enable this, we need the `%autoreload` magic command."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"To just get a taste of the data that the Butler will deliver for a chosen dataset, we have added a `taster` class to the `stackclub` library. All needed imports are contained in that file, so we only need to import the `stackclub` library to work through this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"%matplotlib inline\n",
"\n",
"import stackclub"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can find the Stack version that this notebook is running by using eups list -s on the terminal command line:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# What version of the Stack am I using?\n",
"! echo $HOSTNAME\n",
"! eups list -s lsst_distrib"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Listing the Available Datasets\n",
"First, let's look at what is currently available. There are several shared data folders in the LSP, the read-only `/datasets` folder, the project-group-writeable folder `/project/shared/data`, and the Stack Club shared directory `/project/stack-club`. Let's take a look at what's in `/project/shared/data`. Specifically, we want to see butler-friendly data _repositories_, distinguished by their containing a file called `_mapper`, or `repositoryCfg.yaml` in their top level."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**`/project/shared/data`:** These datasets are designed to be small test sets, ideal for tutorials."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"shared_repos_with_mappers = ! ls -d /project/shared/data/*/_mapper | grep -v README | cut -d'/' -f1-5 | sort | uniq\n",
"shared_repos_with_yaml_files = ! ls -d /project/shared/data/*/repositoryCfg.yaml | grep -v README | cut -d'/' -f1-5 | sort | uniq\n",
"shared_repos = np.unique(shared_repos_with_mappers + shared_repos_with_yaml_files)\n",
"\n",
"shared_repos"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for repo in shared_repos:\n",
" ! du -sh $repo"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**`/datasets`:**\n",
"These are typically much bigger: to measure the size, uncomment the second cell below and edit it to target the dataset you are interested in. Running `du` on all folders takes several minutes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"repos_with_mappers = ! ls -d /datasets/*/repo/_mapper |& grep -v \"No such\" | cut -d'/' -f1-4 | sort | uniq\n",
"repos_with_yaml_files = ! ls -d /datasets/*/repo/repositoryCfg.yaml |& grep -v \"No such\" | cut -d'/' -f1-4 | sort | uniq\n",
"repos = np.unique(repos_with_mappers + repos_with_yaml_files)\n",
"\n",
"repos"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"for repo in repos:\n",
" ! du -sh $repo\n",
"\"\"\";"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exploring the Data Repo with the Stack Club `Taster`\n",
"\n",
"The `stackclub` library provides a `Taster` class, to explore the datasets in a given repo. As an example, let's take a look at some HSC data using the `Taster`. When instantiating the `Taster`, if you plan to use it for visualizing sky coverage, you can provide it with a path to the tracts from the main repo.\n",
"\n",
"### Initializing the `Taster`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Parent repo\n",
"repo = '/datasets/hsc/repo/'\n",
"\n",
"#Location of tracts for a particular rerun and depth relative to main repo\n",
"rerun = 'DM-13666' # DM-13666, DM-10404 \n",
"depth = 'WIDE' # WIDE, DEEP, UDEEP\n",
"tract_location = 'rerun/' + rerun + '/' + depth"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Execute one of the following two cells. The latter will make `tarquin` aware of the tracts for the dataset while the former will just look at the repo as a whole and not visualize any sky area."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tarquin = stackclub.Taster(repo, vb=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tarquin = stackclub.Taster(repo, vb=True, path_to_tracts=tract_location)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Properties of the `Taster`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The taster, `tarquin`, carries a butler around with it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(tarquin.butler)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we ask the taster to investigate a folder that is not a repo, its butler will be `None`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"failed = stackclub.Taster('not-a-repo', vb=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(failed.butler)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The taster uses its butler to query the metadata of the repo for datasets, skymaps etc. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tarquin.look_for_datasets_of_type(['raw', 'calexp', 'deepCoadd_calexp', 'deepCoadd_mergeDet'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **PROBLEM: these last two datatypes are not listed in the repo metadata. This is one of the issues with the Gen-2 butler and the`Taster` is not smart enough to search the tract folders for catalog files. This should be updated with Gen-3.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tarquin.look_for_skymap()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `what_exists` method searches for everything \"interesting\". In the `taster.py` class, interesting currently consists of \n",
"* `'raw'`\n",
"* `'calexp'` \n",
"* `'src'`\n",
"* `'deepCoadd_calexp'`\n",
"* `'deepCoadd_meas'` \n",
"\n",
"but this method can easily be updated to include more dataset types."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tarquin.what_exists()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If one wishes to check the existance of all dataset types, you can use the `all` parameter of the `what_exists()` method to do exactly that. Checking all dataset types may take a minute or so (while the `Taster` does a lot of database queries)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tarquin.what_exists(all=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A dictionary with existence information is stored in the `exists` attribute:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tarquin.exists"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `Taster` can report on the data available, counting the number of visits, sources, etc, according to what's in the repo. It uses methods like this one:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tarquin.estimate_sky_area()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and this one:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tarquin.count_things()\n",
"\n",
"print(tarquin.counts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When the `estimate_sky_area` method runs, `tarquin` collects all the tracts associated with the repo. A list of the tracts is stored in the attribute `tarquin.tracts`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tarquin.tracts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the tracts, we can get a rough estimate for what parts of the sky have been targeted in the dataset. The method for doing this is `tarquin.plot_sky_coverage`, and follows the example code given in [Exploring_A_Data_Repo.ipynb](Exploring_A_Data_Repo.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tarquin.plot_sky_coverage()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To have your `Taster` do all the above, and just report on what it finds, do:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tarquin.report()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you are interested in learning which fields, filters, visits, etc. have been counted by `tarquin`, remember that `tarquin` carries an instance of the `Butler` with it, so you can run typical `Butler` methods. For example, if you found the number of filters being 13 odd, you can look at the filters like this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tarquin.butler.queryMetadata('calexp', ['filter'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more on the `Taster`'s methods, do, for example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# help(tarquin)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example Tastings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's compare the WIDE, DEEP and UDEEP parts of the HSC dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"repo = '/datasets/hsc/repo/'\n",
"rerun = 'DM-13666'\n",
"\n",
"for depth in ['WIDE', 'DEEP', 'UDEEP']:\n",
" tract_location = 'rerun/' + rerun + '/' + depth\n",
" \n",
" taster = stackclub.Taster(repo, path_to_tracts=tract_location)\n",
" taster.report()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may notice that all **Metadata Characteristics** beginning with \"Number of\" are the same for the three depths. This is a result of `tarquin`'s `Butler` getting this information from the repo as a whole, rather than the specific depth we specified for the tracts. There is more information on why the `Butler` works in this way in the [Exploring_A_Data_Repo.ipynb](https://github.com/LSSTScienceCollaborations/StackClub/blob/project/data_inventory/drphilmarshall/Basics/Exploring_A_Data_Repo.ipynb) notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"In this notebook we took a first look at the datasets available to us in two shared directories in the LSST science platform filesystem, and used the `stackclub.Taster` class to report on their basic properties, and their sky coverage. Details on the methods used by the `Taster` can be found in the [Exploring_A_Data_Repo.ipynb](https://github.com/LSSTScienceCollaborations/StackClub/blob/project/data_inventory/drphilmarshall/Basics/Exploring_A_Data_Repo.ipynb) notebook, or by executing the follwoing cell:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"help(tarquin)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# STILL TODO\n",
"* Build defensiveness into the `Taster` so that it can handle a wider variety of datasets.\n",
"* Update `Taster` to use Gen-3 butler\n",
"\n",
"### Looking at other shared datasets and repos\n",
"\n",
"The following loops over all shared datasets fails in interesting ways: some folders don't seem to be `Butler`-friendly. We need to do a bit more work to identify the actual repos available to us, and then use the `Taster` to provide a guide to all of them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for repo in shared_repos:\n",
" try:\n",
" taster = stackclub.Taster(repo)\n",
" taster.report()\n",
" except:\n",
" print(\"Taster failed to explore repo \",repo)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for repo in repos:\n",
" try:\n",
" taster = stackclub.Taster(repo)\n",
" taster.report()\n",
" except:\n",
" print(\"Taster failed to explore repo \",repo)"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "LSST",
"language": "python",
"name": "lsst"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
},
"livereveal": {
"scroll": true,
"start_slideshow_at": "selected"
}
},
"nbformat": 4,
"nbformat_minor": 2
}