{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Access with the Gen-2 and Gen-3 Butlers\n", "\n", "
Owner: **Daniel Perrefort** ([@djperrefort](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@djperrefort))\n", "
Updated for DC2 by: Douglas Tucker ([@douglasleetucker](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@douglasleetucker)) following in part work for DESC by Yao-Yuan Mao (@yymao) and Johann Cohen-Tanugi (@johannct)\n", "
Last Verified to Run: **2021-03-09**\n", "
Verified Stack Release: **w_2021_18**\n", "\n", "## Core Concepts\n", "\n", "This notebook provides a hands-on overview of how to interact with either the Gen-2 or the Gen-3 `Butler`. (A more detailed Gen-3-focused version of this tutorial can be found in `DC2Gen3ButlerTutorial.ipynb`.) The `Butler` provides a way to access information using a uniform interface without needing to keep track of how the information is internally stored or organized. Data access with `Butler` has three levels you need to be aware of:\n", "\n", "1. Each instantiated `Butler` object provides access to a collection of datasets called a **repository**. Each repository is defined by Butler using the local file directory where the data is stored.\n", "2. Each data set in a **repository** is assigned a unique name called a **type**. These types are strings that describe the data set and should not be confused with an \"object type\" as defined by Python.\n", "3. Individual entries in a data set are identified using a unique **data identifier**, which is a dictionary who's allowed keys and values depend on the data set you are working with.\n", "\n", "## Learning Objectives:\n", "\n", "This notebook demonstrates how to use the Gen-2 `Butler` object from the DM stack to access and manipulate data. After finishing this notebook, users will know how to:\n", "\n", "1. Load and access a data repository using `Butler`\n", "2. Select subsets of data and convert data into familiar data structures\n", "3. Use `Butler` to access coordinate information and cutout postage stamps\n", "4. Use `Butler` to access a skymap" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "import lsst.afw.display as afwDisplay\n", "from lsst.daf.persistence import Butler # gen2 butler\n", "import lsst.daf.butler as dafButler # gen3 butler\n", "import lsst.geom\n", "from lsst.geom import SpherePoint, Angle\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Data\n", "\n", "To start we instantiate a `Butler` object by providing it with the directory of the **repository** we want to access. Next, we load a **type** of dataset and select data from a single **data identifier**. For this demonstration we consider the `deepCoadd_ref` dataset, which contains tables of information concerning coadded images used in the differencing image pipeline. The id values for this data set include two required values: `tract` and `patch` which denote sections of the sky.\n", "\n", "We have a choice of two data sets: a data set from the HyperSuprimeCam (HSC), and the DESC DC2 Run2.2i WFD data set (DC2). In either case, we choose an 'arbitrary' `tract` and `patch` (Want to figure out how we found this tract and patch? Check out the notebooks `Exploring_An_HSC_Data_Repo.ipynb` and `Exploring_A_DC2_Data_Repo.ipynb`.) For later, we also choose a filter. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#dataset='HSC'\n", "dataset='DC2'\n", "#genvers='gen2'\n", "genvers='gen3'\n", "\n", "# Temporary \"fix\" so one does not need to restart kernel \n", "# when switching from DC2 to HSC...\n", "# See also: https://lsstc.slack.com/archives/C3UCAEW3D/p1584386779038000\n", "#import lsst.afw.image as afwImage\n", "#print(afwImage.Filter.getNames())\n", "#afwImage.Filter.reset()\n", "import lsst.obs.base as obsBase\n", "obsBase.FilterDefinitionCollection.reset()\n", "#print(afwImage.Filter.getNames())\n", "\n", "if dataset == 'HSC':\n", " # Access HSC RC calexp gen2 repository\n", " repo = '/project/shared/data/DATA_ci_hsc/rerun/coaddForcedPhot'\n", " tract = 0\n", " patch = '1,1'\n", " filter = 'HSC-I'\n", " \n", " # Open the buter for this gen2 repo...\n", " butler = Butler(repo)\n", "\n", "elif dataset == 'DC2':\n", "\n", " if genvers == 'gen2':\n", "\n", " # Access DC2 calexp gen2 repository\n", " repo = '/datasets/DC2/DR6/Run2.2i/patched/2021-02-10/rerun/run2.2i-coadd-wfd-dr6-v1'\n", "\n", " #tract = 4851\n", " #patch = '1,4'\n", " # This DC2 tract and patch should contain a cluster at z=0.66 M=1.5e15:\n", " tract = 4024\n", " patch = '3,4'\n", " filter = 'i'\n", "\n", " # Open the butler for this gen2 repo...\n", " butler = Butler(repo)\n", "\n", " elif genvers == 'gen3':\n", "\n", " # Access DC2 gen3 repository\n", " repo='/repo/dc2'\n", " collection='2.2i/runs/DP0.1'\n", " \n", " #tract = 4851\n", " #patch = 29 #gen3 analog to gen2 patch id '1,4'\n", " # This DC2 tract and patch should contain a cluster at z=0.66 M=1.5e15:\n", " tract = 4024\n", " patch = 31 #gen3 analog to gen2 patch id '3,4'\n", " filter = 'i'\n", "\n", " # Open the butler for this gen3 repo...\n", " butler = dafButler.Butler(repo,collections=collection)\n", "\n", " else:\n", " msg = \"Unrecognized gen version: %s\"%genvers\n", " raise Exception(msg)\n", "\n", " \n", "else:\n", " msg = \"Unrecognized dataset: %s\"%dataset\n", " raise Exception(msg) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define the data id and data set type:\n", "data_id = {'tract': tract, 'patch': patch}\n", "dataset_type = 'deepCoadd_ref'\n", "\n", "# We can check that the data exists before we try to read it.\n", "# This works for both gen2 and gen3 bulters...\n", "data_exists = butler.datasetExists(dataset_type, dataId=data_id)\n", "print('Data exists for ID:', data_exists)\n", "\n", "data_entry = butler.get(dataset_type, dataId=data_id)\n", "data_entry" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data table returned above is formatted as a `SourceCatalog` object, which is essentially a collection of `numpy` arrays. We can see this when we index a particular column. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(type(data_entry['merge_measurement_i']))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`SourceCatalog` objects have their own set of methods for table manipulations (sorting, appending rows, etc.). However, we can also work with the data in a more familiar format, such as an astropy `Table` or a pandas `DataFrame`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_frame = data_entry.asAstropy().to_pandas()\n", "data_frame.head()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is important to note that `Butler` objects do not always return tabular data. We will see an example of this later when we load and parse image data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting Subsets of Data\n", "\n", "In practice, you may not know the format of the data identifier for a given data set. For the `Gen-2` butler, the `getKeys()` method can be used to determine the key values expected in a **data identifier**: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if genvers == 'gen2':\n", " data_id_format = butler.getKeys(dataset_type)\n", " print('Expected data id format:', data_id_format)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the `Gen-3` butler, however, you will want to use the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if genvers == 'gen3':\n", " data_id_format = butler.registry.getDatasetType(dataset_type).dimensions.required.names\n", " # You can also use:\n", " # data_id_format = butler.registry.getDatasetType(dataset_type).dimensions.names\n", " print('Expected data id format:', data_id_format)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is important to note that if you specify a key that is not part of the data **type**, the `Butler` will silently ignore it. This can be misleading. For example, in the previous example we read in a table that has a column of booleans named `merge_footprint_i`. If you specify `merge_footprint_i: True` in the dataID and rerun the example, `Butler` will ignore the extra key silently. As a result, you might expect the returned table to only include values where `merge_footprint_i` is `True`, but that isn't what happens. \n", "\n", "Here is an example of the correct way to select data from the returned table:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# An example of what not to do!!\n", "#\n", "# new_data_id = {'tract': tract, 'patch': patch, 'merge_footprint_i': True}\n", "# merged_i_data = butler.get(dataset_type, dataId=new_data_id)\n", "# assert merged_i_data['merge_measurement_i'].all()\n", "\n", "# Do this instead...\n", "new_data_id = {'tract': tract, 'patch': patch}\n", "merged_i_data = butler.get(dataset_type, dataId=new_data_id)\n", "merged_i_data = merged_i_data[merged_i_data['merge_measurement_i']].copy(True)\n", "\n", "# Check that the returned table does in fact have only entries where\n", "# merge_footprint_i is True\n", "print(merged_i_data['merge_measurement_i'].all())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Important:** Note the use of `copy` above which is required to create an array that is contiguous in memory (yay!)\n", "\n", "You can also select all complete dataIds for a dataset type that match a partial (or empty) dataId. For example, for a `Gen-2` butler, the below cell iterates over all possible ids and checks if the corresponding file exists:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if genvers == 'gen2':\n", " subset = butler.subset(dataset_type, dataId=data_id)\n", " id_list = [dr.dataId for dr in subset if dr.datasetExists()]\n", " print(f'Available Ids:\\n {id_list}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, for a `Gen3` butler, one can use the `queryDatasets` command on the butler's `registry`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if genvers == 'gen3':\n", " registry = butler.registry\n", " datasetRefs = registry.queryDatasets(datasetType=dataset_type,dataId=data_id, collections=collection)\n", " print(f'Available Ids:\\n')\n", " for i,ref in enumerate(datasetRefs):\n", " print(ref.dataId.full)\n", " try: uri = butler.getURI(ref)\n", " except: print(\"File not found...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Postage Stamps\n", "\n", "When dealing with image data, we can use `Butler` to generate postage stamps at a given set of coordinates. For this example, we consider the `deepCoadd` data set, which has one extra key value than the previous example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coadd_type = 'deepCoadd'\n", "\n", "if genvers == 'gen2':\n", " butler.getKeys(coadd_type)\n", "elif genvers == 'gen3':\n", " print(butler.registry.getDatasetType(coadd_type).dimensions.required.names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to generate a postage stamp, we need to define the center and size of the cutout. First, we pick an RA and Dec from our previous example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(We note that the DESC version of this notebook has a very nice query to pick some large galaxies -- see https://github.com/LSSTDESC/StackClub/blob/master/Basics/ButlerTutorial.ipynb -- but this capability does not currently exist for the NCSA RSP, as it requires installation of the `easyquery` package. -- Douglas Tucker, 5 March 2021)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if dataset == 'HSC':\n", "\n", " nice_galaxy_indices = np.where((merged_i_data['base_ClassificationExtendedness_value'] == 1) & \n", " (merged_i_data['modelfit_CModel_instFlux'] > 5000) &\n", " (merged_i_data['modelfit_CModel_instFlux'] / merged_i_data['modelfit_CModel_instFluxErr'] > 10) &\n", " (merged_i_data['base_PixelFlags_flag'] == 0) &\n", " (merged_i_data['merge_footprint_r']) & \n", " (merged_i_data['detect_isPatchInner'])\n", " )[0]\n", "\n", " # Another possible option for HSC is to find indices of all targets with a flux between 100 and 500:\n", " # mask = np.where((merged_i_data['base_PsfFlux_flux'] > 100) & (merged_i_data['base_PsfFlux_flux'] < 500))\n", "\n", " \n", "elif dataset == 'DC2':\n", "\n", " nice_galaxy_indices = np.where((merged_i_data['base_ClassificationExtendedness_value'] == 1) & \n", " (merged_i_data['modelfit_CModel_instFlux'] > 5000) &\n", " (merged_i_data['modelfit_CModel_instFlux'] / merged_i_data['modelfit_CModel_instFluxErr'] > 10) &\n", " (merged_i_data['base_PixelFlags_flag'] == 0) &\n", " (merged_i_data['merge_footprint_g']) & \n", " (merged_i_data['merge_footprint_r']) & \n", " (merged_i_data['detect_isPatchInner'])\n", " )[0]\n", "\n", "else:\n", " \n", " nice_galaxy_indices = np.full((2),1000)\n", " \n", "\n", "print(nice_galaxy_indices)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Pick an RA and Dec\n", "i = nice_galaxy_indices[1]\n", "ra = np.degrees(merged_i_data['coord_ra'][i])\n", "dec = np.degrees(merged_i_data['coord_dec'][i])\n", "print(ra, dec)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we retrieve the image using the `Butler`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Retrieve the image using butler\n", "if genvers == 'gen2':\n", " coadd_id = {'tract': tract, 'patch': patch, 'filter': filter}\n", " image = butler.get(coadd_type, dataId=coadd_id)\n", "elif genvers == 'gen3':\n", " coadd_id = {'tract': tract, 'patch': patch, 'band': filter}\n", " image = butler.get(coadd_type, dataId=coadd_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the postage stamp was generated using `Butler`, it is represented as an `afwImage` object. This is a data type from the DM stack that is used to represent images. Since it is a DM object, we choose to plot it using the DM `afwDisplay` module." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's take a look at the full image first\n", "fig = plt.figure(figsize=(10,10))\n", "display = afwDisplay.Display(frame=1, backend='matplotlib')\n", "display.scale(\"linear\", \"zscale\")\n", "display.mtv(image.getMaskedImage().getImage())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we plot our cutout from the full image." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define the center and size of our cutout\n", "radec = SpherePoint(ra, dec, lsst.geom.degrees)\n", "cutout_size = 300\n", "cutout_extent = lsst.geom.ExtentI(cutout_size, cutout_size)\n", "\n", "# Cutout and optionally save the postage stamp to file\n", "postage_stamp = image.getCutout(radec, cutout_extent)\n", "# postage_stamp.writeFits()\n", "\n", "# Convert RA,DEC on the sky to X,Y in the image\n", "xy = postage_stamp.getWcs().skyToPixel(radec)\n", "\n", "# Display image\n", "display = afwDisplay.Display(frame=1, backend='matplotlib')\n", "display.mtv(postage_stamp)\n", "display.scale(\"linear\", \"zscale\")\n", "display.dot('o', xy.getX(), xy.getY(), ctype='red')\n", "display.show_colorbar()\n", "\n", "plt.xlabel('x')\n", "plt.ylabel('y')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the cutout image is aware of the masks and pixel values of the original image. This is why the axis labels in the above cutout are so large. We also note that the orientation of the postage stamp is in the x, y orientation of the original coadded image.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating an RGB picture of a coadd patch¶ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE: This section will not work for the HSC data set, since the HSC data set only includes 2 filter bands ('HSC-I' and 'HSC-R'). This section, however, does work for the DC2 data set. If you are not working with the DC data set, you may wish to skip down to the \"Selecting an Area on the Sky with a Sky Map\" section below.**\n", "\n", "A nice and simple interface is also available to create pretty pictures of patch areas (stolen from D. Boutigny). We are using the same patch as above, and define our three colors as bands r,i and g. Then we ask the deepCoadd type from the butler, which corresponds to coadded images. We finally make use of the `afw.display` interface to build the RGB image." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if dataset == 'DC2':\n", "\n", " import lsst.afw.display.rgb as rgb\n", "\n", " dataId = {'tract':tract, 'patch':patch}\n", "\n", " bandpass_color_map = {'green':'r', 'red':'i', 'blue':'g'}\n", "\n", " exposures = {}\n", " for bandpass in bandpass_color_map.values():\n", " if genvers == 'gen2':\n", " dataId['filter'] = bandpass\n", " elif genvers == 'gen3':\n", " dataId['band'] = bandpass\n", " exposures[bandpass] = butler.get(coadd_type, dataId=dataId)\n", "\n", " fig = plt.figure(figsize=(10,10))\n", " rgb_im = rgb.makeRGB(*(exposures[bandpass_color_map[color]].getMaskedImage().getImage()\n", " for color in ('red', 'green', 'blue')), Q=8, dataRange=1.0, \n", " xSize=None, ySize=None)\n", "\n", " rgb.displayRGB(rgb_im)\n", " \n", "else:\n", " \n", " print('Sorry, it looks as though this section only works for the DC2 data set.')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the RGB map the cluster appears very red!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can also create RGB cutout images!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if dataset == 'DC2':\n", "\n", " rgb_im = rgb.makeRGB(*(exposures[bandpass_color_map[color]].getCutout(radec, cutout_extent).getMaskedImage().getImage()\n", " for color in ('red', 'green', 'blue')), Q=8, dataRange=1.0, \n", " xSize=None, ySize=None)\n", "\n", " rgb.displayRGB(rgb_im)\n", "\n", "else:\n", " \n", " print('Sorry, it looks as though this section only works for the DC2 data set.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting an Area on the Sky with a Sky Map\n", "\n", "As a final example, we consider a third type of data that can be accessed via `Butler` called a `skyMap`. Sky maps allow you to look up information for a given `tract` and `patch`. You may notice from the below example that data set **types** tend to follow the naming convention of having a base name (e.g. `'deepCoadd'`) followed by a descriptor (e.g. `'_skyMap'`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if genvers == 'gen2':\n", " skymap = butler.get('deepCoadd_skyMap')\n", "elif genvers == 'gen3':\n", " skymap = butler.get(\"skyMap\", skymap='DC2')\n", "tract_info = skymap[tract]\n", "tract_info" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "patch_info = tract_info.getPatchInfo((1,1))\n", "patch_info" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tract_bbox = tract_info.getBBox()\n", "tract_pix_corners = lsst.geom.Box2D(tract_bbox).getCorners()\n", "print('Tract corners in pixels:\\n', tract_pix_corners)\n", "\n", "wcs = tract_info.getWcs()\n", "tract_deg_corners = wcs.pixelToSky(tract_pix_corners)\n", "tract_deg_corners = [[c.getRa().asDegrees(), c.getDec().asDegrees()] for c in tract_deg_corners]\n", "print('\\nTract corners in degrees:\\n', tract_deg_corners)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#You can also go in reverse to find the tract, patch that contains a coordinate (320.8,-0.4)\n", "coordList = [SpherePoint(Angle(np.radians(320.8)),Angle(np.radians(-0.4)))]\n", "tractInfo = skymap.findClosestTractPatchList(coordList)\n", "print(tractInfo[0][0])\n", "print(tractInfo[0][1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "LSST", "language": "python", "name": "lsst" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }