{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Proper Motion\n",
    "\n",
    "In the previous lesson, we wrote a query to select stars from the region of the sky where we expect GD-1 to be, and saved the results in a FITS file.\n",
    "\n",
    "Now we'll read that data back and implement the next step in the analysis, identifying stars with the proper motion we expect for GD-1."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Outline\n",
    "\n",
    "Here are the steps in this lesson:\n",
    "\n",
    "1. We'll read back the results from the previous lesson, which we saved in a FITS file.\n",
    "\n",
    "2. Then we'll transform the coordinates and proper motion data from ICRS back to the coordinate frame of GD-1.\n",
    "\n",
    "3. We'll put those results into a Pandas `DataFrame`, which we'll use to select stars near the centerline of GD-1.\n",
    "\n",
    "4. Plotting the proper motion of those stars, we'll identify a region of proper motion for stars that are likely to be in GD-1.\n",
    "\n",
    "5. Finally, we'll select and plot the stars whose proper motion is in that region.\n",
    "\n",
    "After completing this lesson, you should be able to\n",
    "\n",
    "* Select rows and columns from an Astropy `Table`.\n",
    "\n",
    "* Use Matplotlib to make a scatter plot.\n",
    "\n",
    "* Use Gala to transform coordinates.\n",
    "\n",
    "* Make a Pandas `DataFrame` and use a Boolean `Series` to select rows.\n",
    "\n",
    "* Save a `DataFrame` in an HDF5 file.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Installing libraries\n",
    "\n",
    "If you are running this notebook on Colab, you can run the following cell to install the libraries we'll use.\n",
    "\n",
    "If you are running this notebook on your own computer, you might have to install these libraries yourself.  See the instructions in the preface."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# If we're running on Colab, install libraries\n",
    "\n",
    "import sys\n",
    "IN_COLAB = 'google.colab' in sys.modules\n",
    "\n",
    "if IN_COLAB:\n",
    "    !pip install astroquery astro-gala"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reload the data\n",
    "\n",
    "In the previous lesson, we ran a query on the Gaia server and downloaded data for roughly 140,000 stars.  We saved the data in a FITS file so that now, picking up where we left off, we can read the data from a local file rather than running the query again.\n",
    "\n",
    "If you ran the previous lesson successfully, you should already have a file called `gd1_results.fits` that contains the data we downloaded.\n",
    "\n",
    "If not, you can [download the file](https://github.com/AllenDowney/AstronomicalData/raw/main/data/gd1_results.fits) or run the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "from os.path import basename, exists\n",
    "\n",
    "def download(url):\n",
    "    filename = basename(url)\n",
    "    if not exists(filename):\n",
    "        from urllib.request import urlretrieve\n",
    "        local, _ = urlretrieve(url, filename)\n",
    "        print('Downloaded ' + local)\n",
    "\n",
    "download('https://github.com/AllenDowney/AstronomicalData/raw/main/' +\n",
    "         'data/gd1_results.fits')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now here's how we can read the data from the file back into an Astropy `Table`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "from astropy.table import Table\n",
    "\n",
    "filename = 'gd1_results.fits'\n",
    "results = Table.read(filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is an Astropy `Table`.\n",
    "\n",
    "We can use `info` to refresh our memory of the contents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "results.info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selecting rows and columns\n",
    "\n",
    "In this section we'll see operations for selecting columns and rows from an Astropy `Table`.  You can find more information about these operations in the [Astropy documentation](https://docs.astropy.org/en/stable/table/access_table.html).\n",
    "\n",
    "We can get the names of the columns like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "results.colnames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And select an individual column like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "results['ra']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a `Column` object that contains the data, and also the data type, units, and name of the column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": [
    "type(results['ra'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The rows in the `Table` are numbered from 0 to `n-1`, where `n` is the number of rows.  We can select the first row like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "results[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you might have guessed, the result is a `Row` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [],
   "source": [
    "type(results[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the bracket operator selects both columns and rows.  You might wonder how it knows which to select.\n",
    "If the expression in brackets is a string, it selects a column; if the expression is an integer, it selects a row.\n",
    "\n",
    "If you apply the bracket operator twice, you can select a column and then an element from the column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "results['ra'][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or you can select a row and then an element from the row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [],
   "source": [
    "results[0]['ra']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You get the same result either way."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scatter plot\n",
    "\n",
    "To see what the results look like, we'll use a scatter plot.  The library we'll use is [Matplotlib](https://matplotlib.org/), which is the most widely-used plotting library for Python.\n",
    "The Matplotlib interface is based on MATLAB (hence the name), so if you know MATLAB, some of it will be familiar.\n",
    "\n",
    "We'll import like this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pyplot is part of the Matplotlib library.  It is conventional to import it using the shortened name `plt`.\n",
    "\n",
    "In recent versions of Jupyter, plots appear \"inline\"; that is, they are part of the notebook.  In some older versions, plots appear in a new window.\n",
    "In that case, you might want to run the following Jupyter [magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-matplotlib) in a notebook cell:\n",
    "\n",
    "```\n",
    "%matplotlib inline\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pyplot provides two functions that can make scatterplots, [plt.scatter](https://matplotlib.org/3.3.0/api/_as_gen/matplotlib.pyplot.scatter.html) and [plt.plot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html).\n",
    "\n",
    "* `scatter` is more versatile; for example, you can make every point in a scatter plot a different color.\n",
    "\n",
    "* `plot` is more limited, but for simple cases, it can be substantially faster.  \n",
    "\n",
    "Jake Vanderplas explains these differences in [The Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/04.02-simple-scatter-plots.html).\n",
    "\n",
    "Since we are plotting more than 100,000 points and they are all the same size and color, we'll use `plot`.\n",
    "\n",
    "Here's a scatter plot with right ascension on the x-axis and declination on the y-axis, both ICRS coordinates in degrees."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = results['ra']\n",
    "y = results['dec']\n",
    "plt.plot(x, y, 'ko')\n",
    "\n",
    "plt.xlabel('ra (degree ICRS)')\n",
    "plt.ylabel('dec (degree ICRS)');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The arguments to `plt.plot` are `x`, `y`, and a string that specifies the style.  In this case, the letters `ko` indicate that we want a black, round marker (`k` is for black because `b` is for blue).\n",
    "The functions `xlabel` and `ylabel` put labels on the axes.\n",
    "\n",
    "Looking at this plot, we can see that the region we selected, which is a rectangle in GD-1 coordinates, is a non-rectanglar region in ICRS coordinates.\n",
    "\n",
    "However, this scatter plot has a problem.  It is \"[overplotted](https://python-graph-gallery.com/134-how-to-avoid-overplotting-with-python/)\", which means that there are so many overlapping points, we can't distinguish between high and low density areas.\n",
    "\n",
    "To fix this, we can provide optional arguments to control the size and transparency of the points."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "In the call to `plt.plot`, use the keyword argument `markersize` to make the markers smaller.\n",
    "\n",
    "Then add the keyword argument `alpha` to make the markers partly transparent.\n",
    "\n",
    "Adjust these arguments until you think the figure shows the data most clearly.\n",
    "\n",
    "Note: Once you have made these changes, you might notice that the figure shows stripes with lower density of stars.  These stripes are caused by the way Gaia scans the sky, which [you can read about here](https://www.cosmos.esa.int/web/gaia/scanning-law).  The dataset we are using, [Gaia Data Release 2](https://www.cosmos.esa.int/web/gaia/dr2), covers 22 months of observations; during this time, some parts of the sky were scanned more than others."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transform back\n",
    "\n",
    "Remember that we selected data from a rectangle of coordinates in the `GD1Koposov10` frame, then transformed them to ICRS when we constructed the query.\n",
    "The coordinates in `results` are in ICRS.\n",
    "\n",
    "To plot them, we will transform them back to the `GD1Koposov10` frame; that way, the axes of the figure are aligned with the orbit of GD-1, which is useful for two reasons:\n",
    "\n",
    "* By transforming the coordinates, we can identify stars that are likely to be in GD-1 by selecting stars near the centerline of the stream, where $\\phi_2$ is close to 0.\n",
    "\n",
    "* By transforming the proper motions, we can identify stars with non-zero proper motion along the $\\phi_1$ axis.\n",
    "\n",
    "To do the transformation, we'll put the results into a `SkyCoord` object.  In a previous lesson we created a `SkyCoord` object like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "from astropy.coordinates import SkyCoord\n",
    "\n",
    "skycoord = SkyCoord(ra=results['ra'], dec=results['dec'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we're going to do something similar, but in addition to `ra` and `dec`, we'll also include:\n",
    "\n",
    "* `pmra` and `pmdec`, which are proper motion in the ICRS frame, and\n",
    "\n",
    "* `distance` and `radial_velocity`, which we explain below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "import astropy.units as u\n",
    "\n",
    "distance = 8 * u.kpc\n",
    "radial_velocity= 0 * u.km/u.s\n",
    "\n",
    "skycoord = SkyCoord(ra=results['ra'], \n",
    "                    dec=results['dec'],\n",
    "                    pm_ra_cosdec=results['pmra'],\n",
    "                    pm_dec=results['pmdec'], \n",
    "                    distance=distance, \n",
    "                    radial_velocity=radial_velocity)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the first four arguments, we use columns from `results`.\n",
    "\n",
    "For `distance` and `radial_velocity` we use constants, which we explain below.\n",
    "\n",
    "The result is an Astropy `SkyCoord` object, which we can transform to the GD-1 frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gala.coordinates import GD1Koposov10\n",
    "\n",
    "gd1_frame = GD1Koposov10()\n",
    "transformed = skycoord.transform_to(gd1_frame)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is another `SkyCoord` object, now in the `GD1Koposov10` frame."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reflex Correction\n",
    "\n",
    "The next step is to correct the proper motion measurements for the effect of the motion of our solar system around the Galactic center.\n",
    "\n",
    "When we created `skycoord`, we provided constant values for `distance` and `radial_velocity` rather than measurements from Gaia.\n",
    "\n",
    "That might seem like a strange thing to do, but here's the motivation:\n",
    "\n",
    "* Because the stars in GD-1 are so far away, the distance estimates we get from Gaia, which are based on parallax, are not very precise.  So we replace them with our current best estimate of the mean distance to GD-1, about 8 kpc.  See [Koposov, Rix, and Hogg, 2010](https://ui.adsabs.harvard.edu/abs/2010ApJ...712..260K/abstract).\n",
    "\n",
    "* For the other stars in the table, this distance estimate will be inaccurate, so reflex correction will not be correct.  But that should have only a small effect on our ability to identify stars with the proper motion we expect for GD-1.\n",
    "\n",
    "* The measurement of radial velocity has no effect on the correction for proper motion, but we have to provide a value to avoid errors in the reflex correction calculation.  So we provide `0` as an arbitrary place-keeper."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this preparation, we can use `reflex_correct` from Gala ([documentation here](https://gala-astro.readthedocs.io/en/latest/api/gala.coordinates.reflex_correct.html)) to correct for the motion of the solar system."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gala.coordinates import reflex_correct\n",
    "\n",
    "skycoord_gd1 = reflex_correct(transformed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a `SkyCoord` object that contains \n",
    "\n",
    "* `phi1` and `phi2`, which represent the transformed coordinates in the `GD1Koposov10` frame.\n",
    "\n",
    "* `pm_phi1_cosphi2` and `pm_phi2`, which represent the transformed and corrected proper motions.\n",
    "\n",
    "We can select the coordinates and plot them like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "x = skycoord_gd1.phi1\n",
    "y = skycoord_gd1.phi2\n",
    "plt.plot(x, y, 'ko', markersize=0.1, alpha=0.1)\n",
    "\n",
    "plt.xlabel('phi1 (degree GD1)')\n",
    "plt.ylabel('phi2 (degree GD1)');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that we started with a rectangle in the GD-1 frame.  When transformed to the ICRS frame, it's a non-rectangular region.  Now, transformed back to the GD-1 frame, it's a rectangle again."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas DataFrame\n",
    "\n",
    "At this point we have two objects containing different subsets of the data.  `results` is the Astropy `Table` we downloaded from Gaia."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [],
   "source": [
    "type(results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And `skycoord_gd1` is a `SkyCoord` object that contains the transformed coordinates and proper motions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "type(skycoord_gd1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On one hand, this division of labor makes sense because each object provides different capabilities.  But working with multiple object types can be awkward.\n",
    "\n",
    "It will be more convenient to choose one object and get all of the data into it.  We'll choose a Pandas `DataFrame`, for two reasons:\n",
    "\n",
    "1. It provides capabilities that (almost) a superset of the other data structures, so it's the all-in-one solution.\n",
    "\n",
    "2. Pandas is a general-purpose tool that is useful in many domains, especially data science.  If you are going to develop expertise in one tool, Pandas is a good choice.\n",
    "\n",
    "However, compared to an Astropy `Table`, Pandas has one big drawback: it does not keep the metadata associated with the table, including the units for the columns.\n",
    "\n",
    "It's easy to convert a `Table` to a Pandas `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "results_df = results.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`DataFrame` provides `shape`, which shows the number of rows and columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [],
   "source": [
    "results_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It also provides `head`, which displays the first few rows.  `head` is useful for spot-checking large results as you go along."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [],
   "source": [
    "results_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Python detail:** `shape` is an attribute, so we display its value without calling it as a function; `head` is a function, so we need the parentheses."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can extract the columns we want from `skycoord_gd1` and add them as columns in the `DataFrame`.  `phi1` and `phi2` contain the transformed coordinates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [],
   "source": [
    "results_df['phi1'] = skycoord_gd1.phi1\n",
    "results_df['phi2'] = skycoord_gd1.phi2\n",
    "results_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pm_phi1_cosphi2` and `pm_phi2` contain the components of proper motion in the transformed frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [],
   "source": [
    "results_df['pm_phi1'] = skycoord_gd1.pm_phi1_cosphi2\n",
    "results_df['pm_phi2'] = skycoord_gd1.pm_phi2\n",
    "results_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Detail:** If you notice that `SkyCoord` has an attribute called `proper_motion`, you might wonder why we are not using it.\n",
    "\n",
    "We could have: `proper_motion` contains the same data as `pm_phi1_cosphi2` and `pm_phi2`, but in a different format."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring data\n",
    "\n",
    "One benefit of using Pandas is that it provides functions for exploring the data and checking for problems.\n",
    "One of the most useful of these functions is `describe`, which computes summary statistics for each column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [],
   "source": [
    "results_df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "Review the summary statistics in this table.\n",
    "\n",
    "* Do the values make sense based on what you know about the context?\n",
    "\n",
    "* Do you see any values that seem problematic, or evidence of other data issues?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plot proper motion\n",
    "\n",
    "Now we are ready to replicate one of the panels in Figure 1 of the Price-Whelan and Bonaca paper, the one that shows components of proper motion as a scatter plot:\n",
    "\n",
    "<img width=\"300\" src=\"https://github.com/datacarpentry/astronomy-python/raw/gh-pages/fig/gd1-1.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this figure, the shaded area identifies stars that are likely to be in GD-1 because:\n",
    "\n",
    "* Due to the nature of tidal streams, we expect the proper motion for stars in GD-1 to be along the axis of the stream; that is, we expect motion in the direction of `phi2` to be near 0.\n",
    "\n",
    "* In the direction of `phi1`, we don't have a prior expectation for proper motion, except that it should form a cluster at a non-zero value.\n",
    "\n",
    "By plotting proper motion in the GD-1 frame, we hope to find this cluster.\n",
    "Then we will use the bounds of the cluster to select stars that are more likely to be in GD-1. \n",
    "\n",
    "The following figure is a scatter plot of proper motion, in the GD-1 frame, for the stars in `results_df`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = results_df['pm_phi1']\n",
    "y = results_df['pm_phi2']\n",
    "plt.plot(x, y, 'ko', markersize=0.1, alpha=0.1)\n",
    "    \n",
    "plt.xlabel('Proper motion phi1 (mas/yr GD1 frame)')\n",
    "plt.ylabel('Proper motion phi2 (mas/yr GD1 frame)');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most of the proper motions are near the origin, but there are a few extreme values.\n",
    "Following the example in the paper, we'll use `xlim` and `ylim` to zoom in on the region near the origin."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = results_df['pm_phi1']\n",
    "y = results_df['pm_phi2']\n",
    "plt.plot(x, y, 'ko', markersize=0.1, alpha=0.1)\n",
    "    \n",
    "plt.xlabel('Proper motion phi1 (mas/yr GD1 frame)')\n",
    "plt.ylabel('Proper motion phi2 (mas/yr GD1 frame)')\n",
    "\n",
    "plt.xlim(-12, 8)\n",
    "plt.ylim(-10, 10);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a hint of an overdense region near (-7.5, 0), but if you didn't know where to look, you would miss it.\n",
    "\n",
    "To see the cluster more clearly, we need a sample that contains a higher proportion of stars in GD-1.\n",
    "We'll do that by selecting stars close to the centerline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selecting the centerline\n",
    "\n",
    "As we can see in the following figure, many stars in GD-1 are less than 1 degree from the line `phi2=0`.\n",
    "\n",
    "<img src=\"https://github.com/datacarpentry/astronomy-python/raw/gh-pages/fig/gd1-4.png\">\n",
    "\n",
    "Stars near this line have the highest probability of being in GD-1.\n",
    "\n",
    "To select them, we will use a \"Boolean mask\".  We'll start by selecting the `phi2` column from the `DataFrame`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [],
   "source": [
    "phi2 = results_df['phi2']\n",
    "type(phi2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a `Series`, which is the structure Pandas uses to represent columns.\n",
    "\n",
    "We can use a comparison operator, `>`, to compare the values in a `Series` to a constant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [],
   "source": [
    "phi2_min = -1.0 * u.degree\n",
    "phi2_max = 1.0 * u.degree\n",
    "\n",
    "mask = (phi2 > phi2_min)\n",
    "type(mask)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a `Series` of Boolean values, that is, `True` and `False`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [],
   "source": [
    "mask.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To select values that fall between `phi2_min` and `phi2_max`, we'll use the `&` operator, which computes \"logical AND\".\n",
    "The result is true where elements from both Boolean `Series` are true."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [],
   "source": [
    "mask = (phi2 > phi2_min) & (phi2 < phi2_max)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Python detail:** Python's logical operators (`and`, `or`, and `not`) don't work with NumPy or Pandas.  Both libraries use the bitwise operators (`&`, `|`, and `~`) to do elementwise logical operations ([explanation here](https://stackoverflow.com/questions/21415661/logical-operators-for-boolean-indexing-in-pandas)).\n",
    "\n",
    "Also, we need the parentheses around the conditions; otherwise the order of operations is incorrect."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The sum of a Boolean `Series` is the number of `True` values, so we can use `sum` to see how many stars are in the selected region."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [],
   "source": [
    "mask.sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A Boolean `Series` is sometimes called a \"mask\" because we can use it to mask out some of the rows in a `DataFrame` and select the rest, like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [],
   "source": [
    "centerline_df = results_df[mask]\n",
    "type(centerline_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`centerline_df` is a `DataFrame` that contains only the rows from `results_df` that correspond to `True` values in `mask`.\n",
    "So it contains the stars near the centerline of GD-1.\n",
    "\n",
    "We can use `len` to see how many rows are in `centerline_df`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(centerline_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And what fraction of the rows we've selected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(centerline_df) / len(results_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are about 25,000 stars in this region, about 18% of the total."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plotting proper motion\n",
    "\n",
    "Since we've plotted proper motion several times, let's put that code in a function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_proper_motion(df):\n",
    "    \"\"\"Plot proper motion.\n",
    "    \n",
    "    df: DataFrame with `pm_phi1` and `pm_phi2`\n",
    "    \"\"\"\n",
    "    x = df['pm_phi1']\n",
    "    y = df['pm_phi2']\n",
    "    plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)\n",
    "\n",
    "    plt.xlabel('Proper motion phi1 (mas/yr GD1 frame)')\n",
    "    plt.ylabel('Proper motion phi2 (mas/yr GD1 frame)')\n",
    "\n",
    "    plt.xlim(-12, 8)\n",
    "    plt.ylim(-10, 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we can call it like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_proper_motion(centerline_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can see more clearly that there is a cluster near (-7.5, 0).\n",
    "\n",
    "You might notice that our figure is less dense than the one in the paper.  That's because we started with a set of stars from a relatively small region.  The figure in the paper is based on a region about 10 times bigger.\n",
    "\n",
    "In the next lesson we'll go back and select stars from a larger region.  But first we'll use the proper motion data to identify stars likely to be in GD-1."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filtering based on proper motion\n",
    "\n",
    "The next step is to select stars in the \"overdense\" region of proper motion, which are candidates to be in GD-1.\n",
    "\n",
    "In the original paper, Price-Whelan and Bonaca used a polygon to cover this region, as shown in this figure.\n",
    "\n",
    "<img width=\"300\" src=\"https://github.com/datacarpentry/astronomy-python/raw/gh-pages/fig/gd1-1.png\">\n",
    "\n",
    "We'll use a simple rectangle for now, but in a later lesson we'll see how to select a polygonal region as well.\n",
    "\n",
    "Here are bounds on proper motion we chose by eye:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [],
   "source": [
    "pm1_min = -8.9\n",
    "pm1_max = -6.9\n",
    "pm2_min = -2.2\n",
    "pm2_max =  1.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To draw these bounds, we'll use `make_rectangle` to make two lists containing the coordinates of the corners of the rectangle."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_rectangle(x1, x2, y1, y2):\n",
    "    \"\"\"Return the corners of a rectangle.\"\"\"\n",
    "    xs = [x1, x1, x2, x2, x1]\n",
    "    ys = [y1, y2, y2, y1, y1]\n",
    "    return xs, ys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [],
   "source": [
    "pm1_rect, pm2_rect = make_rectangle(\n",
    "    pm1_min, pm1_max, pm2_min, pm2_max)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's what the plot looks like with the bounds we chose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_proper_motion(centerline_df)\n",
    "plt.plot(pm1_rect, pm2_rect, '-');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've identified the bounds of the cluster in proper motion, we'll use it to select rows from `results_df`.\n",
    "\n",
    "We'll use the following function, which uses Pandas operators to make a mask that selects rows where `series` falls between `low` and `high`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [],
   "source": [
    "def between(series, low, high):\n",
    "    \"\"\"Check whether values are between `low` and `high`.\"\"\"\n",
    "    return (series > low) & (series < high)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following mask selects stars with proper motion in the region we chose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [],
   "source": [
    "pm1 = results_df['pm_phi1']\n",
    "pm2 = results_df['pm_phi2']\n",
    "\n",
    "pm_mask = (between(pm1, pm1_min, pm1_max) & \n",
    "           between(pm2, pm2_min, pm2_max))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, the sum of a Boolean series is the number of `True` values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [],
   "source": [
    "pm_mask.sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can use this mask to select rows from `results_df`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [],
   "source": [
    "selected_df = results_df[pm_mask]\n",
    "len(selected_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are the stars we think are likely to be in GD-1.  Let's see what they look like, plotting their coordinates (not their proper motion)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = selected_df['phi1']\n",
    "y = selected_df['phi2']\n",
    "plt.plot(x, y, 'ko', markersize=1, alpha=1)\n",
    "\n",
    "plt.xlabel('phi1 (degree GD1)')\n",
    "plt.ylabel('phi2 (degree GD1)');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that's starting to look like a tidal stream!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Saving the DataFrame\n",
    "\n",
    "At this point we have run a successful query and cleaned up the results; this is a good time to save the data.\n",
    "\n",
    "To save a Pandas `DataFrame`, one option is to convert it to an Astropy `Table`, like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [],
   "source": [
    "selected_table = Table.from_pandas(selected_df)\n",
    "type(selected_table)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we could write the `Table` to a FITS file, as we did in the previous lesson.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But Pandas provides functions to write DataFrames in other formats; to see what they are [find the functions here that begin with `to_`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).\n",
    "\n",
    "One of the best options is HDF5, which is Version 5 of [Hierarchical Data Format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format).\n",
    "\n",
    "HDF5 is a binary format, so files are small and fast to read and write (like FITS, but unlike XML).\n",
    "\n",
    "An HDF5 file is similar to an SQL database in the sense that it can contain more than one table, although in HDF5 vocabulary, a table is called a Dataset.  ([Multi-extension FITS files](https://www.stsci.edu/itt/review/dhb_2011/Intro/intro_ch23.html) can also contain more than one table.)\n",
    "\n",
    "And HDF5 stores the metadata associated with the table, including column names, row labels, and data types (like FITS).\n",
    "\n",
    "Finally, HDF5 is a cross-language standard, so if you write an HDF5 file with Pandas, you can read it back with many other software tools (more than FITS)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can write a Pandas `DataFrame` to an HDF5 file like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = 'gd1_data.hdf'\n",
    "\n",
    "selected_df.to_hdf(filename, 'selected_df', mode='w')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because an HDF5 file can contain more than one Dataset, we have to provide a name, or \"key\", that identifies the Dataset in the file.\n",
    "\n",
    "We could use any string as the key, but it will be convenient to give the Dataset in the file the same name as the `DataFrame`.\n",
    "\n",
    "The argument `mode='w'` means that if the file already exists, we should overwrite it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise \n",
    "\n",
    "We're going to need `centerline_df` later as well.  Write a line of code to add it as a second Dataset in the HDF5 file.\n",
    "\n",
    "Hint: Since the file already exists, you should *not* use `mode='w'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use `getsize` to confirm that the file exists and check the size:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {},
   "outputs": [],
   "source": [
    "from os.path import getsize\n",
    "\n",
    "MB = 1024 * 1024\n",
    "getsize(filename) / MB"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you forget what the names of the Datasets in the file are, you can read them back like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [],
   "source": [
    "with pd.HDFStore(filename) as hdf:\n",
    "    print(hdf.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Python note:** We use a `with` statement here to open the file before the print statement and (automatically) close it after.  Read more about [context managers](https://book.pythontips.com/en/latest/context_managers.html). \n",
    "\n",
    "The keys are the names of the Datasets.  Notice that they start with `/`, which indicates that they are at the top level of the Dataset hierarchy, and not in a named \"group\".\n",
    "\n",
    "In future lessons we will add a few more Datasets to this file, but not so many that we need to organize them into groups."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this lesson, we re-loaded the Gaia data we saved from a previous query.\n",
    "\n",
    "We transformed the coordinates and proper motion from ICRS to a frame aligned with the orbit of GD-1, and stored the results in a Pandas `DataFrame`.\n",
    "\n",
    "Then we replicated the selection process from the Price-Whelan and Bonaca paper:\n",
    "\n",
    "* We selected stars near the centerline of GD-1 and made a scatter plot of their proper motion.\n",
    "\n",
    "* We identified a region of proper motion that contains stars likely to be in GD-1.\n",
    "\n",
    "* We used a Boolean `Series` as a mask to select stars whose proper motion is in that region.\n",
    "\n",
    "So far, we have used data from a relatively small region of the sky.  In the next lesson, we'll write a query that selects stars based on proper motion, which will allow us to explore a larger region."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Best practices\n",
    "\n",
    "* When you make a scatter plot, adjust the size of the markers and their transparency so the figure is not overplotted; otherwise it can misrepresent the data badly.\n",
    "\n",
    "* For simple scatter plots in Matplotlib, `plot` is faster than `scatter`.\n",
    "\n",
    "* An Astropy `Table` and a Pandas `DataFrame` are similar in many ways and they provide many of the same functions.  They have pros and cons, but for many projects, either one would be a reasonable choice.\n",
    "\n",
    "* To store data from a Pandas `DataFrame`, a good option is an HDF file, which can contain multiple Datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}