{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Transformation and Selection\n", "\n", "In the previous lesson, we identified stars with the proper motion we expect for GD-1.\n", "\n", "Now we'll do the same selection in an ADQL query, which will make it possible to work with a larger region of the sky and still download less data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outline\n", "\n", "Here are the steps in this lesson:\n", "\n", "1. Using data from the previous lesson, we'll identify the values of proper motion for stars likely to be in GD-1.\n", "\n", "2. Then we'll compose an ADQL query that selects stars based on proper motion, so we can download only the data we need.\n", "\n", "That will make it possible to search a bigger region of the sky in a single query.\n", "We'll also see how to write the results to a CSV file.\n", "\n", "After completing this lesson, you should be able to\n", "\n", "* Transform proper motions from one frame to another.\n", "\n", "* Compute the convex hull of a set of points.\n", "\n", "* Write an ADQL query that selects based on proper motion.\n", "\n", "* Save data in CSV format." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Installing libraries\n", "\n", "If you are running this notebook on Colab, you can run the following cell to install the libraries we'll use.\n", "\n", "If you are running this notebook on your own computer, you might have to install these libraries yourself. See the instructions in the preface." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "tags": [] }, "outputs": [], "source": [ "# If we're running on Colab, install libraries\n", "\n", "import sys\n", "IN_COLAB = 'google.colab' in sys.modules\n", "\n", "if IN_COLAB:\n", " !pip install astroquery astro-gala" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reload the data\n", "\n", "You can [download the data from the previous lesson](https://github.com/AllenDowney/AstronomicalData/raw/main/data/gd1_data.hdf) or run the following cell, which downloads it if necessary." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", " local, _ = urlretrieve(url, filename)\n", " print('Downloaded ' + local)\n", "\n", "download('https://github.com/AllenDowney/AstronomicalData/raw/main/' +\n", " 'data/gd1_data.hdf')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can reload `centerline_df` and `selected_df`." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "filename = 'gd1_data.hdf'\n", "centerline_df = pd.read_hdf(filename, 'centerline_df')\n", "selected_df = pd.read_hdf(filename, 'selected_df')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selection by proper motion\n", "\n", "Let's review how we got to this point.\n", "\n", "1. We made an ADQL query to the Gaia server to get data for stars in the vicinity of GD-1.\n", "\n", "2. We transformed the coordinates to the `GD1Koposov10` frame so we could select stars along the centerline of GD-1.\n", "\n", "3. We plotted the proper motion of the centerline stars to identify the bounds of the overdense region.\n", "\n", "4. We made a mask that selects stars whose proper motion is in the overdense region.\n", "\n", "At this point we have downloaded data for a relatively large number of stars (more than 100,000) and selected a relatively small number (around 1000).\n", "\n", "It would be more efficient to use ADQL to select only the stars we need. That would also make it possible to download data covering a larger region of the sky.\n", "\n", "However, the selection we did was based on proper motion in the `GD1Koposov10` frame. In order to do the same selection in ADQL, we have to work with proper motions in ICRS.\n", "\n", "As a reminder, here's the rectangle we selected based on proper motion in the `GD1Koposov10` frame." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "pm1_min = -8.9\n", "pm1_max = -6.9\n", "pm2_min = -2.2\n", "pm2_max = 1.0" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "def make_rectangle(x1, x2, y1, y2):\n", " \"\"\"Return the corners of a rectangle.\"\"\"\n", " xs = [x1, x1, x2, x2, x1]\n", " ys = [y1, y2, y2, y1, y1]\n", " return xs, ys" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "pm1_rect, pm2_rect = make_rectangle(\n", " pm1_min, pm1_max, pm2_min, pm2_max)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we'll need to plot proper motion several times, we'll use the following function." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "def plot_proper_motion(df):\n", " \"\"\"Plot proper motion.\n", " \n", " df: DataFrame with `pm_phi1` and `pm_phi2`\n", " \"\"\"\n", " x = df['pm_phi1']\n", " y = df['pm_phi2']\n", " plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)\n", "\n", " plt.xlabel('Proper motion phi1 (GD1 frame)')\n", " plt.ylabel('Proper motion phi2 (GD1 frame)')\n", "\n", " plt.xlim(-12, 8)\n", " plt.ylim(-10, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following figure shows:\n", "\n", "* Proper motion for the stars we selected along the center line of GD-1,\n", "\n", "* The rectangle we selected, and\n", "\n", "* The stars inside the rectangle highlighted in green." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "plot_proper_motion(centerline_df)\n", "\n", "plt.plot(pm1_rect, pm2_rect)\n", "\n", "x = selected_df['pm_phi1']\n", "y = selected_df['pm_phi2']\n", "plt.plot(x, y, 'gx', markersize=0.3, alpha=0.3);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll make the same plot using proper motions in the ICRS frame, which are stored in columns `pmra` and `pmdec`." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "x = centerline_df['pmra']\n", "y = centerline_df['pmdec']\n", "plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)\n", "\n", "x = selected_df['pmra']\n", "y = selected_df['pmdec']\n", "plt.plot(x, y, 'gx', markersize=1, alpha=0.3)\n", " \n", "plt.xlabel('Proper motion ra (ICRS frame)')\n", "plt.ylabel('Proper motion dec (ICRS frame)')\n", "\n", "plt.xlim([-10, 5])\n", "plt.ylim([-20, 5]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The proper motions of the selected stars are more spread out in this frame, which is why it was preferable to do the selection in the GD-1 frame.\n", "\n", "But now we can define a polygon that encloses the proper motions of these stars in ICRS, and use that polygon as a selection criterion in an ADQL query." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convex Hull\n", "\n", "SciPy provides a function that computes the [convex hull](https://en.wikipedia.org/wiki/Convex_hull) of a set of points, which is the smallest convex polygon that contains all of the points.\n", "\n", "To use it, we'll select columns `pmra` and `pmdec` and convert them to a NumPy array." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "points = selected_df[['pmra','pmdec']].to_numpy()\n", "points.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NOTE: If you are using an older version of Pandas, you might not have `to_numpy()`; you can use `values` instead, like this:\n", "\n", "```\n", "points = selected_df[['pmra','pmdec']].values\n", "\n", "```\n", "\n", "We'll pass the points to `ConvexHull`, which returns an object that contains the results. " ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "from scipy.spatial import ConvexHull\n", "\n", "hull = ConvexHull(points)\n", "hull" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`hull.vertices` contains the indices of the points that fall on the perimeter of the hull." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "hull.vertices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use them as an index into the original array to select the corresponding rows." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "pm_vertices = points[hull.vertices]\n", "pm_vertices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To plot the resulting polygon, we have to pull out the x and y coordinates." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "pmra_poly, pmdec_poly = np.transpose(pm_vertices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This use of `transpose` is a useful NumPy idiom. Because `pm_vertices` has two columns, its [matrix transpose](https://en.wikipedia.org/wiki/Transpose) has two rows, which are assigned to the two variables `pmra_poly` and `pmdec_poly`.\n", "\n", "The following figure shows proper motion in ICRS again, along with the convex hull we just computed." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "x = centerline_df['pmra']\n", "y = centerline_df['pmdec']\n", "plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)\n", "\n", "x = selected_df['pmra']\n", "y = selected_df['pmdec']\n", "plt.plot(x, y, 'gx', markersize=0.3, alpha=0.3)\n", "\n", "plt.plot(pmra_poly, pmdec_poly)\n", " \n", "plt.xlabel('Proper motion phi1 (ICRS frame)')\n", "plt.ylabel('Proper motion phi2 (ICRS frame)')\n", "\n", "plt.xlim([-10, 5])\n", "plt.ylim([-20, 5]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So `pm_vertices` represents the polygon we want to select.\n", "The next step is to use it as part of an ADQL query." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assembling the query\n", "\n", "In Lesson 2 we used the following query to select stars in a polygonal region." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "query5_base = \"\"\"SELECT\n", "{columns}\n", "FROM gaiadr2.gaia_source\n", "WHERE parallax < 1\n", " AND bp_rp BETWEEN -0.75 AND 2 \n", " AND 1 = CONTAINS(POINT(ra, dec), \n", " POLYGON({point_list}))\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lesson we'll make two changes:\n", "\n", "1. We'll select stars with coordinates in a larger region.\n", "\n", "2. We'll add another clause to select stars whose proper motion is in the polygon we just computed, `pm_vertices`.\n", "\n", "Here are the coordinates of the larger rectangle in the GD-1 frame." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "import astropy.units as u\n", "\n", "phi1_min = -70 * u.degree\n", "phi1_max = -20 * u.degree\n", "phi2_min = -5 * u.degree\n", "phi2_max = 5 * u.degree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We selected these bounds by trial and error, defining the largest region we can process in a single query." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "phi1_rect, phi2_rect = make_rectangle(\n", " phi1_min, phi1_max, phi2_min, phi2_max)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's how we transform it to ICRS, as we saw in Lesson 2." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "from gala.coordinates import GD1Koposov10\n", "from astropy.coordinates import SkyCoord\n", "\n", "gd1_frame = GD1Koposov10()\n", "corners = SkyCoord(phi1=phi1_rect, \n", " phi2=phi2_rect, \n", " frame=gd1_frame)\n", "\n", "corners_icrs = corners.transform_to('icrs')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use `corners_icrs` as part of an ADQL query, we have to convert it to a string. \n", "Here's the function from Lesson 2 we used to do that." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "def skycoord_to_string(skycoord):\n", " \"\"\"Convert SkyCoord to string.\"\"\"\n", " t = skycoord.to_string()\n", " s = ' '.join(t)\n", " return s.replace(' ', ', ')" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "point_list = skycoord_to_string(corners_icrs)\n", "point_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the columns we want to select." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "columns = 'source_id, ra, dec, pmra, pmdec'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have everything we need to assemble the query.\n" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "query5 = query5_base.format(columns=columns, \n", " point_list=point_list)\n", "print(query5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But don't try to run that query.\n", "Because it selects a larger region, there are too many stars to handle in a single query.\n", "Until we select by proper motion, that is." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting proper motion\n", "\n", "Now we're ready to add a `WHERE` clause to select stars whose proper motion falls in the polygon defined by `pm_vertices`.\n", "\n", "To use `pm_vertices` as part of an ADQL query, we have to convert it to a string.\n", "Using `flatten` and `array2string`, we can almost get the format we need." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "s = np.array2string(pm_vertices.flatten(), \n", " max_line_width=1000,\n", " separator=',')\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We just have to remove the brackets." ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "scrolled": true }, "outputs": [], "source": [ "pm_point_list = s.strip('[]')\n", "pm_point_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Define `query6_base`, starting with `query5_base` and adding a new clause to select stars whose coordinates of proper motion, `pmra` and `pmdec`, fall within the polygon defined by `pm_point_list`." ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Use `format` to format `query6_base` and define `query6`, filling in the values of `columns`, `point_list`, and `pm_point_list`." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can run the query like this:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from astroquery.gaia import Gaia\n", "\n", "job = Gaia.launch_job_async(query6)\n", "print(job)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And get the results." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "candidate_table = job.get_results()\n", "len(candidate_table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We call the results `candidate_table` because it contains stars that are good candidates for GD-1.\n", "\n", "For the next lesson, we'll need `point_list` and `pm_point_list` again, so we should save them in a file.\n", "There are several ways we could do that, but since we are already storing data in an HDF file, let's do the same with these variables.\n", "\n", "We've seen how to save a `DataFrame` in an HDF file.\n", "We can do the same thing with a Pandas `Series`.\n", "To make one, we'll start with a dictionary:" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "d = dict(point_list=point_list, pm_point_list=pm_point_list)\n", "d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And use it to initialize a `Series.`" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "point_series = pd.Series(d)\n", "point_series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can save it in the usual way." ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "filename = 'gd1_data.hdf'\n", "point_series.to_hdf(filename, 'point_series')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting one more time\n", "\n", "Let's see what the results look like." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "x = candidate_table['ra']\n", "y = candidate_table['dec']\n", "plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)\n", "\n", "plt.xlabel('ra (degree ICRS)')\n", "plt.ylabel('dec (degree ICRS)');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we can see why it was useful to transform these coordinates. In ICRS, it is more difficult to identity the stars near the centerline of GD-1.\n", "\n", "So let's transform the results back to the GD-1 frame.\n", "Here's the code we used to transform the coordinates and make a Pandas `DataFrame`, wrapped in a function." ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "from gala.coordinates import reflex_correct\n", "\n", "def make_dataframe(table):\n", " \"\"\"Transform coordinates from ICRS to GD-1 frame.\n", " \n", " table: Astropy Table\n", " \n", " returns: Pandas DataFrame\n", " \"\"\"\n", " skycoord = SkyCoord(\n", " ra=table['ra'], \n", " dec=table['dec'],\n", " pm_ra_cosdec=table['pmra'],\n", " pm_dec=table['pmdec'], \n", " distance=8*u.kpc, \n", " radial_velocity=0*u.km/u.s)\n", "\n", " gd1_frame = GD1Koposov10()\n", " transformed = skycoord.transform_to(gd1_frame)\n", " skycoord_gd1 = reflex_correct(transformed)\n", "\n", " df = table.to_pandas()\n", " df['phi1'] = skycoord_gd1.phi1\n", " df['phi2'] = skycoord_gd1.phi2\n", " df['pm_phi1'] = skycoord_gd1.pm_phi1_cosphi2\n", " df['pm_phi2'] = skycoord_gd1.pm_phi2\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's how we use it:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "candidate_df = make_dataframe(candidate_table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And let's see the results." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "x = candidate_df['phi1']\n", "y = candidate_df['phi2']\n", "plt.plot(x, y, 'ko', markersize=0.5, alpha=0.5)\n", "\n", "plt.xlabel('phi1 (degree GD1)')\n", "plt.ylabel('phi2 (degree GD1)');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're starting to see GD-1 more clearly.\n", "We can compare this figure with this panel from Figure 1 from the original paper:\n", "\n", "\n", "\n", "This panel shows stars selected based on proper motion only, so it is comparable to our figure (although notice that it covers a wider region)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next lesson, we will use photometry data from Pan-STARRS to do a second round of filtering, and see if we can replicate this panel.\n", "\n", "\n", "\n", "Later we'll see how to add annotations like the ones in the figure and customize the style of the figure to present the results clearly and compellingly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "In the previous lesson we downloaded data for a large number of stars and then selected a small fraction of them based on proper motion.\n", "\n", "In this lesson, we improved this process by writing a more complex query that uses the database to select stars based on proper motion. This process requires more computation on the Gaia server, but then we're able to either:\n", "\n", "1. Search the same region and download less data, or\n", "\n", "2. Search a larger region while still downloading a manageable amount of data.\n", "\n", "In the next lesson, we'll learn about the database `JOIN` operation and use it to download photometry data from Pan-STARRS." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Best practices\n", "\n", "* When possible, \"move the computation to the data\"; that is, do as much of the work as possible on the database server before downloading the data.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }