{ "cells": [ { "cell_type": "raw", "metadata": {}, "source": [ "---\n", "title: \"Join\"\n", "teaching: 3000\n", "exercises: 0\n", "questions:\n", "\n", "- \"How do we use `JOIN` to combine information from multiple tables?\"\n", "\n", "objectives:\n", "\n", "- \"Upload a table to the Gaia server.\"\n", "\n", "- \"Write ADQL queries involving `JOIN` operations.\"\n", "\n", "keypoints:\n", "\n", "- \"Use `JOIN` operations to combine data from multiple tables in a databased, using some kind of identifier to match up records from one table with records from another.\"\n", "\n", "- \"This is another example of a practice we saw in the previous notebook, moving the computation to the data.\"\n", "\n", "---\n", "\n", "{% include links.md %}\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Joining Tables\n", "\n", "This is the fifth in a series of notebooks related to astronomy data.\n", "\n", "As a continuing example, we will replicate part of the analysis in a recent paper, \"[Off the beaten path: Gaia reveals GD-1 stars outside of the main stream](https://arxiv.org/abs/1805.00425)\" by Adrian M. Price-Whelan and Ana Bonaca.\n", "\n", "Picking up where we left off, the next step in the analysis is to select candidate stars based on photometry data.\n", "The following figure from the paper is a color-magnitude diagram for the stars selected based on proper motion:\n", "\n", "\n", "\n", "In red is a [stellar isochrone](https://en.wikipedia.org/wiki/Stellar_isochrone), showing where we expect the stars in GD-1 to fall based on the metallicity and age of their original globular cluster. \n", "\n", "By selecting stars in the shaded area, we can further distinguish the main sequence of GD-1 from younger background stars." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outline\n", "\n", "Here are the steps in this notebook:\n", "\n", "1. We'll reload the candidate stars we identified in the previous notebook.\n", "\n", "2. Then we'll run a query on the Gaia server that uploads the table of candidates and uses a `JOIN` operation to select photometry data for the candidate stars.\n", "\n", "3. We'll write the results to a file for use in the next notebook.\n", "\n", "After completing this lesson, you should be able to\n", "\n", "* Upload a table to the Gaia server.\n", "\n", "* Write ADQL queries involving `JOIN` operations." ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "## Installing libraries\n", "\n", "If you are running this notebook on Colab, you can run the following cell to install the libraries we'll use.\n", "\n", "If you are running this notebook on your own computer, you might have to install these libraries yourself. See the instructions in the preface." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# If we're running on Colab, install libraries\n", "\n", "import sys\n", "IN_COLAB = 'google.colab' in sys.modules\n", "\n", "if IN_COLAB:\n", " !pip install astroquery wget" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reloading the data\n", "\n", "The following cell downloads the data from the previous notebook." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "from wget import download\n", "\n", "filename = 'gd1_candidates.hdf5'\n", "path = 'https://github.com/AllenDowney/AstronomicalData/raw/main/data/'\n", "\n", "if not os.path.exists(filename):\n", " print(download(path+filename))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can read it back." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "candidate_df = pd.read_hdf(filename, 'candidate_df')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`candidate_df` is the Pandas DataFrame that contains results from the query in the previous notebook, which selects stars likely to be in GD-1 based on proper motion. It also includes position and proper motion transformed to the ICRS frame." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "x = candidate_df['phi1']\n", "y = candidate_df['phi2']\n", "\n", "plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)\n", "\n", "plt.xlabel('ra (degree GD1)')\n", "plt.ylabel('dec (degree GD1)');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the same figure we saw at the end of the previous notebook. GD-1 is visible against the background stars, but we will be able to see it more clearly after selecting based on photometry data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting photometry data\n", "\n", "The Gaia dataset contains some photometry data, including the variable `bp_rp`, which we used in the original query to select stars with BP - RP color between -0.75 and 2.\n", "\n", "Selecting stars with `bp-rp` less than 2 excludes many class M dwarf stars, which are low temperature, low luminosity. A star like that at GD-1's distance would be hard to detect, so if it is detected, it it more likely to be in the foreground.\n", "\n", "Now, to select stars with the age and metal richness we expect in GD-1, we will use `g - i` color and apparent `g`-band magnitude, which are available from the Pan-STARRS survey." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conveniently, the Gaia server provides data from Pan-STARRS as a table in the same database we have been using, so we can access it by making ADQL queries.\n", "\n", "In general, choosing a star from the Gaia catalog and finding the corresponding star in the Pan-STARRS catalog is not easy. This kind of cross matching is not always possible, because a star might appear in one catalog and not the other. And even when both stars are present, there might not be a clear one-to-one relationship between stars in the two catalogs.\n", "\n", "Fortunately, smart people have worked on this problem, and the Gaia database includes cross-matching tables that suggest a best neighbor in the Pan-STARRS catalog for many stars in the Gaia catalog.\n", "\n", "[This document describes the cross matching process](https://gea.esac.esa.int/archive/documentation/GDR2/Catalogue_consolidation/chap_cu9val_cu9val/ssec_cu9xma/sssec_cu9xma_extcat.html). Briefly, it uses a cone search to find possible matches in approximately the right position, then uses attributes like color and magnitude to choose pairs of observations most likely to be the same star." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joining tables\n", "\n", "So the hard part of cross-matching has been done for us. Using the results is a little tricky, but it gives us a chance to learn about one of the most important tools for working with databases: \"joining\" tables.\n", "\n", "In general, a \"join\" is an operation where you match up records from one table with records from another table using as a \"key\" a piece of information that is common to both tables, usually some kind of ID code.\n", "\n", "In this example:\n", "\n", "* Stars in the Gaia dataset are identified by `source_id`.\n", "\n", "* Stars in the Pan-STARRS dataset are identified by `obj_id`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each candidate star we have selected so far, we have the `source_id`; the goal is to find the `obj_id` for the same star (we hope) in the Pan-STARRS catalog.\n", "\n", "To do that we will:\n", "\n", "1. Make a table that contains the `source_id` for each candidate star and upload the table to the Gaia server;\n", "\n", "2. Use the `JOIN` operator to look up each `source_id` in the `gaiadr2.panstarrs1_best_neighbour` table, which contains the `obj_id` of the best match for each star in the Gaia catalog; then\n", "\n", "3. Use the `JOIN` operator again to look up each `obj_id` in the `panstarrs1_original_valid` table, which contains the Pan-STARRS photometry data we want.\n", "\n", "Let's start with the first step, uploading a table." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing a table for uploading\n", "\n", "For each candidate star, we want to find the corresponding row in the `gaiadr2.panstarrs1_best_neighbour` table.\n", "\n", "In order to do that, we have to:\n", "\n", "1. Write the table in a local file as an XML VOTable, which is a format suitable for transmitting a table over a network.\n", "\n", "2. Write an ADQL query that refers to the uploaded table.\n", "\n", "3. Change the way we submit the job so it uploads the table before running the query." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step is not too difficult because Astropy provides a function called `writeto` that can write a `Table` in `XML`.\n", "\n", "[The documentation of this process is here](https://docs.astropy.org/en/stable/io/votable/).\n", "\n", "First we have to convert our Pandas `DataFrame` to an Astropy `Table`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "astropy.table.table.Table" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from astropy.table import Table\n", "\n", "candidate_table = Table.from_pandas(candidate_df)\n", "type(candidate_table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To write the file, we can use `Table.write` with `format='votable'`, [as described here](https://docs.astropy.org/en/stable/io/unified.html#vo-tables)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "table_id = candidate_table[['source_id']]\n", "table_id.write('candidate_df.xml', format='votable', overwrite=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we select a single column from the table, `source_id`.\n", "We could write the entire table to a file, but that would take longer to transmit over the network, and we really only need one column.\n", "\n", "This process, taking a structure like a `Table` and translating it into a form that can be transmitted over a network, is called [serialization](https://en.wikipedia.org/wiki/Serialization).\n", "\n", "XML is one of the most common serialization formats. One nice feature is that XML data is plain text, as opposed to binary digits, so you can read the file we just wrote:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "\r\n", " \r\n", " \r\n", " \r\n", " \r\n", " \r\n", " \r\n" ] } ], "source": [ "!head candidate_df.xml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "XML is a general format, so different XML files contain different kinds of data. In order to read an XML file, it's not enough to know that it's XML; you also have to know the data format, which is called a [schema](https://en.wikipedia.org/wiki/XML_schema).\n", "\n", "In this example, the schema is VOTable; notice that one of the first tags in the file specifies the schema, and even includes the URL where you can get its definition.\n", "\n", "So this is an example of a self-documenting format.\n", "\n", "A drawback of XML is that it tends to be big, which is why we wrote just the `source_id` column rather than the whole table.\n", "The size of the file is about 750 KB, so that's not too bad." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 downey downey 396K Dec 29 11:50 candidate_df.xml\r\n" ] } ], "source": [ "!ls -lh candidate_df.xml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are using Windows, `ls` might not work; in that case, try:\n", "\n", "```\n", "!dir candidate_df.xml\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "There's a gotcha here we want to warn you about. Why do you think we used double brackets to specify the column we wanted? What happens if you use single brackets?\n", "\n", "Run these code snippets to find out.\n", "\n", "```\n", "table_id = candidate_table[['source_id']]\n", "print(type(table_id))\n", "```\n", "\n", "```\n", "column = candidate_table['source_id']\n", "print(type(column))\n", "```\n", "\n", "```\n", "# This one should cause an error\n", "column.write('candidate_df.xml', \n", " format='votable', \n", " overwrite=True)\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "# table_id is a Table\n", "\n", "# column is a Column\n", "\n", "# Column does not provide `write`, so you get an AttributeError" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Uploading a table\n", "\n", "The next step is to upload this table to the Gaia server and use it as part of a query.\n", "\n", "[Here's the documentation that explains how to run a query with an uploaded table](https://astroquery.readthedocs.io/en/latest/gaia/gaia.html#synchronous-query-on-an-on-the-fly-uploaded-table).\n", "\n", "In the spirit of incremental development and testing, let's start with the simplest possible query." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"SELECT *\n", "FROM tap_upload.candidate_df\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This query downloads all rows and all columns from the uploaded table. The name of the table has two parts: `tap_upload` specifies a table that was uploaded using TAP+ (remember that's the name of the protocol we're using to talk to the Gaia server).\n", "\n", "And `candidate_df` is the name of the table, which we get to choose (unlike `tap_upload`, which we didn't get to choose).\n", "\n", "Here's how we run the query:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created TAP+ (v1.2.1) - Connection:\n", "\tHost: gea.esac.esa.int\n", "\tUse HTTPS: True\n", "\tPort: 443\n", "\tSSL Port: 443\n", "Created TAP+ (v1.2.1) - Connection:\n", "\tHost: geadata.esac.esa.int\n", "\tUse HTTPS: True\n", "\tPort: 443\n", "\tSSL Port: 443\n", "INFO: Query finished. [astroquery.utils.tap.core]\n" ] } ], "source": [ "from astroquery.gaia import Gaia\n", "\n", "job = Gaia.launch_job_async(query=query, \n", " upload_resource='candidate_df.xml', \n", " upload_table_name='candidate_df')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`upload_resource` specifies the name of the file we want to upload, which is the file we just wrote.\n", "\n", "`upload_table_name` is the name we assign to this table, which is the name we used in the query.\n", "\n", "And here are the results:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "Table length=7346\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
source_id
int64
635559124339440000
635860218726658176
635674126383965568
635535454774983040
635497276810313600
635614168640132864
635821843194387840
635551706931167104
635518889086133376
635580294233854464
...
612282738058264960
612485911486166656
612386332668697600
612296172717818624
612250375480101760
612394926899159168
612288854091187712
612428870024913152
612256418500423168
612429144902815104
" ], "text/plain": [ "\n", " source_id \n", " int64 \n", "------------------\n", "635559124339440000\n", "635860218726658176\n", "635674126383965568\n", "635535454774983040\n", "635497276810313600\n", "635614168640132864\n", "635821843194387840\n", "635551706931167104\n", "635518889086133376\n", "635580294233854464\n", " ...\n", "612282738058264960\n", "612485911486166656\n", "612386332668697600\n", "612296172717818624\n", "612250375480101760\n", "612394926899159168\n", "612288854091187712\n", "612428870024913152\n", "612256418500423168\n", "612429144902815104" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = job.get_results()\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If things go according to plan, the result should contain the same rows and columns as the uploaded table." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7346, 7346)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(table_id), len(results)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['source_id']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table_id.colnames" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['source_id']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results.colnames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we uploaded a table and then downloaded it again, so that's not too useful.\n", "\n", "But now that we can upload a table, we can join it with other tables on the Gaia server." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joining with an uploaded table\n", "\n", "Here's the first example of a query that contains a `JOIN` clause." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "query1 = \"\"\"SELECT *\n", "FROM gaiadr2.panstarrs1_best_neighbour as best\n", "JOIN tap_upload.candidate_df as candidate_df\n", " ON best.source_id = candidate_df.source_id\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's break that down one clause at a time:\n", "\n", "* `SELECT *` means we will download all columns from both tables.\n", "\n", "* `FROM gaiadr2.panstarrs1_best_neighbour as best` means that we'll get the columns from the Pan-STARRS best neighbor table, which we'll refer to using the short name `best`.\n", "\n", "* `JOIN tap_upload.candidate_df as candidate_df` means that we'll also get columns from the uploaded table, which we'll refer to using the short name `candidate_df`.\n", "\n", "* `ON best.source_id = candidate_df.source_id` specifies that we will use `source_id ` to match up the rows from the two tables.\n", "\n", "Here's the [documentation of the best neighbor table](https://gea.esac.esa.int/archive/documentation/GDR2/Gaia_archive/chap_datamodel/sec_dm_crossmatches/ssec_dm_panstarrs1_best_neighbour.html).\n", "\n", "Let's run the query:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO: Query finished. [astroquery.utils.tap.core]\n" ] } ], "source": [ "job1 = Gaia.launch_job_async(query=query1, \n", " upload_resource='candidate_df.xml', \n", " upload_table_name='candidate_df')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And get the results." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "Table length=3724\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
source_idoriginal_ext_source_idangular_distancenumber_of_neighboursnumber_of_matesbest_neighbour_multiplicitygaia_astrometric_paramssource_id_2
arcsec
int64int64float64int32int16int16int16int64
6358602187266581761309113851876713490.0536670358954670841015635860218726658176
6356741263839655681308313884284887200.0388102681415775161015635674126383965568
6355354547749830401306313783776573690.0343230288289910761015635535454774983040
6354972768103136001308113804456319300.047202554132500061015635497276810313600
6356141686401328641305713959221401350.0203041897099641431015635614168640132864
6355986079743697921303413920912795130.0365246268534030541015635598607974369792
6357376618354965761310013993335021360.0366268278207166061015635737661835496576
6358509458927486721320113986549341470.0211787423933783961015635850945892748672
6356005321197136641304213922858936230.045188209150430151015635600532119713664
........................
6122417812491246081297513437559955610.042357158300018151015612241781249124608
6123321473614430721301413414585387770.022652498590129771015612332147361443072
6124267440168024321305213468524656560.032476530099618431015612426744016802432
6123317393403417601301113412177938390.0360642408180257351015612331739340341760
6122827380582649601297413404459335190.0252932373534968981015612282738058264960
6123863326686976001303513545702197740.020103160014030861015612386332668697600
6122961727178186241296913380061687800.0512642120258362051015612296172717818624
6122503754801017601297413464758974640.0317837403475309051015612250375480101760
6123949268991591681305813551997517950.040191748305466981015612394926899159168
6122564185004231681299313490752973100.0092427896695131561015612256418500423168
" ], "text/plain": [ "\n", " source_id original_ext_source_id ... source_id_2 \n", " ... \n", " int64 int64 ... int64 \n", "------------------ ---------------------- ... ------------------\n", "635860218726658176 130911385187671349 ... 635860218726658176\n", "635674126383965568 130831388428488720 ... 635674126383965568\n", "635535454774983040 130631378377657369 ... 635535454774983040\n", "635497276810313600 130811380445631930 ... 635497276810313600\n", "635614168640132864 130571395922140135 ... 635614168640132864\n", "635598607974369792 130341392091279513 ... 635598607974369792\n", "635737661835496576 131001399333502136 ... 635737661835496576\n", "635850945892748672 132011398654934147 ... 635850945892748672\n", "635600532119713664 130421392285893623 ... 635600532119713664\n", " ... ... ... ...\n", "612241781249124608 129751343755995561 ... 612241781249124608\n", "612332147361443072 130141341458538777 ... 612332147361443072\n", "612426744016802432 130521346852465656 ... 612426744016802432\n", "612331739340341760 130111341217793839 ... 612331739340341760\n", "612282738058264960 129741340445933519 ... 612282738058264960\n", "612386332668697600 130351354570219774 ... 612386332668697600\n", "612296172717818624 129691338006168780 ... 612296172717818624\n", "612250375480101760 129741346475897464 ... 612250375480101760\n", "612394926899159168 130581355199751795 ... 612394926899159168\n", "612256418500423168 129931349075297310 ... 612256418500423168" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results1 = job1.get_results()\n", "results1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This table contains all of the columns from the best neighbor table, plus the single column from the uploaded table." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "['source_id',\n", " 'original_ext_source_id',\n", " 'angular_distance',\n", " 'number_of_neighbours',\n", " 'number_of_mates',\n", " 'best_neighbour_multiplicity',\n", " 'gaia_astrometric_params',\n", " 'source_id_2']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results1.colnames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because one of the column names appears in both tables, the second instance of `source_id` has been appended with the suffix `_2`.\n", "\n", "The length of `results1` is about 3000, which means we were not able to find matches for all stars in the list of candidates." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3724" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(results1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get more information about the matching process, we can inspect `best_neighbour_multiplicity`, which indicates for each star in Gaia how many stars in Pan-STARRS are equally likely matches." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<MaskedColumn name='best_neighbour_multiplicity' dtype='int16' description='Number of neighbours with same probability as best neighbour' length=3724>\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
1
1
1
1
1
1
1
1
1
1
1
1
...
1
1
1
1
1
1
1
1
1
1
1
1
" ], "text/plain": [ "\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", "...\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1\n", " 1" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results1['best_neighbour_multiplicity']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like most of the values are `1`, which is good; that means that for each candidate star we have identified exactly one source in Pan-STARRS that is likely to be the same star.\n", "\n", "To check whether there are any values other than `1`, we can convert this column to a Pandas `Series` and use `describe`, which we saw in in Lesson 3." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 3724.0\n", "mean 1.0\n", "std 0.0\n", "min 1.0\n", "25% 1.0\n", "50% 1.0\n", "75% 1.0\n", "max 1.0\n", "dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "multiplicity = pd.Series(results1['best_neighbour_multiplicity'])\n", "multiplicity.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In fact, `1` is the only value in the `Series`, so every candidate star has a single best match.\n", "\n", "Similarly, `number_of_mates` indicates the number of *other* stars in Gaia that match with the same star in Pan-STARRS." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 3724.0\n", "mean 0.0\n", "std 0.0\n", "min 0.0\n", "25% 0.0\n", "50% 0.0\n", "75% 0.0\n", "max 0.0\n", "dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mates = pd.Series(results1['number_of_mates'])\n", "mates.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All values in this column are `0`, which means that for each match we found in Pan-STARRS, there are no other stars in Gaia that also match. \n", "\n", "**Detail:** The table also contains `number_of_neighbors` which is the number of stars in Pan-STARRS that match in terms of position, before using other criteria to choose the most likely match." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting the photometry data\n", "\n", "The most important column in `results1` is `original_ext_source_id` which is the `obj_id` we will use to look up the likely matches in Pan-STARRS to get photometry data.\n", "\n", "The process is similar to what we just did to look up the matches. We will:\n", "\n", "1. Make a table that contains `source_id` and `original_ext_source_id`.\n", "\n", "2. Write the table to an XML VOTable file.\n", "\n", "3. Write a query that joins the uploaded table with `gaiadr2.panstarrs1_original_valid` and selects the photometry data we want.\n", "\n", "4. Run the query using the uploaded table.\n", "\n", "Since we've done everything here before, we'll do these steps as an exercise." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Select `source_id` and `original_ext_source_id` from `results1` and write the resulting table as a file named `external.xml`." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "table_ext = results1[['source_id', 'original_ext_source_id']]\n", "table_ext.write('external.xml', format='votable', overwrite=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `!head` to confirm that the file exists and contains an XML VOTable." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "\r\n", " \r\n", " \r\n", " \r\n", " \r\n", " Unique Gaia source identifier\r\n", " \r\n" ] } ], "source": [ "!head external.xml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Read [the documentation of the Pan-STARRS table](https://gea.esac.esa.int/archive/documentation/GDR2/Gaia_archive/chap_datamodel/sec_dm_external_catalogues/ssec_dm_panstarrs1_original_valid.html) and make note of `obj_id`, which contains the object IDs we'll use to find the rows we want.\n", "\n", "Write a query that uses each value of `original_ext_source_id` from the uploaded table to find a row in `gaiadr2.panstarrs1_original_valid` with the same value in `obj_id`, and select all columns from both tables.\n", "\n", "Suggestion: Develop and test your query incrementally. For example:\n", "\n", "1. Write a query that downloads all columns from the uploaded table. Test to make sure we can read the uploaded table.\n", "\n", "2. Write a query that downloads the first 10 rows from `gaiadr2.panstarrs1_original_valid`. Test to make sure we can access Pan-STARRS data.\n", "\n", "3. Write a query that joins the two tables and selects all columns. Test that the join works as expected.\n", "\n", "\n", "As a bonus exercise, write a query that joins the two tables and selects just the columns we need:\n", "\n", "* `source_id` from the uploaded table\n", "\n", "* `g_mean_psf_mag` from `gaiadr2.panstarrs1_original_valid`\n", "\n", "* `i_mean_psf_mag` from `gaiadr2.panstarrs1_original_valid`\n", "\n", "Hint: When you select a column from a join, you have to specify which table the column is in." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "# First test\n", "\n", "query2 = \"\"\"SELECT *\n", "FROM tap_upload.external as external\n", "\"\"\"\n", "\n", "# Second test\n", "\n", "query2 = \"\"\"SELECT TOP 10\n", "FROM gaiadr2.panstarrs1_original_valid\n", "\"\"\"\n", "\n", "# Third test\n", "\n", "query2 = \"\"\"SELECT *\n", "FROM gaiadr2.panstarrs1_original_valid as ps\n", "JOIN tap_upload.external as external\n", " ON ps.obj_id = external.original_ext_source_id\n", "\"\"\"\n", "\n", "# Complete query\n", "\n", "query2 = \"\"\"SELECT\n", "external.source_id, ps.g_mean_psf_mag, ps.i_mean_psf_mag\n", "FROM gaiadr2.panstarrs1_original_valid as ps\n", "JOIN tap_upload.external as external\n", " ON ps.obj_id = external.original_ext_source_id\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's how we launch the job and get the results." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO: Query finished. [astroquery.utils.tap.core]\n" ] } ], "source": [ "job2 = Gaia.launch_job_async(query=query2, \n", " upload_resource='external.xml', \n", " upload_table_name='external')" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "Table length=3724\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
source_idg_mean_psf_magi_mean_psf_mag
mag
int64float64float64
63586021872665817617.897800445556617.5174007415771
63567412638396556819.287300109863317.6781005859375
63553545477498304016.923799514770516.478099822998
63549727681031360019.924200057983418.3339996337891
63561416864013286416.151599884033214.6662998199463
63559860797436979216.522399902343816.1375007629395
63573766183549657614.503299713134813.9849004745483
63585094589274867216.517499923706116.0450000762939
63560053211971366420.450599670410219.5177001953125
.........
61224178124912460820.234399795532218.6518001556396
61233214736144307221.384899139404320.3076000213623
61242674401680243217.828100204467817.4281005859375
61233173934034176021.865699768066419.5223007202148
61228273805826496022.515199661254919.9743995666504
61238633266869760019.379299163818417.9923000335693
61229617271781862417.494400024414116.926700592041
61225037548010176015.333000183105514.6280002593994
61239492689915916816.441400527954115.8212003707886
61225641850042316820.871599197387719.9612007141113
" ], "text/plain": [ "\n", " source_id g_mean_psf_mag i_mean_psf_mag \n", " mag \n", " int64 float64 float64 \n", "------------------ ---------------- ----------------\n", "635860218726658176 17.8978004455566 17.5174007415771\n", "635674126383965568 19.2873001098633 17.6781005859375\n", "635535454774983040 16.9237995147705 16.478099822998\n", "635497276810313600 19.9242000579834 18.3339996337891\n", "635614168640132864 16.1515998840332 14.6662998199463\n", "635598607974369792 16.5223999023438 16.1375007629395\n", "635737661835496576 14.5032997131348 13.9849004745483\n", "635850945892748672 16.5174999237061 16.0450000762939\n", "635600532119713664 20.4505996704102 19.5177001953125\n", " ... ... ...\n", "612241781249124608 20.2343997955322 18.6518001556396\n", "612332147361443072 21.3848991394043 20.3076000213623\n", "612426744016802432 17.8281002044678 17.4281005859375\n", "612331739340341760 21.8656997680664 19.5223007202148\n", "612282738058264960 22.5151996612549 19.9743995666504\n", "612386332668697600 19.3792991638184 17.9923000335693\n", "612296172717818624 17.4944000244141 16.926700592041\n", "612250375480101760 15.3330001831055 14.6280002593994\n", "612394926899159168 16.4414005279541 15.8212003707886\n", "612256418500423168 20.8715991973877 19.9612007141113" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results2 = job2.get_results()\n", "results2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Optional Challenge: Do both joins in one query.\n", "\n", "There's an [example here](https://github.com/smoh/Getting-started-with-Gaia/blob/master/gaia-adql-snippets.md) you could start with." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "# Solution\n", "\n", "query3 = \"\"\"SELECT\n", "candidate_df.source_id, ps.g_mean_psf_mag, ps.i_mean_psf_mag\n", "FROM tap_upload.candidate_df as candidate_df\n", "JOIN gaiadr2.panstarrs1_best_neighbour as best\n", " ON best.source_id = candidate_df.source_id\n", "JOIN gaiadr2.panstarrs1_original_valid as ps\n", " ON ps.obj_id = best.original_ext_source_id\n", "\"\"\"\n", "\n", "# job3 = Gaia.launch_job_async(query=query3, \n", "# upload_resource='candidate_df.xml', \n", "# upload_table_name='candidate_df')\n", "\n", "# results3 = job3.get_results()\n", "# results3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Write the data\n", "\n", "Since we have the data in an Astropy `Table`, let's store it in a FITS file." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "filename = 'gd1_photo.fits'\n", "results2.write(filename, overwrite=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check that the file exists, and see how big it is." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 downey downey 96K Dec 29 11:51 gd1_photo.fits\r\n" ] } ], "source": [ "!ls -lh gd1_photo.fits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At around 175 KB, it is smaller than some of the other files we've been working with.\n", "\n", "If you are using Windows, `ls` might not work; in that case, try:\n", "\n", "```\n", "!dir gd1_photo.fits\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "In this notebook, we used database `JOIN` operations to select photometry data for the stars we've identified as candidates to be in GD-1.\n", "\n", "In the next notebook, we'll use this data for a second round of selection, identifying stars that have photometry data consistent with GD-1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Best practice\n", "\n", "* Use `JOIN` operations to combine data from multiple tables in a databased, using some kind of identifier to match up records from one table with records from another.\n", "\n", "* This is another example of a practice we saw in the previous notebook, moving the computation to the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }