{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Queries\n", "\n", "This is the first in a series of lessons about working with astronomical data.\n", "\n", "As a running example, we will replicate parts of the analysis in a recent paper, \"[Off the beaten path: Gaia reveals GD-1 stars outside of the main stream](https://arxiv.org/abs/1805.00425)\" by Adrian Price-Whelan and Ana Bonaca." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outline\n", "\n", "This lesson demonstrates the steps for selecting and downloading data from the Gaia Database:\n", "\n", "1. First we'll make a connection to the Gaia server,\n", "\n", "2. We will explore information about the database and the tables it contains,\n", "\n", "3. We will write a query and send it to the server, and finally\n", "\n", "4. We will download the response from the server.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Query Language\n", "\n", "In order to select data from a database, you have to compose a query, which is a program written in a \"query language\".\n", "The query language we'll use is ADQL, which stands for \"Astronomical Data Query Language\".\n", "\n", "ADQL is a dialect of [SQL](https://en.wikipedia.org/wiki/SQL) (Structured Query Language), which is by far the most commonly used query language. Almost everything you will learn about ADQL also works in SQL.\n", "\n", "[The reference manual for ADQL is here](http://www.ivoa.net/documents/ADQL/20180112/PR-ADQL-2.1-20180112.html).\n", "But you might find it easier to learn from [this ADQL Cookbook](https://www.gaia.ac.uk/data/gaia-data-release-1/adql-cookbook)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Jupyter\n", "\n", "If you have not worked with Jupyter notebooks before, you might start with [the tutorial on from Jupyter.org called \"Try Classic Notebook\"](https://jupyter.org/try), or [this tutorial from DataQuest](https://www.dataquest.io/blog/jupyter-notebook-tutorial/).\n", "\n", "There are two environments you can use to write and run notebooks: \n", "\n", "* \"Jupyter Notebook\" is the original, and\n", "\n", "* \"Jupyter Lab\" is a newer environment with more features.\n", "\n", "For these lessons, you can use either one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are too impatient for the tutorials, here are the most important things to know:\n", "\n", "1. Notebooks are made up of code cells and text cells (and a few other less common kinds). Code cells contain code; text cells, like this one, contain explanatory text written in [Markdown](https://www.markdownguide.org/).\n", "\n", "2. To run a code cell, click the cell to select it and press Shift-Enter. The output of the code should appear below the cell.\n", "\n", "3. In general, notebooks only run correctly if you run every code cell in order from top to bottom. If you run cells out of order, you are likely to get errors.\n", "\n", "4. You can modify existing cells, but then you have to run them again to see the effect.\n", "\n", "5. You can add new cells, but again, you have to be careful about the order you run them in.\n", "\n", "6. If you have added or modified cells and the behavior of the notebook seems strange, you can restart the \"kernel\", which clears all of the variables and functions you have defined, and run the cells again from the beginning.\n", "\n", "* If you are using Jupyter notebook, open the `Kernel` menu and select \"Restart and Run All\".\n", "\n", "* In Jupyter Lab, open the `Kernel` menu and select \"Restart Kernel and Run All Cells\"\n", "\n", "* In Colab, open the `Runtime` menu and select \"Restart and run all\"\n", "\n", "Before you go on, you might want to explore the other menus and the toolbar to see what else you can do." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Installing libraries\n", "\n", "If you are running this notebook on Colab, you should run the following cell to install the libraries we'll need.\n", "\n", "If you are running this notebook on your own computer, you might have to install these libraries yourself." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "# If we're running on Colab, install libraries\n", "\n", "import sys\n", "IN_COLAB = 'google.colab' in sys.modules\n", "\n", "if IN_COLAB:\n", " !pip install astroquery" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Connecting to Gaia\n", "\n", "The library we'll use to get Gaia data is [Astroquery](https://astroquery.readthedocs.io/en/latest/).\n", "Astroquery provides `Gaia`, which is an [object that represents a connection to the Gaia database](https://astroquery.readthedocs.io/en/latest/gaia/gaia.html).\n", "\n", "We can connect to the Gaia database like this:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from astroquery.gaia import Gaia" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This import statement creates a [TAP+](http://www.ivoa.net/documents/TAP/) connection; TAP stands for \"Table Access Protocol\", which is a network protocol for sending queries to the database and getting back the results. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Databases and Tables\n", "\n", "What is a database, anyway? Most generally, it can be any collection of data, but when we are talking about ADQL or SQL:\n", "\n", "* A database is a collection of one or more named tables.\n", "\n", "* Each table is a 2-D array with one or more named columns of data.\n", "\n", "We can use `Gaia.load_tables` to get the names of the tables in the Gaia database. With the option `only_names=True`, it loads information about the tables, called \"metadata\", not the data itself." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "tables = Gaia.load_tables(only_names=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following `for` loop prints the names of the tables." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [], "source": [ "for table in tables:\n", " print(table.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So that's a lot of tables. The ones we'll use are:\n", "\n", "* `gaiadr2.gaia_source`, which contains Gaia data from [data release 2](https://www.cosmos.esa.int/web/gaia/data-release-2),\n", "\n", "* `gaiadr2.panstarrs1_original_valid`, which contains the photometry data we'll use from PanSTARRS, and\n", "\n", "* `gaiadr2.panstarrs1_best_neighbour`, which we'll use to cross-match each star observed by Gaia with the same star observed by PanSTARRS.\n", "\n", "We can use `load_table` (not `load_tables`) to get the metadata for a single table. The name of this function is misleading, because it only downloads metadata, not the contents of the table." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "meta = Gaia.load_table('gaiadr2.gaia_source')\n", "meta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jupyter shows that the result is an object of type `TapTableMeta`, but it does not display the contents.\n", "\n", "To see the metadata, we have to print the object." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "print(meta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Columns\n", "\n", "The following loop prints the names of the columns in the table." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [], "source": [ "for column in meta.columns:\n", " print(column.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can probably infer what many of these columns are by looking at the names, but you should resist the temptation to guess.\n", "To find out what the columns mean, [read the documentation](https://gea.esac.esa.int/archive/documentation/GDR2/Gaia_archive/chap_datamodel/sec_dm_main_tables/ssec_dm_gaia_source.html).\n", "\n", "If you want to know what can go wrong when you don't read the documentation, [you might like this article](https://www.vox.com/future-perfect/2019/6/4/18650969/married-women-miserable-fake-paul-dolan-happiness)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "One of the other tables we'll use is `gaiadr2.panstarrs1_original_valid`. Use `load_table` to get the metadata for this table. How many columns are there and what are their names?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing queries\n", "\n", "By now you might be wondering how we download these tables. With tables this big, you generally don't. Instead, you use queries to select only the data you want.\n", "\n", "A query is a string written in a query language like SQL; for the Gaia database, the query language is a dialect of SQL called ADQL.\n", "\n", "Here's an example of an ADQL query." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "query1 = \"\"\"SELECT \n", "TOP 10\n", "source_id, ra, dec, parallax \n", "FROM gaiadr2.gaia_source\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Python note:** We use a [triple-quoted string](https://docs.python.org/3/tutorial/introduction.html#strings) here so we can include line breaks in the query, which makes it easier to read.\n", "\n", "The words in uppercase are ADQL keywords:\n", "\n", "* `SELECT` indicates that we are selecting data (as opposed to adding or modifying data).\n", "\n", "* `TOP` indicates that we only want the first 10 rows of the table, which is useful for testing a query before asking for all of the data.\n", "\n", "* `FROM` specifies which table we want data from.\n", "\n", "The third line is a list of column names, indicating which columns we want. \n", "\n", "In this example, the keywords are capitalized and the column names are lowercase. This is a common style, but it is not required. ADQL and SQL are not case-sensitive.\n", "\n", "Also, the query is broken into multiple lines to make it more readable. This is a common style, but not required. Line breaks don't affect the behavior of the query." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To run this query, we use the `Gaia` object, which represents our connection to the Gaia database, and invoke `launch_job`:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "job = Gaia.launch_job(query1)\n", "job" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is an object that represents the job running on a Gaia server.\n", "\n", "If you print it, it displays metadata for the forthcoming results." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "print(job)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Don't worry about `Results: None`. That does not actually mean there are no results.\n", "\n", "However, `Phase: COMPLETED` indicates that the job is complete, so we can get the results like this:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "results = job.get_results()\n", "type(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `type` function indicates that the result is an [Astropy Table](https://docs.astropy.org/en/stable/table/).\n", "\n", "**Optional detail:** Why is `table` repeated three times? The first is the name of the module, the second is the name of the submodule, and the third is the name of the class. Most of the time we only care about the last one. It's like the Linnean name for gorilla, which is *Gorilla gorilla gorilla*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An Astropy `Table` is similar to a table in an SQL database except:\n", "\n", "* SQL databases are stored on disk drives, so they are persistent; that is, they \"survive\" even if you turn off the computer. An Astropy `Table` is stored in memory; it disappears when you turn off the computer (or shut down this Jupyter notebook).\n", "\n", "* SQL databases are designed to process queries. An Astropy `Table` can perform some query-like operations, like selecting columns and rows. But these operations use Python syntax, not SQL.\n", "\n", "Jupyter knows how to display the contents of a `Table`." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each column has a name, units, and a data type.\n", "\n", "For example, the units of `ra` and `dec` are degrees, and their data type is `float64`, which is a 64-bit [floating-point number](https://en.wikipedia.org/wiki/Floating-point_arithmetic), used to store measurements with a fraction part.\n", "\n", "This information comes from the Gaia database, and has been stored in the Astropy `Table` by Astroquery." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Read [the documentation](https://gea.esac.esa.int/archive/documentation/GDR2/Gaia_archive/chap_datamodel/sec_dm_main_tables/ssec_dm_gaia_source.html) of this table and choose a column that looks interesting to you. Add the column name to the query and run it again. What are the units of the column you selected? What is its data type?" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Asynchronous queries\n", "\n", "`launch_job` asks the server to run the job \"synchronously\", which normally means it runs immediately. But synchronous jobs are limited to 2000 rows. For queries that return more rows, you should run \"asynchronously\", which mean they might take longer to get started.\n", "\n", "If you are not sure how many rows a query will return, you can use the SQL command `COUNT` to find out how many rows are in the result without actually returning them. We'll see an example in the next lesson.\n", "\n", "The results of an asynchronous query are stored in a file on the server, so you can start a query and come back later to get the results.\n", "For anonymous users, files are kept for three days.\n", "\n", "As an example, let's try a query that's similar to `query1`, with these changes:\n", "\n", "* It selects the first 3000 rows, so it is bigger than we should run synchronously.\n", "\n", "* It selects two additional columns, `pmra` and `pmdec`, which are proper motions along the axes of `ra` and `dec`.\n", "\n", "* It uses a new keyword, `WHERE`." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "query2 = \"\"\"SELECT \n", "TOP 3000\n", "source_id, ra, dec, pmra, pmdec, parallax\n", "FROM gaiadr2.gaia_source\n", "WHERE parallax < 1\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `WHERE` clause indicates which rows we want; in this case, the query selects only rows \"where\" `parallax` is less than 1. This has the effect of selecting stars with relatively low parallax, which are farther away.\n", "We'll use this clause to exclude nearby stars that are unlikely to be part of GD-1.\n", "\n", "`WHERE` is one of the most common clauses in ADQL/SQL, and one of the most useful, because it allows us to download only the rows we need from the database.\n", "\n", "We use `launch_job_async` to submit an asynchronous query." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "job = Gaia.launch_job_async(query2)\n", "job" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here are the results." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "results = job.get_results()\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might notice that some values of `parallax` are negative. As [this FAQ explains](https://www.cosmos.esa.int/web/gaia/archive-tips#negative%20parallax), \"Negative parallaxes are caused by errors in the observations.\" They have \"no physical meaning,\" but they can be a \"useful diagnostic on the quality of the astrometric solution.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "The clauses in a query have to be in the right order. Go back and change the order of the clauses in `query2` and run it again.\n", "The modified query should fail, but notice that you don't get much useful debugging information.\n", "\n", "For this reason, developing and debugging ADQL queries can be really hard. A few suggestions that might help:\n", "\n", "* Whenever possible, start with a working query, either an example you find online or a query you have used in the past.\n", "\n", "* Make small changes and test each change before you continue.\n", "\n", "* While you are debugging, use `TOP` to limit the number of rows in the result. That will make each test run faster, which reduces your development time. \n", "\n", "* Launching test queries synchronously might make them start faster, too." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Operators\n", "\n", "In a `WHERE` clause, you can use any of the [SQL comparison operators](https://www.w3schools.com/sql/sql_operators.asp); here are the most common ones:\n", "\n", "| Symbol | Operation\n", "|--------| :---\n", "| `>` | greater than\n", "| `<` | less than\n", "| `>=` | greater than or equal\n", "| `<=` | less than or equal\n", "| `=` | equal\n", "| `!=` or `<>` | not equal\n", "\n", "Most of these are the same as Python, but some are not. In particular, notice that the equality operator is `=`, not `==`.\n", "Be careful to keep your Python out of your ADQL!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can combine comparisons using the logical operators:\n", "\n", "* AND: true if both comparisons are true\n", "* OR: true if either or both comparisons are true\n", "\n", "Finally, you can use `NOT` to invert the result of a comparison. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "[Read about SQL operators here](https://www.w3schools.com/sql/sql_operators.asp) and then modify the previous query to select rows where `bp_rp` is between `-0.75` and `2`." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`bp_rp` contains BP-RP color, which is the difference between two other columns, `phot_bp_mean_mag` and `phot_rp_mean_mag`.\n", "You can [read about this variable here](https://gea.esac.esa.int/archive/documentation/GDR2/Gaia_archive/chap_datamodel/sec_dm_main_tables/ssec_dm_gaia_source.html).\n", "\n", "This [Hertzsprung-Russell diagram](https://sci.esa.int/web/gaia/-/60198-gaia-hertzsprung-russell-diagram) shows the BP-RP color and luminosity of stars in the Gaia catalog (Copyright: ESA/Gaia/DPAC, CC BY-SA 3.0 IGO).\n", "\n", "\n", "\n", "Selecting stars with `bp-rp` less than 2 excludes many [class M dwarf stars](https://xkcd.com/2360/), which are low temperature, low luminosity. A star like that at GD-1's distance would be hard to detect, so if it is detected, it it more likely to be in the foreground." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Formatting queries\n", "\n", "The queries we have written so far are string \"literals\", meaning that the entire string is part of the program.\n", "But writing queries yourself can be slow, repetitive, and error-prone.\n", "\n", "It is often better to write Python code that assembles a query for you. One useful tool for that is the [string `format` method](https://www.w3schools.com/python/ref_string_format.asp).\n", "\n", "As an example, we'll divide the previous query into two parts; a list of column names and a \"base\" for the query that contains everything except the column names.\n", "\n", "Here's the list of columns we'll select. " ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "columns = 'source_id, ra, dec, pmra, pmdec, parallax'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's the base; it's a string that contains at least one format specifier in curly brackets (braces)." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "query3_base = \"\"\"SELECT \n", "TOP 10 \n", "{columns}\n", "FROM gaiadr2.gaia_source\n", "WHERE parallax < 1\n", " AND bp_rp BETWEEN -0.75 AND 2\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This base query contains one format specifier, `{columns}`, which is a placeholder for the list of column names we will provide.\n", "\n", "To assemble the query, we invoke `format` on the base string and provide a keyword argument that assigns a value to `columns`." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "query3 = query3_base.format(columns=columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, the variable that contains the column names and the variable in the format specifier have the same name.\n", "That's not required, but it is a common style.\n", "\n", "The result is a string with line breaks. If you display it, the line breaks appear as `\\n`." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "query3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But if you print it, the line breaks appear as... line breaks." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "print(query3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the format specifier has been replaced with the value of `columns`.\n", "\n", "Let's run it and see if it works:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "scrolled": true }, "outputs": [], "source": [ "job = Gaia.launch_job(query3)\n", "print(job)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "results = job.get_results()\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Good so far." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "This query always selects sources with `parallax` less than 1. But suppose you want to take that upper bound as an input.\n", "\n", "Modify `query3_base` to replace `1` with a format specifier like `{max_parallax}`. Now, when you call `format`, add a keyword argument that assigns a value to `max_parallax`, and confirm that the format specifier gets replaced with the value you provide." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "This notebook demonstrates the following steps:\n", "\n", "1. Making a connection to the Gaia server,\n", "\n", "2. Exploring information about the database and the tables it contains,\n", "\n", "3. Writing a query and sending it to the server, and finally\n", "\n", "4. Downloading the response from the server as an Astropy `Table`.\n", "\n", "In the next lesson we will extend these queries to select a particular region of the sky." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Best practices\n", "\n", "* If you can't download an entire dataset (or it's not practical) use queries to select the data you need.\n", "\n", "* Read the metadata and the documentation to make sure you understand the tables, their columns, and what they mean.\n", "\n", "* Develop queries incrementally: start with something simple, test it, and add a little bit at a time.\n", "\n", "* Use ADQL features like `TOP` and `COUNT` to test before you run a query that might return a lot of data.\n", "\n", "* If you know your query will return fewer than 2000 rows, you can run it synchronously, which might complete faster. If it might return more than 2000 rows, you should run it asynchronously.\n", "\n", "* ADQL and SQL are not case-sensitive, so you don't have to capitalize the keywords, but you should.\n", "\n", "* ADQL and SQL don't require you to break a query into multiple lines, but you should." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jupyter notebooks can be good for developing and testing code, but they have some drawbacks. In particular, if you run the cells out of order, you might find that variables don't have the values you expect.\n", "\n", "To mitigate these problems:\n", "\n", "* Make each section of the notebook self-contained. Try not to use the same variable name in more than one section.\n", "\n", "* Keep notebooks short. Look for places where you can break your analysis into phases with one notebook per phase." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 2 }