# 5. Joining Tables

This is the fifth in a series of notebooks related to astronomy data.

As a continuing example, we will replicate part of the analysis in a recent paper, "[Off the beaten path: Gaia reveals GD-1 stars outside of the main stream](https://arxiv.org/abs/1805.00425)" by Adrian M. Price-Whelan and Ana Bonaca.

Picking up where we left off, the next step in the analysis is to select candidate stars based on photometry data.
The following figure from the paper is a color-magnitude diagram for the stars selected based on proper motion:



In red is a [stellar isochrone](https://en.wikipedia.org/wiki/Stellar_isochrone), showing where we expect the stars in GD-1 to fall based on the metallicity and age of their original globular cluster. 

By selecting stars in the shaded area, we can further distinguish the main sequence of GD-1 from younger background stars.

## Outline

Here are the steps in this notebook:

1. We'll reload the candidate stars we identified in the previous notebook.

2. Then we'll run a query on the Gaia server that uploads the table of candidates and uses a `JOIN` operation to select photometry data for the candidate stars.

3. We'll write the results to a file for use in the next notebook.

After completing this lesson, you should be able to

* Upload a table to the Gaia server.

* Write ADQL queries involving `JOIN` operations.

## Installing libraries

If you are running this notebook on Colab, you can run the following cell to install the libraries we'll use.

If you are running this notebook on your own computer, you might have to install these libraries yourself. See the instructions in the preface.

In [28]:
# If we're running on Colab, install libraries

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
 !pip install astroquery

## Getting photometry data

The Gaia dataset contains some photometry data, including the variable `bp_rp`, which contains BP-RP color (the difference in mean flux between the BP and RP bands).
We use this variable to select stars with `bp_rp` between -0.75 and 2, which excludes many class M dwarf stars.

Now, to select stars with the age and metal richness we expect in GD-1, we will use `g-i` color and apparent `g`-band magnitude, which are available from the Pan-STARRS survey.

Conveniently, the Gaia server provides data from Pan-STARRS as a table in the same database we have been using, so we can access it by making ADQL queries.

In general, choosing a star from the Gaia catalog and finding the corresponding star in the Pan-STARRS catalog is not easy. This kind of cross matching is not always possible, because a star might appear in one catalog and not the other. And even when both stars are present, there might not be a clear one-to-one relationship between stars in the two catalogs.

Fortunately, smart people have worked on this problem, and the Gaia database includes cross-matching tables that suggest a best neighbor in the Pan-STARRS catalog for many stars in the Gaia catalog.

[This document describes the cross matching process](https://gea.esac.esa.int/archive/documentation/GDR2/Catalogue_consolidation/chap_cu9val_cu9val/ssec_cu9xma/sssec_cu9xma_extcat.html). Briefly, it uses a cone search to find possible matches in approximately the right position, then uses attributes like color and magnitude to choose pairs of observations most likely to be the same star.

## The best neighbor table

So the hard part of cross-matching has been done for us. Using the results is a little tricky, but it gives us a chance to learn about one of the most important tools for working with databases: "joining" tables.

In general, a "join" is an operation where you match up records from one table with records from another table using as a "key" a piece of information that is common to both tables, usually some kind of ID code.

In this example:

* Stars in the Gaia dataset are identified by `source_id`.

* Stars in the Pan-STARRS dataset are identified by `obj_id`.

For each candidate star we have selected so far, we have the `source_id`; the goal is to find the `obj_id` for the same star (we hope) in the Pan-STARRS catalog.

To do that we will:

1. Use the `JOIN` operator to look up each `source_id` in the `panstarrs1_best_neighbour` table, which contains the `obj_id` of the best match for each star in the Gaia catalog; then

2. Use the `JOIN` operator again to look up each `obj_id` in the `panstarrs1_original_valid` table, which contains the Pan-STARRS photometry data we want.

Before we get to the `JOIN` operation, let's explore these tables.
Here's the metadata for `panstarrs1_best_neighbour`.

In [29]:
from astroquery.gaia import Gaia

meta = Gaia.load_table('gaiadr2.panstarrs1_best_neighbour')

In [30]:
print(meta)

And here are the columns.

In [31]:
for column in meta.columns:
 print(column.name)

Here's the [documentation for these variables](https://gea.esac.esa.int/archive/documentation/GDR2/Gaia_archive/chap_datamodel/sec_dm_crossmatches/ssec_dm_panstarrs1_best_neighbour.html) .

The ones we'll use are:

* `source_id`, which we will match up with `source_id` in the Gaia table.

* `number_of_neighbours`, which indicates how many sources in Pan-STARRS are matched with this source in Gaia.

* `number_of_mates`, which indicates the number of *other* sources in Gaia that are matched with the same source in Pan-STARRS.

* `original_ext_source_id`, which we will match up with `obj_id` in the Pan-STARRS table.

Ideally, `number_of_neighbours` should be 1 and `number_of_mates` should be 0; in that case, there is a one-to-one match between the source in Gaia and the corresponding source in Pan-STARRS.

Here's a query that selects these columns and returns the first 5 rows.

In [32]:
query = """SELECT 
TOP 5
source_id, number_of_neighbours, number_of_mates, original_ext_source_id
FROM gaiadr2.panstarrs1_best_neighbour
"""

In [33]:
job = Gaia.launch_job_async(query=query)

In [34]:
results = job.get_results()
results

## The Pan-STARRS table

Here's the metadata for the table that contains the Pan-STARRS data.

In [35]:
meta = Gaia.load_table('gaiadr2.panstarrs1_original_valid')

In [36]:
print(meta)

And here are the columns.

In [37]:
for column in meta.columns:
 print(column.name)

Here's the [documentation for these variables]() .

The ones we'll use are:

* `obj_id`, which we will match up with `original_ext_source_id` in the best neighbor table.

* `g_mean_psf_mag`, which contains mean magnitude from the `i` filter.

* `i_mean_psf_mag`, which contains mean magnitude from the `i` filter.

Here's a query that selects these variables and returns the first 5 rows.

In [38]:
query = """SELECT 
TOP 5
obj_id, g_mean_psf_mag, i_mean_psf_mag 
FROM gaiadr2.panstarrs1_original_valid
"""

In [39]:
job = Gaia.launch_job_async(query=query)

In [40]:
results = job.get_results()
results

The following figure shows how these tables are related.

* The orange circles and arrows represent the first `JOIN` operation, which takes each `source_id` in the Gaia table and finds the same value of `source_id` in the best neighbor table.

* The blue circles and arrows represent the second `JOIN` operation, which takes each `original_ext_source_id` in the Gaia table and finds the same value of `obj_id` in the best neighbor table.

There's no guarantee that the corresponding rows of these tables are in the same order, so the `JOIN` operation involves some searching.
However, ADQL/SQL databases are implemented in a way that makes this kind of source efficient.
If you are curious, you can [read more about it](https://chartio.com/learn/databases/how-does-indexing-work/).

## Joining tables

Now let's get to the details of performing a `JOIN` operation.
As a starting place, let's go all the way back to the cone search from Lesson 2.

In [41]:
query_cone = """SELECT 
TOP 10 
source_id
FROM gaiadr2.gaia_source
WHERE 1=CONTAINS(
 POINT(ra, dec),
 CIRCLE(88.8, 7.4, 0.08333333))
"""

And let's run it, to make sure we have a working query to build on.

In [42]:
from astroquery.gaia import Gaia

job = Gaia.launch_job_async(query=query_cone)

In [43]:
results = job.get_results()
results

Now we can start adding features.
First, let's replace `source_id` with a format specifier, `columns`: 

In [44]:
query_base = """SELECT 
{columns}
FROM gaiadr2.gaia_source
WHERE 1=CONTAINS(
 POINT(ra, dec),
 CIRCLE(88.8, 7.4, 0.08333333))
"""

Here are the columns we want from the Gaia table, again. 

In [45]:
columns = 'source_id, ra, dec, pmra, pmdec'

query = query_base.format(columns=columns)
print(query)

And let's run the query again.

In [46]:
job = Gaia.launch_job_async(query=query)

In [47]:
results = job.get_results()
results

## Adding the best neighbor table

Now we're ready for the first join.
The join operation requires two clauses:

* `JOIN` specifies the name of the table we want to join with, and

* `ON` specifies how we'll match up rows between the tables.

In this example, we join with `gaiadr2.panstarrs1_best_neighbour AS best`, which means we can refer to the best neighbor table with the abbreviated name `best`.

And the `ON` clause indicates that we'll match up the `source_id` column from the Gaia table with the `source_id` column from the best neighbor table. 

In [48]:
query_base = """SELECT 
{columns}
FROM gaiadr2.gaia_source AS gaia
JOIN gaiadr2.panstarrs1_best_neighbour AS best
 ON gaia.source_id = best.source_id
WHERE 1=CONTAINS(
 POINT(gaia.ra, gaia.dec),
 CIRCLE(88.8, 7.4, 0.08333333))
"""

**SQL detail:** In this example, the `ON` column has the same name in both tables, so we could replace the `ON` clause with a simpler [`USING` clause](https://docs.oracle.com/javadb/10.8.3.0/ref/rrefsqljusing.html):

```
USING(source_id)
```

Now that there's more than one table involved, we can't use simple column names any more; we have to use **qualified column names**.
In other words, we have to specify which table each column is in.
Here's the complete query, including the columns we want from the Gaia and best neighbor tables.

In [49]:
column_list = ['gaia.source_id',
 'gaia.ra',
 'gaia.dec',
 'gaia.pmra',
 'gaia.pmdec',
 'best.best_neighbour_multiplicity',
 'best.number_of_mates',
 ]
columns = ', '.join(column_list)

query = query_base.format(columns=columns)
print(query)

In [50]:
job = Gaia.launch_job_async(query=query)

In [51]:
results = job.get_results()
results

Notice that this result has fewer rows than the previous result.
That's because there are sources in the Gaia table with no corresponding source in the Pan-STARRS table.

By default, the result of the join only includes rows where the same `source_id` appears in both tables.
This default is called an "inner" join because the results include only the intersection of the two tables.
[You can read about the other kinds of join here](https://www.geeksforgeeks.org/sql-join-set-1-inner-left-right-and-full-joins/).

## Adding the Pan-STARRS table

### Exercise

Now we're ready to bring in the Pan-STARRS table. Starting with the previous query, add a second `JOIN` clause that joins with `gaiadr2.panstarrs1_original_valid`, gives it the abbreviated name `ps`, and matches `original_ext_source_id` from the best neighbor table with `obj_id` from the Pan-STARRS table.

Add `g_mean_psf_mag` and `i_mean_psf_mag` to the column list, and run the query.
The result should contain 490 rows and 9 columns.

In [52]:
# Solution goes here

## Selecting by coordinates and proper motion

Now let's bring in the `WHERE` clause from the previous lesson, which selects sources based on parallax, BP-RP color, sky coordinates, and proper motion.

Here's `query6_base` from the previous lesson.

In [53]:
query6_base = """SELECT 
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
 AND bp_rp BETWEEN -0.75 AND 2 
 AND 1 = CONTAINS(POINT(ra, dec), 
 POLYGON({point_list}))
 AND 1 = CONTAINS(POINT(pmra, pmdec),
 POLYGON({pm_point_list}))
"""

Let's reload the Pandas `Series` that contains `point_list` and `pm_point_list`.

In [54]:
import pandas as pd

filename = 'gd1_data.hdf'
point_series = pd.read_hdf(filename, 'point_series')
point_series

Now we can assemble the query.

In [55]:
columns = 'source_id, ra, dec, pmra, pmdec'

query6 = query6_base.format(columns=columns,
 point_list=point_series['point_list'],
 pm_point_list=point_series['pm_point_list'])

print(query6)

Again, let's run it to make sure we are starting with a working query.

In [56]:
job = Gaia.launch_job_async(query=query6)

In [57]:
results = job.get_results()
results

### Exercise

Create a new query base called `query7_base` that combines the `WHERE` clauses from the previous query with the `JOIN` clauses for the best neighbor and Pan-STARRS tables.
Format the query base using the column names in `column_list`, and call the result `query7`.

Hint: Make sure you use qualified column names everywhere!

Run your query and download the results. The table you get should have 3725 rows and 9 columns.

In [58]:
# Solution goes here

## Checking the match

To get more information about the matching process, we can inspect `best_neighbour_multiplicity`, which indicates for each star in Gaia how many stars in Pan-STARRS are equally likely matches.

In [59]:
results['best_neighbour_multiplicity']

It looks like most of the values are `1`, which is good; that means that for each candidate star we have identified exactly one source in Pan-STARRS that is likely to be the same star.

To check whether there are any values other than `1`, we can convert this column to a Pandas `Series` and use `describe`, which we saw in in Lesson 3.

In [60]:
import pandas as pd

multiplicity = pd.Series(results['best_neighbour_multiplicity'])
multiplicity.describe()

In fact, `1` is the only value in the `Series`, so every candidate star has a single best match.

Similarly, `number_of_mates` indicates the number of *other* stars in Gaia that match with the same star in Pan-STARRS.

In [61]:
mates = pd.Series(results['number_of_mates'])
mates.describe()

All values in this column are `0`, which means that for each match we found in Pan-STARRS, there are no other stars in Gaia that also match. 

**Detail:** The table also contains `number_of_neighbors` which is the number of stars in Pan-STARRS that match in terms of position, before using other criteria to choose the most likely match. But we are more interested in the final match, using both criteria.

## Transforming coordinates

Here's the function we've used to transform the results from ICRS to GD-1 coordinates.

In [62]:
import astropy.units as u
from astropy.coordinates import SkyCoord
from gala.coordinates import GD1Koposov10
from gala.coordinates import reflex_correct

def make_dataframe(table):
 """Transform coordinates from ICRS to GD-1 frame.
 
 table: Astropy Table
 
 returns: Pandas DataFrame
 """
 skycoord = SkyCoord(
 ra=table['ra'], 
 dec=table['dec'],
 pm_ra_cosdec=table['pmra'],
 pm_dec=table['pmdec'], 
 distance=8*u.kpc, 
 radial_velocity=0*u.km/u.s)

 gd1_frame = GD1Koposov10()
 transformed = skycoord.transform_to(gd1_frame)
 skycoord_gd1 = reflex_correct(transformed)

 df = table.to_pandas()
 df['phi1'] = skycoord_gd1.phi1
 df['phi2'] = skycoord_gd1.phi2
 df['pm_phi1'] = skycoord_gd1.pm_phi1_cosphi2
 df['pm_phi2'] = skycoord_gd1.pm_phi2
 return df

Now can transform the result from the last query.

In [63]:
candidate_df = make_dataframe(results)

And see how it looks.

In [64]:
import matplotlib.pyplot as plt

x = candidate_df['phi1']
y = candidate_df['phi2']
plt.plot(x, y, 'ko', markersize=0.5, alpha=0.5)

plt.xlabel('phi1 (degree GD1)')
plt.ylabel('phi2 (degree GD1)');

The result is similar to what we saw in the previous lesson, except that have fewer stars now, because we did not find photometry data for all of the candidate sources.

## Saving the DataFrame

Let's save this `DataFrame` so we can pick up where we left off without running this query again.
The HDF file should already exist, so we'll add `candidate_df` to it.

In [65]:
filename = 'gd1_data.hdf'

candidate_df.to_hdf(filename, 'candidate_df')

We can use `getsize` to confirm that the file exists and check the size:

In [66]:
from os.path import getsize

MB = 1024 * 1024
getsize(filename) / MB

## Summary

In this notebook, we used database `JOIN` operations to select photometry data for the stars we've identified as candidates to be in GD-1.

In the next notebook, we'll use this data for a second round of selection, identifying stars that have photometry data consistent with GD-1.

But before you go on, you might be interested in another file format, CSV.

## CSV

Pandas can write a variety of other formats, [which you can read about here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).
We won't cover all of them, but one other important one is [CSV](https://en.wikipedia.org/wiki/Comma-separated_values), which stands for "comma-separated values".

CSV is a plain-text format that can be read and written by pretty much any tool that works with data. In that sense, it is the "least common denominator" of data formats.

However, it has an important limitation: some information about the data gets lost in translation, notably the data types. If you read a CSV file from someone else, you might need some additional information to make sure you are getting it right.

Also, CSV files tend to be big, and slow to read and write.

With those caveats, here's how to write one:

In [67]:
candidate_df.to_csv('gd1_data.csv')

We can check the file size like this:

In [68]:
getsize('gd1_data.csv') / MB

We can see the first few lines like this:

In [69]:
def head(filename, n=3):
 """Print the first `n` lines of a file."""
 with open(filename) as fp:
 for i in range(n):
 print(next(fp))

In [70]:
head('gd1_data.csv')

The CSV file contains the names of the columns, but not the data types.

We can read the CSV file back like this:

In [71]:
read_back_csv = pd.read_csv('gd1_data.csv')

Let's compare the first few rows of `candidate_df` and `read_back_csv`

In [72]:
candidate_df.head(3)

In [73]:
read_back_csv.head(3)

Notice that the index in `candidate_df` has become an unnamed column in `read_back_csv`. The Pandas functions for writing and reading CSV files provide options to avoid that problem, but this is an example of the kind of thing that can go wrong with CSV files.

## Best practices

* Use `JOIN` operations to combine data from multiple tables in a databased, using some kind of identifier to match up records from one table with records from another.

* This is another example of a practice we saw in the previous notebook, moving the computation to the data.

* For most applications, saving data in FITS or HDF5 is better than CSV. FITS and HDF5 are binary formats, so the files are usually smaller, and they store metadata, so you don't lose anything when you read the file back.

* On the other hand, CSV is a "least common denominator" format; that is, it can be read by practically any application that works with data.