{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Accessing DC2 data in PostgreSQL at NERSC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Owner: **Joanne Bogart [@jrbogart](https://github.com/LSSTDESC/DC2-analysis/issues/new?body=@jrbogart)** \n", "Last Verified to Run: **2020-08-03**\n", "\n", "This notebook is an introduction to use of the PostgreSQL database at NERSC. Currently the fully supported datasets are the object catalogs for Run1.2i, Run1.2p v4, and run2.1i dr1b v1. Since object catalogs as well as other kinds of catalogs are also available via GCR one might question the need for another form of access. The justification (for those applications only using object catalogs) is performance. Typical queries such as the one labeled `q5` below run significantly faster. Ingest also tends to be faster. The Run1.2p v4 data were available via PostgreSQL within a day of the completion of stack processing and transfer to NERSC.\n", "\n", "__Learning objectives__:\n", "\n", "After going through this notebook, you should be able to:\n", " 1. Connect to the DESC DC2 PostgreSQL database at NERSC.\n", " 2. Find out what tables are in the database, what their constituent columns are, and how they relate to DPDD and native object catalog quantities.\n", " 3. Make queries, selecting columns of interest subject to typical cuts. \n", " 4. Make use of standard tools, such as a Pandas, for plotting or other analysis of query results.\n", "\n", "__Logistics__: This notebook is intended to be run through the JupyterHub NERSC interface available here: https://jupyter-dev.nersc.gov. To setup your NERSC environment, please follow the instructions available here: https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter-dev+at+NERSC\n", "\n", "#### Tutorial organization\n", "The first part consists of a typical try-as-you-go interactive tutorial. This is followed by strictly expository sections, starting with [SQL Primer](#SQL-Primer). For those new to SQL they should be read in order (more technical sections may be skipped), either before or after the interactive sections, depending on preferred learning style; others may find them useful for reference.\n", "\n", "### Prerequisites\n", "* A file ~/.pgpass containing a line like this:\n", "\n", "`nerscdb03.nersc.gov:54432:desc_dc2_drp:desc_dc2_drp_user:`_password_\n", "\n", "This line allows you to use the desc_dc2_drp_user account, which has *SELECT* privileges on the database, without entering a password in plain text. There is a separate account for adding to or modifying the database. .pgpass must be protected so that only owner may read and write it.\n", " \n", "You can obtain the file by running the script `/global/common/software/lsst/dbaccess/postgres_reader.sh`. It will copy a suitable file to your home directory and set permissions. \n", "\n", "If you already have a `.pgpass` file in your home directory the script will stop without doing anything to avoid clobbering your file. In that case, see the file `reader.pgpass` in the same directory. You can merge it into your `.pgpass` file by hand. \n", "\n", "* Access to the psycopg2 package which provides a Python interface to PostgreSQL. The recommended way to achieve this is to use the desc-python kernel on jupyter-dev or, for non-Notebook code, DESC-provided shifter images.\n", " \n", "This notebook uses psycopg2 directly for queries. It is also possible to use sqlalchemy but you will still need a PostgreSQL driver. Of these psycopg2 is the most popular.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import psycopg2\n", "\n", "import numpy as np\n", "\n", "%matplotlib inline \n", "import matplotlib.pyplot as plt\n", "\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make the db connection" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dbname = 'desc_dc2_drp'\n", "dbuser = 'desc_dc2_drp_user'\n", "dbhost = 'nerscdb03.nersc.gov'\n", "dbconfig = {'dbname' : dbname, 'user' : dbuser, 'host' : dbhost}\n", "dbconn = psycopg2.connect(**dbconfig)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = 'run12i' \n", "# schema = 'run12p_v4'\n", "# schema = 'run21i_dr1b_v1'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tables for the Run1.2i data as well as a view to make dpdd quantities more easily accessible are in the `schema` (acts like a namespace) run12i. The value for `schema` above will change for other datasets. \n", "\n", "There is a special system schema, **information_schema**, which contains tables describing the structure of user tables. Of these **information_schema.columns** is most likely to be useful. The following lists all tables and views belonging to schema run12i. (I will use the convention of writing SQL keywords in all caps in queries. It's not necessary; the SQL interpreter ignores case.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "q1 = \"SELECT DISTINCT table_name FROM information_schema.columns WHERE table_schema='{schema}' ORDER BY table_name\".format(**locals())\n", "with dbconn.cursor() as cursor:\n", " # Could have several queries interspersed with other code in this block\n", " cursor.execute(q1)\n", " for record in cursor:\n", " print(record[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**\\_temp:forced\\_patch** is an artifact of the ingest process which is of no interest here.\n", "All the remaining entries with the exception of **dpdd** are tables. Each table has rows of data, one per object in the catalog. The rows consist of \"native quantities\" as produced by running the dm stack on the simulated data. **dpdd** is a _view_* making the quantities identified in [GCRCatalogs/SCHEMA.md](https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/SCHEMA.md#schema-for-dc2-coadd-catalogs) available. Information is broken across several tables because there are too many columns for a single table. All tables have a field ```object_id```. In the **dpdd** view it's called ```objectId``` because that's the official name for it. The following code will list all quantities in the **dpdd** view. Note the database ignores case; all quantities appear in lower case only.\n", "\n", "*A _view_ definition consists of references to quantities stored in the tables or computable from them; the view has no data of its own. The view name is used in queries just like a table name." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tbl = 'dpdd'\n", "q2 = \"SELECT column_name, data_type FROM information_schema.columns WHERE table_schema='{schema}' AND table_name='{tbl}' order by column_name \".format(**locals())\n", "print(q2)\n", "with dbconn.cursor() as cursor:\n", " cursor.execute(q2)\n", " records = cursor.fetchall()\n", " print(\"There are {} columns in table {}. They are:\\n\".format(len(records), tbl))\n", " print(\"Name Data Type\")\n", " for record in records:\n", " print(\"{0!s:55} {1!s:20}\".format(record[0], record[1]) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a similar query for the **position** table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tbl = 'position'\n", "q2_pos = \"SELECT column_name, data_type FROM information_schema.columns WHERE table_schema='{schema}' AND table_name='{tbl}'\".format(**locals())\n", "with dbconn.cursor() as cursor:\n", " cursor.execute(q2_pos)\n", " records = cursor.fetchall()\n", " print(\"There are {} columns in table {}. They are:\\n\".format(len(records), tbl))\n", " print(\"Name Data Type\")\n", " for record in records:\n", " print(\"{0!s:55} {1!s:20}\".format(record[0], record[1]) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a query which counts up objects per tract and stores the results (queries return a list of tuples) in a pandas DataFrame. It makes use of a user-defined function (UDF*) ```tract_from_object_id```, which is by far the fastest way to determine the tract. \n", "\n", "*The UDF `tract_from_object_id` is one of several which minimize query time by making optimal use of the structure of the database. See the second tutorial in this series for a discussion of some of the others. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "q3 = \"SELECT tract_from_object_id(object_id), COUNT(object_id) FROM {schema}.position WHERE detect_isprimary GROUP BY tract_from_object_id(object_id)\".format(**locals())\n", "with dbconn.cursor() as cursor:\n", " %time cursor.execute(q3)\n", " df = pd.DataFrame(cursor.fetchall(), columns=['tract', 'count'])\n", " print(df)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the same query, but made on the dpdd view rather than the position table. There is no need to specify \"is primary\" because the dpdd view already has this constraint." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "q4 = \"SELECT tract_from_object_id(objectid), COUNT(objectid) FROM {schema}.dpdd GROUP BY tract_from_object_id(objectid)\".format(**locals())\n", "with dbconn.cursor() as cursor:\n", " cursor.execute(q4)\n", " df = pd.DataFrame(cursor.fetchall(), columns=['tract', 'count'])\n", " print(df)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following can be compared with a similar query in the GCR Intro notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "q5 = \"SELECT ra,dec FROM {schema}.dpdd\".format(**locals())\n", "with dbconn.cursor() as cursor:\n", " %time cursor.execute(q5)\n", " %time records = cursor.fetchall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Color-color\n", "Techniques are adapted from the notebook [object_pandas_stellar_locus](https://github.com/LSSTDESC/DC2-analysis/blob/master/tutorials/object_pandas_stellar_locus.ipynb).\n", "#### Using cuts\n", "Put some typical cuts in a WHERE clause: select `clean` objects (no flagged pixels; not skipped by deblender) for which signal to noise in bands of interest is acceptable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "global_cuts = 'clean '\n", "t = None\n", "\n", "min_SNR = 25 \n", "max_err = 1/min_SNR\n", "band_cuts = ' (magerr_g < {}) AND (magerr_i < {}) AND (magerr_r < {}) '.format(max_err,max_err,max_err)\n", "\n", "where = ' WHERE ' + global_cuts + ' AND ' + band_cuts \n", "q6 = \"SELECT mag_g,mag_r,mag_i FROM {schema}.dpdd \".format(**locals()) + where\n", "print(q6)\n", "records = []\n", "with dbconn.cursor() as cursor:\n", " %time cursor.execute(q6)\n", " records = cursor.fetchall()\n", " nObj = len(records)\n", " \n", "df = pd.DataFrame(records, columns=['mag_g', 'mag_r', 'mag_i'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Plotting" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_stellar_locus_davenport(color1='gmr', color2='rmi',\n", " datafile='../tutorials/assets/Davenport_2014_MNRAS_440_3430_table1.txt'):\n", " #datafile='assets/Davenport_2014_MNRAS_440_3430_table1.txt'):\n", " data = pd.read_csv(datafile, sep='\\s+', header=1)\n", " return data[color1], data[color2]\n", " \n", " \n", "def plot_stellar_locus(color1='gmr', color2='rmi',\n", " color='red', linestyle='--', linewidth=2.5):\n", " model_gmr, model_rmi = get_stellar_locus_davenport(color1, color2)\n", " plot_kwargs = {'linestyle': linestyle, 'linewidth': linewidth, 'color': color,\n", " 'scalex': False, 'scaley': False}\n", " plt.plot(model_gmr, model_rmi, **plot_kwargs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_color_color(z, color1, color2, range1=(-1, +2), range2=(-1, +2), bins=31, title=None):\n", " \"\"\"Plot a color-color diagram. Overlay stellar locus. \"\"\"\n", " band1, band2 = color1[0], color1[-1]\n", " band3, band4 = color2[0], color2[-1]\n", " H, xedges, yedges = np.histogram2d(\n", " z['mag_%s' % band1] - z['mag_%s' % band2],\n", " z['mag_%s' % band3] - z['mag_%s' % band4],\n", " range=(range1, range2), bins=bins)\n", " \n", " zi = H.T\n", " xi = (xedges[1:] + xedges[:-1])/2\n", " yi = (yedges[1:] + yedges[:-1])/2\n", "\n", " cmap = 'viridis_r'\n", " plt.figure(figsize=(8, 8))\n", " plt.pcolormesh(xi, yi, zi, cmap=cmap)\n", " plt.contour(xi, yi, zi)\n", " plt.xlabel('%s-%s' % (band1, band2))\n", " plt.ylabel('%s-%s' % (band3, band4))\n", " if not title == None:\n", " plt.suptitle(title, size='xx-large', y=0.92)\n", "\n", " plot_stellar_locus(color1, color2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_color_color(df, 'gmr', 'rmi')\n", "print('Using schema {}, cut on max err={}, found {} objects'.format(schema, max_err, nObj))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now make the same plot, but for Run 1.2p data. The query takes noticeably longer because the Run 1.2p catalog is about 5 times bigger than the one for Run 1.2i. For Run 2.1i dr1b v1 it takes over 30 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = 'run12p_v4'\n", "\n", "global_cuts = 'clean '\n", "t = None\n", "\n", "min_SNR = 25 \n", "max_err = 1/min_SNR\n", "band_cuts = ' (magerr_g < {}) AND (magerr_i < {}) AND (magerr_r < {}) '.format(max_err,max_err,max_err)\n", "where = ' WHERE ' + global_cuts + ' AND ' + band_cuts \n", "q7 = \"SELECT mag_g,mag_r,mag_i FROM {schema}.dpdd \".format(**locals()) + where\n", "print(q7)\n", "records = []\n", "with dbconn.cursor() as cursor:\n", " %time cursor.execute(q7)\n", " records = cursor.fetchall()\n", " nObj = len(records)\n", " \n", "df = pd.DataFrame(records, columns=['mag_g', 'mag_r', 'mag_i'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_color_color(df, 'gmr', 'rmi', title=t)\n", "print(f'Using schema {schema}, max err={max_err}, found {nObj} objects')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SQL Primer\n", "This optional section is intended for those with little or no prior experience with an SQL database, in particular with a server-mediated database like PostgreSQL. Most of the material discussed here applies to SQLite as well but not all; differences are pointed out. Since the PostgreSQL database is read-only for users write operations like `INSERT` and `UPDATE` are not discussed.\n", "\n", "**NOTE:** The most up-to-date version of the Primer and following sections may be found [in Confluence](https://confluence.slac.stanford.edu/x/s4joE)\n", "\n", "For definitive answers to any questions concerning the Postgres implementation of SQL, consult the [manual](https://www.postgresql.org/docs/9.6/sql.html)\n", "### Connecting to the Database\n", "In order to use a database server (e.g. PostgreSQL, MySQL, Oracle) you have to supply connection information: host, database name, user name and password. That is precisely what the first block of code at the start of this tutorial above does. Everything is there but the password. That comes from a file accessible only to the user as described in the [Prerequisites section](#Prerequisites).\n", "\n", "None of this is necessary for SQLite; all you need is read access to the file. Opening the file with the Python sqlite3 module is the analog of the line `dbconn = psycopg2.connect(**dbconfig)` above" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tables\n", "#### Table definitions\n", "A table definition contains a set of column definitions: a name, a datatype, and optional constraints. For example all values of the column can be required to be `NOT NULL` (i.e., every row must have a value for the column, which is not the default) or they can be required to be `UNIQUE`. Or both. A table definition may also include index definitions. An `INDEX` is an extra data structure internal to the database which makes it faster to look up the column or columns which are indexed.\n", "#### Table data\n", "The data for each column must conform to the datatype of the column. SQLite is pretty lax in this respect; PostgreSQL requires more. If a column is of type `TEXT` all literal values appearing in queries must be quoted, even if the value is a string of digits. A query like this\n", "\n", "`SELECT col1 FROM mytable WHERE col2=123;`\n", "\n", "will fail for PostgreSQL if col2 is of type `TEXT`. SQLite allows it.\n", "#### Views\n", "A view has the same apparent structure as a table (a collection of rows, each with the same columns) and is used the same way in `SELECT` statements, but the column definitions and data actually come from one or more tables. In the definition of the view, columns from the table can be renamed and the collection of rows included may be subject to conditions, just as in a `SELECT` statement. Views can simplify the interface for users by eliminating the need for `JOIN`s in their queries and regularizing column names for commonly-used collections of columns which may come from multiple tables, as is the case for `run12p_v4.dpdd`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fetching data: SELECT\n", "No matter what system you use to access a database, to fetch data someone (you or some service on your behalf) has to send a text string to the server that has this form : \n", "\n", "`SELECT` *things-to-fetch* `FROM` *table(s)-specification* *optional-selection-conditions* *other-optional-clauses*\n", "\n", "and returns a table whose columns correspond to the *things-to-fetch* list. Unlike parquet files or pandas dataframes, the returned data are physically ordered first by row, then column.\n", "\n", "As in the rest of this tutorial, I follow the convention of writing keywords in `ALL CAPS` only to make them stand out; it's not necessary.\n", "\n", "Examples come primarily from tables in the schema `star_truth`, in particular `truth_summary` and `stellar_variability_truth`. See [https://github.com/LSSTDESC/DC2-analysis/blob/master/tutorials/postgres_truth.ipynb] to learn more about these tables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### What to fetch\n", "*things-to-fetch* is a comma-separated list of expressions. The most common expression is simply a column name, but a wide variety of expressions are allowed, including functions applied to columns, for example\n", "\n", "`SELECT id, ra, dec ...\n", " SELECT DISTINCT id\n", " SELECT MAX(ABS(delta_flux)), AVG(delta_flux) ...\n", " SELECT CONCAT(raft, '_', ccd), one_col + another_col, ...`\n", " \n", "where, in the last example, the values of columns `raft` and `ccd` are text strings. `MAX` and `AVG` are **aggregate functions**, so-called because they apply to collections of column values rather than the value for a single row. `SELECT DISTINCT` eliminates duplicate rows from the output. The complete list of available functions and operators for PostgreSQL 9.6 (the version running at NERSC) can be found in their [comprehensive documentation](https://www.postgresql.org/docs/9.6/functions.html)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Table expressions\n", "A *table expression* may identify a single table (or view) or multiple. For our PostgreSQL database, identifying a table requires a *schema name* and a *table name* separated by `.`; e.g. `star_truth.truth_summary`.\n", "\n", "`SELECT MAX(ra), MIN(ra) from star_truth.truth_summary`\n", "\n", "A query may involve more than one table, especially when the tables are related by a column which appears in both or some other expression involving columns. The table expression says how to `JOIN` (i.e., match) the tables to make a new temporary table from which values are selected.\n", "\n", "`SELECT obshistid,delta_flux,S.id \n", " FROM star_truth.truth_summary AS S \n", " JOIN star_truth.stellar_variability_truth AS V \n", " ON V.id = S.id \n", " WHERE ABS(ra - 59.9) < 0.2 LIMIT 5`\n", " \n", "or equivalently\n", "\n", "`SELECT obshistid,delta_flux,id \n", " FROM star_truth.truth_summary \n", " JOIN star_truth.stellar_variability_truth\n", " USING (id)\n", " WHERE ABS(ra - 59.9) < 0.2 LIMIT 5`\n", "\n", "In this query `ra` appears only in `truth_summary` whereas `delta_flux`, being a (per-object, per-visit) quantity, can only be in `stellar_variability_truth`. The quantity to be fetched is in one table, but the quantity needed to select rows of interest is in a different table (though in a real query you might want to fetch `ra` as well). The second form is a shorthand available when, as is often the case, the join condition is just the equality of columns from each table with the same name. The column name or list of column names in a `USING` expression must be enclosed in `( )`. \n", "\n", "The first example also illustrates the use of the keyword `AS` to define an alias to avoid having to retype long table specifications to qualify a column name. The extra qualification is only necessary if \n", "1. the column name appears in more than one table \n", "2. it's not in the `USING` list (hence unnecessary in the second example)\n", "\n", "##### Technical details\n", "There are different kinds of joins. The simplest (`T1 CROSS JOIN T2`) has no condition. It's equivalent to `T1 INNER JOIN T2 ON TRUE`. The result is the Cartesian product of rows from T1 concatenated with rows from T2, not usually what you want. The syntax for the other kinds of joins all involve some kind of constraint or condition. They will return different results if one table or the other has rows which don't meet the condition; e.g., if a column to be matched has a NULL value for some row. For the example query it wouldn't matter which was used because every row in each of the two tables has a match with at least one row in the other table. `INNER` is the default so `JOIN` in the example is equivalent to `INNER JOIN`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Selection conditions\n", "This is the (optional, but usually present) `WHERE` clause of a query. It's a boolean expression normally involving column names which is applied to each row of the table expression in `FROM`. Rows are included in the query result if and only if they evaluate to `TRUE`. See the example in the preceding section. There are all kinds of complicated forms of `WHERE` clauses - for example they may include subqueries - but typical ones are just simple conditions involving columns and functions of columns, connected by `AND` or `OR`.\n", "\n", "The `IN` operator can be useful: \n", "\n", "`SELECT delta_flux, id, obshistid FROM star_truth.stellar_variability_truth \n", " WHERE obshistid IN (2205, 7371, 10853);`\n", "\n", "It's just shorthand for `(obshistid = 2205) OR (obshistid = 7371) OR (obshistid = 10853)`.\n", "\n", "**NOTE:** In the SQL standard the test for equality is `A = B`, *not* `A == B`. SQLite will let you get away with either; PostgreSQL will not." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Other optional clauses\n", "\n", "##### GROUP BY\n", "Groups returned rows together according to specified criteria. When there is a `GROUP BY` clause, each of the fetched items must either be something in the `GROUP BY` list or must involve an aggregate function (`MAX` and `COUNT` in the example below). \n", "`SELECT id, bandpass, MAX(ABS(delta_flux)), COUNT(obshistid) FROM star_truth.stellar_variability_truth \n", " WHERE id IN ('835183','31303590103','31102013522','31303588649', '30317268917', '30825472052','835279','31102039372','30825477672','31102046245', '30321363109','31102051190','31102061342') GROUP BY id,bandpass`\n", " \n", "This query returns up to six rows for each `id` in the list.\n", "##### ORDER BY\n", "Return results in sorted order as specified. The results from the previous query are easier to deal with if sorted.\n", "`SELECT id, bandpass, MAX(ABS(delta_flux)), COUNT(obshistid) FROM star_truth.stellar_variability_truth \n", " WHERE id IN ('835183','31303590103','31102013522','31303588649', '30317268917', '30825472052','835279','31102039372','30825477672','31102046245', '30321363109','31102051190','31102061342') GROUP BY id,bandpass ORDER BY id,bandpass`\n", " \n", "##### LIMIT\n", "Don't return more rows than the limit specified. Useful when you just want to see if a new query is working but don't want to wait for potentially large output.\n", "#### Still more\n", " There are more ways to concoct the table expression. There are more optional clauses. For the complete picture see the [section on SELECT](https://www.postgresql.org/docs/9.6/sql-select.html) in Postgres documentation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Python module psycopg2\n", "There are plenty of examples of use in the first part of this tutorial. This section extracts the basic patterns. It is specific to psycopg2 but other Python interfaces to Postgres or other databases (e.g. mysql.connector for MySQL; sqlalchemy if you stick to the core interface) are similarly structured. There are three things you need to do:\n", "1. connect (normally once for many queries, but see warning below)\n", "2. issue a query\n", "3. obtain results from the query\n", "\n", "### Connect\n", "This is already described above, but for ease of reference it is repeated here:\n", "1. Make a .pgpass file as in [Prerequisites](#Prerequisites) above. You only need to do this once. \n", "2. Import psycopg2\n", "3. Make a dict with other required connection information\n", "4. Make the connection using the dict\n", "\n", "The last three steps for our database at NERSC look like\n", "```\n", " import psycopg2\n", " \n", " dbname = 'desc_dc2_drp'\n", " dbuser = 'desc_dc2_drp_user'\n", " dbhost = 'nerscdb03.nersc.gov'\n", " dbconfig = {'dbname' : dbname, 'user' : dbuser, 'host' : dbhost}\n", " dbconn = psycopg2.connect(**dbconfig)\n", "```\n", "**WARNING:** If all your subsequent queries succeed you will only need to make the connection once per script or notebook execution, but certain kinds of failures will make the connection unusable. In that case you need to re-execute the last line above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ### Issue a query\n", " As appropriate for your application, you will have created a string, call it q, which starts out `SELECT...`. To execute it you get a cursor and call its `execute` method:\n", " ```\n", " cursor = dbconn.cursor()\n", " cursor.execute(q)\n", " ```\n", "You can reuse the same cursor for another query as long as you obtain results from the first query before issuing the second.\n", " \n", "**NOTE 1:** Unlike Python, SQL uses only single quotes `'` for string delimiters. A construction like this\n", "\n", "`q = 'SELECT col1 FROM t1 WHERE col2 = \"red\" ' `\n", "\n", "won't work because SQL doesn't recognize the `\"`\n", "\n", "`q = 'SELECT col1 FROM t1 WHERE col2='red' ' `\n", "\n", "will fail because the internal single quotes aren't escaped. Fortunately Python is rich in quoting methods so you can just use another one for the outer string:\n", "\n", "`q = \"SELECT col1 FROM t1 WHERE col2='red'\"`\n", "\n", "```\n", "q = '''SELECT col1 from t1 \n", " WHERE col2='red'\n", " AND col3='blue'\n", " '''\n", "```\n", "\n", "**NOTE 2:** In the first part of this tutorial I've used standard Python string formatting to assemble the query string. Alternatively, one can use an SQL-specific way to substitute parameters in a template. That's the preferred method if the values for the variables come from an untrusted source, like a web page, and **definitely** should be used if such values may be used in writes to the database, but neither of those holds in our case. For a full explanation of how to use this method, see the section [Passing parameters to SQL queries](https://www.psycopg.org/docs/usage.html#passing-parameters-to-sql-queries) of the psycopg2 documentation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retrieve results\n", "Results are retrieved using cursor methods `fetchone()`, `fetchall()` and `fetchmany(size)` where the things fetched are *rows*. (In essence a row is a tuple whose components may be and usually are of different types.) The most appropriate method will depend on your application. Use `fetchmany()` when you would ideally process all the data at once, but there is so much that it could strain resources like memory. The following shows a typical pattern of use\n", "\n", "```\n", "import psycopg2\n", "import pandas as pd\n", "\n", "dbconn = psycopg2.connect(db_dict)\n", " .\n", " .\n", "q = ...\n", "max_rows = 10000\n", "\n", "# Cursor will automatically be closed at the end of the block\n", "with dbconn.cursor() as cur:\n", " cur.execute(q)\n", " while True:\n", " rows = cur.fetchmany(max_rows)\n", " if len(rows) == 0:\n", " break\n", " df = pd.DataFrame(rows)\n", " process_some(df)\n", "```\n", "Putting the data in a DataFrame requires a copy; there is a limit to available memory. Choosing an appropriate value for `max_rows` depends on knowing how much memory is available and how wide the rows are. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "### PostgreSQL 9.6\n", "The official documentation is generally clear and complete\n", "1. [Everything](https://www.postgresql.org/docs/9.6/index.html)\n", "2. [Postgres SQL syntax](https://www.postgresql.org/docs/9.6/sql-syntax.html)\n", "3. [SELECT](https://www.postgresql.org/docs/9.6/queries.html)\n", "4. [Functions and operators](https://www.postgresql.org/docs/9.6/functions.html)\n", "5. [Additional spatial functions](https://github.com/LSSTDESC/DC2-PostgreSQL/blob/master/postgres-objcatalog/README_functions.md)\n", "### psycopg2\n", "1. [Project documentation](https://www.psycopg.org/docs/index.html)\n", "2. [Wiki with simple examples](https://wiki.postgresql.org/wiki/Using_psycopg2_with_PostgreSQL#Connect_to_Postgres)" ] } ], "metadata": { "kernelspec": { "display_name": "desc-python", "language": "python", "name": "desc-python" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }