{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Databases\n", "\n", "\n", "## Starting Out: Introduction to SQL and Relational Databases\n", "\n", "SQL is a language designed for a very specific purpose: to interact with relational databases. \n", "\n", "- **Database**: A database is a structured collection of data. There are various different ways of structuring the database, and there may or may not be information about the relationship between entities in the database.\n", "- **Query**: A query is a request for data from the database. \n", "- **Database Management System (DBMS)**: A DBMS is a system of storing and managing databases, including querying the database.\n", "- **Relational Database Management System (RDBMS)**: In an RDBMS, data records are stored in *tables*, each of which has a predefined set of *columns*, the pieces of information captured for each record in a table, and *rows* in the table, where each row has a place to store a value for every column in the table.\n", "\n", "Tables, including their columns, column types and relationships with other tables, are defined in a database **schema**. Many times, tables will contain a **primary key**, one or more columns that uniquely define a row. You can think of the primary key as a kind of ID, in which each row is given a unique ID. Tables can also contain **foreign keys**, which are column(s) that comprise the primary key in another table and, thus, provides a way of matching between multiple tables.\n", "\n", "In this notebook, we will use SQL to:\n", "- Select data subsets\n", "- Sum over groups\n", "- Create new tables\n", "- Count distinct values of desired variables\n", "- Order data by chosen variables\n", "- Join tables together" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1: Establish a Connection to the Database\n", "\n", "This notebook uses an SQL kernel, meaning it is able to interpret the SQL code using SQLite. We do, however, still need to specify the database we are using. The cell below establishes this connection to the database. Note that this is specific to using SQL within a Jupyter notebook, and if you use SQL through a different interface, you will not need to use this method of connecting to a database." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%defaultDatasource jdbc:sqlite:ncdoc.db" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Formulate Data Query\n", "\n", "Depending on what data we are interested in, we can use different queries to pull different data. In this example, we will first start by looking at the inmates data." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "abd5924e-32ec-4369-a194-7395923c7ebb", "version_major": 2, "version_minor": 0 }, "method": "display_data" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SELECT *\n", "FROM inmate\n", "LIMIT 20;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see 20 rows of the `inmate` dataset. Let's go over the basics of this SQL command.\n", "\n", "- **SELECT:** We start out with the `SELECT` statement. The `SELECT` statement specifies which variables (columns) you want. \n", " - Here, we used `SELECT *`. The \"`*`\" just says that we want all the variables. \n", " - If we wanted a few columns, we would use the column names separated by commas instead of \"`*`\".\n", "\n", "\n", "- **FROM:** Now, let's look at the next part of the query, `FROM inmate`. This part of the query specifies the table, `inmate`, from which we want to retrieve the data. Most of your queries will begin in this fashion, describing which columns you want and from which table.\n", "\n", "\n", "- **LIMIT:** We typically include a `LIMIT` statement at the end of our query so that we don't get overloaded with rows being output. Here, `LIMIT 20` means that we just want the first ten rows. Many times, the `LIMIT` that you want will be higher than 20 -- you might generally prefer to use 1000 or so. Having a `LIMIT` for all queries is highly recommended even if you know only a few rows will be shown, since it acts as a safety precaution against (for example) displaying millions of rows of data.\n", "\n", "In this case, we've put everything in one line, but that's not necessary. We could have split the code up into multiple lines, like so:\n", "\n", " SELECT *\n", " FROM inmate\n", " LIMIT 20;\n", "\n", "This gives the same output as our original query. Generally, once queries start getting longer, breaking up the code into multiple lines can be very helpful in organizing your code and making it easier to read.\n", "\n", "Along those lines, note that we used a semi-colon at the end of the query to mark the end of the query. That isn't absolutely necessary here, but it does help mark the end of a query and is required in other applications of SQL, so it's good practice to use it. \n", "\n", "> ### Side note about capitalization\n", "If you notice, we've been using all caps for SQL commands and all lowercase for data table and schema names. This is simply a convention, as SQL is not case sensitive. For example, we could have run `select * from inmate limit 20;` and it would have given us the exact same output as the first query. \n", "\n", "> This does mean you need to be careful when using column names. If your column name has capital letters in it, you need use double quotes (e.g. `\"INMATE_DOC_NUMBER\"`) to preserve the capitalization. For this reason, you might find that using all lowercase letters in column names is preferable, which is what we've done here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Note that the `LIMIT` provides one simple way to get a \"sample\" of data; however, using `LIMIT` does **not provide a _random_** sample. You may get different samples of data than others using just the `LIMIT` clause, but it is just based on what is fastest for the database to return." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Summaries of the Data\n", "\n", "One of the basic things you might be interested in doing is finding out how many rows there are in the dataset. You can do this using the `COUNT` statement.\n", "\n", "Let's find the total number of sentences that we have in our `sentences` table." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "461421" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "SELECT COUNT(*)\n", "FROM sentences\n", "LIMIT 20;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also count the number of non-NULL values there are in a given variable by including a column name instead of `*`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "461421" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "SELECT COUNT(inmate_doc_number)\n", "FROM sentences\n", "LIMIT 20;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To count the number of unique cases, you can use the `DISTINCT` statement. This checks how many unique values of that variable there are. \n", "\n", "Let's find the total number of unique individuals who were in those sentences." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "461421" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "SELECT COUNT(DISTINCT inmate_doc_number)\n", "FROM sentences\n", "LIMIT 20;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also do other basic summaries, such as finding sum using `SUM` or average using `AVG`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conditional Statements\n", "\n", "Suppose we want to look at a subset of the data. We can use conditional statements to do this." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "ename": "org.sqlite.SQLiteException", "evalue": " [SQLITE_ERROR] SQL error or missing database (no such column", "output_type": "error", "text": "org.sqlite.SQLiteException: [SQLITE_ERROR] SQL error or missing database (no such column: release_year)", "traceback": [ "\u001b[1;31morg.sqlite.SQLiteException: [SQLITE_ERROR] SQL error or missing database (no such column: release_year)\u001b[0;0m" ] } ], "source": [ "SELECT *\n", "FROM inmate\n", "WHERE inmate_gender_code = 'MALE'\n", "LIMIT 10;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `WHERE` statement is used to return only data that meets certain conditions. Here, we used it to find only sentences that were between 1980 and 1990. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Common Comparison Operators\n", "Though there are some more complicated comparison operators (if you're curious, feel free to look up what `LIKE` and `IN` do), these should cover most of what you want to do. \n", "- **`=`**: equal to\n", "- **`!=`** or \"**`<>`**\": not equal to\n", "- **`<`**: less than\n", "- **`<=`**: less-than-or-equal-to\n", "- **`>`**: greater than\n", "- **`>=`**: greater-than-or-equal-to\n", "- **`IS NULL`** and **`IS NOT NULL`**: The signifier of a row in a column not having a value is a special keyword: `NULL`. To check for `NULL`, you use `IS NULL` or `IS NOT NULL`, rather than \"=\" or \"!=\". For example, to count the number of rows with `NULL` values for `inmate_gender_code` we might use the following:\n", "\n", " SELECT *\n", " FROM inmate\n", " WHERE inmate_gender_code IS NOT NULL\n", " LIMIT 10;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Aggregation Functions\n", "\n", "What if we wanted to get summaries of the data, such as counts, aggregated by a categorical variable? We can do this using the `GROUP BY` statement." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SELECT inmate_race_code, count(*)\n", "FROM inmate\n", "GROUP BY inmate_race_code\n", "LIMIT 20;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, the `GROUP BY` statement groups it into the categories of the variable. Since we've chosen to display the count, we can see the counts. We can also change the order in which the results are displayed so that it's in increasing order. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SELECT inmate_race_code, count(*)\n", "FROM inmate\n", "GROUP BY inmate_race_code\n", "ORDER BY count(*)\n", "LIMIT 20;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ORDER BY` statement orders the rows that it displays according to whatever you put after it. In this case, we chose the count. \n", "\n", "### Using GROUP BY with Multiple Variables\n", "\n", "Suppose we wanted to look at counts by race and by gender. We could do this by adding the gender variable to the `GROUP BY` statement." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SELECT inmate_race_code, inmate_gender_code, count(*)\n", "FROM inmate\n", "GROUP BY inmate_race_code, inmate_gender_code\n", "ORDER BY count(*) DESC\n", "LIMIT 20;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This first groups by `inmate_race_code`, then it groups by `inmate_gender_code`, in that order. Further, notice that we used `DESC` after `ORDER BY`. This orders in descending order instead of ascending order, so that we can see the groups with the most people at the top.\n", "\n", "### Conditional Statements After Aggregation\n", "\n", "Suppose we wanted to display only certain counts. We can use `HAVING` to do this." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ae5012e3-d425-4f30-b12b-ca01f473240a", "version_major": 2, "version_minor": 0 }, "method": "display_data" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SELECT inmate_race_code, inmate_gender_code, count(*)\n", "FROM inmate\n", "GROUP BY inmate_race_code, inmate_gender_code\n", "HAVING count(*) > 1000\n", "ORDER BY count(*) DESC\n", "LIMIT 20;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will only display the counts for which the count is greater than 1000. Note that this is different from using `WHERE`, since the conditional statement comes after the `GROUP BY` statement. Basically, `HAVING` gives us a way of using the same types of conditional statements after we do our aggregation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the nice things about relational databases is organization using multiple tables that are linked together in some way. For example, suppose we have one table with 6 rows called **Table A**:\n", "\n", "| id | var1|\n", "|---|---|\n", "|1|5|\n", "|2|10|\n", "|3|2|\n", "|4|6|\n", "|5|22|\n", "|6|9|\n", "\n", "And another table with 5 rows called **Table B**:\n", "\n", "| is | var2|\n", "|---|---|\n", "|2|2|\n", "|5|4|\n", "|6|1|\n", "|7|2|\n", "|8|0|\n", "\n", "Let's say we want to combine Table A and Table B so that we have one table that contains information about `id`, `var1`, and `var2`. We want to do this by matching the two tables by what they have in common, `id`. That is, we want a table that looks like this (let's call this **Table C**):\n", "\n", "| id | var1 | var2 |\n", "|---|---|---|\n", "|2|10|2|\n", "|5|22|4|\n", "|6|9|1|\n", "\n", "Table C has each `id` that was in both Table A and Table B. It also contains the appropriate values for `var1` and `var2` corresponding to each `id`. This kind of matching can be quite tricky to figure out manually, since there are different numbers of rows in each table, not all of the `id` values match for the two tables, and there are some `id` values that aren't in both. Fortunately for us, SQL is well-equipped to handle this task using the `JOIN` statement." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SELECT sentences.inmate_doc_number, inmate.inmate_race_code, sentences.actual_sentence_end_date \n", "FROM sentences\n", "JOIN inmate\n", "ON sentences.inmate_doc_number = inmate.inmate_doc_number\n", "LIMIT 1000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we're connected and have established a plan for how we're joining two tables together, let's take a look at the SQL code that performs this join and break it down.\n", "\n", " SELECT * FROM sentences\n", " JOIN inmate\n", " ON sentences.inmate_doc_number = inmate.inmate_doc_number\n", " LIMIT 1000\n", "\n", "Here, we want to `SELECT` each column from a data table that we get from joining the tables `inmate` and `sentences`. The second line takes the `inmate` table and joins the `sentences` table to it. \n", "\n", "We can't just mash two tables together though -- we need some way of making sure that the appropriate rows match. We do this with the third line:\n", "\n", " ON sentences.inmate_doc_number = inmate.inmate_doc_number\n", "\n", "This part specifies what we're joining on. That is, what is the ID variable that is in both tables that we want to match. They don't need to be named the same in both tables, though you do need to specify what they are in each table, even if they are the same, as well as which table they are from.\n", "\n", "If you run the full code below, you should see the first 1000 rows (because of the `LIMIT 1000`) of the joined table. You should be able to scroll through all of the variables and see that we've managed to merge the `inmate` and `sentences` tables together according to their IDs.\n", "\n", "> **Side note:** We're only going to be displaying a few of the columns instead of using `SELECT *` like we showed above. This is because we aren't able to display more than 50 columns here in this notebook format. Joining to get tables with greater than 50 columns is perfectly fine, but we'll only look at a few at a time to make it easier to follow in these exercises." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SELECT sentences.inmate_doc_number, inmate.inmate_race_code, sentences.actual_sentence_end_date \n", "FROM sentences\n", "JOIN inmate\n", "ON sentences.inmate_doc_number = inmate.inmate_doc_number\n", "LIMIT 1000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Different Types of Joins \n", "\n", "We've so far done only one type of join, an inner join. This is the default join (which is why we didn't need to specify anything more in the code). However, there are different types of joins.\n", "\n", "### Left and Right Joins in SQL\n", "\n", "\n", "Suppose we want to look at every single census block in one table, only filling in information from the second table if it exists. We'll illustrate this using Table A and Table B from before. Recall that our `JOIN` created Table C:\n", "\n", "| id | var1 | var2 |\n", "|---|---|---|\n", "|2|10|2|\n", "|5|22|4|\n", "|6|9|1|\n", "\n", "Instead, we want to create the following table:\n", "\n", "| id | var1 | var2 |\n", "|---|---|---|\n", "|1|5|*null*|\n", "|2|10|2|\n", "|3|2|*null*|\n", "|4|6|*null*|\n", "|5|22|4|\n", "|6|9|1|\n", "\n", "Here, we've kept every single row in Table A, and simply filled in the information from Table B if it existed for that `id`. This is called a **LEFT JOIN**, since we're taking the table on the left (that is, Table A) and adding the information from Table B onto that. We could have also done a **RIGHT JOIN**, which does the same thing, except flipping the tables, giving us something that looks like:\n", "\n", "| id | var1 | var2|\n", "|---|---|---|\n", "|2|10|2|\n", "|5|22|4|\n", "|6|9|1|\n", "|7|*null*|2|\n", "|8|*null*|0|" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SELECT sentences.inmate_doc_number, inmate.inmate_race_code, sentences.actual_sentence_end_date \n", "FROM sentences\n", "LEFT JOIN inmate\n", "ON sentences.inmate_doc_number = inmate.inmate_doc_number\n", "LIMIT 1000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Outer Join\n", "\n", "An outer join keeps all unique ids, then puts `NULL` if it isn't part of that table. This is similar to a `LEFT` or `RIGHT JOIN`, except instead of only keeping all IDs from one table, it keeps them from both tables. Consider our example with Table A and Table B. We want to join them such that we get a table that looks like:\n", "\n", "| id | var1 | var2 |\n", "|---|---|---|\n", "|1|5|*null*|\n", "|2|10|2|\n", "|3|2|*null*|\n", "|4|6|*null*|\n", "|5|22|4|\n", "|6|9|1|\n", "|7|*null*|2|\n", "|8|*null*|0|\n", "\n", "In a way, it's like combining the `LEFT` and `RIGHT JOIN`s so that we have all information from both tables.\n", "\n", "### Applying Outer Joins\n", "\n", "We use `OUTER JOIN` for that. Unfortunately, we aren't able to show the outer join here, as it isn't supported by SQLite. We've provided the code here, but it won't run, so just make sure to keep it in mind for the future." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SELECT sentences.inmate_doc_number, inmate.inmate_race_code, sentences.actual_sentence_end_date \n", "FROM sentences\n", "FULL OUTER JOIN inmate\n", "ON sentences.inmate_doc_number = inmate.inmate_doc_number\n", "LIMIT 1000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating New Tables for Future Use \n", "\n", "So far, we've mostly just been exploring the data without making any changes to the database. However, there might be times when we might want to create new tables. We can do this using `CREATE TABLE`. Let's use a previous example to create a new table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "CREATE TABLE joinedtable AS\n", "SELECT * \n", "FROM sentences\n", "JOIN inmate\n", "ON sentences.inmate_doc_number = inmate.inmate_doc_number\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This should look mostly familiar, since everything after the first line is stuff we've already done. The first line creates a new table called `joinedtable` from the output.\n", "\n", "This is a bit of a mess, though. We usually don't need everything from the tables that we do join, so we can choose what we keep. Let's create a new table that has just the information we need." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "CREATE TABLE joinedtable2 AS\n", "SELECT sentences.inmate_doc_number, inmate.inmate_race_code, sentences.actual_sentence_end_date \n", "FROM sentences\n", "JOIN inmate\n", "ON sentences.inmate_doc_number = inmate.inmate_doc_number\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, notice that we use aliasing to help make refering to tables easier. That is, in the third and fourth lines, we put \"`a`\" and \"`b`\" after each table to give it that alias. We can then use \"`a`\" and \"`b`\" whenever we refer to either table, which makes the `SELECT` statement easier. \n", "\n", "Along those lines, notice that we specify which table each variable was from. If the column name is unique between the two tables (i.e. both tables don't have a column with the same name), then you don't need to specify the table as we've done. However, if they aren't unique and both tables have a variable with that name, you need to specify which one you want.\n", "\n", "Lastly, we've made the table easier to read by changing the name of the variable in the new table, using `AS` in the `SELECT` part of the query. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dropping Tables\n", "\n", "Conversely, you can also drop, or delete, tables. We created a table in the previous section that we won't need, so let's drop it." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "ename": "java.lang.RuntimeException", "evalue": " No datasource", "output_type": "error", "text": "java.lang.RuntimeException: No datasource", "traceback": [ "\u001b[1;31mjava.lang.RuntimeException: No datasource\u001b[0;0m" ] } ], "source": [ "DROP TABLE joinedtable;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might be tempted to avoid dropping tables since it seems relatively harmless to simply not use the table anymore without dropping them. However, it is important to keep databases clean and consider the amount of space each table takes up. " ] } ], "metadata": { "kernelspec": { "display_name": "SQL", "language": "SQL", "name": "sql" }, "language_info": { "codemirror_mode": "sql", "file_extension": ".sql", "mimetype": "", "name": "SQL", "nbconverter_exporter": "", "version": "" } }, "nbformat": 4, "nbformat_minor": 4 }