{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SQL Murder Mystery\n", "\n", "The [*SQL Murder Mystery*](https://github.com/NUKnightLab/sql-mysteries/) is *a self-directed lesson to learn SQL concepts and commands and a fun game for experienced SQL users to solve an intriguing crime* published by the Northwestern University Knight Lab and based on Noah Veltman's [*Command Line Murder Mystery*](https://github.com/veltman/clmystery).\n", "\n", "To solve the mystery, you need to download a small database which contains all the clues you need, *if* you can work out what SQL commands to use to find them.\n", "\n", "Using a notebook makes sense for this sort of investigation because it can help you keep track of, and notes on, the current state of your investigation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This original database is distributed as a SQLite database, but this notebook shows how to load it into, and access it from, the PostgreSQL database distributed as part of the TM351 VM.\n", "\n", "*Instructions for downloading a Virtualbox image of the TM351 VM, or building the VM on a remote host, can be found [here](https://github.com/innovationOUtside/tm351vm/wiki).*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download and Install the Database\n", "\n", "We can download the original SQLite3 database directly from the original repository:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2019-10-08 11:33:31-- https://github.com/NUKnightLab/sql-mysteries/raw/master/sql-murder-mystery.db\n", "Resolving github.com (github.com)... 140.82.118.4\n", "Connecting to github.com (github.com)|140.82.118.4|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://raw.githubusercontent.com/NUKnightLab/sql-mysteries/master/sql-murder-mystery.db [following]\n", "--2019-10-08 11:33:31-- https://raw.githubusercontent.com/NUKnightLab/sql-mysteries/master/sql-murder-mystery.db\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 3002368 (2.9M) [application/octet-stream]\n", "Saving to: ‘sql-murder-mystery.db’\n", "\n", "sql-murder-mystery. 100%[===================>] 2.86M 3.91MB/s in 0.7s \n", "\n", "2019-10-08 11:33:33 (3.91 MB/s) - ‘sql-murder-mystery.db’ saved [3002368/3002368]\n", "\n", "sql-murder-mystery.db\n" ] } ], "source": [ "!wget https://github.com/NUKnightLab/sql-mysteries/raw/master/sql-murder-mystery.db\n", "\n", "!ls *.db" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To export the data tables from the SQLite database into Postgres, we need to do a couple of things:\n", " \n", "- install the [`pgloader`](https://github.com/dimitri/pgloader) utility which provides a wide range of tools for loading variously formatted datasets into Postgres;\n", "- create a Postgres user and database to store the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install the `pgloader` utility\n", "\n", "The `pgloader` utility is a Linux package that we can install using `apt-get`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "!sudo apt-get update && sudo apt-get install -y pgloader && sudo apt-get clean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up the Postgres Database\n", "\n", "To keep things tidy, we can create a specific database to import the data into, along with a specific Postgres user with access permissions on that database.\n", "\n", "We'll also import some tools and extensions to make working with the database easier." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The sql extension is already loaded. To reload it, use:\n", " %reload_ext sql\n" ] } ], "source": [ "from tm351_utils.db import *\n", "%load_ext sql\n", "%load_ext schemadisplay_magic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make sure we don't contaminate any of the evidence by starting with a clean slate:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NOTICE: database \"mysterydb\" does not exist, skipping\r\n", "DROP DATABASE\r\n" ] } ], "source": [ "#clearConnections(\"mysterydb\")\n", "!echo \"DROP DATABASE IF EXISTS mysterydb\" | sudo -u postgres psql" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create an empty database to import the data into:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "! createdb mysterydb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's also create a user account for our investigating officer with permissions over the database:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CREATE ROLE\n", "GRANT\n" ] } ], "source": [ "# Create a new user...\n", "!echo \"CREATE USER detective WITH PASSWORD 'clouseau';\" | sudo -u postgres psql\n", "\n", "# ...with permissions over the new database\n", "!echo \"GRANT ALL PRIVILEGES ON DATABASE mysterydb TO detective;\" | sudo -u postgres psql" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a Database Connection String and Import the Tables\n", "\n", "Create a connection string for connecting to the database, set from identifiable parameter values." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'postgresql://detective:clouseau@localhost:5432/mysterydb'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DB_ENGINE='postgresql' # tells ipython-sql that we will use postgreSQL as our database engine\n", "DB_USER='detective' # id of the user who is logging in\n", "DB_PWD='clouseau' # the user's password\n", "DB_ADDR='localhost:5432' # the host and port on which the database engine is listening\n", "DB_NAME='mysterydb' # the name of the database\n", "\n", "DB_CONNECTION = '{engine}://{user}:{pwd}@{addr}/{name}'.format(engine=DB_ENGINE,\n", " user=DB_USER,\n", " pwd=DB_PWD,\n", " addr=DB_ADDR,\n", " name=DB_NAME)\n", "DB_CONNECTION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now use this connection string as the target location for importing the data from the SQLite database into the Postgres database:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2019-10-08T11:33:45.030000Z LOG Main logs in '/tmp/pgloader/pgloader.log'\n", "2019-10-08T11:33:45.035000Z LOG Data errors in '/tmp/pgloader/'\n", "2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table \"crime_scene_report\" does not exist, skipping\n", "2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table \"drivers_license\" does not exist, skipping\n", "2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table \"person\" does not exist, skipping\n", "2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table \"facebook_event_checkin\" does not exist, skipping\n", "2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table \"interview\" does not exist, skipping\n", "2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table \"get_fit_now_member\" does not exist, skipping\n", "2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table \"get_fit_now_check_in\" does not exist, skipping\n", "2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table \"income\" does not exist, skipping\n", "2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table \"solution\" does not exist, skipping\n", " table name read imported errors time\n", "---------------------- --------- --------- --------- --------------\n", " fetch 0 0 0 0.000s\n", " fetch meta data 9 9 0 0.026s\n", " create, truncate 0 0 0 0.033s\n", "---------------------- --------- --------- --------- --------------\n", " crime_scene_report 1228 1228 0 0.080s\n", " drivers_license 10007 10007 0 0.293s\n", " person 10011 10011 0 0.243s\n", "facebook_event_checkin 20011 20011 0 0.440s\n", " interview 4991 4991 0 0.107s\n", " get_fit_now_member 184 184 0 0.053s\n", " get_fit_now_check_in 2703 2703 0 0.067s\n", " income 7514 7514 0 0.074s\n", " solution 0 0 0 0.007s\n", "index build completion 0 0 0 0.000s\n", "---------------------- --------- --------- --------- --------------\n", " Create Indexes 0 0 0 0.000s\n", " Reset Sequences 0 0 0 0.010s\n", "---------------------- --------- --------- --------- --------------\n", " Total streaming time 56649 56649 0 1.433s\n" ] } ], "source": [ "!pgloader ./sql-murder-mystery.db $DB_CONNECTION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Connect to the Database\n", "\n", "We can use the [`ipython-sql`](https://github.com/catherinedevlin/ipython-sql) magic, which is preinstalled in the TM351 VM, to connect to the database, which will simplify running queries over it:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Connected: detective@mysterydb'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql $DB_CONNECTION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The magic returns results as `pandas` dataframes. The following setting ensures we can read the full content of any result." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "pd.set_option('display.max_colwidth', -1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To orientate ourselves to the contents of the database, we can display the table schema:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "G\n", "\n", "\n", "income\n", "income\n", "\n", "- ssn : BIGINT\n", "- annual_income : BIGINT\n", "\n", "\n", "\n", "get_fit_now_member\n", "get_fit_now_member\n", "\n", "- id : TEXT\n", "- person_id : BIGINT\n", "- name : TEXT\n", "- membership_start_date : BIGINT\n", "- membership_status : TEXT\n", "\n", "\n", "\n", "interview\n", "interview\n", "\n", "- person_id : BIGINT\n", "- transcript : TEXT\n", "\n", "\n", "\n", "crime_scene_report\n", "crime_scene_report\n", "\n", "- date : BIGINT\n", "- type : TEXT\n", "- description : TEXT\n", "- city : TEXT\n", "\n", "\n", "\n", "person\n", "person\n", "\n", "- id : BIGINT\n", "- name : TEXT\n", "- license_id : BIGINT\n", "- address_number : BIGINT\n", "- address_street_name : TEXT\n", "- ssn : BIGINT\n", "\n", "\n", "\n", "solution\n", "solution\n", "\n", "- user : BIGINT\n", "- value : TEXT\n", "\n", "\n", "\n", "get_fit_now_check_in\n", "get_fit_now_check_in\n", "\n", "- membership_id : TEXT\n", "- check_in_date : BIGINT\n", "- check_in_time : BIGINT\n", "- check_out_time : BIGINT\n", "\n", "\n", "\n", "drivers_license\n", "drivers_license\n", "\n", "- id : BIGINT\n", "- age : BIGINT\n", "- height : BIGINT\n", "- eye_color : TEXT\n", "- hair_color : TEXT\n", "- gender : TEXT\n", "- plate_number : TEXT\n", "- car_make : TEXT\n", "- car_model : TEXT\n", "\n", "\n", "\n", "facebook_event_checkin\n", "facebook_event_checkin\n", "\n", "- person_id : BIGINT\n", "- event_id : BIGINT\n", "- event_name : TEXT\n", "- date : BIGINT\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%schema --connection_string $DB_CONNECTION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A helper function also lets us view the tables in the database:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tablenametableowner
0crime_scene_reportdetective
1drivers_licensedetective
2persondetective
3facebook_event_checkindetective
4interviewdetective
5get_fit_now_memberdetective
6get_fit_now_check_indetective
7incomedetective
8solutiondetective
\n", "
" ], "text/plain": [ " tablename tableowner\n", "0 crime_scene_report detective\n", "1 drivers_license detective\n", "2 person detective\n", "3 facebook_event_checkin detective\n", "4 interview detective\n", "5 get_fit_now_member detective\n", "6 get_fit_now_check_in detective\n", "7 income detective\n", "8 solution detective" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "showTables(DB_NAME)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also review the columns contained within a table (omit the table name to see columns listed for *all* tables in the specified database)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column_nametable_name
0citycrime_scene_report
1datecrime_scene_report
2descriptioncrime_scene_report
3typecrime_scene_report
\n", "
" ], "text/plain": [ " column_name table_name\n", "0 city crime_scene_report\n", "1 date crime_scene_report\n", "2 description crime_scene_report\n", "3 type crime_scene_report" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "showColumns(DB_NAME, 'crime_scene_report', user=DB_USER, password=DB_PWD)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Let the Investigation Begin\n", "\n", "Check out the original guidance for the mystery [here](https://github.com/NUKnightLab/sql-mysteries/blob/master/prompt_beginner.pdf); additional hints are also available within the [original repository](https://github.com/NUKnightLab/sql-mysteries/).\n", "\n", "In brief, you're going to be investigating a __murder__ that took place on __January 15th, 2018__ in __SQL City__.\n", "\n", "To get you started, here's the original crime report:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * postgresql://detective:***@localhost:5432/mysterydb\n", "1 rows affected.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datetypedescriptioncity
020180115murderSecurity footage shows that there were 2 witnesses. The first witness lives at the last house on \"Northwestern Dr\". The second witness, named Annabel, lives somewhere on \"Franklin Ave\".SQL City
\n", "
" ], "text/plain": [ " date type \\\n", "0 20180115 murder \n", "\n", " description \\\n", "0 Security footage shows that there were 2 witnesses. The first witness lives at the last house on \"Northwestern Dr\". The second witness, named Annabel, lives somewhere on \"Franklin Ave\". \n", "\n", " city \n", "0 SQL City " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT * FROM crime_scene_report WHERE date=20180115 AND city='SQL City' AND type='murder';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now get to it..." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }