{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# *Storing Storm Data*\n", "\n", "\n", "**International Hurricane Watchgroup (IHW) has been asked to update their analysis tools. Because of the increase in public awareness of hurricanes, they are required to be more diligient with the analysis of historical hurricane data they share across the organization. They have been facing trouble sharing data across the teams and keeping it consistent.

From what we've been told, it seems that their method of sharing the data with their anaylsts has been to save a CSV file on their local servers and have every analyst pull the data down. Then, each analyst uses a local SQLite engine to store the CSV, run their queries, and send their results around. We've been shared a CSV [File](https://dq-content.s3.amazonaws.com/251/storm_data.csv).

\n", "Data Dictionary:**\n", "\n", "> fid - ID for the row
\n", "> year - Recorded year
\n", "> month - Recorded month
\n", "> day - Recorded date
\n", "> ad_time - Recorded time in UTC
\n", "> btid - Hurricane ID
\n", "> name - Name of the hurricane
\n", "> lat - Latitude of the recorded location
\n", "> long - Longitude of the recorded location
\n", "> wind_kts - Wind speed in knots per second
\n", "> pressure - Atmospheric pressure of the hurricane
\n", "> cat - Hurricane category
\n", "> basin - The basin the hurricane is located
\n", "> shape_leng - Hurricane shape length
\n", "\n", "# Aim\n", "\n", "***We want to productionize their services and change the current inefficient way of sharing and querying data.
We will create a database that will accomplish the following:***\n", "\n", "> Have Database for the IHW to store their tables.
\n", "> Have Table in that Database that contains fields detailed in the CSV file (in a Storage Efficient Way)
\n", "> Have Users that can update, read, and insert into a table of the data.
\n", "> Have Users that can only read into a table.
\n", "> And Insert data into the table." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Introduction to Data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FIDYEARMONTHDAYAD_TIMEBTIDNAMELATLONGWIND_KTSPRESSURECATBASINShape_Leng
020011957881800Z63NOTNAMED22.5-140.0500TSEastern Pacific1.140175
1200219611031200Z116PAULINE22.1-140.2450TSEastern Pacific1.166190
2200319628290600Z124C18.0-140.0450TSEastern Pacific2.102380
3200419677140600Z168DENISE16.6-139.5450TSEastern Pacific2.121320
4200519728161200Z251DIANA18.5-139.8700H1Eastern Pacific1.702939
\n", "
" ], "text/plain": [ " FID YEAR MONTH DAY AD_TIME BTID NAME LAT LONG WIND_KTS \\\n", "0 2001 1957 8 8 1800Z 63 NOTNAMED 22.5 -140.0 50 \n", "1 2002 1961 10 3 1200Z 116 PAULINE 22.1 -140.2 45 \n", "2 2003 1962 8 29 0600Z 124 C 18.0 -140.0 45 \n", "3 2004 1967 7 14 0600Z 168 DENISE 16.6 -139.5 45 \n", "4 2005 1972 8 16 1200Z 251 DIANA 18.5 -139.8 70 \n", "\n", " PRESSURE CAT BASIN Shape_Leng \n", "0 0 TS Eastern Pacific 1.140175 \n", "1 0 TS Eastern Pacific 1.166190 \n", "2 0 TS Eastern Pacific 2.102380 \n", "3 0 TS Eastern Pacific 2.121320 \n", "4 0 H1 Eastern Pacific 1.702939 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import io \n", "import csv\n", "from urllib import request\n", "import pandas as pd\n", "\n", "response = request.urlopen(\"https://dq-content.s3.amazonaws.com/251/storm_data.csv\")\n", "reader = csv.reader(io.TextIOWrapper(response))\n", "\n", "df = pd.read_csv(io.TextIOWrapper(response))\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exploring Various Columns and deciding their required Data Types\n", "\n", "Let's see datatypes of different columns. And how much storage each data type will need." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "FID int64\n", "YEAR int64\n", "MONTH int64\n", "DAY int64\n", "AD_TIME object\n", "BTID int64\n", "NAME object\n", "LAT float64\n", "LONG float64\n", "WIND_KTS int64\n", "PRESSURE int64\n", "CAT object\n", "BASIN object\n", "Shape_Leng float64\n", "dtype: object" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Getting Max length of Numerical Columns**" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['FID', 59228, 5]\n", "['YEAR', 2008, 4]\n", "['MONTH', 12, 2]\n", "['DAY', 31, 2]\n", "['AD_TIME', '1800Z', 5]\n", "['BTID', 1410, 4]\n", "['LAT', 69.0, 4]\n", "['LONG', 180.0, 5]\n", "['WIND_KTS', 165, 3]\n", "['PRESSURE', 1024, 4]\n", "['Shape_Leng', 11.18034, 8]\n" ] } ], "source": [ "columns = list(df.columns)\n", "num_columns = columns[:6] + columns[7:-3] + columns[-1:]\n", "\n", "for item in num_columns:\n", " print([item, max(df[item].value_counts().index), len(str(max(df[item].value_counts().index)))])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Numerical Columns Datatypes for PostgreSQL\n", "\n", "- For `FID`: We wil use INTEGER datatype. Since, it's largest value is 59228.\n", "- Columns `YEAR`, `MONTH`, `DAY` represent any particular date and `AD_TIME` represents a record of the time in [UTC (Coordinated Universal Time)](https://en.wikipedia.org/wiki/Coordinated_Universal_Time): Here, we will combine all of them into a single column and use TIMESTAMP datatype for this column.\n", "- For `BTID`, `WIND_KTS` and `PRESSURE`: We will use SMALLINT. Since, there maximum values are 1410, 165 and 1024 respectively.\n", "- For `LAT` and `LONG`: We will use DECIMAL datatype with precision 4 and scale 1. Since, they have max. 3 digits before decimal and 1 digit after decimal.\n", "- For `Shape_Leng`: We will use DECIMAL datatype with precision 8 and scale 6. Since, it has max. two digits before decimal and 6 digits after decimal." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Getting Maximum Length of String Columns**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['NAME', 'CAT', 'BASIN']\n", "(9, 'SEBASTIEN')\n", "(2, 'TS')\n", "(15, 'Eastern Pacific')\n" ] } ], "source": [ "str_columns = [columns[x] for x in (6, 11, 12)]\n", "print(str_columns)\n", "\n", "for item in str_columns:\n", " print(max([(len(x), x) for x in df[item].unique()]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### String Columns Datatypes for PostgreSQL\n", "\n", "- For `NAME`:\n", " - We will use VARCHAR(10) since max length is 9\n", "- For `CAT`:\n", " - We will use VARCHAR(2) since max length is 2\n", "- For `BASIN`:\n", " - we will use VARCHAR(16) since max length is 15" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating the Table\n", "\n", "Now we will create `IHW Database` and `Hurricanes Table` to store the values from CSV into PostgreSQL" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Create Database First\n", "\n", "import psycopg2\n", "from datetime import datetime\n", "\n", "conn = psycopg2.connect(\"dbname=postgres user=postgres password=postgres host=localhost\")\n", "conn.autocommit = True\n", "cur = conn.cursor()\n", "\n", "cur.execute(\"DROP DATABASE IF EXISTS ihw\")\n", "cur.execute(\"CREATE DATABASE ihw\")\n", "conn.close()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "conn = psycopg2.connect(\"dbname=ihw user=postgres password=postgres host=localhost\")\n", "conn.autocommit = True\n", "cur = conn.cursor()\n", "cur.execute(\"DROP TABLE IF EXISTS hurricanes\")\n", "\n", "cur.execute(\"\"\"\n", " CREATE TABLE hurricanes (\n", " fid INTEGER PRIMARY KEY,\n", " date TIMESTAMP,\n", " btid SMALLINT,\n", " name VARCHAR(10),\n", " lat DECIMAL(4, 1), \n", " long DECIMAL(4, 1), \n", " wind_kts SMALLINT, \n", " pressure SMALLINT,\n", " category VARCHAR(2),\n", " basin VARCHAR(16),\n", " shape_length DECIMAL(8, 6)\n", " )\n", " \"\"\")\n", "\n", "conn.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`hurricanes table`" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fiddatebtidnamelatlongwind_ktspressurecategorybasinshape_length
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [fid, date, btid, name, lat, long, wind_kts, pressure, category, basin, shape_length]\n", "Index: []" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conn = psycopg2.connect(\"dbname=ihw user=postgres password=postgres host=localhost\")\n", "query = \"SELECT * FROM hurricanes limit 5;\"\n", "pd.read_sql_query(query, conn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating Users\n", "\n", "With the table set up, we will now create two users on the database:\n", "\n", "- One that can `insert`, `update`, and `read` the data but not `delete`. This is like creating a \"data production\" user whose job it is is to always write new and existing data to the table.\n", "\n", "- Second, for the IHW team's analysts to just run read queries on the data." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "conn.close()\n", "\n", "conn = psycopg2.connect(\"dbname=ihw user=postgres password=postgres host=localhost\")\n", "conn.autocommit = True\n", "cur = conn.cursor()\n", "\n", "cur.execute(\"Drop USER IF EXISTS ihw_production\")\n", "cur.execute(\"CREATE USER ihw_production WITH PASSWORD 'ihw.production.whi'\")\n", "\n", "cur.execute(\"Drop USER IF EXISTS ihw_analyst\")\n", "cur.execute(\"CREATE USER ihw_analyst WITH PASSWORD 'ihw.analyst.whi'\")\n", "\n", "cur.execute(\"REVOKE ALL ON hurricanes FROM ihw_production\")\n", "cur.execute(\"REVOKE ALL ON hurricanes FROM ihw_analyst\")\n", "cur.execute(\"GRANT INSERT, UPDATE, SELECT ON hurricanes TO ihw_production\")\n", "cur.execute(\"GRANT SELECT ON hurricanes TO ihw_analyst\")\n", "conn.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating Readonly Group" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "conn = psycopg2.connect(\"dbname=ihw user=postgres password=postgres host=localhost\")\n", "conn.autocommit = True\n", "cur = conn.cursor()\n", "cur.execute(\"DROP GROUP IF EXISTS analysts\")\n", "\n", "cur.execute(\"CREATE GROUP analysts NOLOGIN\")\n", "conn.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inserting Data\n", "\n", "Now we will insert data into the table using `Insert` and `mogrify` method." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "conn = psycopg2.connect(\"dbname=ihw user=ihw_production password=ihw.production.whi host=localhost\")\n", "conn.autocommit = True\n", "cur = conn.cursor()\n", "\n", "response = request.urlopen(\"https://dq-content.s3.amazonaws.com/251/storm_data.csv\")\n", "reader = csv.reader(io.TextIOWrapper(response))\n", "next(reader)\n", "\n", "mogrified_values = []\n", "\n", "for row in reader:\n", " date = datetime(int(row[1]), int(row[2]), int(row[3]), hour=int(row[4][:2]), minute=int(row[4][2:-1]))\n", " updated_row = [row[0], date] + row[5:]\n", "# print(updated_row)\n", "\n", " mogrified = cur.mogrify(\"(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)\", updated_row).decode('utf-8')\n", " mogrified_values.append(mogrified)\n", " \n", "cur.execute(\"INSERT INTO hurricanes VALUES \" + \",\".join(mogrified_values))\n", "conn.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Confirming Data is read into PostgreSQL\n", "\n", "Now, we will run a select query to check if data is loaded into the Database and whether it can be accessed by analysts. This data can now be queried efficiently." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`hurricanes table`" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fiddatebtidnamelatlongwind_ktspressurecategorybasinshape_length
020011957-08-08 18:00:0063NOTNAMED22.5-140.0500TSEastern Pacific1.140175
120021961-10-03 12:00:00116PAULINE22.1-140.2450TSEastern Pacific1.166190
220031962-08-29 06:00:00124C18.0-140.0450TSEastern Pacific2.102380
320041967-07-14 06:00:00168DENISE16.6-139.5450TSEastern Pacific2.121320
420051972-08-16 12:00:00251DIANA18.5-139.8700H1Eastern Pacific1.702939
\n", "
" ], "text/plain": [ " fid date btid name lat long wind_kts pressure \\\n", "0 2001 1957-08-08 18:00:00 63 NOTNAMED 22.5 -140.0 50 0 \n", "1 2002 1961-10-03 12:00:00 116 PAULINE 22.1 -140.2 45 0 \n", "2 2003 1962-08-29 06:00:00 124 C 18.0 -140.0 45 0 \n", "3 2004 1967-07-14 06:00:00 168 DENISE 16.6 -139.5 45 0 \n", "4 2005 1972-08-16 12:00:00 251 DIANA 18.5 -139.8 70 0 \n", "\n", " category basin shape_length \n", "0 TS Eastern Pacific 1.140175 \n", "1 TS Eastern Pacific 1.166190 \n", "2 TS Eastern Pacific 2.102380 \n", "3 TS Eastern Pacific 2.121320 \n", "4 H1 Eastern Pacific 1.702939 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conn = psycopg2.connect(\"dbname=ihw user=ihw_analyst password=ihw.analyst.whi host=localhost\")\n", "query = \"SELECT * FROM hurricanes limit 5;\"\n", "pd.read_sql_query(query, conn)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "conn.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Important:***
Run the following commands if you're unable to create/drop a Database. This can be due to auto-connections." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import psycopg2\n", "conn = psycopg2.connect(\"dbname=ihw user=postgres password=postgres host=localhost\")\n", "conn.autocommit = True\n", "cur = conn.cursor()\n", "\n", "cur.execute(\"REVOKE CONNECT ON DATABASE ihw FROM public;\")\n", "conn.close()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "conn = psycopg2.connect(\"dbname=ihw user=postgres password=postgres host=localhost\")\n", "conn.autocommit = True\n", "cur = conn.cursor()\n", "\n", "cur.execute(\"\"\"SELECT pid, pg_terminate_backend(pid) \n", "FROM pg_stat_activity \n", "WHERE datname = 'ihw' AND pid <> pg_backend_pid();\"\"\")\n", "conn.close()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }