{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Postgres Tutorial\n", "\n", "## Table of contents\n", "\n", "1. [Data Storage](#Data-Storage)\n", "2. [Connecting to PostgreSQL](#Connecting-to-PostgreSQL)\n", "\n", "## Data Storage\n", "[[go back to the top]](#Table-of-contents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial is based on [this blog article about the same topic](https://www.dataquest.io/blog/loading-data-into-postgres/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data storage a crucial part of any data system. It is important to understand how your data can be stored and accessed. **Data engineers** build architectures for data platforms enabling data scientists to query their data. They are responsible for building data pipelines that connect the pieces of the data ecosystem. They are familiar with:\n", "- Working with large datasets\n", "- Automation of intensive queries\n", "- The architecture of robust data platforms\n", "\n", "Each part of the pipeline below is built/maintained by a data engineer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook I will describe one example **relational databases**, [`PostgreSQL`](https://www.postgresql.org/) more specifically. `PostgreSQL` is one of the main open source relational databases. It has advantages such as:\n", "- It is easily accessible in cloud providers such as AWS\n", "- It is highly stable\n", "- It is open source\n", "\n", "The `Python` library `psycopg2` one can create tables and load data into a local running `PostgreSQL` servers." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: psycopg2 in /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages (2.7.6.1)\r\n" ] } ], "source": [ "!pip install psycopg2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`SQLite` is a `SQL` engine where all the data is saved onto a *one* file. Since `SQLite` only allows a single process to write to the database and therefore it is not built for multiple connections, it has limited usefulness in a data production systems. `PostgreSQL` is **implemented as a server rather than a single file** hence it accepts connections from any clients who can request e.g. `SELECT` or any other query. `PostgreSQL` has a client-server model design (clients can interact with the server)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to [this blog](https://www.dataquest.io/blog/loading-data-into-postgres/):\n", "\n", "> Anytime you are accessing a website, your browser (the client) will continually be requesting website data from the server \n", "\n", "When using `PostgreSQL`, the connecting client will use `database=specific` requests following a defined protocol\"\n", "\n", "> A protocol is the language that both the client and server will use when the client requests data and the server responds with data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Connecting to PostgreSQL\n", "[[go back to the top]](#Table-of-contents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`psycopg2` is type of client that \"speaks the database protocol\" described earlier. In the code below, the database name `dbname` and user are chosen. \n", "> Because of the multiple connections, Postgres uses multiple users and databases as a way to improve security and division of data. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import psycopg2\n", "conn = psycopg2.connect(dbname=\"marcotavora\", user=\"marcotavora\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The return value of the `connect()` method is a `Connection` object." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "print(conn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interacting with the database\n", "[[go back to the top]](#Table-of-contents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The connection object `conn` \n", "> creates a client session with the database server that instantiates a persistant client to speak with. To issue commands against the database, you will also need to create another object called the `Cursor` object. `Cursor` is created by the `Connection` object and using the `Cursor` object we will be able to execute our commands. To execute commands on the Postgres database, you call the execute method on the Cursor object with a stringified SQL command. \n", "\n", "To get the returned values from the query, we need `fetchone()` or `fetchall()`. The former returns the first row result and the latter returns a list of each row in the table." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import psycopg2\n", "conn = psycopg2.connect(dbname=\"marcotavora\", \n", " user=\"marcotavora\")\n", "cur = conn.cursor()\n", "cur.execute('SELECT * FROM users')\n", "one = cur.fetchone()\n", "all = cur.fetchall()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " The first element of the list all is:\n", "\n", " (0, 'cbenjamin@yahoo.com', 'Joseph Kirby', '3594 Fox Ford Apt. 192 West Kristen GA 22838-8977') \n", "\n", " The second element of the list all is:\n", "\n", " (1, 'morganlopez@matthews-hickman.com', 'Erin Figueroa', '64763 Li Meadows Apt. 554 New Marcoton MA 99016-6159') \n", "\n", " The third element of the list all is:\n", "\n", " (2, 'ypark@russo.biz', 'Leon Matthews', '91144 Hamilton Manors Suite 421 Ronaldland WA 98705')\n" ] } ], "source": [ "print(' The first element of the list all is:\\n\\n',one,'\\n')\n", "print(' The second element of the list all is:\\n\\n',all[0],'\\n')\n", "print(' The third element of the list all is:\\n\\n',all[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a table\n", "[[go back to the top]](#Table-of-contents)\n", "\n", "We create a table as follows:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " CREATE TABLE tableName(\n", " column1 dataType1 PRIMARY KEY,\n", " column2 dataType2,\n", " column3 dataType3,\n", " ...\n", " );" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Each column is a placeholder for the column name, dataType is the data type you want to store for that column, and PRIMARY KEY is an example of an optional parameter to add on to the table. In Postgres, every table requires at least one PRIMARY KEY column that contains a unique set of values. Let's now take a look at the CSV file we wish to load into the database (note that the CSV does not contain real users but are randomly generated users using a Python library called faker)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import psycopg2\n", "conn = psycopg2.connect(dbname=\"marcotavora\", \n", " user=\"marcotavora\")\n", "cur = conn.cursor()\n", "cur.execute(\"\"\"CREATE TABLE new_users_table(id integer PRIMARY KEY, email text, name text, address text)\"\"\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cur" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inserting the data\n", "[[go back to the top]](#Table-of-contents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us load the `csv` file into the database issuing an INSERT command on the table. Using the `INSERT` command, we can insert into the `new_users_table` table using `pyscopg2`. In more details:\n", "- Write a string `INSERT` SQL command for the `execute()` method. \n", "- Format the string with all the values" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'id,email,name,address\\n'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "First 2 rows (the first one was skipped):\n", "\n" ] }, { "data": { "text/plain": [ "[['0',\n", " 'cbenjamin@yahoo.com',\n", " 'Joseph Kirby',\n", " '3594 Fox Ford Apt. 192 West Kristen GA 22838-8977'],\n", " ['1',\n", " 'morganlopez@matthews-hickman.com',\n", " 'Erin Figueroa',\n", " '64763 Li Meadows Apt. 554 New Marcoton MA 99016-6159']]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import csv\n", "with open('user_accounts_new.csv', 'r') as f:\n", " next(f)\n", " reader = csv.reader(f)\n", " lst = [row for row in reader]\n", "\n", "n = 2\n", "print('First {} rows (the first one was skipped):\\n'.format(n))\n", "lst[0:n]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " import csv\n", " import psycopg2\n", "\n", " conn = psycopg2.connect(dbname=\"marcotavora\", user=\"marcotavora\")\n", " cur = conn.cursor()\n", "\n", " with open('user_accounts_new.csv', 'r') as f:\n", " reader = csv.reader(f)\n", " next(reader) \n", " for row in reader:\n", " cur.execute(\"INSERT INTO new_users_table VALUES (%s, %s, %s, %s)\",row)\n", "\n", " conn.commit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is inefficient since we had to loop through every row from the `csv`. There is a better way." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Copying the data\n", "[[go back to the top]](#Table-of-contents)\n", "\n", "The `copy_from` method loads a file directly into a table and like `execute()` method, it is attached to the `Cursor` object. The `copy_from` arguments requires:\n", "- a file to load (excluding the header)\n", "- the `tablename` it should load into\n", "- a delimiter (the key argument sep)\n", "- running `commit()` transfers the file" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'id,email,name,address\\n'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import psycopg2\n", "\n", "# conn = psycopg2.connect(dbname=\"marcotavora\", \n", "# user=\"marcotavora\")\n", "# cur = conn.cursor()\n", "with open('user_accounts_new.csv', 'r') as f:\n", " next(f) \n", " cur.copy_from(f, \n", " 'new_users_table', sep=',')\n", " \n", "conn.commit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We finished loading the `user_accounts_new.csv` file into our table. The summary to [Dataquest](https://www.dataquest.io/blog/loading-data-into-postgres/) follows:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " The first element of the list all is:\n", "\n", " (0, 'cbenjamin@yahoo.com', 'Joseph Kirby', '3594 Fox Ford Apt. 192 West Kristen GA 22838-8977') \n", "\n", " The second element of the list all is:\n", "\n", " (1, 'morganlopez@matthews-hickman.com', 'Erin Figueroa', '64763 Li Meadows Apt. 554 New Marcoton MA 99016-6159') \n", "\n", " The third element of the list all is:\n", "\n", " (2, 'ypark@russo.biz', 'Leon Matthews', '91144 Hamilton Manors Suite 421 Ronaldland WA 98705')\n" ] } ], "source": [ "import psycopg2\n", "# conn = psycopg2.connect(dbname=\"marcotavora\", \n", "# user=\"marcotavora\")\n", "# cur = conn.cursor()\n", "cur.execute('SELECT * FROM new_users_table')\n", "one = cur.fetchone()\n", "all = cur.fetchall()\n", "\n", "print(' The first element of the list all is:\\n\\n',one,'\\n')\n", "print(' The second element of the list all is:\\n\\n',all[0],'\\n')\n", "print(' The third element of the list all is:\\n\\n',all[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "[[go back to the top]](#Table-of-contents)\n", "\n", "> - Postgres uses the client-server model to enable multiple connections to the database.\n", "- Using the popular `psycopg2` library, we can use Python to connect to PostgreSQL.\n", "- Postgres is type sensitive so we have to declare types on each of our columns.\n", "- Postgres uses SQL transactions to save the state of the database.\n", "- The most efficient way to load files into Postgres tables is to use COPY, or the `psycopg2.copy_from()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }