{ "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" }, "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook will go over one of the easiest ways to graph data from your [Amazon Redshift data warehouse](http://aws.amazon.com/redshift/) using [Plotly's public platform](https://plot.ly/) for publishing beautiful, interactive graphs from Python to the web.\n", "\n", "[Plotly's Enterprise platform](https://plot.ly/product/enterprise/) allows for an easy way for your company to build and share graphs without the data leaving your servers." ] }, { "cell_type": "code", "collapsed": true, "input": [ "from __future__ import print_function #python 3 support\n", "\n", "import plotly.plotly as py\n", "from plotly.graph_objs import *\n", "import plotly.tools as tls\n", "import pandas as pd\n", "import os\n", "import requests\n", "requests.packages.urllib3.disable_warnings() # this squashes insecure SSL warnings - DO NOT DO THIS ON PRODUCTION!" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we'll be using [Amazon's Sample Redshift Data](http://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-create-sample-db.html) for this notebook. Although we won't be connecting through a JDBC/ODBC connection we'll be using the [psycopg2 package](http://initd.org/psycopg/docs/index.html) with [SQLAlchemy](http://www.sqlalchemy.org/) and [pandas](http://pandas.pydata.org/) to make it simple to query and analyze our data.\n", "\n", "###Packages\n", "\n", "- Pandas\n", "- psycopg2\n", "- SQLAlchemy\n", "\n", "###Information you need to get started\n", "\n", "You'll need your [Redshift Endpoint URL](http://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-connect-to-cluster.html) in order to access your Redshift instance. I've obscured mine below but yours will be in a format similar to `datawarehouse.some_chars_here.region_name.redshift.amazonaws.com`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Connecting to Redshift is made extremely simple once you've set your cluster configuration. This configuration needs to include the username, password, port, host and database name. I've opted to store mine as environmental variables on my machine.\n" ] }, { "cell_type": "code", "collapsed": true, "input": [ "redshift_endpoint = os.getenv(\"REDSHIFT_ENDPOINT\")\n", "redshift_user = os.getenv(\"REDSHIFT_USER\")\n", "redshift_pass = os.getenv(\"REDSHIFT_PASS\")\n", "port = 5439\n", "dbname = 'dev'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As I mentioned there are numerous ways to connect to a Redshift databause and I've included two below. We can use either the SQLAlchemy package or we can use the psycopg2 package for a more direct access. \n", "\n", "Both will allow us to execute SQL queries and get results however the SQLAlchemy engine makes it a bit easier to directly return our data as a dataframe using pandas. Plotly has a tight integration with pandas as well, making it extremely easy to make interactive graphs to share with your company." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "SQLAlchemy" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sqlalchemy import create_engine\n", "engine_string = \"postgresql+psycopg2://%s:%s@%s:%d/%s\" \\\n", "% (redshift_user, redshift_pass, redshift_endpoint, port, dbname)\n", "engine = create_engine(engine_string)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Psycopg2" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import psycopg2\n", "conn = psycopg2.connect(\n", " host=\"datawarehouse.cm4z2iunjfsc.us-west-2.redshift.amazonaws.com\", \n", " user=redshift_user, \n", " port=port, \n", " password=redshift_pass, \n", " dbname=dbname)\n", "cur = conn.cursor() # create a cursor for executing queries" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Loading in Data\n", "\n", "This next section goes over loading in the sample data from Amazon's sample database. This is strictly for the purposes of the tutorial so feel free to skim this section if you're going to be working with your own data.\n", "\n", "-----------------START DATA LOADING-----------------" ] }, { "cell_type": "code", "collapsed": true, "input": [ "cur.execute(\"\"\"drop table users;\n", "\n", "drop table venue;\n", "\n", "drop table category;\n", "\n", "drop table date;\n", "\n", "drop table event;\n", "\n", "drop table listing;\n", "\n", "drop table sales;\"\"\")\n", "conn.commit()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": null }, { "cell_type": "code", "collapsed": false, "input": [ "aws_key = os.getenv(\"AWS_ACCESS_KEY_ID\") # needed to access S3 Sample Data\n", "aws_secret = os.getenv(\"AWS_SECRET_ACCESS_KEY\")\n", "\n", "base_copy_string = \"\"\"copy %s from 's3://awssampledbuswest2/tickit/%s.txt' \n", "credentials 'aws_access_key_id=%s;aws_secret_access_key=%s' \n", "delimiter '%s';\"\"\" # the base COPY string that we'll be using\n", "\n", "#easily generate each table that we'll need to COPY data from\n", "tables = [\"users\", \"venue\", \"category\", \"date\", \"event\", \"listing\"]\n", "data_files = [\"allusers_pipe\", \"venue_pipe\", \"category_pipe\", \"date2008_pipe\", \"allevents_pipe\", \"listings_pipe\"]\n", "delimiters = [\"|\", \"|\", \"|\", \"|\", \"|\", \"|\", \"|\"]\n", "\n", "#the generated COPY statements we'll be using to load data;\n", "copy_statements = []\n", "for tab, f, delim in zip(tables, data_files, delimiters):\n", " copy_statements.append(base_copy_string % (tab, f, aws_key, aws_secret, delim))\n", "\n", "# add in Sales data, delimited by '\\t'\n", "copy_statements.append(\"\"\"copy sales from 's3://awssampledbuswest2/tickit/sales_tab.txt' \n", "credentials 'aws_access_key_id=%s;aws_secret_access_key=%s' \n", "delimiter '\\t' timeformat 'MM/DD/YYYY HH:MI:SS';\"\"\" % (aws_key, aws_secret))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": null }, { "cell_type": "code", "collapsed": false, "input": [ "# Create Table Statements\n", "cur.execute(\"\"\"\n", "create table users(\n", "\tuserid integer not null distkey sortkey,\n", "\tusername char(8),\n", "\tfirstname varchar(30),\n", "\tlastname varchar(30),\n", "\tcity varchar(30),\n", "\tstate char(2),\n", "\temail varchar(100),\n", "\tphone char(14),\n", "\tlikesports boolean,\n", "\tliketheatre boolean,\n", "\tlikeconcerts boolean,\n", "\tlikejazz boolean,\n", "\tlikeclassical boolean,\n", "\tlikeopera boolean,\n", "\tlikerock boolean,\n", "\tlikevegas boolean,\n", "\tlikebroadway boolean,\n", "\tlikemusicals boolean);\n", "\n", "create table venue(\n", "\tvenueid smallint not null distkey sortkey,\n", "\tvenuename varchar(100),\n", "\tvenuecity varchar(30),\n", "\tvenuestate char(2),\n", "\tvenueseats integer);\n", "\n", "create table category(\n", "\tcatid smallint not null distkey sortkey,\n", "\tcatgroup varchar(10),\n", "\tcatname varchar(10),\n", "\tcatdesc varchar(50));\n", "\n", "create table date(\n", "\tdateid smallint not null distkey sortkey,\n", "\tcaldate date not null,\n", "\tday character(3) not null,\n", "\tweek smallint not null,\n", "\tmonth character(5) not null,\n", "\tqtr character(5) not null,\n", "\tyear smallint not null,\n", "\tholiday boolean default('N'));\n", "\n", "create table event(\n", "\teventid integer not null distkey,\n", "\tvenueid smallint not null,\n", "\tcatid smallint not null,\n", "\tdateid smallint not null sortkey,\n", "\teventname varchar(200),\n", "\tstarttime timestamp);\n", "\n", "create table listing(\n", "\tlistid integer not null distkey,\n", "\tsellerid integer not null,\n", "\teventid integer not null,\n", "\tdateid smallint not null sortkey,\n", "\tnumtickets smallint not null,\n", "\tpriceperticket decimal(8,2),\n", "\ttotalprice decimal(8,2),\n", "\tlisttime timestamp);\n", "\n", "create table sales(\n", "\tsalesid integer not null,\n", "\tlistid integer not null distkey,\n", "\tsellerid integer not null,\n", "\tbuyerid integer not null,\n", "\teventid integer not null,\n", "\tdateid smallint not null sortkey,\n", "\tqtysold smallint not null,\n", "\tpricepaid decimal(8,2),\n", "\tcommission decimal(8,2),\n", "\tsaletime timestamp);\"\"\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": null }, { "cell_type": "code", "collapsed": false, "input": [ "for copy_statement in copy_statements: # execute each COPY statement\n", " cur.execute(copy_statement)\n", "conn.commit()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": null }, { "cell_type": "code", "collapsed": false, "input": [ "for table in tables + [\"sales\"]:\n", " cur.execute(\"select count(*) from %s;\" % (table,)) \n", " print(cur.fetchone())\n", "conn.commit() # make sure data went through and commit our statements permanently." ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----------------END DATA LOADING-----------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've loaded some data into our Redshift cluster, we can start running queries against it.\n", "\n", "We're going to start off by exploring and presenting some of our user's tastes and habits. Pandas makes it easy to query our data base and get back a dataframe in return. In this query, I'm simply getting the preferences of our users. What kinds of events do they like?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df = pd.read_sql_query(\"\"\"\n", "SELECT sum(likesports::int) as sports, sum(liketheatre::int) as theatre, \n", "sum(likeconcerts::int) as concerts, sum(likejazz::int) as jazz, \n", "sum(likeclassical::int) as classical, sum(likeopera::int) as opera, \n", "sum(likerock::int) as rock, sum(likevegas::int) as vegas, \n", "sum(likebroadway::int) as broadway, sum(likemusicals::int) as musical, \n", "state\n", "FROM users \n", "GROUP BY state\n", "ORDER BY state asc;\n", "\"\"\", engine)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that I've gotten a DataFrame back, let's make a quick heatmap using plotly." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data = Data([\n", " Heatmap(\n", " z = df.drop('state', axis=1).values,\n", " x = df.drop('state', axis=1).columns,\n", " y = df.state,\n", " colorscale = 'Hot'\n", " )\n", " ])\n", "layout = Layout(title=\"State and Music Tastes\", yaxis=YAxis(autotick=False, dtick=1))\n", "py.iplot(Figure(data=data, layout=layout), filename='redshift/state and music taste heatmap', height=1000)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "*the above graph is interactive, click and drag to zoom, double click to return to initial layout, shift click to pan*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This graph is simple to produce and even more simple to explore. The interactivity makes it great for those that aren't completely familiar with heatmaps.\n", "\n", "Looking at this particular one we can easily get a sense of popularity. We can see here that sports events don't seem to be particularly popular among our users and that certain states have much higher preferences (and possibly users) than others.\n", "\n", "A common next step might be to create some box plots of these user preferences." ] }, { "cell_type": "code", "collapsed": false, "input": [ "layout = Layout(title=\"Declared User Preference Box Plots\", \n", " yaxis=YAxis())\n", "\n", "data = []\n", "for pref in df.drop('state', axis=1).columns:\n", " # for every preference type, make a box plot\n", " data.append(Box(y=df[pref], name=pref)) \n", " \n", "py.iplot(Figure(data=data, layout=layout), filename='redshift/user preference box plots')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ "" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "*the above graph is interactive, click and drag to zoom, double click to return to initial layout, shift click to pan*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems to be that sports are just a bit more compressed than the rest. This may be because there's simply fewer people interested in sports or our company doesn't have many sporting events." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've explored a little bit about some of our customers we've stumbled upon this sports anomoly. Are we listing less sports events? Do we sell approximately the same amount of all event types and our users just aren't drawn to sports events? \n", "\n", "We've got to understand a bit more and to do so we'll be plotting a simple bar graph of our event information." ] }, { "cell_type": "code", "collapsed": false, "input": [ "df = pd.read_sql_query(\"\"\"\n", "SELECT sum(event.catid) as category_sum, catname as category_name\n", "FROM event, category\n", "where event.catid = category.catid\n", "GROUP BY category.catname\n", "\"\"\", engine)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "layout = Layout(title=\"Event Categories Sum\", yaxis=YAxis(title=\"Sum\"))\n", "data = [Bar(x=df.category_name, y=df.category_sum)]\n", "py.iplot(Figure(data=data, layout=layout))" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's a good thing we started exploring this data because we've got to rush to management and report the discrepancy between our users' preferences and the kinds of events that we're hosting! Luckily, sharing plotly's graphs is extremely easy using the `play with this data` link at the bottom right.\n", "\n", "However for our report, let's dive a bit deeper into the events that we're listing and when we're listing them. Maybe we're trending upwards with certain event types?" ] }, { "cell_type": "code", "collapsed": true, "input": [ "df = pd.read_sql_query(\"\"\"\n", "SELECT sum(sales.qtysold) as quantity_sold, date.caldate \n", "FROM sales, date\n", "WHERE sales.dateid = date.dateid \n", "GROUP BY date.caldate \n", "ORDER BY date.caldate asc;\n", "\"\"\", engine)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "layout = Layout(title=\"Event Sales Per Day\", yaxis=YAxis(title=\"Sales Quantity\"))\n", "data = [Scatter(x=df.caldate, y=df.quantity_sold)]\n", "py.iplot(Figure(data=data, layout=layout))" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "" ] } ], "prompt_number": 11 }, { "cell_type": "raw", "metadata": {}, "source": [ "Overall it seems inconclusive except that our events seem to be seasonal. This aggregate graph doesn't show too much so it's likely worth exploring a bit more about each category." ] }, { "cell_type": "code", "collapsed": true, "input": [ "df = pd.read_sql_query(\"\"\"\n", "SELECT sum(sales.qtysold) as quantity_sold, date.caldate, category.catname as category_name \n", "FROM sales, date, event, category\n", "WHERE sales.dateid = date.dateid \n", "AND sales.eventid = event.eventid\n", "AND event.catid = category.catid\n", "GROUP BY date.caldate, category_name\n", "ORDER BY date.caldate asc;\n", "\"\"\", engine)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's always great to try and better understand which graph type conveys your message the best. Sometimes subplots do the best and other times it's best to put them all on one graph. Plotly makes it easy to do either one!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "data = []\n", "for count, (name, g) in enumerate(df.groupby(\"category_name\")):\n", " data.append(Scatter(\n", " name=name,\n", " x=g.caldate,\n", " y=g.quantity_sold,\n", " xaxis='x' + str(count + 1),\n", " yaxis='y' + str(count + 1)\n", " ))\n", "\n", "fig = tls.make_subplots(rows=2,cols=2)\n", "fig['layout'].update(title=\"Event Sales Per Day By Category\")\n", "fig['data'] += data\n", "py.iplot(fig)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This is the format of your plot grid:\n", "[ (1,1) x1,y1 ] [ (1,2) x2,y2 ]\n", "[ (2,1) x3,y3 ] [ (2,2) x4,y4 ]\n", "\n" ] }, { "html": [ "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ "" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above subplots seem to tell an interesting story although it's important to note that with subplots the axes are not always aligned. So let's try plotting all of them together, with lines for each category." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data = []\n", "for name, g in df.groupby(\"category_name\"):\n", " data.append(Scatter(\n", " name=name,\n", " x=g.caldate,\n", " y=g.quantity_sold\n", " ))\n", "\n", "fig = Figure()\n", "fig['layout'].update(title=\"Event Sales Per Day By Category\")\n", "fig['data'] += data\n", "py.iplot(fig, filename='redshift/Event Sales Per Day by Category')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "This looks much better and explains the story perfectly. It seems that all of our events are fairly regular through the year except for a spike in musicals and plays around March. This might be of interest to so I'm going to mark up this graph and share it with some of the relevant sales representatives in my company. \n", "\n", "The rest of my team can edit the graph with me in a web app. Collaborating does not require coding, emailing, or downloading software. I can even fit a function to the data in the web app." ] }, { "cell_type": "code", "collapsed": true, "input": [ "from IPython.display import Image" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "Image(url=\"http://i.imgur.com/nUVihzx.png\")" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "tls.embed(\"https://plot.ly/~bill_chambers/195\")" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotly makes it easier for data analysts and data scientists to share data in meaningful ways. By marking up drawings and embedding comments on the graph, I can make sure that I'm sharing everything within a context. Rather than having to send a static image, I can share an interactive plot a coworker can explore and understand as well. Plotly makes it easy for companies to make sure that information is conveyed in the right context.\n", "\n", "Learn more about:\n", "- [Amazon Redshift Data Warehouse](http://aws.amazon.com/redshift/)\n", "- [Plotly Enterprise - Plotly Hosted on your servers](https://plot.ly/product/enterprise/)\n", "- [Subplots in Plotly](https://plot.ly/python/subplots/)\n", "- [Creating a plot of best fit](https://plot.ly/online-graphing/tutorials/create-a-line-of-best-fit-online/)" ] } ], "metadata": {} } ] }