{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", " \n", " Website\n", " \n", "
\n", "\n", "Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, and Jonathan Morgan." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Visualization in Python\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of Contents\n", "- [Introduction](#Introduction)\n", " - [Learning Objectives](#Learning-Objectives)\n", "- [Python Setup](#Python-Setup)\n", "- [Load the Data](#Load-the-Data)\n", "- [Our First Chart in `matplotlib`](#Our-First-Chart-in-matplotlib)\n", " - [A Note on Data Sourcing](#A-Note-on-Data-Sourcing)\n", " - [Layering in `matplotlib`](#Layering-in-matplotlib)\n", "- [Introducing seaborn](#Introducing-seaborn)\n", " - [Combining `seaborn` and `matplotlib`](#Combining-seaborn-and-matplotlib)\n", "- [Exploring cohort employment](#Exploring-cohort-employment)\n", " - [Saving Charts as a Variable](#Saving-Charts-as-a-Variable)\n", "- [Exporting Completed Graphs](#Exporting-Completed-Graphs)\n", "- [Choosing a Data Visualization Package](#Choosing-a-Data-Visualization-Package)\n", " - [An Important Note on Graph Titles](#An-Important-Note-on-Graph-Titles)\n", "- [Additional Resources](#Additional-Resources)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "- Back to [Table of Contents](#Table-of-Contents)\n", "\n", "In this module, you will learn to quickly and flexibly make a wide series of visualizations for exploratory data analysis and communicating to your audience. This module contains a practical introduction to data visualization in Python and covers important rules that any data visualizer should follow.\n", "\n", "### Learning Objectives\n", "\n", "* Become familiar with a core base of data visualization tools in Python - specifically matplotlib and seaborn\n", "\n", "* Begin exploring what visualizations are going to best reveal various types of patterns in your data\n", "\n", "* Learn more about our primary datasets data with exploratory analyses and visualizations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python Setup\n", "- Back to [Table of Contents](#Table-of-Contents)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# data manipulation in Python\n", "import pandas as pd\n", "\n", "# visualization packages\n", "import matplotlib.pyplot as plt \n", "import seaborn as sns\n", "\n", "# database connection\n", "from sqlalchemy import create_engine\n", "\n", "# see how long queries/etc take\n", "import time\n", "\n", "# so images get plotted in the notebook\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the Data\n", "- Back to [Table of Contents](#Table-of-Contents)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# set up sqlalchemy engine\n", "host = 'stuffed.adrf.info'\n", "DB = 'appliedda'\n", "\n", "connection_string = \"postgresql://{}/{}\".format(host, DB)\n", "conn = create_engine(connection_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will continue exploring a similar selection of data as we ended with in the [Dataset Exploration](02_2_Dataset_Exploration.ipynb) notebook.\n", "\n", "**SQL code to generate the tables we'll explore below**\n", "\n", "Table 1: `tanf_cohort_2006q4`: study cohort in this notebook, individuals who finished a TANF spell in Q4 of 2006\n", "\n", " CREATE TABLE ada_tanf.tanf_cohort_2006q4 AS\n", " SELECT DISTINCT ON (i.recptno) i.recptno, i.start_date, i.end_date, \n", " m.birth_date, m.ssn_hash, sex, rac, rootrace\n", " FROM il_dhs.ind_spells i\n", " LEFT JOIN il_dhs.member m\n", " ON i.recptno = m.recptno\n", " WHERE end_date >= '2006-10-01'::date AND \n", " end_date < '2007-01-01'::date\n", " AND benefit_type = 'tanf46';\n", " \n", " -- age at end/beginning of spell\n", " ALTER TABLE ada_tanf.tanf_cohort_2006q4 ADD COLUMN age_end numeric, ADD COLUMN age_start numeric;\n", " UPDATE ada_tanf.tanf_cohort_2006q4 SET (age_start, age_end) =\n", " (extract(epoch from age(start_date, birth_date))/(3600.*24*365), \n", " extract(epoch from age(end_date, birth_date))/(3600.*24*365));\n", " \n", " -- add duration of spell\n", " ALTER TABLE ada_tanf.tanf_cohort_2006q4 ADD COLUMN spell_dur int;\n", " UPDATE ada_tanf.tanf_cohort_2006q4 SET spell_dur = end_date - start_date;\n", " \n", " -- add indexes\n", " CREATE INDEX ON ada_tanf.tanf_cohort_2006q4 (recptno);\n", " CREATE INDEX ON ada_tanf.tanf_cohort_2006q4 (ssn_hash);\n", " CREATE INDEX ON ada_tanf.tanf_cohort_2006q4 (start_date);\n", " CREATE INDEX ON ada_tanf.tanf_cohort_2006q4 (end_date);\n", " \n", " -- change owner to schema's admin group\n", " ALTER TABLE ada_tanf.tanf_cohort_2006q4 OWNER TO ada_tanf_admin;\n", " \n", " -- good practice to VACUUM (although DB does it periodically)\n", " VACUUM FULL ada_tanf.tanf_cohort_2006q4;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Table 2: `tanf_cohort_2006q4_jobs`\n", "\n", " -- create job view for the 2006q4 cohort\n", " CREATE TABLE ada_tanf.tanf_cohort_2006q4_jobs AS\n", " SELECT \n", " -- job identifiers\n", " ssn, ein, seinunit, empr_no, year, quarter,\n", " -- individual's earnings at this job\n", " wage AS earnings\n", " FROM il_des_kcmo.il_wage\n", " WHERE year IN (2005, 2006, 2007)\n", " AND ssn IN \n", " (SELECT ssn_hash FROM ada_tanf.tanf_cohort_2006q4);\n", " \n", " -- add indexes\n", " CREATE INDEX ON ada_tanf.tanf_cohort_2006q4_jobs (ssn);\n", " CREATE INDEX ON ada_tanf.tanf_cohort_2006q4_jobs (ein);\n", " CREATE INDEX ON ada_tanf.tanf_cohort_2006q4_jobs (seinunit);\n", " CREATE INDEX ON ada_tanf.tanf_cohort_2006q4_jobs (empr_no);\n", " CREATE INDEX ON ada_tanf.tanf_cohort_2006q4_jobs (year);\n", " CREATE INDEX ON ada_tanf.tanf_cohort_2006q4_jobs (quarter);\n", " \n", " -- change owner to schema's admin group\n", " ALTER TABLE ada_tanf.tanf_cohort_2006q4_jobs OWNER TO ada_tanf_admin;\n", " \n", " -- good practice to VACUUM (although DB does it periodically)\n", " VACUUM FULL ada_tanf.tanf_cohort_2006q4_jobs;" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# get dataframe of study cohort\n", "start_time = time.time()\n", "query = \"\"\"\n", "SELECT * FROM ada_tanf.tanf_cohort_2006q4;\n", "\"\"\"\n", "\n", "# read the data, and parse the dates so we can use datetime functions\n", "df = pd.read_sql(query, conn, parse_dates=['start_date', 'end_date', 'birth_date'])\n", "\n", "# report how long reading this data frame took\n", "print('data read in {:.2f} seconds'.format(time.time()-start_time))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get DataFrame of cohort jobs\n", "\n", "start_time = time.time()\n", "\n", "query = \"\"\"\n", "SELECT * FROM ada_tanf.tanf_cohort_2006q4_jobs;\n", "\"\"\"\n", "\n", "df_jobs = pd.read_sql(query, conn)\n", "\n", "print('data read in {:.2f} seconds'.format(time.time()-start_time))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_jobs.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# how many of our cohort have a job in the `il_wage` dataset\n", "df_jobs['ssn'].nunique() #.nunique() returns the unique number of values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visual data exploration with `matplotlib`\n", "- Back to [Table of Contents](#Table-of-Contents)\n", "\n", "Under the hood, `Pandas` uses `matplotlib` to produce visualizations. `matplotlib` is the most commonly used base visualization package and provides low level access to different chart characteristics (eg tick mark labels)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# and view a simple hist of the age distribution of our cohort\n", "df.hist(column='age_end')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# one default we may want to change for histograms is the number of bins\n", "df.hist(column='age_end', bins=50)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.hist('spell_dur')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# spell duration is in days, so what is the duration in years?\n", "(df['spell_dur']/365.).hist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# aside: note the \".\" after 365 is a holdover\n", "# from Python 2 to give us floating point rather than integer values\n", "# in Python 3 integer division returns floating point values\n", "print('type with \".\" is: {}'.format((df['spell_dur']/365.).dtypes))\n", "print('type without \".\" is: {}'.format((df['spell_dur']/365).dtypes))\n", "\n", "# to get integers in python 3 you can do floor division:\n", "print('floor division type is: {}'.format((df['spell_dur']//365).dtypes))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# how many spells last over 5 years?\n", "((df['spell_dur']/365.)>5).value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the .value_counts() function can also report percentages by using the 'normalize' parameter:\n", "((df['spell_dur']/365.)>5).value_counts(normalize=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "# what is the overall distribution of earnings across all the jobs?\n", "df_jobs.hist(column='earnings', bins=50)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the simple histogram produced above shows a l/ot of small earnings values\n", "# what is the distribution of the higher values\n", "df_jobs['earnings'].describe(percentiles = [.01, .1, .25, .5, .75, .9, .95, .99, .999])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## We can see a long tail in the earnings per job\n", "## let's subset to below the 99% percentile and make a historgram\n", "subset_values = df_jobs['earnings'] Note in the above cell we split subsetting the data into two steps:\n", "1. We created `subset_values` which is simply a list of True or False\n", "2. Then we selected all rows in the `df_jobs` dataframe where `subset_values` was True" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## We can change options within the hist function (e.g. number of bins, color, transparency):\n", "df_jobs[subset_values].hist(column='earnings', bins=20, facecolor=\"purple\", alpha=0.5, figsize=(10,6))\n", "\n", "## And we can change the plot options with `plt` (which is our alias for matplotlib.pyplot)\n", "plt.xlabel('Job earnings ($)')\n", "plt.ylabel('Number of jobs')\n", "plt.title('Distribution of jobs by earnings for 2006q4 cohort')\n", "\n", "## And add Data sourcing:\n", "### xy are measured in percent of axes length, from bottom left of graph:\n", "plt.annotate('Source: IL Depts of Employment Security and Human Services', \n", " xy=(0.5,-0.15), xycoords=\"axes fraction\")\n", "\n", "## We use plt.show() to display the graph once we are done setting options:\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A Note on Data Sourcing\n", "- Back to [Table of Contents](#Table-of-Contents)\n", "\n", "Data sourcing is a critical aspect of any data visualization. Although here we are simply referencing the agencies that created the data, it is ideal to provide as direct of a path as possible for the viewer to find the data the graph is based on. When this is not possible (e.g. the data is sequestered), directing the viewer to documentation or methodology for the data is a good alternative. Regardless, providing clear sourcing for the underlying data is an **absolutely requirement** of any respectable visualization, and further builds trusts and enables reproducibility." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Layering in `matplotlib`\n", "- Back to [Table of Contents](#Table-of-Contents)\n", "\n", "As briefly demonstrated by changing the labels and adding the source, above, we can make consecutive changes to the same plot; that means we can also layer multiple plots on the same `figure`. By default, the first graph you create will be on the bottom with following graphs on top." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# to demonstrate simple layering\n", "# we will create a histogram of 2005 and 2007 earnings\n", "# similar to that already demonstrated above\n", "\n", "plt.hist(df_jobs[subset_values & (df_jobs['year']==2005)]['earnings'], facecolor=\"blue\", alpha=0.6)\n", "plt.hist(df_jobs[subset_values & (df_jobs['year']==2007)]['earnings'], facecolor=\"orange\", alpha=0.6)\n", "\n", "plt.annotate('Source: IL Depts of Employment Security and Human Services', \n", " xy=(0.5,-0.15), xycoords=\"axes fraction\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introducing `seaborn`\n", "- Back to [Table of Contents](#Table-of-Contents)\n", "\n", "`Seaborn` is a popular visualization package built on top of `matplotlib` which makes some more cumbersome graphs easier to make, however it does not give direct access to the lower level objects in a `figure` (more on that later)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "## Barplot function in seaborn\n", "sns.barplot(x='year', y='earnings', data=df_jobs)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What values does the above plot actually show us? Let's use the `help()` function to check the details of the `seaborn.barplot()` function we called above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(sns.barplot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the documentation, we can see that there is an `estimator` function that by default is `mean`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Barplot using sum of earnings rather than the default mean\n", "sns.barplot(x='year', y='earnings', data=df_jobs, estimator=sum)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "## Seaborn has a great series of charts for showing different cuts of data\n", "sns.factorplot(x='quarter', y='earnings', hue='year', data=df_jobs, kind='bar')\n", "plt.show()\n", "\n", "## Other options for the 'kind' argument can be found in the documentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining `seaborn` and `matplotlib` \n", "- Back to [Table of Contents](#Table-of-Contents)\n", "\n", "There are many excellent data visualiation modules available in Python, but for the tutorial we will stick to the tried and true combination of `matplotlib` and `seaborn`.\n", "\n", "Below, we use `seaborn` for setting an overall aesthetic style and then faceting (created small multiples). We then use `matplotlib` to set very specific adjustments - things like adding the title, adjusting the locations of the plots, and sizing th graph space. This is a pretty protoyptical use of the power of these two libraries together. \n", "\n", "More on [`seaborn`'s set_style function](https://seaborn.pydata.org/generated/seaborn.set_style.html).\n", "More on [`matplotlib`'s figure (fig) API](https://matplotlib.org/api/figure_api.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "## Seaborn offers a powerful tool called FacetGrid for making small multiples of matplotlib graphs:\n", "\n", "### Create an empty set of grids:\n", "facet_histograms = sns.FacetGrid(df_jobs[subset_values], row='year', col='quarter')\n", "\n", "## \"map' a histogram to each grid:\n", "facet_histograms = facet_histograms.map(plt.hist, 'earnings')\n", "\n", "## Data Sourcing:\n", "plt.annotate('Source: IL Depts of Employment Security and Human Services', \n", " xy=(0.5,-0.35), xycoords=\"axes fraction\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Seaborn's set_style function allows us to set many aesthetic parameters.\n", "sns.set_style(\"whitegrid\")\n", "\n", "facet_histograms = sns.FacetGrid(df_jobs[subset_values], row='year', col='quarter')\n", "facet_histograms = facet_histograms.map(plt.hist, 'earnings')\n", "\n", "## We can still change options with matplotlib, using facet_histograms.fig\n", "facet_histograms.fig.subplots_adjust(top=0.9)\n", "facet_histograms.fig.suptitle(\"Earnings for 99% of the jobs held by 2006q4 cohort\", fontsize=14)\n", "facet_histograms.fig.set_size_inches(12,8)\n", "\n", "## Data Sourcing:\n", "facet_histograms.fig.text(x=0.5, y=-0.05, s='Source: IL Depts of Employment Security and Human Services',\n", " fontsize=12)\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring cohort employment\n", "\n", "Question: what are employment patterns of our cohort?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# reminder of what columns we have in our two DataFrames\n", "print(df.columns.tolist())\n", "print('') # just to add a space\n", "print(df_jobs.columns.tolist())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# also check the total rows in the two datasets, and the number of unique individuals in our jobs data\n", "print(df.shape[0])\n", "print(df_jobs.shape[0], df_jobs['ssn'].nunique())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# how many in our cohort had any job during each quarter\n", "df_jobs.groupby(['year', 'quarter'])['ssn'].nunique().plot(kind='bar')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# did individuals have more than one job in a given quarter?\n", "df_jobs.groupby(['year', 'quarter', 'ssn'])['ein'].count().sort_values(ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many people were employed in the same pattern of quarters over our 3 year period?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# count the number of jobs each individual had in each quarter\n", "# where a \"job\" is simply that they had a record in the IDES data\n", "df_tmp = df_jobs.groupby(['year', 'quarter', 'ssn'])['ein'].count().unstack(['year', 'quarter'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_tmp.head(1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# flatten all columns to a single name with an '_' separator:\n", "df_tmp.columns = ['_'.join([str(c) for c in col]) for col in df_tmp.columns.values]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_tmp.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# replace NaN with 0\n", "df_tmp.fillna(0, inplace=True)\n", "\n", "# and set values >0 to 1\n", "df_tmp[df_tmp>0] = 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# make \"ssn\" a column instead of an index - then we can count it when we group by the 'year_q' columns\n", "df_tmp.reset_index(inplace=True)\n", "df_tmp.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# make a list of just the columns that start with '2005' or 2006\n", "cols = [c for c in df_tmp.columns.values if c.startswith('2005') | c.startswith('2006')]\n", "\n", "print(cols)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# aside on the above \"list comprehension\", here are the same steps one by one:\n", "\n", "# 1- get an array of our columns\n", "column_list = df_tmp.columns.values\n", "\n", "# 2 - loop through each value in the array\n", "for c in column_list:\n", " # 3 - check if the string starts with either '2005' or '2006'\n", " if c.startswith('2005') | c.startswith('2006'):\n", " # 4 - add the column to our new list (here we just print to demonstrate)\n", " print(c)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# group by all columns to count number of people with the same pattern\n", "df_tmp = df_tmp.groupby(cols)['ssn'].count().reset_index()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('There are {} different patterns of employment in 2005 and 2006'.format(df_tmp.shape[0]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# total possible patterns of employment\n", "poss_patterns = 2**len(cols)\n", "\n", "pct_of_patterns = 100 * df_tmp.shape[0] / poss_patterns\n", "\n", "print('With this definition of employment, our cohort shows {:.1f}% of the possible patterns'.format(pct_of_patterns))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Look at just the top 10:\n", "df_tmp.sort_values('ssn', ascending=False).head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# and how many people follow other patterns\n", "df_tmp.sort_values('ssn', ascending=False).tail(df_tmp.shape[0]-10)['ssn'].sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# grab the top 10 for a visualization\n", "df_tmp_top = df_tmp.sort_values('ssn', ascending=False).head(10).reset_index()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# drop old index\n", "df_tmp_top.drop(columns='index', inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('percent of employed in top 10 patterns is {:.1f}%'.format(100*df_tmp_top['ssn'].sum()/df_tmp['ssn'].sum()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate percentage of cohort in each group:\n", "df_tmp_top['pct_cohort'] = df_tmp_top['ssn'].astype(float) / df['ssn_hash'].nunique()\n", "df_tmp_top.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A heatmap using Seaborn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# visualize with a simple heatmap\n", "sns.heatmap(df_tmp_top[cols])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default visualization leaves a lot to be desired. Now let's customize the same heatmap." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create the matplotlib object so we can tweak graph properties later\n", "fig, ax = plt.subplots(figsize = (14,8))\n", "\n", "# create the list of labels we want on our y-axis\n", "ylabs = ['{:.2f}%'.format(x*100) for x in df_tmp_top['pct_cohort']]\n", "\n", "# make the heatmap\n", "sns.heatmap(df_tmp_top[cols], linewidths=0.01, linecolor='grey', yticklabels=ylabs, cbar=False, cmap=\"Blues\")\n", "\n", "# make y-labels horizontal and change tickmark font size\n", "plt.yticks(rotation=360, fontsize=12)\n", "plt.xticks(fontsize=12)\n", "\n", "# add axis labels\n", "ax.set_ylabel('Percent of cohort', fontsize=16)\n", "ax.set_xlabel('Quarter', fontsize=16)\n", "\n", "## Data Sourcing:\n", "ax.annotate('Source: IL Depts of Employment Security and Human Services', \n", " xy=(0.5,-0.15), xycoords=\"axes fraction\", fontsize=12)\n", "\n", "## add a title\n", "fig.suptitle('Top 10 most common employment patterns of cohort', fontsize=18)\n", "ax.set_title('Blue is \"employed\" and white is \"not employed\"', fontsize=12)\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Decision trees\n", "\n", "Decision trees are a useful visualization when exploring how important your \"features\" (aka \"right-hand variables\", \"explanatory variables\", etc) are in predicting your \"label\" (aka \"outcome\") - we will revisit these concepts much more in the Machine Learning portion of the program. For now, we're going to use the data we have been exploring above to demonstrate creating and visualizing a Decision Tree." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# our \"label\" will just be if our cohort was present in the wage data after 2006\n", "\n", "# get the list of SSN's present after 2006:\n", "employed = df_jobs[df_jobs['year']>2006]['ssn'].unique()\n", "\n", "df['label'] = df['ssn_hash'].isin(employed)\n", "\n", "# how many of our cohort are \"employed\" after exiting TANF:\n", "df['label'].value_counts(normalize=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# set which columns to use as our \"features\"\n", "sel_features = ['sex', 'rac', 'rootrace', 'age_end', 'age_start', 'spell_dur']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# additional imports to create and visualize tree\n", "\n", "# we will revisit sklearn during the Machine Learning portions of the program\n", "from sklearn.tree import DecisionTreeClassifier\n", "\n", "# packages to display a tree in Jupyter notebooks\n", "from sklearn.externals.six import StringIO\n", "from IPython.display import Image\n", "from sklearn.tree import export_graphviz\n", "import graphviz as gv\n", "import pydotplus" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create Tree to visualize, \n", "# here we'll set maximum tree depth to 3 but you should try different values\n", "dtree = DecisionTreeClassifier(max_depth=3)\n", "\n", "# fit our data\n", "dtree.fit(df[sel_features],df['label'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# visualize the tree\n", "\n", "# object to hold the graphviz data\n", "dot_data = StringIO()\n", "\n", "# create the visualization\n", "export_graphviz(dtree, out_file=dot_data, filled=True,\n", " rounded=True, special_characters=True,\n", " feature_names=df[sel_features].columns.values)\n", "\n", "# convert to a graph from the data\n", "graph = pydotplus.graph_from_dot_data(dot_data.getvalue())\n", "\n", "# display the graph in our notebook\n", "Image(graph.create_png())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> what does this tree tell us about our data?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exporting Completed Graphs\n", "- Back to [Table of Contents](#Table-of-Contents)\n", "\n", "When you are satisfied with your visualization, you may want to save a a copy outside of your notebook. You can do this with `matplotlib`'s savefig function. You simply need to run:\n", "\n", "plt.savefig(\"fileName.fileExtension\")\n", "\n", "The file extension is actually surprisingly important. Image formats like png and jpeg are actually **not ideal**. These file formats store your graph as a giant grid of pixels, which is space-efficient, but can't be edited later. Saving your visualizations instead as a PDF is strongly advised. PDFs are a type of vector image, which means all the components of the graph will be maintained.\n", "\n", "With PDFs, you can later open the image in a program like Adobe Illustrator and make changes like the size or typeface of your text, move your legends, or adjust the colors of your visual encodings. All of this would be impossible with a png or jpeg." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "## Let's save the employement patterns heatmap we created earlier\n", "## below just copied and pasted from above:\n", "\n", "# Create the matplotlib object so we can tweak graph properties later\n", "fig, ax = plt.subplots(figsize = (14,8))\n", "\n", "# create the list of labels we want on our y-axis\n", "ylabs = ['{:.2f}%'.format(x*100) for x in df_tmp_top['pct_cohort']]\n", "\n", "# make the heatmap\n", "sns.heatmap(df_tmp_top[cols], linewidths=0.01, linecolor='grey', yticklabels=ylabs, cbar=False, cmap=\"Blues\")\n", "\n", "# make y-labels horizontal and change tickmark font size\n", "plt.yticks(rotation=360, fontsize=12)\n", "plt.xticks(fontsize=12)\n", "\n", "# add axis labels\n", "ax.set_ylabel('Percent of cohort', fontsize=16)\n", "ax.set_xlabel('Quarter', fontsize=16)\n", "\n", "## Data Sourcing:\n", "ax.annotate('Source: IL Depts of Employment Security and Human Services', \n", " xy=(0.5,-0.15), xycoords=\"axes fraction\", fontsize=12)\n", "\n", "## add a title\n", "fig.suptitle('Top 10 most common employment patterns of cohort', fontsize=18)\n", "ax.set_title('Blue is \"employed\" and white is \"not employed\"', fontsize=12)\n", "\n", "fig.savefig('./output/cohort2006q4_empl_patterns.pdf')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# and the decision tree we made - note since we created the \"graph\" object we \n", "# do not need to completely reproduce the tree itself\n", "\n", "graph.write_pdf('./output/cohort2006q4_dtree.pdf')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Choosing a Data Visualization Package\n", "\n", "- Back to [Table of Contents](#Table-of-Contents)\n", "\n", "You can read more about different options for data visualization in Python in the [Additional Resources](#Additional-Resources) section at the bottom of this notebook. \n", "\n", "`matplotlib` is very expressive, meaning it has functionality that can easily account for fine-tuned graph creation and adjustment. However, this also means that `matplotlib` is somewhat more complex to code.\n", "\n", "`seaborn` is a higher-level visualization module, which means it is much less expressive and flexible than matplotlib, but far more concise and easier to code.\n", "\n", "It may seem like we need to choose between these two approaches, but this is not the case! Since `seaborn` is itself written in `matplotlib` (you will sometimes see `seaborn` be called a `matplotlib` 'wrapper'), we can use `seaborn` for making graphs quickly and then `matplotlib` for specific adjustments. When you see `plt` referenced in the code below, we are using `matplotlib`'s pyplot submodule.\n", "\n", "\n", "`seaborn` also improves on `matplotlib` in important ways, such as the ability to more easily visualize regression model results, creating small multiples, enabling better color palettes, and improve default aesthetics. From [`seaborn`'s documentation](https://seaborn.pydata.org/introduction.html):\n", "\n", "> If matplotlib 'tries to make easy things easy and hard things possible', seaborn tries to make a well-defined set of hard things easy too. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### An Important Note on Graph Titles\n", "- Back to [Table of Contents](#Table-of-Contents)\n", "\n", "The title of a visualization occupies the most valuable real estate on the page. If nothing else, you can be reasonably sure a viewer will at least read the title and glance at your visualization. This is why you want to put thought into making a clear and effective title that acts as a **narrative** for your chart. Many novice visualizers default to an **explanatory** title, something like: \"Average Wages Over Time (2006-2016)\". This title is correct - it just isn't very useful. This is particularly true since any good graph will have explained what the visualization is through the axes and legends. Instead, use the title to reinforce and explain the core point of the visualization. It should answer the question \"Why is this graph important?\" and focus the viewer onto the most critical take-away." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional Resources\n", "\n", "* [Data-Viz-Extras](../notebooks_additional/Data-Viz-extras.ipynb) notebook in the \"notebooks_additional\" folder\n", "\n", "* [A Thorough Comparison of Python's DataViz Modules](https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair)\n", "\n", "* [Seaborn Documentation](http://seaborn.pydata.org)\n", "\n", "* [Matplotlib Documentation](https://matplotlib.org)\n", "\n", "* [Advanced Functionality in Seaborn](blog.insightdatalabs.com/advanced-functionality-in-seaborn)\n", "\n", "* Other Python Visualization Libraries:\n", " * [`Bokeh`](http://bokeh.pydata.org)\n", " * [`Altair`](https://altair-viz.github.io)\n", " * [`ggplot`](http://ggplot.yhathq.com.com)\n", " * [`Plotly`](https://plot.ly)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" }, "toc": { "nav_menu": { "height": "272px", "width": "241px" }, "number_sections": false, "sideBar": true, "skip_h1_title": false, "toc_cell": false, "toc_position": { "height": "829px", "left": "0px", "right": "1548px", "top": "110px", "width": "301px" }, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }