{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "

Visualizing Change Using Time-Series Line Charts\n", "

by Nick Heitzman\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Contents \n", " \n", "#### Introduction\n", "[Introduction](#intro) \n", "\n", "#### Obtain Population Data\n", "[Import Basic Libraries](#libs) \n", "[Retrieve Population Data from Quandl](#quandl) \n", " \n", "#### Methods for Visualizing Change\n", "[Plot the Data](#plot) \n", "[Subplots](#subplot) \n", "[Dual Y-Axes](#dualaxes) \n", "[Periodic Change](#change) \n", "[Periodic Percent Change](#pctchange) \n", "[Indexing Data](#index) \n", "\n", "\n", "#### Conculsion\n", "[Conclusion](#conclusion) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Time-series data visualizations are everywhere. While these charts are understood amongst individuals of all professions, effectively communicating change over time can present unexpected challenges. When creating any type of visualization, it is important to first determine the message you would like to communicate. The increased popularity of exploratory data visualization tools such as Tableau and Microsoft Power BI make it easy to forget this step. These tools provide users with the ability to connect to databases and click around until they find the prettiest visualization. These capabilities can often lead to ineffective visualizations with no explicit purpose. \n", "\n", "When creating time-series line charts, it’s important to consider which of the following you would like to communicate:\n", "-\tActual value of units?\n", "-\tChange in absolute units? \n", "-\tPercent change?\n", "-\tChange from a specific point in time? \n", "\n", "Ultimately, no chart can communicate all of these effectively. It is important to recognize this, determine which message is most important, and then design your visual accordingly. \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "### Import Basic Libraries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import pandas as pd\n", "import Quandl as qd\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "#Nick's Quandl Auth token\n", "auth = '9zjPBpsaLGqS-KPGzvyn'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Retreive Population Data from Quandl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### What is Quandl?\n", "\n", "Quandl is an online data warehouse which has millions of public datasets. Quandl's API is set up to pull data directly into a Pandas dataframe, and it automatically sets the date as the index. For more info on using Quandl with Python, visit: https://www.quandl.com/help/python \n", " \n", "Quandl houses the world bank's public data. The north_america_codes.json file contains all of the total population data for each country in North America, including Central America and the Caribbean. \n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_codes = pd.read_json('north_america_codes.json')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
codecountry
0WORLDBANK/USA_SP_POP_TOTLUSA
1WORLDBANK/CAN_SP_POP_TOTLCanada
10WORLDBANK/HTI_SP_POP_TOTLHaiti
11WORLDBANK/JAM_SP_POP_TOTLJamaica
12WORLDBANK/KNA_SP_POP_TOTLSaint Kitts and Nevis
\n", "
" ], "text/plain": [ " code country\n", "0 WORLDBANK/USA_SP_POP_TOTL USA\n", "1 WORLDBANK/CAN_SP_POP_TOTL Canada\n", "10 WORLDBANK/HTI_SP_POP_TOTL Haiti\n", "11 WORLDBANK/JAM_SP_POP_TOTL Jamaica\n", "12 WORLDBANK/KNA_SP_POP_TOTL Saint Kitts and Nevis" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_codes.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Retreive Data\n", "\n", "Using the Quandl API, I loop through each country to pull population data. As each country's data is pulled, it is concatenated into a single Pandas DataFrame (df)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "df = pd.DataFrame\n", "\n", "for x in df_codes.code:\n", " df_temp = ''\n", " df_temp = qd.get(x,authtoken=auth)\n", " df_temp.rename(columns={'Value': x[10:13]}, inplace=True)\n", " \n", " if df.empty:\n", " df = df_temp\n", " else:\n", " df = pd.concat([df, df_temp],axis=1)\n", " \n", "df.columns = [x.lower() for x in df.columns]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data Munging\n", "\n", "I then calculate the total for North America. For the purpose of this analysis, we are going to compare USA, Mexico, and Canada in addition to the North American total. The DataFrame is then limited to just these four columns." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
north americausamexcan
Date
2010-12-3154035524730934705711788640434005274
2011-12-3154560831831172163211936123334342780
2012-12-3155098514831411207812084747734754312
2013-12-3155636176931649753112233239935158304
2014-12-3156167409331885705612379921535540419
\n", "
" ], "text/plain": [ " north america usa mex can\n", "Date \n", "2010-12-31 540355247 309347057 117886404 34005274\n", "2011-12-31 545608318 311721632 119361233 34342780\n", "2012-12-31 550985148 314112078 120847477 34754312\n", "2013-12-31 556361769 316497531 122332399 35158304\n", "2014-12-31 561674093 318857056 123799215 35540419" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.insert(0,'north america',df.sum(axis=1))\n", "df = df[['north america', 'usa', 'mex', 'can']]\n", "df.tail(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Methods for Visualizing Change" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotly is a third party library that allows users to develop interactive visualizations and share them online. The Plotly library cufflinks was created specifically to interact with Pandas dataframes. Cufflinks allows users to make great visualizations in a single line of code." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import cufflinks as cf\n", "\n", "# Use these imports for offline development\n", "#import plotly.offline as py\n", "#py.init_notebook_mode() \n", "#cf.go_offline()\n", "\n", "# Use these imports for online publishing\n", "import plotly.plotly as py\n", "cf.go_online()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "colors = ['orange', 'blue', 'green', 'red']\n", "dims = (800,500)\n", "width = 2.5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### Plot the Data" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "title = \"\"\"North America Population\"\"\"\n", "fig1 = df.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True )\n", "py.iplot(fig1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most basic method for visualizing change is to directly plot the data. The chart above shows population of the United States, Mexico, Canada, and North America (including Central America and the Caribbean). While this affords readers the ability to see the absolute units, each series has a vastly different scale. These differences in scale makes it difficult for your audience to quickly compare change. Looking at this chart, which country do you think grew at the fastest rate?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### Using Subplots" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "title = \"\"\"North America Population\"\"\"\n", "fig2 = df.iplot(subplots = True,theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True )\n", "py.iplot(fig2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The subplots method allows us to look at each series individually while also comparing the general trends. The subplots method can be helpful for comparing datasets with vastly different scales; however, it is not particularly useful for this analysis. Subplots are informative when there is large variation in your data. They are not effective for datasets that constantly increase over time. These four charts essentially just show ~45 degree angles. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### Dual Y-Axes" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "title = 'North America Population'\n", "fig3= df.iplot(theme='white',dimensions=dims,colors=colors,title=title, \\\n", " secondary_y =['mex','can'],legend = False, width=width, asFigure=True )\n", "py.iplot(fig3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It can be tempting to use a secondary y-axis such as to help solve the problem of scale. I strongly caution against this approach. In this chart, the populations of Canada and Mexico are plotted on the right-axis. A dual axes chart can potentially cause a few different issues:\n", "-\tReaders have to fight the tendency to compare magnitude between lines\n", "-\tOur brains are trained to look for periods in time in which lines intersect. We instinctually believe these are significant points in time. In a dual axes chart, these intersections are meaningless.\n", "\n", "Stephen Few, one of the experts in the data visualization field, [wrote about how](https://www.perceptualedge.com/articles/visual_business_intelligence/dual-scaled_axes.pdf) he could not identify a scenario in which a dual y-axis is ever the best way of visualizing data. While I mostly agree, I believe there are circumstances where a dual y-axis can help provide context (such as how many observations took place in a specific location on a chart). For this analysis, a dual y-axis is not an effective way of communicating change amongst our datasets.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### Periodic Change" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_diff = df.diff()\n", "title = \"\"\"Annual Change in North American Population\n", "\"\"\"\n", "fig4 = df_diff.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True)\n", "py.iplot(fig4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While plotting change in absolute units allows us to make comparisons within specific datasets, it is not particularly effective for comparing change across data sets with vastly different scales. If we examine, 1990-1994 we can see the population of the United States had much higher than normal growth. What this chart does not effectively communicate, is the rapid growth in Mexico from 1960-1980." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### Periodic Percent Change" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pct_change = df.pct_change() * 100\n", "title = \"\"\"Annual Percent Change in North American Population\"\"\"\n", "fig5 = df_pct_change.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True)\n", "py.iplot(fig5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualizing percent change is a great way to establish growth relationships between data sets of different units and scales. Of all the charts I made when creating this post, this yielded the most surprising results. Two items particularly jumped out at me:\n", "-\tNone of the previous charts illustrated that Mexico has experienced more rapid population growth than the United States and Canada.\n", "-\tPopulation growth is slowing amongst the three major countries in North America. While this is a bit surprising, a closer look at the previous chart helps explain this. Absolute annual population growth (the numerator) has been relatively flat since 1960; however, the current population of each country (the denominator) continues to increase.\n", "\n", "While this type of chart demonstrates change, readers completely lose context of scale. This chart does not communicate how much larger the population of the United States is compared with Canada (the US has roughly 10x the population of Canada). Another drawback to the percent change method is the outlier effect. If the population of a country decreased one year, an increase in population the following year would be overstated. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### Indexing Data" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "north america 268076376\n", "usa 180671000\n", "mex 38676974\n", "can 17909009\n", "Name: 1960-12-31 00:00:00, dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = df[df.index == df.index.min()].squeeze()\n", "df_1960 = 100 + ((df - x) / x) * 100\n", "x" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
north americausamexcan
Date
1960-12-31100.000000100.000000100.000000100.000000
1961-12-31102.029384101.671547103.263691102.021279
1962-12-31104.012274103.247339106.612141103.936516
1963-12-31105.966402104.743982110.050073105.890840
1964-12-31107.920963106.209076113.585406107.906585
\n", "
" ], "text/plain": [ " north america usa mex can\n", "Date \n", "1960-12-31 100.000000 100.000000 100.000000 100.000000\n", "1961-12-31 102.029384 101.671547 103.263691 102.021279\n", "1962-12-31 104.012274 103.247339 106.612141 103.936516\n", "1963-12-31 105.966402 104.743982 110.050073 105.890840\n", "1964-12-31 107.920963 106.209076 113.585406 107.906585" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_1960.head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "title = \"\"\"North American Population (Index 100 = December 31, 1960)\"\"\"\n", "fig6 = df_1960.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True)\n", "py.iplot(fig6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indexing data is my absolute favorite way to compare change across datasets. This chart allows the reader to understand the rate at which change has occurred across datasets from a certain point in time (December 31, 1960). By using this fixed point in time as a reference, we reduce the impact of single outliers. This method not only allows us to not only compare datasets which have different scales, but also those which are measured in different units. What jumped out to me most was the fact that Mexico’s population has more than tripled since 1960! \n", "\n", "Whilte I love index charts, there is no perfect time-series chart. Two specific areas of caution when using an index are:\n", "-\tIt is irresponsible to pick an outlier as the starting point. This misleads your audience, as the change since an outlier rarely relevant.\n", "-\tSimilar to the percent change chart, an audience would be unable to understand the differences in magnitude across datasets.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of the previously discussed charts can be useful for communicating change across time. That being said, no time-series chart is perfect. As data visualizers, we must accept this and: \n", "\n", "1)\tDetermine the message we would like to communicate and \n", "2)\tChoose the method which most effectively delivers this message \n", "\n", "It is also important to remember that charts are free! There is no need to try to squeeze every bit of information into a single chart. I feel the entire story of North American population growth can be explained using the following three charts: \n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "py.iplot(fig1)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "py.iplot(fig5)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "py.iplot(fig6)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }