<h1 align="center"> Visualizing Change Using Time-Series Line Charts
<h3 align="center"> by Nick Heitzman


### Contents  
   
#### Introduction
[Introduction](#intro)  

#### Obtain Population Data
[Import Basic Libraries](#libs)  
[Retrieve Population Data from Quandl](#quandl)  
   
#### Methods for Visualizing Change
[Plot the Data](#plot)  
[Subplots](#subplot)  
[Dual Y-Axes](#dualaxes)  
[Periodic Change](#change)  
[Periodic Percent Change](#pctchange)  
[Indexing Data](#index)  


#### Conculsion
[Conclusion](#conclusion)  


### Introduction

Time-series data visualizations are everywhere. While these charts are understood amongst individuals of all professions, effectively communicating change over time can present unexpected challenges. When creating any type of visualization, it is important to first determine the message you would like to communicate. The increased popularity of exploratory data visualization tools such as Tableau and Microsoft Power BI make it easy to forget this step. These tools provide users with the ability to connect to databases and click around until they find the prettiest visualization. These capabilities can often lead to ineffective visualizations with no explicit purpose. 

When creating time-series line charts, itâ€™s important to consider which of the following you would like to communicate:
-	Actual value of units?
-	Change in absolute units? 
-	Percent change?
-	Change from a specific point in time?  

Ultimately, no chart can communicate all of these effectively. It is important to recognize this, determine which message is most important, and then design your visual accordingly. 


<a id='libs'></a>
### Import Basic Libraries

In [1]:
%matplotlib inline
import pandas as pd
import Quandl as qd
import warnings
warnings.filterwarnings('ignore')

#Nick's Quandl Auth token
auth = '9zjPBpsaLGqS-KPGzvyn'

<a id='quandl'></a>
### Retreive Population Data from Quandl

#### What is Quandl?

Quandl is an online data warehouse which has millions of public datasets. Quandl's API is set up to pull data directly into a Pandas dataframe, and it automatically sets the date as the index.  For more info on using Quandl with Python, visit: https://www.quandl.com/help/python   
   
Quandl houses the world bank's public data. The north_america_codes.json file contains all of the total population data for each country in North America, including Central America and the Caribbean. 


In [2]:
df_codes = pd.read_json('north_america_codes.json')

In [3]:
df_codes.head()

Unnamed: 0,code,country
0,WORLDBANK/USA_SP_POP_TOTL,USA
1,WORLDBANK/CAN_SP_POP_TOTL,Canada
10,WORLDBANK/HTI_SP_POP_TOTL,Haiti
11,WORLDBANK/JAM_SP_POP_TOTL,Jamaica
12,WORLDBANK/KNA_SP_POP_TOTL,Saint Kitts and Nevis


#### Retreive Data

Using the Quandl API, I loop through each country to pull population data. As each country's data is pulled, it is concatenated into a single Pandas DataFrame (df).

In [4]:
df = pd.DataFrame

for x in df_codes.code:
    df_temp = ''
    df_temp = qd.get(x,authtoken=auth)
    df_temp.rename(columns={'Value': x[10:13]}, inplace=True)
    
    if df.empty:
        df = df_temp
    else:
        df = pd.concat([df, df_temp],axis=1)
        
df.columns = [x.lower() for x in df.columns]

#### Data Munging

I then calculate the total for North America. For the purpose of this analysis, we are going to compare USA, Mexico, and Canada in addition to the North American total. The DataFrame is then limited to just these four columns.

In [5]:
df.insert(0,'north america',df.sum(axis=1))
df = df[['north america', 'usa', 'mex', 'can']]
df.tail(5)

Unnamed: 0_level_0,north america,usa,mex,can
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-12-31,540355247,309347057,117886404,34005274
2011-12-31,545608318,311721632,119361233,34342780
2012-12-31,550985148,314112078,120847477,34754312
2013-12-31,556361769,316497531,122332399,35158304
2014-12-31,561674093,318857056,123799215,35540419


### Methods for Visualizing Change

Plotly is a third party library that allows users to develop interactive visualizations and share them online. The Plotly library cufflinks was created specifically to interact with Pandas dataframes. Cufflinks allows users to make great visualizations in a single line of code.

In [6]:
import cufflinks as cf

# Use these imports for offline development
#import plotly.offline as py
#py.init_notebook_mode() 
#cf.go_offline()

# Use these imports for online publishing
import plotly.plotly as py
cf.go_online()

In [7]:
colors = ['orange', 'blue', 'green', 'red']
dims = (800,500)
width = 2.5

<a id='plot'></a>
#### Plot the Data

In [8]:
title = """North America Population"""
fig1 = df.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True )
py.iplot(fig1)

The most basic method for visualizing change is to directly plot the data. The chart above shows population of the United States, Mexico, Canada, and North America (including Central America and the Caribbean). While this affords readers the ability to see the absolute units, each series has a vastly different scale. These differences in scale makes it difficult for your audience to quickly compare change. Looking at this chart, which country do you think grew at the fastest rate?

<a id='subplot'></a>
#### Using Subplots

In [9]:
title = """North America Population"""
fig2 = df.iplot(subplots = True,theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True )
py.iplot(fig2)

The subplots method allows us to look at each series individually while also comparing the general trends. The subplots method can be helpful for comparing datasets with vastly different scales; however, it is not particularly useful for this analysis. Subplots are informative when there is large variation in your data. They are not effective for datasets that constantly increase over time. These four charts essentially just show ~45 degree angles. 

<a id='dualaxes'></a>
#### Dual Y-Axes

In [10]:
title = 'North America Population'
fig3= df.iplot(theme='white',dimensions=dims,colors=colors,title=title, \
        secondary_y =['mex','can'],legend = False, width=width, asFigure=True )
py.iplot(fig3)

It can be tempting to use a secondary y-axis such as to help solve the problem of scale. I strongly caution against this approach. In this chart, the populations of Canada and Mexico are plotted on the right-axis. A dual axes chart can potentially cause a few different issues:
-	Readers have to fight the tendency to compare magnitude between lines
-	Our brains are trained to look for periods in time in which lines intersect. We instinctually believe these are significant points in time. In a dual axes chart, these intersections are meaningless.

Stephen Few, one of the experts in the data visualization field, [wrote about how](https://www.perceptualedge.com/articles/visual_business_intelligence/dual-scaled_axes.pdf)  he could not identify a scenario in which a dual y-axis is ever the best way of visualizing data. While I mostly agree, I believe there are circumstances where a dual y-axis can help provide context (such as how many observations took place in a specific location on a chart). For this analysis, a dual y-axis is not an effective way of communicating change amongst our datasets.


<a id='change'></a>
#### Periodic Change

In [11]:
df_diff = df.diff()
title = """Annual Change in North American Population
"""
fig4 = df_diff.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True)
py.iplot(fig4)

While plotting change in absolute units allows us to make comparisons within specific datasets, it is not particularly effective for comparing change across data sets with vastly different scales. If we examine, 1990-1994 we can see the population of the United States had much higher than normal growth. What this chart does not effectively communicate, is the rapid growth in Mexico from 1960-1980.

<a id='pctchange'></a>
#### Periodic Percent Change

In [12]:
df_pct_change = df.pct_change() * 100
title = """Annual Percent Change in North American Population"""
fig5 = df_pct_change.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True)
py.iplot(fig5)

Visualizing percent change is a great way to establish growth relationships between data sets of different units and scales. Of all the charts I made when creating this post, this yielded the most surprising results. Two items particularly jumped out at me:
-	None of the previous charts illustrated that Mexico has experienced more rapid population growth than the United States and Canada.
-	Population growth is slowing amongst the three major countries in North America. While this is a bit surprising, a closer look at the previous chart helps explain this. Absolute annual population growth (the numerator) has been relatively flat since 1960; however, the current population of each country (the denominator) continues to increase.

While this type of chart demonstrates change, readers completely lose context of scale. This chart does not communicate how much larger the population of the United States is compared with Canada (the US has roughly 10x the population of Canada). Another drawback to the percent change method is the outlier effect. If the population of a country decreased one year, an increase in population the following year would be overstated.  


<a id='index'></a>
#### Indexing Data

In [13]:
x = df[df.index == df.index.min()].squeeze()
df_1960 = 100 + ((df - x) / x) * 100
x

north america    268076376
usa              180671000
mex               38676974
can               17909009
Name: 1960-12-31 00:00:00, dtype: float64

In [14]:
df_1960.head()

Unnamed: 0_level_0,north america,usa,mex,can
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1960-12-31,100.0,100.0,100.0,100.0
1961-12-31,102.029384,101.671547,103.263691,102.021279
1962-12-31,104.012274,103.247339,106.612141,103.936516
1963-12-31,105.966402,104.743982,110.050073,105.89084
1964-12-31,107.920963,106.209076,113.585406,107.906585


In [15]:
title = """North American Population (Index 100 = December 31, 1960)"""
fig6 = df_1960.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True)
py.iplot(fig6)

Indexing data is my absolute favorite way to compare change across datasets. This chart allows the reader to understand the rate at which change has occurred across datasets from a certain point in time (December 31, 1960). By using this fixed point in time as a reference, we reduce the impact of single outliers. This method not only allows us to not only compare datasets which have different scales, but also those which are measured in different units. What jumped out to me most was the fact that Mexicoâ€™s population has more than tripled since 1960!  

Whilte I love index charts, there is no perfect time-series chart. Two specific areas of caution when using an index are:
-	It is irresponsible to pick an outlier as the starting point. This misleads your audience, as the change since an outlier rarely relevant.
-	Similar to the percent change chart, an audience would be unable to understand the differences in magnitude across datasets.


<a id='conclusion'></a>
### Conclusion

All of the previously discussed charts can be useful for communicating change across time. That being said, no time-series chart is perfect. As data visualizers, we must accept this and:  

1)	Determine the message we would like to communicate and  
2)	Choose the method which most effectively delivers this message  

It is also important to remember that charts are free! There is no need to try to squeeze every bit of information into a single chart. I feel the entire story of North American population growth can be explained using the following three charts: 


In [16]:
py.iplot(fig1)

In [17]:
py.iplot(fig5)

In [18]:
py.iplot(fig6)