# Appendix B - `pandas`

Unfamiliar with data analysis in Python? Here's a crash course in `pandas`,
[the preeminent Python library for data munging](http://pandas.pydata.org/).

This appendix was written in Pweave markdown. You can view the source [over in
`examples/`](examples/analyses/pandas.mdw), and the command for compiling to
markdown [in the Makefile](Makefile).

<< python >>=
import pweave

# Allow long lines in code chunks
pweave.rcParams["chunk"]["defaultoptions"].update({
    'wrap': False,
})

import pandas as pd  # pd is canonical shorthand for pandas
from tabulate import tabulate

from sqlalchemy import create_engine

# Connect to our example database (built in the Makefile)
DB_OPTS = dict(
    user='postgres',
    host='localhost',
    pw='',
    port=5432,
    name='libraries'
)

DB_FMT = 'postgresql://{user}:{pw}@{host}:{port}/{name}'
DB_CONN = DB_FMT.format(**DB_OPTS)

engine = create_engine(DB_CONN)
conn = engine.connect()
@

## DataFrames

`pandas` provides really useful convenienience functions for reading data
into their fundamental data type, the [`DataFrame`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

For CSV files, use [`pandas.read_csv`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html).

<<>>=
csv = pd.read_csv('examples/data/cpl-wifi.csv')
@

If you've got a database, [`pandas.read_sql`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html)
accepts a table name or SQL query and a database connection, and returns the
table data or query result as a DataFrame.

<<>>=
table = pd.read_sql('wifi', conn)
@

Sneak a peek at your data with `head`...

<< results = 'markdown' >>=
def show(df, **kwargs):
    '''tabulate is a Python library that prints tabular data, like DataFrames,
    in the format of your choosing. This function will print to Markdown
    tables; but you can also choose LaTeX or HTML, depending on your output!
    '''
    print(tabulate(df, headers=df, tablefmt='pipe', **kwargs))

show(csv.head())
@

...and/or `tail`.

<< results = 'markdown' >>=
show(table.tail())
@

Inherited a file or database with horrid column names? Rename them by setting
`DataFrame.columns` equal to a list of new labels for each column...

<< evaluate = False >>=
# Given a DataFrame with columns ['This is a horrible col', 'Here is more BAD DATA!!!', 'FOIA AWAY']
df.columns = ['good_column', 'better_column', 'best_column']
@

...or change a few particularly egregious columns with [`DataFrame.rename`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html#pandas.DataFrame.rename).

<< evaluate = False >>=
df.rename({'The WORST COLUnnfdm OF ever time': 'sensible_column'}, inplace=True)
@

## Selecting data

Select a column or columns with a column name...

<<>>=
table['cumulative_number_of_sessions'].head()
@

...or array of column names.

<< results = 'markdown' >>=
show(table[['month', 'year', 'number_of_sessions']].head())
@

You can use `DataFrame.loc` to select rows meeting a given condition...

<< results = 'markdown' >>=
show(table.loc[table['number_of_sessions'] > 200000].head())
@

...or you can give your DataFrame a custom index...

<< results = 'markdown' >>=
# Note that indices need not be unique nor numeric!
indexed_table = table.set_index('month')
show(indexed_table.head(3))
@

...then use `DataFrame.loc` to select rows with the given index value...

<< results = 'markdown' >>=
show(indexed_table.loc['January'])
@

...or to select attributes with the given index value and a column name.

<<>>=
indexed_table.loc['February', 'cumulative_number_of_sessions']
@

For positional indexing, there's `DataFrame.iloc`. Pass in a single integer
or a slice to get the corresponding rows...

<< results = 'markdown' >>=
show(table.iloc[:5])
@

...and key notation to select attributes.

<<>>=
table.iloc[:5]['number_of_sessions']
@

## Summary statistics

<<>>=
table['number_of_sessions'].describe()
@

<<>>=
table['year'].value_counts()
@

## Transformations

Calculated columns are useful and easy! Operate on multiple columns...

<< results = 'markdown' >>=
working_table = table.copy()
working_table['pct_total_to_date'] = working_table['number_of_sessions'] / working_table['cumulative_number_of_sessions']
show(working_table.head())
@

...`apply` lambda functions...

<< results = 'markdown' >>=
average = working_table['number_of_sessions'].mean()
working_table['above_average'] = working_table['number_of_sessions'].apply(lambda x: x > average)
show(working_table.iloc[25:30])
@

...or `map` dictionaries for simple calculated or categorical fields.

<< results = 'markdown' >>=
seasons = {
    'January': 'Winter',
    'February': 'Winter',
    'March': 'Spring',
    'April': 'Spring',
    'May': 'Spring',
    'June': 'Summer',
    'July': 'Summer',
    'August': 'Summer',
    'September': 'Fall',
    'October': 'Fall',
    'November': 'Fall',
    'December': 'Winter'
}

working_table['season'] = working_table['month'].map(seasons)

show(working_table.head(12))
@

Have something a bit more involved? Iterate over your `DataFrame` with `iterrows`.
Note that each iterated row is a tuple containing an index value and the row attributes,
accessible via key notation.

<<>>=
previous_year = None
pct_change = []

for idx, row in working_table.iterrows():
    current_year = row['number_of_sessions']
    if previous_year:
        pct_change.append((current_year - previous_year) / previous_year)
    else:
        pct_change.append(None)
    previous_year = current_year

working_table['pct_change'] = pct_change
@

## Aggregate functions

http://wesmckinney.com/blog/groupby-fu-improvements-in-grouping-and-aggregating-data-in-pandas/

## Plotting

<<>>=
ax = working_table.plot(
    title='Wifi Sessions at Chicago Public Libraries',
    y=['number_of_sessions', 'cumulative_number_of_sessions']
)
@