# Appendix B - `pandas` Unfamiliar with data analysis in Python? Here's a crash course in `pandas`, [the preeminent Python library for data munging](http://pandas.pydata.org/). This appendix was written in Pweave markdown. You can view the source [over in `examples/`](examples/analyses/pandas.mdw), and the command for compiling to markdown [in the Makefile](Makefile). << python >>= import pweave # Allow long lines in code chunks pweave.rcParams["chunk"]["defaultoptions"].update({ 'wrap': False, }) import pandas as pd # pd is canonical shorthand for pandas from tabulate import tabulate from sqlalchemy import create_engine # Connect to our example database (built in the Makefile) DB_OPTS = dict( user='postgres', host='localhost', pw='', port=5432, name='libraries' ) DB_FMT = 'postgresql://{user}:{pw}@{host}:{port}/{name}' DB_CONN = DB_FMT.format(**DB_OPTS) engine = create_engine(DB_CONN) conn = engine.connect() @ ## DataFrames `pandas` provides really useful convenienience functions for reading data into their fundamental data type, the [`DataFrame`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). For CSV files, use [`pandas.read_csv`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). <<>>= csv = pd.read_csv('examples/data/cpl-wifi.csv') @ If you've got a database, [`pandas.read_sql`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html) accepts a table name or SQL query and a database connection, and returns the table data or query result as a DataFrame. <<>>= table = pd.read_sql('wifi', conn) @ Sneak a peek at your data with `head`... << results = 'markdown' >>= def show(df, **kwargs): '''tabulate is a Python library that prints tabular data, like DataFrames, in the format of your choosing. This function will print to Markdown tables; but you can also choose LaTeX or HTML, depending on your output! ''' print(tabulate(df, headers=df, tablefmt='pipe', **kwargs)) show(csv.head()) @ ...and/or `tail`. << results = 'markdown' >>= show(table.tail()) @ Inherited a file or database with horrid column names? Rename them by setting `DataFrame.columns` equal to a list of new labels for each column... << evaluate = False >>= # Given a DataFrame with columns ['This is a horrible col', 'Here is more BAD DATA!!!', 'FOIA AWAY'] df.columns = ['good_column', 'better_column', 'best_column'] @ ...or change a few particularly egregious columns with [`DataFrame.rename`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html#pandas.DataFrame.rename). << evaluate = False >>= df.rename({'The WORST COLUnnfdm OF ever time': 'sensible_column'}, inplace=True) @ ## Selecting data Select a column or columns with a column name... <<>>= table['cumulative_number_of_sessions'].head() @ ...or array of column names. << results = 'markdown' >>= show(table[['month', 'year', 'number_of_sessions']].head()) @ You can use `DataFrame.loc` to select rows meeting a given condition... << results = 'markdown' >>= show(table.loc[table['number_of_sessions'] > 200000].head()) @ ...or you can give your DataFrame a custom index... << results = 'markdown' >>= # Note that indices need not be unique nor numeric! indexed_table = table.set_index('month') show(indexed_table.head(3)) @ ...then use `DataFrame.loc` to select rows with the given index value... << results = 'markdown' >>= show(indexed_table.loc['January']) @ ...or to select attributes with the given index value and a column name. <<>>= indexed_table.loc['February', 'cumulative_number_of_sessions'] @ For positional indexing, there's `DataFrame.iloc`. Pass in a single integer or a slice to get the corresponding rows... << results = 'markdown' >>= show(table.iloc[:5]) @ ...and key notation to select attributes. <<>>= table.iloc[:5]['number_of_sessions'] @ ## Summary statistics <<>>= table['number_of_sessions'].describe() @ <<>>= table['year'].value_counts() @ ## Transformations Calculated columns are useful and easy! Operate on multiple columns... << results = 'markdown' >>= working_table = table.copy() working_table['pct_total_to_date'] = working_table['number_of_sessions'] / working_table['cumulative_number_of_sessions'] show(working_table.head()) @ ...`apply` lambda functions... << results = 'markdown' >>= average = working_table['number_of_sessions'].mean() working_table['above_average'] = working_table['number_of_sessions'].apply(lambda x: x > average) show(working_table.iloc[25:30]) @ ...or `map` dictionaries for simple calculated or categorical fields. << results = 'markdown' >>= seasons = { 'January': 'Winter', 'February': 'Winter', 'March': 'Spring', 'April': 'Spring', 'May': 'Spring', 'June': 'Summer', 'July': 'Summer', 'August': 'Summer', 'September': 'Fall', 'October': 'Fall', 'November': 'Fall', 'December': 'Winter' } working_table['season'] = working_table['month'].map(seasons) show(working_table.head(12)) @ Have something a bit more involved? Iterate over your `DataFrame` with `iterrows`. Note that each iterated row is a tuple containing an index value and the row attributes, accessible via key notation. <<>>= previous_year = None pct_change = [] for idx, row in working_table.iterrows(): current_year = row['number_of_sessions'] if previous_year: pct_change.append((current_year - previous_year) / previous_year) else: pct_change.append(None) previous_year = current_year working_table['pct_change'] = pct_change @ ## Aggregate functions http://wesmckinney.com/blog/groupby-fu-improvements-in-grouping-and-aggregating-data-in-pandas/ ## Plotting <<>>= ax = working_table.plot( title='Wifi Sessions at Chicago Public Libraries', y=['number_of_sessions', 'cumulative_number_of_sessions'] ) @