{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python Data Science with Pandas\n",
    "\n",
    "Agenda:\n",
    "1. Selecting and slicing data with loc and iloc\n",
    "2. Selecting data by value\n",
    "3. Selecting data from a time series by date and time values\n",
    "4. Applying and mapping functions to pandas dataframes and series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# magic command to display matplotlib plots inline within the ipython notebook webpage\n",
    "%matplotlib inline\n",
    "\n",
    "# import necessary modules\n",
    "import pandas as pd, numpy as np, matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1 <br /> Basics of Selecting and Slicing Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# Operation                       Syntax         Result\n",
    "#----------------------------------------------------------\n",
    "# Select column                   df[col]        Series\n",
    "# Select row by label             df.loc[label]  Series\n",
    "# Select row by integer location  df.iloc[loc]   Series\n",
    "# Slice rows                      df[5:10]       DataFrame\n",
    "# Select rows by boolean vector   df[bool_vec]   DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# create a pandas dataframe from the location data set\n",
    "df = pd.read_csv('data/summer-travel-gps-full.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With Python strings and lists, you can access individual elements inside the object by indexing or slicing with square brackets like my_list[0]. The same syntax works on pandas dataframes and series."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# get a column from the dataframe by its label with [] indexing - returns column as a pandas series with the dataframe's index\n",
    "df['city'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can slice a pandas dataframe to grab multiple columns or rows at once, by their index, just like you can slice a list or a string. Here, we just pass a list of 2 column names in:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# get multiple columns by their labels by passing a list of column names within the [] operator - returns a dataframe\n",
    "df[['city', 'country']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get a single \"cell's\" value out of a dataframe, pass a column name, then a row label. This is equivalent to slicing the dataframe down to a single series, then slicing a single value out of that series using [ ] indexing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# get a value using the [] operator for a column label then a row label\n",
    "df['city'][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using .loc[ ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# use .loc to select by row label - returns the row as a series whose index is the dataframe column labels\n",
    "df.loc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# use .loc to select by row label and column label\n",
    "df.loc[0, 'country']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# 0:3 is a slice of rows with label 0 to label 3\n",
    "# ['city', 'date'] is a list of column labels\n",
    "df.loc[0:3, ['city', 'date']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# slice by rows and columns\n",
    "df.loc[0:3, 'date':'country']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# you can set values with .loc as well\n",
    "print(df.loc[0, 'city'])\n",
    "df.loc[0, 'city'] = 'London'\n",
    "print(df.loc[0, 'city'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using .iloc[ ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# use .iloc for integer position based indexing\n",
    "# get the value from the row in position 3 and the column in position 2\n",
    "df.iloc[3, 2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# you can use .iloc with slices too\n",
    "# slice rows from position 112 to 115 and columns from position 2 to 4\n",
    "# iloc is not inclusive, so for example \"from 2 to 4\" will return positions 2 and 3 (but not 4)\n",
    "df.iloc[112:115, 2:4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# you can use iloc to select every nth row from a data set\n",
    "n = 300\n",
    "df.iloc[range(0, len(df), n)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# it's easier to tell the difference between .loc and .iloc if the index labels aren't the same as their positions\n",
    "df.index = [label**2 for label in df.index]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# this returns the rows with labels between 4 and 9 (.loc is inclusive)\n",
    "df.loc[4:9]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# this returns rows with labels in positions 4 through 8 (not through 9, because .iloc is not inclusive)\n",
    "df.iloc[4:9]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2 <br /> How to select rows by some value(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# load a reduced set of gps data\n",
    "df = pd.read_csv('data/summer-travel-gps-simplified.csv')\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# create a Series of true/false, indicating if each row in the column is equal to some value\n",
    "df['city']=='Munich'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# now, select only those rows in the df that match that condition\n",
    "df[df['city']=='Munich']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# pandas logical operators are: | for or, & for and, ~ for not\n",
    "# these must be grouped by using parentheses\n",
    "# what cities were visited in spain that were not barcelona?\n",
    "not_bcn = df[(df['country']=='Spain') & ~(df['city']=='Barcelona')]\n",
    "not_bcn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "not_bcn['city'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# select rows where either the city is munich, or the country is serbia\n",
    "df[(df['city']=='Munich') | (df['country']=='Serbia')]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# how many observations are west of the prime meridian?\n",
    "len(df[df['lon'] < 0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# get all rows that contain a city that starts with the letter G\n",
    "row_mask = df['city'].str.startswith('G')\n",
    "df[row_mask]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# select all rows with certain city names by using .isin([list])\n",
    "row_mask = df['city'].isin(['Munich', 'Berat', 'Maia', 'Sarajevo'])\n",
    "df[row_mask]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 3 <br /> How to select based on a date-time values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# load the location data set, indexed by the date field\n",
    "# and, parse the dates so they're no longer strings but now rather Python datetime objects\n",
    "# this lets us do date and time based operations on the data set\n",
    "dt = pd.read_csv('data/summer-travel-gps-full.csv', index_col='date', parse_dates=True)\n",
    "dt.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "len(dt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# 1759 rows - but is the timestamp index unique?\n",
    "dt.index.is_unique"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# the index is not unique - drop duplicates\n",
    "dt.drop_duplicates(inplace=True)\n",
    "len(dt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# the .weekday attribute determines which day of the week a date is\n",
    "# 0 is sunday and 6 is saturday, M-F are 1-5\n",
    "# what day of the week is each datetime in our dataframe's index?\n",
    "str(list(dt.index.weekday))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# use bitwise OR to create a boolean vector of which rows are a weekend date\n",
    "weekend_mask = (dt.index.weekday==6) | (dt.index.weekday==0)\n",
    "weekend_mask"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weekends = dt[weekend_mask]\n",
    "weekdays = dt[~weekend_mask]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "hourly_share = pd.DataFrame()\n",
    "\n",
    "# calculate what share of the weekday observations each hour has\n",
    "weekday_hourly = weekdays.groupby(weekdays.index.hour).size()\n",
    "hourly_share['weekday'] = weekday_hourly / sum(weekday_hourly)\n",
    "\n",
    "# calculate what share of the weekend observations each hour has\n",
    "weekend_hourly = weekends.groupby(weekends.index.hour).size()\n",
    "hourly_share['weekend'] = weekend_hourly / sum(weekend_hourly)\n",
    "\n",
    "# format the x-axis ticks like 0:00 times and plot the data\n",
    "hourly_share.index = [s + ':00' for s in hourly_share.index.astype(str)]\n",
    "hourly_share.plot(figsize=[9, 4], kind='bar', stacked=False, alpha=0.7, title='Share of observations, by hour')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# calculate and plot the number of observations each day of the week has\n",
    "daily_count = dt.groupby(dt.index.weekday).size()\n",
    "daily_count.index = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']\n",
    "daily_count.plot(figsize=[8, 5], kind='bar', title='Number of observations, by day of week')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 4 <br /> map(), apply(), and applymap() in pandas\n",
    "\n",
    "These methods are useful for mapping/applying a function across elements, rows, and columns of a pandas DataFrame or Series. But they have some important and often confusing differences.\n",
    "\n",
    "1. map() applies a function element-wise on a Series\n",
    "2. apply() works on a row or column basis on a DataFrame (specify the axis!), or on a row basis on a Series\n",
    "3. applymap() works element-wise on an entire DataFrame\n",
    "\n",
    "Let's see what that means in practice with some examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# create a new DataFrame with fake year data\n",
    "df = pd.DataFrame({'start_year':[2001, 2002, 2005, 2005, 2006], \n",
    "                   'end_year':[2002, 2010, 2008, 2006, 2014]})\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# you can iterate through a DataFrame using the .iterrows() method\n",
    "for _, row in df.iterrows():\n",
    "    start_year = row['start_year']\n",
    "    if start_year > 2004:\n",
    "        print(start_year + 10)\n",
    "    else:\n",
    "        print(start_year)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### alternatively, .map() applies a function element-wise on a Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# create a function\n",
    "def get_new_year(x):\n",
    "    if x > 2004:\n",
    "        return x + 10\n",
    "    else:\n",
    "        return x\n",
    "\n",
    "# then map it to the series\n",
    "df['start_year'].map(get_new_year)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# or do the same thing all in one line, using a lambda function as .map()'s argument... you commonly see this in pandas\n",
    "df['start_year'].map(lambda x: x + 10 if x > 2004 else x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A [lambda function](http://www.python-course.eu/lambda.php) is a simple, one-off, anonymous function. You can't call it again later because it doesn't have a name. It just lets you repeatedly perform some operation across a series of values (in our case, a column in our dataframe) using a minimal amount of code. Also notice that the if-else statement is all on one line: this is called a [ternary operator](http://pythoncentral.io/one-line-if-statement-in-python-ternary-conditional-operator/) or an inline-if."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# you can easily create a new column to contain the results of the function mapping\n",
    "df['new_year'] = df['start_year'].map(get_new_year)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### .apply() is like .map(), but it works on a row or column basis on an entire DataFrame (specify the axis)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# applies a function to calculate the difference between the min and max values in each column (ie, row-wise)\n",
    "def get_difference(vector):\n",
    "    difference = vector.max() - vector.min()\n",
    "    return difference\n",
    "\n",
    "df.apply(get_difference, axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# same thing again, using a lambda function\n",
    "df.apply(lambda x: x.max() - x.min(), axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# here .apply() finds the difference between the min and max values in each row (ie, column-wise) and saves to a new column\n",
    "df['difference'] = df.apply(get_difference, axis=1)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### .applymap() works element-wise on an entire DataFrame\n",
    "This is like doing a .map() to each column in the DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# divide every value in the dataframe by 2 (use a float so you don't do rounded integer division)\n",
    "df.applymap(lambda x: x / 2.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}