{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**0. Import & parse CSV data**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%pylab inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "#from matplotlib.pylab import plot, legend\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# read trends.csv with pandas. The first line is the header: \n", "## week_start,week_end,textbooks,spring_break,kayak,skiing,global_warming\n", "## 2004-01-04,2004-01-10,65,36,17,85,13\n", "## 2004-01-11,2004-01-17,79,40,18,84,18\n", "\n", "\n", "# Giving it index_col=0 uses the week_start as the index \n", "# and parse_dates=True parses them as a date.\n", "df = pd.read_csv('trends.csv',index_col=0,parse_dates=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we read in the csv file, specifying that the first column (which was week_start) should be the index column. By setting parse_dates=True, we told pandas to parse the date and store each index as a Timestamp object. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first line in the csv file was interpreted as the header, and these strings were stored as the column names. We can see the names and data types of each column with df.dtypes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the first few entries of the pandas dataframe to see how it looks:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# del df[\"week_end\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NB: In the table above it looks like the first row is \"week_start\" followed by a bunch of blank entries, but in fact this is just showing that the label for the indices is called \"week_start\". We can grab the first row of the data frame to verify what it contains:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(df.iloc[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that iloc indexes by *integer position* whereas loc will index by the actual index key, e.g." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(df.loc['2004-01-04'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I. Plot trends in single frame, with legend" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can plot each column of interest against the index (which in this case is week_start) one at a time:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.plot(figsize=(10,10))\n", "# if we want to add a legend, must import legend from matplotlib\n", "#legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we could slice the columns of interest with df.loc and plot them all in one line. Pandas will even do the legend for you!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.loc[:,['textbooks','spring_break','kayak','skiing']].plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we could slice the columns of interest by integer position using df.iloc" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.iloc[:,1:5].plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Let’s look at some summary statistics:\n", "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**II. Determine maxima and minima of each trend term**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Determine in which week of each year (for all five search trends including \"global warming\") that search reached its peak and its minimum." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can query the pandas data frame to return to us a slice of the data frame where a year is a particular value" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sub_df = df[df.index.year == 2015]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# look at the first few values of the data frame to see that it contains just 2006 values\n", "print(sub_df.head())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# we could find where there was a maximum value in a series with idxmax\n", "sub_df.kayak.idxmax()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# or we could use boolean indexing\n", "sub_df.loc[sub_df.kayak == sub_df.kayak.max()].index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find the week in each year that each trend was maximum and minimum. Let's loop through each key of interest and each year, and print out the max and min values of each column in that year." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "keylist = ['textbooks','spring_break','kayak','skiing','global_warming']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for key in keylist:\n", " print('Searches for \"' + key + '\":')\n", " for yr in range(2004,2016):\n", " # our indices are timestamp objects which have nice properties such as being able to grab the year, month, day, week, etc\n", " sub_df = df.loc[df.index.year == yr]\n", " trend = sub_df[key]\n", " peak_start_timestamp = sub_df.loc[trend == max(trend)].index[0] # taking the first element in case there are multiple with the max value\n", " peak_start = str(peak_start_timestamp.month) + '-' + str(peak_start_timestamp.day)\n", " peak_week = str(peak_start_timestamp.week) \n", " peak_end = sub_df.loc[trend == max(trend)].week_end.iloc[-1] # taking the first element in case there are multiple with the max value\n", " peak_end = peak_end[5:].lstrip('0') # take only the date, not the year\n", " \n", " # similar process for minimum values\n", " min_start_timestamp = sub_df.loc[trend == min(trend)].index[0]\n", " min_start = str(min_start_timestamp.month) + '-' + str(min_start_timestamp.day)\n", " min_week = str(min_start_timestamp.week)\n", " min_end = sub_df.loc[trend == min(trend)].week_end.iloc[0]\n", " min_end = min_end[5:].lstrip('0') \n", " \n", " print(' in ' + str(yr) + ' were highest from ' + peak_start + ' to ' + peak_end + \" (Week \" + peak_week + \")\")\n", " print(' were lowest from ' + min_start + ' to ' + min_end + \" (Week \" + min_week + \")\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each year, list week numbers corresponding to maximum and minimum search values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**III. Study scatter about median values**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's find which search value has the largest scatter over time. \n", "Q: Is it true that $\\sigma(A) > \\sigma(B) \\Rightarrow \\sigma_\\textrm{median}(A) > \\sigma_\\textrm{median}(B)$?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def std_median(datums):\n", " return np.sqrt( np.sum( (datums - datums.median())**2 ) * 1.0/ len(datums) )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(df.spring_break.std(), std_median(df.spring_break))\n", "print(df.textbooks.std(), std_median(df.textbooks))\n", "print(df.skiing.std(), std_median(df.skiing))\n", "print(df.kayak.std(), std_median(df.kayak))\n", "print(df.global_warming.std(), std_median(df.global_warming))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that searches for kayak have the highest scatter, whereas those for textbooks remain more stable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**IV. Correlation between trends**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use cross-correlation to find the time delay between two time-series. The maximum of the cross-correlation function gives the time when the two signals are best aligned (in other words, the lag between the two signals)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "result = np.correlate(df['spring_break'],df['skiing'], mode='full')\n", "gap = np.arange(result.size) - result.size/2\n", "plot(gap,result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the numpy function argmax to get the index at which this value is maximum\n", "(note for pandas expert - result appears still to be a pandas object and maintains attributes such as idxmax, but they don't work?)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(gap[result.argmax()]) # Find the x value corresponding to the maximum y value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do the same exercise to find the lag between global warming and skiing searches" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "result2 = np.correlate(df.global_warming,df.skiing, mode='full')\n", "gap2 = np.arange(result2.size) - result2.size/2\n", "plot(gap2,result2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(gap2[result2.argmax()]) # Find the x value corresponding to the maximum y value" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mid_peak_min = len(gap2)/2 - 26\n", "mid_peak_max = len(gap2)/2 + 26\n", "print(gap2[mid_peak_min:mid_peak_max][result2[mid_peak_min:mid_peak_max].argmax()])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }