{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Data Analysis \n", "\n", " **In God we trust, all others bring data.** - *The Elements of Statistical Learning*\n", "\n", "Big Data, Data Analytics, Data Science etc are the common buzzwords of the data world. So much so that data is considered to be the \"new oil\". There are excellent data-specific programming tools like SAS, R, Hadoop. Using a more generic scripting language like Python for data analysis is helpful as it allows for combination of data tasks with scientific programming.\n", "\n", "\n", "One major issue for statistical programmers using Python, in the past has been the lack of libraries implementing standard models and a cohesive framework for specifying models. **Pandas**, the data analysis library which has been in development since 2008, aims to bridge this gap.\n", "\n", "Pandas derives its name from **pan**el **da**tasets, which is a commonly used term for multi-dimensional datasets encountered in statistics and econometrics.\n", "\n", "\n", "\n", "\n", "\n", "Data analysis is only as good as its visualization. Today we will use a number of datasets in combination with the plotting library in Python; **matplotlib** to demonstrate our learnings. The notebook is structured as follows:\n", "\n", "## Contents\n", "- [Data Analysis](#intro)\n", "- [Matplotlib](#mpl)\n", "- [Data Analysis: pandas](#pandas)\n", " - [Series](#series)\n", " - [String methods](#smethods)\n", " - [Reading from a csv](#csv)\n", "- [DataFrames](#df)\n", " - [Exercise 1: DataFrames](#ex1)\n", " - [Data Manipulation](#dm)\n", " - [Exercise 2: Data Extraction](#ex2)\n", " - [Plotting data](#plot)\n", " - [Missing Data](#missing)\n", " - [Excercise 3: DataFrame Methods](#ex3)\n", " - [More Manipulations](#mm)\n", "- [Statistical Tests](#stats)\n", " - [Regression](#regression)\n", " - [T-Test](#ttest)\n", " - [Time Series](#ts)\n", "- [Data Problem](#dp)\n", " - [Data Cleaning](#dc)\n", " - [Data Analysis](#da)\n", "- [Miscellaneous plots](#oplot)\n", "- [References](#refs)\n", "- [Credits](#credits)\n", "\n", "\n", "\n", "\n", "\n", " \n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from __future__ import division\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "\n", "pd.set_option('display.mpl_style', 'default')\n", "#IPython magic command for inline plotting\n", "%matplotlib inline\n", "#a better plot shape for IPython\n", "mpl.rcParams['figure.figsize']=[15,3]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Quick Overview of matplotlib\n", "\n", "Matplotlib is the primary plotting library in Python. We will have a separate notebook dedicated to its features in a subsequent session. For the purpose of plotting with **pandas** today, we will touch upon the very basic plotting in **matplotlib**." ] }, { "cell_type": "code", "collapsed": false, "input": [ "x = np.linspace(0, 1, 10001)\n", "y = np.cos(np.pi/x) * np.exp(-x**2)\n", "\n", "plt.plot(x, y)\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Plot the following equations over the domain $x \\in \\left[-1, 2\\right]$.\n", " * $y = f(x) = x^2 \\exp(-x)$\n", " * $y = f(x) = \\log x$\n", " * $y = f(x) = 1 + x^x + 3 x^4$" ] }, { "cell_type": "code", "collapsed": false, "input": [ "x=np.linspace(-1, 2, 10001)\n", "y = x**2*np.exp(-x)\n", "\n", "plt.plot(x, y)\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Data analysis: pandas\n", "\n", "The pandas data analysis module provides data structures and tools for data analysis. It focuses on data handling and manipulation as well as linear and panel regression. It is designed to let you carry out your entire data workflow in Python without having to switch to a domain-specific language such as R. Although largely compatible with NumPy/SciPy, there are some important differences in indexing, data organization, and features. The basic Pandas data type is not `ndarray`, but Series and DataFrame. These allow you to index data and align axes efficiently.\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Series \n", "\n", "A `Series` object is a one-dimensional array which can hold any data type. Like a dictionary, it has a set of indices for access (like keys); unlike a dictionary, it is ordered. Data alignment is intrinsic and will not be broken unless you do it explicitly. It is very similar to ndarray from NumPy.\n", "\n", "An arbitrary list of values can be used as the index, or a list of axis labels (so it can act something like a `dict`)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "s = pd.Series([1,5,float('NaN'),7.5,2.1,3])\n", "print(s)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "dates = pd.date_range('20140201', periods=s.size)\n", "s.index = dates\n", "print(s)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "letters = ['A', 'B', 'Ch', '#', '#', '---']\n", "s.index = letters\n", "print(s)\n", "print('\\nAccess is like a dictionary key:\\ns[\\'---\\'] = '+str(s['---']))\n", "print('\\nRepeat labels are possible:\\ns[\\'#\\']=\\n'+str(s['#']))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NumPy functions expecting an ndarray often do just fine with Series as well." ] }, { "cell_type": "code", "collapsed": false, "input": [ "t = np.exp(s)\n", "print(t)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## String Methods\n", "\n", "Series is equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the Series\u2019s str attribute and generally have names matching the equivalent (scalar) built-in string methods:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ " s.str.upper()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ " s.str.lower()\n", " " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "s.str.len()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])\n", "print s2\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "s2.str.split('_')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
MethodDescription
catConcatenate strings
splitSplit strings on delimiter
getIndex into each element (retrieve i-th element
joinJoin strings in each element of the Series with passed separator
containsReturn boolean array if each string contains pattern/regex
replaceReplace occurrences of pattern/regex with some other string
repeatDuplicate values (s.str.repeat(3) equivalent to x * 3)
padAdd whitespace to left, right, or both sides of strings
centerEquivalent to pad(side='both')
wrapSplit long strings into lines with length less than a given width
sliceSlice each string in the Series
slice_replaceReplace slice in each string with passed value
countCount occurrences of pattern
startswithEquivalent to str.startswith(pat) for each element
endswithEquivalent to str.endswith(pat) for each element
findall Compute list of all occurrences of pattern/regex for each string
matchCall re.match on each element, returning matched groups as list
extractCall re.match on each element, as match does, but return matched groups as strings for convenience.
lenCompute string lengths
stripEquivalent to str.strip
rstripEquivalent to str.rstrip
lstripEquivalent to str.lstrip
lowerEquivalent to str.lower
upperEquivalent to str.upper
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Reading from a csv\n", "\n", "In most data scenarios, you will receive a comma separated file, on which you will need to perform your analysis. Reading a `csv` file into Python can be achieved by using the `read_csv` function. We will use the date from [this website](http://donnees.ville.montreal.qc.ca/dataset/velos-comptage), about how many people were on 7 different bike paths in Montreal, each day. Let's use the data from 2012." ] }, { "cell_type": "code", "collapsed": false, "input": [ "broken_df = pd.read_csv('2012.csv')\n", "\n", "#Look at the first 4 rows\n", "broken_df[:4]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "fixed_df = pd.read_csv('2012.csv', index_col='Date')\n", "fixed_df[:3]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "##DataFrame\n", "\n", "What we did when we read the `csv` file into `broken_df`, we created a 2 Dimensional data structure called a `DataFrame`. The `DataFrame` object is similar to a table or a spreadsheet in Excel, i.e. a 2D Matrix-like object. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "s = pd.Series([1,5,float('NaN'),7.5,2.1,3])\n", "df = pd.DataFrame(s, columns=['x'])\n", "print(df)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "t=np.exp(s)\n", "df['exp(x)'] = t\n", "df['exp(exp(x))'] = np.exp(t)\n", "print(df)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a number of ways to access the elements of a `DataFrame`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(df['x'], '\\n') #column\n", "#letters = ['A', 'B', 'Ch', '#', '#', '---']\n", "#df.index=letters\n", "#print(df.loc['#'], '\\n') #row by label\n", "#print(df.iloc[3], '\\n') #row by number (note the transposition in output!)\n", "print(df[1:4]) #row by slice" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Exercise 1 : DataFrames" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df1=pd.DataFrame(np.random.randn(dates.size,4),index=dates,columns=list('ABCD'))\n", "print df1" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the `DataFrame df1` created above, perform the following operations:\n", "\n", " 1. df1.head() and df1.tail()\n", " 2. df1.describe()\n", " 3. df1.T\n", " 4. df1.sort(columns='B')\n", " 5. df1.columns, df1.index, df1.values" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df1.sort(columns=list('B'))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## DataFrames: Data Manipulation\n", "\n", "Now let us look at the cyclist DataFrame we created. To extract a column from the DataFrame, " ] }, { "cell_type": "code", "collapsed": false, "input": [ "fixed_df['Berri 1']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use **Boolean indexing** on columns to extract information satisfying our desired conditions. For example, if I wished to extract all data from the cyclist data set where the value in the column `Berri 1` is greater than 1000," ] }, { "cell_type": "code", "collapsed": false, "input": [ "fixed_df[fixed_df['Berri 1'] > 1000]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from pandas.util.testing import rands\n", "df=pd.DataFrame(np.random.randn(dates.size,4),index=dates,columns=list('ABCD'))\n", "print df" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df[df.B>0]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df[df > 0]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df2 = df.copy()\n", "df2['E']=['one', 'one','two','three','four','three']\n", "print df2" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df2[df2['E'].isin(['one'])]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df.at[dates[0],'A'] = 0\n", "print df" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df.iat[0,1] = 0\n", "print df" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Exercise 2 : Conditional data extraction" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from random import randint\n", "df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],\n", " 'B': [randint(1, 9)*10 for x in xrange(10)],\n", " 'C': [randint(1, 9)*100 for x in xrange(10)]})\n", "print df" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find the entries from *A* for which corresponding values for *B* will be greater than 50, and those in *C* equal to 900" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Plotting Data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "fixed_df['Berri 1'].plot()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Missing Data\n", "\n", "Many data sets simply arrive with missing data, either because it exists and was not collected or it never existed. For example, in a collection of financial time series, some of the time series might start on different dates. Thus, values prior to the start date would generally be marked as missing.\n", "\n", "As data comes in many shapes and forms, `pandas` aims to be flexible with regard to handling missing data. While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, the Python None will arise and we wish to also consider that \u201cmissing\u201d or \u201cnull\u201d." ] }, { "cell_type": "code", "collapsed": false, "input": [ "df= pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))\n", "print df" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])\n", "print df1" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df1.loc[dates[0]:dates[1],'E'] = 1\n", "print df1" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df1.dropna(how='all') #any" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df1.fillna(value=15)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.isnull(df1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing values propogate through arithmetic operations." ] }, { "cell_type": "code", "collapsed": false, "input": [ "df2=pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))\n", "print df2\n", "df2.loc[dates[0]:dates[2],'B']=float('NaN')\n", "print df2\n", "print df1+df2" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But this can be avoided by using built-in methods, that exclude missing values." ] }, { "cell_type": "code", "collapsed": false, "input": [ "df1['A'].sum()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df1.mean(1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df2.cumsum()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Gaussian numbers histogram\n", "from numpy.random import normal\n", "n = 1000\n", "x = pd.Series(normal(size=n))\n", "#print x\n", "avg = x.mean()\n", "std = x.std()\n", "\n", "x_avg = pd.Series(np.ones(n)* avg)\n", "x_stdl = pd.Series(np.ones(n)*(avg-std))\n", "x_stdh = pd.Series(np.ones(n)*(avg+std))\n", "\n", "df_gauss=pd.DataFrame({'A':x_stdl,'B':x_stdh,'x':x})\n", "\n", "df_gauss.plot(style=['rx','rx','bx'])\n", "plt.figure()\n", "df_gauss['x'].diff().hist(color='g', bins=50)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Exercise 3: DataFrame methods" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df=pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'))\n", "print df" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try the following with `df` as defined above:\n", "\n", " 1. df.mean()\n", " 2. df.apply(np.cumsum)\n", " 3. df.apply(lambda x: x.max() - x.min())\n", " 4. Plot a histogram" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df.apply(lambda x: x.max() - x.min())\n", "#What does lambda do?" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def f(x):\n", "... return x*2\n", "g = lambda x: x*2 \n", "\n", "print g(3)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### More manipulations" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pandas import read_csv\n", "from urllib import urlopen\n", "page = urlopen(\"http://econpy.pythonanywhere.com/ex/NFL_1979.csv\")\n", "df = read_csv(page)\n", "print df[:3]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df1=df[0:10]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print df1" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "A=df1[:3]\n", "B=df1[3:7]\n", "C=df1[7:10]\n", "print A,B,C" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "parts=[A,B,C]\n", "df2=pd.concat(parts)\n", "print df2" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})\n", "right= pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})\n", "print left\n", "print right " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.merge(left, right, on='key')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "rowadd=df.iloc[3]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print rowadd,df" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df.append(rowadd,ignore_index=True)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',\n", " ....: 'foo', 'bar', 'foo', 'foo'],\n", " ....: 'B' : ['one', 'one', 'two', 'three',\n", " ....: 'two', 'two', 'one', 'three'],\n", " ....: 'C' : np.random.randn(8),\n", " ....: 'D' : np.random.randn(8)})\n", "print df" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df.groupby('A').sum()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "df.groupby(['A','B']).sum()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Statistical Tests\n", "\n", "`pandas` allows for using some built-in statistical methods to compare, fit or interpolate data. \n", "\n", "\n", "### Regression\n", "\n", "Regression analysis refers to the process of estimating relationships between variables. Linear regression is equivalent to fitting a line between to sets of data points (x,y)\n", "\n", "$$y_i(x) = a_0 + a_1x_i $$" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\n", "import statsmodels.formula.api as sm\n", "import matplotlib.pyplot as plt\n", "url = \"http://vincentarelbundock.github.com/Rdatasets/csv/HistData/Guerry.csv\"\n", "df = pd.read_csv(url)\n", "#print df\n", "df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()\n", "df.head()\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "\n", "mod = sm.ols(formula='Lottery ~ Literacy ', data=df)\n", "res = mod.fit()\n", "print res.summary()\n", "intercept, slope =res.params" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "xtest=np.linspace(1,100,100)\n", "ytest=intercept+slope*xtest" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "plt.plot(df['Literacy'],df['Lottery'],'kx')\n", "plt.plot(xtest,ytest,'r')\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### T-Test\n", "\n", "The t-test assesses whether the means of two groups are statistically different from each other." ] }, { "cell_type": "code", "collapsed": false, "input": [ "\n", "town1_heights = pd.Series([5, 6, 7, 6, 7.1, 6, 4])\n", "town2_heights = pd.Series([5.5, 6.5, 7, 6, 7.1, 6])\n", "\n", "town1_mean = town1_heights.mean()\n", "town2_mean = town2_heights.mean()\n", "\n", "print \"Town 1 avg. height\", town1_mean\n", "print \"Town 2 avg. height\", town2_mean\n", "\n", "print \"Effect size: \", abs(town1_mean - town2_mean)\n", "\n", "df=pd.DataFrame({'T1':town1_heights,'T2':town2_heights})\n", "b=df.boxplot()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from scipy import stats\n", "\n", "print \"Town 1 Shapiro-Wilks p-value\", stats.shapiro(town1_heights)[1]\n", "\n", "print \" T-Test p-value:\", stats.ttest_ind(town1_heights, town2_heights,equal_var = False)[1]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Time Series\n", "\n", "A time series is a sequence of data points, measured typically at successive points in time spaced at uniform time intervals. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "rng = pd.date_range('1/1/2012', periods=100, freq='S')\n", "print rng" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)\n", "ts.plot()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))\n", "ts=ts.cumsum()\n", "ts.plot()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Bar plot\n", "\n", "ts = pd.DataFrame(np.random.randn(1000,5), index=pd.date_range('1/1/2000', periods=1000))\n", "ts=ts.cumsum()\n", "print ts.ix[5]\n", "ts.ix[5].plot(kind='bar'); plt.axhline(0, color='k')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Data Problem\n", "\n", "Imagine yourself to be a sales analyst at an apparel company. Your boss asks you to look at weather data from the past year to understand the weather data over the months, so that you can have the right apparel on display at the appropriate time.\n", "\n", "You can get the data from [here](http://climate.weather.gc.ca/index_e.html) (so your company is Canadian). The template for downloading the data is:\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "url_template = \"http://climate.weather.gc.ca/climateData/bulkdata_e.html?format=csv&stationID=5415&Year={year}&Month={month}&timeframe=1&submit=Download+Data\"" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Usually in data tasks, there are no specified objectives. One needs to play around with the data in order to derive inferences. While this might seem like a vague and daunting task, it simply requires a start and once you get familiar with the data, you will eventually find some patterns and will be able to make an initial set of conclusions.\n", "\n", "Here let's start with the data for March 2012 (there seems to be less data for the more recent years).\n", "\n", "\n", "\n", "### Data Cleaning" ] }, { "cell_type": "code", "collapsed": false, "input": [ "url = url_template.format(month=3, year=2012)\n", "weather_mar2012 = pd.read_csv(url, skiprows=16, index_col='Date/Time', parse_dates=True, encoding='latin1')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "weather_mar2012" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are only interested in the temperatures, so let's go ahead and plot the column for the month of March,2012. But we also see that the Temp column has some special characters, which might be painful to reuse. So, let's fix that first!\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "weather_mar2012.columns = [\n", " u'Year', u'Month', u'Day', u'Time', u'Data Quality', u'Temp (C)', \n", " u'Temp Flag', u'Dew Point Temp (C)', u'Dew Point Temp Flag', \n", " u'Rel Hum (%)', u'Rel Hum Flag', u'Wind Dir (10s deg)', u'Wind Dir Flag', \n", " u'Wind Spd (km/h)', u'Wind Spd Flag', u'Visibility (km)', u'Visibility Flag',\n", " u'Stn Press (kPa)', u'Stn Press Flag', u'Hmdx', u'Hmdx Flag', u'Wind Chill', \n", " u'Wind Chill Flag', u'Weather']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "weather_mar2012[u'Temp (C)'].plot(figsize=(15, 5))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are also many columns with NA values. We cannot use them in any of our analyses, so we can go ahead and drop them." ] }, { "cell_type": "code", "collapsed": true, "input": [ "weather_mar2012 = weather_mar2012.dropna(axis=1, how='any')\n", "weather_mar2012" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We managed to clean up some of the data for March 2012. That's great, but we are also interested in the entire year's data. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "##Pandas cookbook\n", "def download_weather_month(year, month):\n", " if month == 1:\n", " year += 1\n", " url = url_template.format(year=year, month=month)\n", " weather_data = pd.read_csv(url, skiprows=16, index_col='Date/Time', parse_dates=True)\n", " weather_data = weather_data.dropna(axis=1)\n", " weather_data.columns = [col.replace('\\xb0', '') for col in weather_data.columns]\n", " weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time', 'Data Quality'], axis=1)\n", " return weather_data" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "data_by_month = [download_weather_month(2012, i) for i in range(1, 13)]\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Saving to a csv\n", "weather_2012 = pd.concat(data_by_month)\n", "weather_2012.to_csv('weather_2012.csv')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Data Analysis\n", "\n", "`pandas` provides vectorized string functions, to make it easy to operate on columns containing text." ] }, { "cell_type": "code", "collapsed": false, "input": [ "weather_description = weather_2012['Weather']\n", "is_snowing = weather_description.str.contains('Snow')\n", "is_snowing.plot()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now try to find the month where it snowed the most, so that your company can have extra stock of those down jackets for this month. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "weather_2012['Temp (C)'].resample('M', how=np.median).plot(kind='bar')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "is_snowing.astype(float).resample('M', how=np.mean)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "is_snowing.astype(float).resample('M', how=np.mean).plot(kind='bar')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Other kinds of Plotting \n", "### Scatter plots" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])\n", "print df\n", "#df.plot(kind='bar')\n", "#df.plot(kind='bar', stacked=True)\n", "#df.plot(kind='barh', stacked=True)\n", "#print pd.__version__\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from pandas.tools.plotting import scatter_matrix\n", "df = pd.DataFrame(np.random.randn(100, 4), columns=['a', 'b', 'c', 'd'])\n", "scatter_matrix(df, figsize=(7, 7), diagonal='kde')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parallel Coordinates" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pandas import read_csv\n", "from urllib import urlopen\n", "from pandas.tools.plotting import andrews_curves \n", "\n", "page = urlopen(\"https://raw.githubusercontent.com/pydata/pandas/master/pandas/tests/data/iris.csv\")\n", "df = read_csv(page)\n", "andrews_curves(df, 'Name')\n", "\n", "from pandas.tools.plotting import parallel_coordinates\n", "\n", "#parallel_coordinates(df,'Name')\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lag plots" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pandas.tools.plotting import lag_plot\n", "data = pd.Series(0.1 * np.random.rand(1000) + 0.9 * np.sin(np.linspace(-99 * np.pi, 99 * np.pi, num=1000)))\n", "lag_plot(data)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## References\n", "- [Pandas cookbook](https://github.com/jvns/pandas-cookbook)\n", "- [Pandas Data Visualization](http://pandas.pydata.org/pandas-docs/stable/visualization.html/). \n", "- [Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/index.html/).\n", "- [On Andrew's curves](http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/tutorials/mvahtmlnode9.html)\n", "- [Parallel coordinates](http://en.wikipedia.org/wiki/Parallel_coordinates)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Credits\n", "\n", "Neal Davis and Lakshmi Rao developed these materials for [Computational Science and Engineering](http://cse.illinois.edu/) at the University of Illinois at Urbana\u2013Champaign.\n", "\n", "\n", "This content is available under a [Creative Commons Attribution 3.0 Unported License](https://creativecommons.org/licenses/by/3.0/).\n", "\n", "[![](https://bytebucket.org/davis68/resources/raw/f7c98d2b95e961fae257707e22a58fa1a2c36bec/logos/baseline_cse_wdmk.png?token=be4cc41d4b2afe594f5b1570a3c5aad96a65f0d6)](http://cse.illinois.edu/)" ] } ], "metadata": {} } ] }