{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Data Analysis \n", "\n", " **In God we trust, all others bring data.** - *The Elements of Statistical Learning*\n", "\n", "Big Data, Data Analytics, Data Science etc are the common buzzwords of the data world. So much so that data is considered to be the \"new oil\". There are excellent data-specific programming tools like SAS, R, Hadoop. Using a more generic scripting language like Python for data analysis is helpful as it allows for combination of data tasks with scientific programming.\n", "\n", "\n", "One major issue for statistical programmers using Python, in the past has been the lack of libraries implementing standard models and a cohesive framework for specifying models. **Pandas**, the data analysis library which has been in development since 2008, aims to bridge this gap.\n", "\n", "Pandas derives its name from **pan**el **da**tasets, which is a commonly used term for multi-dimensional datasets encountered in statistics and econometrics.\n", "\n", "\n", "\n", "\n", "\n", "Data analysis is only as good as its visualization. Today we will use a number of datasets in combination with the plotting library in Python; **matplotlib** to demonstrate our learnings. The notebook is structured as follows:\n", "\n", "## Contents\n", "- [Data Analysis](#intro)\n", "- [Matplotlib](#mpl)\n", "- [Data Analysis: pandas](#pandas)\n", " - [Series](#series)\n", " - [String methods](#smethods)\n", " - [Reading from a csv](#csv)\n", "- [DataFrames](#df)\n", " - [Exercise 1: DataFrames](#ex1)\n", " - [Data Manipulation](#dm)\n", " - [Exercise 2: Data Extraction](#ex2)\n", " - [Plotting data](#plot)\n", " - [Missing Data](#missing)\n", " - [Excercise 3: DataFrame Methods](#ex3)\n", " - [More Manipulations](#mm)\n", "- [Statistical Tests](#stats)\n", " - [Regression](#regression)\n", " - [T-Test](#ttest)\n", " - [Time Series](#ts)\n", "- [Data Problem](#dp)\n", " - [Data Cleaning](#dc)\n", " - [Data Analysis](#da)\n", "- [Miscellaneous plots](#oplot)\n", "- [References](#refs)\n", "- [Credits](#credits)\n", "\n", "\n", "\n", "\n", "\n", " \n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from __future__ import division\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "\n", "pd.set_option('display.mpl_style', 'default')\n", "#IPython magic command for inline plotting\n", "%matplotlib inline\n", "#a better plot shape for IPython\n", "mpl.rcParams['figure.figsize']=[15,3]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Quick Overview of matplotlib\n", "\n", "Matplotlib is the primary plotting library in Python. We will have a separate notebook dedicated to its features in a subsequent session. For the purpose of plotting with **pandas** today, we will touch upon the very basic plotting in **matplotlib**." ] }, { "cell_type": "code", "collapsed": false, "input": [ "x = np.linspace(0, 1, 10001)\n", "y = np.cos(np.pi/x) * np.exp(-x**2)\n", "\n", "plt.plot(x, y)\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Plot the following equations over the domain $x \\in \\left[-1, 2\\right]$.\n", " * $y = f(x) = x^2 \\exp(-x)$\n", " * $y = f(x) = \\log x$\n", " * $y = f(x) = 1 + x^x + 3 x^4$" ] }, { "cell_type": "code", "collapsed": false, "input": [ "x=np.linspace(-1, 2, 10001)\n", "y = x**2*np.exp(-x)\n", "\n", "plt.plot(x, y)\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Data analysis: pandas\n", "\n", "The pandas data analysis module provides data structures and tools for data analysis. It focuses on data handling and manipulation as well as linear and panel regression. It is designed to let you carry out your entire data workflow in Python without having to switch to a domain-specific language such as R. Although largely compatible with NumPy/SciPy, there are some important differences in indexing, data organization, and features. The basic Pandas data type is not `ndarray`, but Series and DataFrame. These allow you to index data and align axes efficiently.\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Series \n", "\n", "A `Series` object is a one-dimensional array which can hold any data type. Like a dictionary, it has a set of indices for access (like keys); unlike a dictionary, it is ordered. Data alignment is intrinsic and will not be broken unless you do it explicitly. It is very similar to ndarray from NumPy.\n", "\n", "An arbitrary list of values can be used as the index, or a list of axis labels (so it can act something like a `dict`)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "s = pd.Series([1,5,float('NaN'),7.5,2.1,3])\n", "print(s)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "dates = pd.date_range('20140201', periods=s.size)\n", "s.index = dates\n", "print(s)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "letters = ['A', 'B', 'Ch', '#', '#', '---']\n", "s.index = letters\n", "print(s)\n", "print('\\nAccess is like a dictionary key:\\ns[\\'---\\'] = '+str(s['---']))\n", "print('\\nRepeat labels are possible:\\ns[\\'#\\']=\\n'+str(s['#']))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NumPy functions expecting an ndarray often do just fine with Series as well." ] }, { "cell_type": "code", "collapsed": false, "input": [ "t = np.exp(s)\n", "print(t)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## String Methods\n", "\n", "Series is equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the Series\u2019s str attribute and generally have names matching the equivalent (scalar) built-in string methods:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ " s.str.upper()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ " s.str.lower()\n", " " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "s.str.len()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])\n", "print s2\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "s2.str.split('_')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
Method | \n", "Description | \n", "
---|---|
cat | \n", "Concatenate strings | \n", "
split | \n", "Split strings on delimiter | \n", "
get | \n", "Index into each element (retrieve i-th element | \n", "
join | \n", "Join strings in each element of the Series with passed separator | \n", "
contains | \n", "Return boolean array if each string contains pattern/regex | \n", "
replace | \n", "Replace occurrences of pattern/regex with some other string | \n", "
repeat | \n", "Duplicate values (s.str.repeat(3) equivalent to x * 3) | \n", "
pad | \n", "Add whitespace to left, right, or both sides of strings | \n", "
center | \n", "Equivalent to pad(side='both') | \n", "
wrap | \n", "Split long strings into lines with length less than a given width | \n", "
slice | \n", "Slice each string in the Series | \n", "
slice_replace | \n", "Replace slice in each string with passed value | \n", "
count | \n", "Count occurrences of pattern | \n", "
startswith | \n", "Equivalent to str.startswith(pat) for each element | \n", "
endswith | \n", "Equivalent to str.endswith(pat) for each element | \n", "
findall | \n", "Compute list of all occurrences of pattern/regex for each string | \n", "
match | \n", "Call re.match on each element, returning matched groups as list | \n", "
extract | \n", "Call re.match on each element, as match does, but return matched groups as strings for convenience. | \n", "
len | \n", "Compute string lengths | \n", "
strip | \n", "Equivalent to str.strip | \n", "
rstrip | \n", "Equivalent to str.rstrip | \n", "
lstrip | \n", "Equivalent to str.lstrip | \n", "
lower | \n", "Equivalent to str.lower | \n", "
upper | \n", "Equivalent to str.upper | \n", "