{ "metadata": { "name": "", "signature": "sha256:2ef8aafef08f084c8e4dbdbaa5cca5b6cff30d403b744ada3defbc6b97855e5b" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Introduction to Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a personal preference, I believe it is better not to import all the functions into the current namespace." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 3 types of data structures.\n", "\n", "| Data Structure |Dimensions |\n", "| ------------- | ------------- |\n", "| **Series** | 1-Dim |\n", "| **DataFrame** | 2-Dim |\n", "| ~~Panel~~ | 3-Dim |\n", "\n", "We will be dealing with Series and DataFrames. We will not be handing Panel here.\n", "\n", "All datastructures have both List-like and Dict-like properties.\n" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Series at it's simplest form can be created from a dict." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data = {'Mon':'Monday',\n", " 'Tues':'Tuesday',\n", " 'Wed':'Wednesday',\n", " 'Thurs':'Thursday',\n", " }\n", "s = pd.Series(data)\n", "s" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "Mon Monday\n", "Thurs Thursday\n", "Tues Tuesday\n", "Wed Wednesday\n", "dtype: object" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "s.index" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "Index([u'Mon', u'Thurs', u'Tues', u'Wed'], dtype='object')" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Series can also be created from a sequence of values and a sequence of index." ] }, { "cell_type": "code", "collapsed": false, "input": [ "s = pd.Series(np.random.randint(5, 15, 7), ('Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat',\n", " 'Sun'), name='Temperature')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "s.index.name = \"Day of the Week\"" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "s" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "Day of the Week\n", "Mon 12\n", "Tues 12\n", "Wed 14\n", "Thur 5\n", "Fri 5\n", "Sat 14\n", "Sun 12\n", "Name: Temperature, dtype: int64" ] } ], "prompt_number": 21 }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Series as a Dict" ] }, { "cell_type": "code", "collapsed": false, "input": [ "s['Tues']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ "12" ] } ], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "'Mon' in s" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "True" ] } ], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "'Son' in s" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 24, "text": [ "False" ] } ], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Series can also be sliced using index." ] }, { "cell_type": "code", "collapsed": false, "input": [ "s['Thur':'Sun']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 25, "text": [ "Day of the Week\n", "Thur 5\n", "Fri 5\n", "Sat 14\n", "Sun 12\n", "Name: Temperature, dtype: int64" ] } ], "prompt_number": 25 }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Series : as an ndarray" ] }, { "cell_type": "code", "collapsed": false, "input": [ "s.max()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 26, "text": [ "14" ] } ], "prompt_number": 26 }, { "cell_type": "code", "collapsed": false, "input": [ "s + 2*s #Vectorized operation\n" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 27, "text": [ "Day of the Week\n", "Mon 36\n", "Tues 36\n", "Wed 42\n", "Thur 15\n", "Fri 15\n", "Sat 42\n", "Sun 36\n", "Name: Temperature, dtype: int64" ] } ], "prompt_number": 27 }, { "cell_type": "code", "collapsed": false, "input": [ "s[1] #Accessing a value by position" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 28, "text": [ "12" ] } ], "prompt_number": 28 }, { "cell_type": "code", "collapsed": false, "input": [ "s[2:5] #Slicing the Series by position" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 29, "text": [ "Day of the Week\n", "Wed 14\n", "Thur 5\n", "Fri 5\n", "Name: Temperature, dtype: int64" ] } ], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "s[:1]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 33, "text": [ "Day of the Week\n", "Mon 12\n", "Name: Temperature, dtype: int64" ] } ], "prompt_number": 33 }, { "cell_type": "code", "collapsed": false, "input": [ "s - np.random.randint(5, 15, 7)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 39, "text": [ "Day of the Week\n", "Mon 7\n", "Tues 0\n", "Wed 9\n", "Thur -1\n", "Fri 0\n", "Sat 0\n", "Sun 5\n", "Name: Temperature, dtype: int64" ] } ], "prompt_number": 39 }, { "cell_type": "code", "collapsed": false, "input": [ "for x in s: print x #iterating over values" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "12\n", "12\n", "14\n", "5\n", "5\n", "14\n", "12\n" ] } ], "prompt_number": 42 }, { "cell_type": "code", "collapsed": false, "input": [ "for pos, value in enumerate(s): print pos, ':', value" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0 : 12\n", "1 : 12\n", "2 : 14\n", "3 : 5\n", "4 : 5\n", "5 : 14\n", "6 : 12\n" ] } ], "prompt_number": 43 }, { "cell_type": "code", "collapsed": false, "input": [ "for key, value in s.iteritems(): print key, ':', value" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Mon : 12\n", "Tues : 12\n", "Wed : 14\n", "Thur : 5\n", "Fri : 5\n", "Sat : 14\n", "Sun : 12\n" ] } ], "prompt_number": 44 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dataframe is a two dimensional array, and probably the most used data structure in Pandas. The columns themselves can have different data types but all the values within each column should be of the same datatype.\n", "\n", "A dataframe can be created from\n", " * python dict\n", " * csv\n", " * xls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-Now let us look at the obligatory Day-Temperature example." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import datetime" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "base = datetime.datetime.today()\n", "days = 20\n", "date_list = [base - datetime.timedelta(days=x) for x in range(0, days)]\n", "date_list = [datetime.date(x.year, x.month, x.day) for x in date_list]\n", "date_list.reverse()\n", "data = {'date':date_list, \n", " 'Chennai':np.random.randint(25,35,days), \n", " 'Mumbai':np.random.randint(15,25,days), \n", " 'Delhi':np.random.randint(5,15,days)}\n", "df = pd.DataFrame(data)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "type(df)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "pandas.core.frame.DataFrame" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ChennaiDelhiMumbaidate
0 29 5 19 2014-11-02
1 33 11 24 2014-11-03
2 27 14 19 2014-11-04
3 30 9 20 2014-11-05
4 27 5 15 2014-11-06
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ " Chennai Delhi Mumbai date\n", "0 29 5 19 2014-11-02\n", "1 33 11 24 2014-11-03\n", "2 27 14 19 2014-11-04\n", "3 30 9 20 2014-11-05\n", "4 27 5 15 2014-11-06" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "df = df.set_index('date')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ChennaiDelhiMumbai
date
2014-11-02 29 5 19
2014-11-03 33 11 24
2014-11-04 27 14 19
2014-11-05 30 9 20
2014-11-06 27 5 15
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ " Chennai Delhi Mumbai\n", "date \n", "2014-11-02 29 5 19\n", "2014-11-03 33 11 24\n", "2014-11-04 27 14 19\n", "2014-11-05 30 9 20\n", "2014-11-06 27 5 15" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "df.median()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ "Chennai 30.0\n", "Delhi 9.5\n", "Mumbai 19.0\n", "dtype: float64" ] } ], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "df.mean()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "Chennai 29.60\n", "Delhi 9.15\n", "Mumbai 19.20\n", "dtype: float64" ] } ], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "df.diff().head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ChennaiDelhiMumbai
date
2014-11-02NaNNaNNaN
2014-11-03 4 6 5
2014-11-04 -6 3 -5
2014-11-05 3 -5 1
2014-11-06 -3 -4 -5
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 24, "text": [ " Chennai Delhi Mumbai\n", "date \n", "2014-11-02 NaN NaN NaN\n", "2014-11-03 4 6 5\n", "2014-11-04 -6 3 -5\n", "2014-11-05 3 -5 1\n", "2014-11-06 -3 -4 -5" ] } ], "prompt_number": 24 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Obligatory CSV Example ;)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "titanic = pd.read_csv('data/titanic.csv')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 25 }, { "cell_type": "code", "collapsed": false, "input": [ "titanic = titanic.set_index('PassengerId')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 33 }, { "cell_type": "code", "collapsed": false, "input": [ "titanic.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 34, "text": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "5 0 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund, Mr. Owen Harris male 22 \n", "2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 \n", "3 Heikkinen, Miss. Laina female 26 \n", "4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 \n", "5 Allen, Mr. William Henry male 35 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S \n", "4 1 0 113803 53.1000 C123 S \n", "5 0 0 373450 8.0500 NaN S " ] } ], "prompt_number": 34 }, { "cell_type": "code", "collapsed": false, "input": [ "len(titanic)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 29, "text": [ "891" ] } ], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "titanic.Fare.sum()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 30, "text": [ "28693.9493" ] } ], "prompt_number": 30 }, { "cell_type": "code", "collapsed": false, "input": [ "titanic.Survived.value_counts()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 31, "text": [ "0 549\n", "1 342\n", "dtype: int64" ] } ], "prompt_number": 31 }, { "cell_type": "code", "collapsed": false, "input": [ "titanic.Pclass.value_counts()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 35, "text": [ "3 491\n", "1 216\n", "2 184\n", "dtype: int64" ] } ], "prompt_number": 35 }, { "cell_type": "heading", "level": 5, "metadata": {}, "source": [ "Lets dive in!" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }