{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "try:\n", " import seaborn\n", "except ImportError:\n", " pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tabular data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_csv(\"data/titanic.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Starting from reading this dataset, to answering questions about this data in a few lines of code:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**What is the age distribution of the passengers?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df['Age'].hist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**How does the survival rate of the passengers differ between sexes?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Or how does it differ between the different classes?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.groupby('Pclass')['Survived'].aggregate(lambda x: x.sum() / len(x)).plot(kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Are young people more likely to survive?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df['Survived'].sum() / df['Survived'].count()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df25 = df[df['Age'] <= 25]\n", "df25['Survived'].sum() / len(df25['Survived'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All the needed functionality for the above examples will be explained throughout this tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data structures\n", "\n", "Pandas provides two fundamental data objects, for 1D (``Series``) and 2D data (``DataFrame``)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Series\n", "\n", "A Series is a basic holder for **one-dimensional labeled data**. It can be created much as a NumPy array is created:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s = pd.Series([0.1, 0.2, 0.3, 0.4])\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Attributes of a Series: `index` and `values`\n", "\n", "The series has a built-in concept of an **index**, which by default is the numbers *0* through *N - 1*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can access the underlying numpy array representation with the `.values` attribute:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can access series values via the index, just like for NumPy arrays:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike the NumPy array, though, this index can be something other than integers:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])\n", "s2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s2['c']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this way, a ``Series`` object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value.\n", "\n", "In fact, it's possible to construct a series directly from a Python dictionary:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pop_dict = {'Germany': 81.3, \n", " 'Belgium': 11.3, \n", " 'France': 64.3, \n", " 'United Kingdom': 64.9, \n", " 'Netherlands': 16.9}\n", "population = pd.Series(pop_dict)\n", "population" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can index the populations like a dict as expected:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "population['France']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "but with the power of numpy arrays:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "population * 1000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrames: Multi-dimensional Data\n", "\n", "A DataFrame is a **tablular data structure** (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the most common ways of creating a dataframe is from a dictionary of arrays or lists.\n", "\n", "Note that in the IPython notebook, the dataframe will display in a rich HTML view:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],\n", " 'population': [11.3, 64.3, 81.3, 16.9, 64.9],\n", " 'area': [30510, 671308, 357050, 41526, 244820],\n", " 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}\n", "countries = pd.DataFrame(data)\n", "countries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Attributes of the DataFrame\n", "\n", "A DataFrame has besides a `index` attribute, also a `columns` attribute:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To check the data types of the different columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An overview of that information can be given with the `info()` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also a DataFrame has a `values` attribute, but attention: when you have heterogeneous data, all values will be upcasted:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we don't like what the index looks like, we can reset it and set one of our columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries = countries.set_index('country')\n", "countries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To access a Series representing a column in the data, use typical indexing syntax:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries['area']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic operations on Series/Dataframes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# redefining the example objects\n", "\n", "population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, \n", " 'United Kingdom': 64.9, 'Netherlands': 16.9})\n", "\n", "countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],\n", " 'population': [11.3, 64.3, 81.3, 16.9, 64.9],\n", " 'area': [30510, 671308, 357050, 41526, 244820],\n", " 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Elementwise-operations (like numpy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just like with numpy arrays, many operations are element-wise:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "population / 100" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries['population'] / countries['area']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Alignment! (unlike numpy)\n", "\n", "Only, pay attention to **alignment**: operations between series will align on the index: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s1 = population[['Belgium', 'France']]\n", "s2 = population[['France', 'Germany']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "s1 + s2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reductions (like numpy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The average population number:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "population.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The minimum area:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries['area'].min()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For dataframes, often only the numeric columns are included in the result:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries.median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Calculate the population numbers relative to Belgium\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false, "scrolled": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Calculate the population density for each country and add this as a new column to the dataframe.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false, "scrolled": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false, "scrolled": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Some other useful methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sorting the rows of the DataFrame according to the values in a column:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries.sort_values('density', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One useful method to use is the ``describe`` method, which computes summary statistics for each column:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `plot` method can be used to quickly visualize the data in different ways:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, for this dataset, it does not say that much:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries['population'].plot(kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can play with the `kind` keyword: 'line', 'bar', 'hist', 'density', 'area', 'pie', 'scatter', 'hexbin'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing and exporting data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A wide range of input/output formats are natively supported by pandas:\n", "\n", "* CSV, text\n", "* SQL database\n", "* Excel\n", "* HDF5\n", "* json\n", "* html\n", "* pickle\n", "* ..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pd.read" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "states.to" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other features\n", "\n", "* Working with missing data (`.dropna()`, `pd.isnull()`)\n", "* Merging and joining (`concat`, `join`)\n", "* Grouping: `groupby` functionality\n", "* Reshaping (`stack`, `pivot`)\n", "* Time series manipulation (resampling, timezones, ..)\n", "* Easy plotting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many, many more interesting operations that can be done on Series and DataFrame objects, but rather than continue using this toy data, we'll instead move to a real-world example, and illustrate some of the advanced concepts along the way.\n", "\n", "See the next notebooks!" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Acknowledgement\n", "\n", "> *© 2015, Stijn Van Hoey and Joris Van den Bossche (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*\n", "\n", "> This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).\n", "\n", "---" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }