{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"try:\n",
" import seaborn\n",
"except ImportError:\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tabular data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = pd.read_csv(\"data/titanic.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Starting from reading this dataset, to answering questions about this data in a few lines of code:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**What is the age distribution of the passengers?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df['Age'].hist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**How does the survival rate of the passengers differ between sexes?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Or how does it differ between the different classes?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.groupby('Pclass')['Survived'].aggregate(lambda x: x.sum() / len(x)).plot(kind='bar')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Are young people more likely to survive?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df['Survived'].sum() / df['Survived'].count()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df25 = df[df['Age'] <= 25]\n",
"df25['Survived'].sum() / len(df25['Survived'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All the needed functionality for the above examples will be explained throughout this tutorial."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data structures\n",
"\n",
"Pandas provides two fundamental data objects, for 1D (``Series``) and 2D data (``DataFrame``)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Series\n",
"\n",
"A Series is a basic holder for **one-dimensional labeled data**. It can be created much as a NumPy array is created:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s = pd.Series([0.1, 0.2, 0.3, 0.4])\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Attributes of a Series: `index` and `values`\n",
"\n",
"The series has a built-in concept of an **index**, which by default is the numbers *0* through *N - 1*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s.index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can access the underlying numpy array representation with the `.values` attribute:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can access series values via the index, just like for NumPy arrays:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unlike the NumPy array, though, this index can be something other than integers:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])\n",
"s2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s2['c']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this way, a ``Series`` object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value.\n",
"\n",
"In fact, it's possible to construct a series directly from a Python dictionary:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pop_dict = {'Germany': 81.3, \n",
" 'Belgium': 11.3, \n",
" 'France': 64.3, \n",
" 'United Kingdom': 64.9, \n",
" 'Netherlands': 16.9}\n",
"population = pd.Series(pop_dict)\n",
"population"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can index the populations like a dict as expected:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"population['France']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"but with the power of numpy arrays:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"population * 1000"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrames: Multi-dimensional Data\n",
"\n",
"A DataFrame is a **tablular data structure** (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the most common ways of creating a dataframe is from a dictionary of arrays or lists.\n",
"\n",
"Note that in the IPython notebook, the dataframe will display in a rich HTML view:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],\n",
" 'population': [11.3, 64.3, 81.3, 16.9, 64.9],\n",
" 'area': [30510, 671308, 357050, 41526, 244820],\n",
" 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}\n",
"countries = pd.DataFrame(data)\n",
"countries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Attributes of the DataFrame\n",
"\n",
"A DataFrame has besides a `index` attribute, also a `columns` attribute:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.index"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To check the data types of the different columns:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An overview of that information can be given with the `info()` method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also a DataFrame has a `values` attribute, but attention: when you have heterogeneous data, all values will be upcasted:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we don't like what the index looks like, we can reset it and set one of our columns:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries = countries.set_index('country')\n",
"countries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To access a Series representing a column in the data, use typical indexing syntax:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries['area']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Basic operations on Series/Dataframes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# redefining the example objects\n",
"\n",
"population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, \n",
" 'United Kingdom': 64.9, 'Netherlands': 16.9})\n",
"\n",
"countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],\n",
" 'population': [11.3, 64.3, 81.3, 16.9, 64.9],\n",
" 'area': [30510, 671308, 357050, 41526, 244820],\n",
" 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Elementwise-operations (like numpy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just like with numpy arrays, many operations are element-wise:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"population / 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries['population'] / countries['area']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Alignment! (unlike numpy)\n",
"\n",
"Only, pay attention to **alignment**: operations between series will align on the index: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s1 = population[['Belgium', 'France']]\n",
"s2 = population[['France', 'Germany']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"s1 + s2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reductions (like numpy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The average population number:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"population.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The minimum area:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries['area'].min()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For dataframes, often only the numeric columns are included in the result:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.median()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"