{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Indexing and selecting data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "try:\n", " import seaborn\n", "except ImportError:\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# redefining the example objects\n", "\n", "# series\n", "population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, \n", " 'United Kingdom': 64.9, 'Netherlands': 16.9})\n", "\n", "# dataframe\n", "data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],\n", " 'population': [11.3, 64.3, 81.3, 16.9, 64.9],\n", " 'area': [30510, 671308, 357050, 41526, 244820],\n", " 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}\n", "countries = pd.DataFrame(data)\n", "countries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setting the index to the country names:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries = countries.set_index('country')\n", "countries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some notes on selecting data\n", "\n", "One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy. We now have to distuinguish between:\n", "\n", "- selection by label\n", "- selection by position." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `data[]` provides some convenience shortcuts " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a DataFrame, basic indexing selects the columns.\n", "\n", "Selecting a single column:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries['area']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or multiple columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries[['area', 'population']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But, slicing accesses the rows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries['France':'Netherlands']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " NOTE: Unlike slicing in numpy, the end label is **included**.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So as a summary, `[]` provides the following convenience shortcuts:\n", "\n", "- Series: selecting a label: `s[label]`\n", "- DataFrame: selecting a single or multiple columns: `df['col']` or `df[['col1', 'col2']]`\n", "- DataFrame: slicing the rows: `df['row_label1':'row_label2']` or `df[mask]`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Systematic indexing with `loc` and `iloc`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When using `[]` like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:\n", " \n", "* `loc`: selection by label\n", "* `iloc`: selection by position\n", "\n", "These methods index the different dimensions of the frame:\n", "\n", "* `df.loc[row_indexer, column_indexer]`\n", "* `df.iloc[row_indexer, column_indexer]`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting a single element:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries.loc['Germany', 'area']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But the row or column indexer can also be a list, slice, boolean array, .." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries.loc['France':'Germany', ['area', 'population']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "Selecting by position with `iloc` works similar as indexing numpy arrays:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries.iloc[0:2,1:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The different indexing methods can also be used to assign data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "countries2 = countries.copy()\n", "countries2.loc['Belgium':'Germany', 'population'] = 10" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Boolean indexing (filtering)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL). \n", "\n", "The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries['area'] > 100000" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries[countries['area'] > 100000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "
\n", " EXERCISE: Add a column `density` with the population density (note: population column is expressed in millions)\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Select the capital and the population column of those countries where the density is larger than 300\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Add a column 'density_ratio' with the ratio of the density to the mean density\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Change the capital of the UK to Cambridge\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "
\n", " EXERCISE: Select all countries whose population density is between 100 and 300 people/kmĀ²\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Some other useful methods: `isin` and string methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `isin` method of Series is very useful to select rows that may contain certain values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "s = countries['capital']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "s.isin?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s.isin(['Berlin', 'London'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can then be used to filter the dataframe with boolean indexing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries[countries['capital'].isin(['Berlin', 'London'])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say we want to select all data for which the capital starts with a 'B'. In Python, when having a string, we could use the `startswith` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "'Berlin'.startswith('B')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In pandas, these are available on a Series through the `str` namespace:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries['capital'].str.startswith('B')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Select all countries that have capital names with more than 7 characters\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Select all countries that have capital names that contain the character sequence 'am'\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pitfall: chained indexing (and the 'SettingWithCopyWarning')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "countries.loc['Belgium', 'capital'] = 'Ghent' " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries['capital']['Belgium'] = 'Antwerp' " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries[countries['capital'] == 'Antwerp']['capital'] = 'Brussels' " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "countries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How to avoid this?\n", "\n", "* Use `loc` instead of chained indexing if possible!\n", "* Or `copy` explicitly if you don't want to change the original data." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## More exercises!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the quick ones among you, here are some more exercises with some larger dataframe with film data. These exercises are based on the [PyCon tutorial of Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial/) (so all credit to him!) and the datasets he prepared for that. You can download these data from here: [`titles.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKajNMa1pfSzN6Q3M) and [`cast.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKal9UYTJSR2ZhSW8) and put them in the `/data` folder." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cast = pd.read_csv('data/cast.csv')\n", "cast.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "titles = pd.read_csv('data/titles.csv')\n", "titles.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: How many movies are listed in the titles dataframe?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: What are the earliest two films listed in the titles dataframe?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: How many movies have the title \"Hamlet\"?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: List all of the \"Treasure Island\" movies from earliest to most recent.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: How many movies were made from 1950 through 1959?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: How many roles in the movie \"Inception\" are NOT ranked by an \"n\" value?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: But how many roles in the movie \"Inception\" did receive an \"n\" value?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: Display the cast of \"North by Northwest\" in their correct \"n\"-value order, ignoring roles that did not earn a numeric \"n\" value.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: How many roles were credited in the silent 1921 version of Hamlet?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: List the supporting roles (having n=2) played by Cary Grant in the 1940s, in order by year.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "clear_cell": true, "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Nbtutor - export exercises", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }