{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Iteration, Indexing and Filters\n", "In this notebook we will learn indexing ```DataFrame``` with ```loc``` and ```iloc``` functionality and ```slicing``` operation. Some Data ```Filtering``` techniques will also be conducted.\n", "\n", "Read more about these functionality from [Pydata documentation for indexing](https://pandas.pydata.org/docs/user_guide/groupby.html)[1].\n", "Some parts of this notebook are taken from [EuroScipy 2016 Pandas Tutorial by Joris Van den Bossche and Nicholas Devenish](https://github.com/jorisvandenbossche/pandas-tutorial)[2]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "titanic = pd.read_csv('data/titanic.csv')\n", "titanic = titanic.set_index(\"Name\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>PassengerId</th>\n", " <th>Survived</th>\n", " <th>Pclass</th>\n", " <th>Sex</th>\n", " <th>Age</th>\n", " <th>SibSp</th>\n", " <th>Parch</th>\n", " <th>Ticket</th>\n", " <th>Fare</th>\n", " <th>Cabin</th>\n", " <th>Embarked</th>\n", " </tr>\n", " <tr>\n", " <th>Name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Braund, Mr. Owen Harris</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>male</td>\n", " <td>22.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>A/5 21171</td>\n", " <td>7.2500</td>\n", " <td>NaN</td>\n", " <td>S</td>\n", " </tr>\n", " <tr>\n", " <th>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</th>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>female</td>\n", " <td>38.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>PC 17599</td>\n", " <td>71.2833</td>\n", " <td>C85</td>\n", " <td>C</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " PassengerId Survived \\\n", "Name \n", "Braund, Mr. Owen Harris 1 0 \n", "Cumings, Mrs. John Bradley (Florence Briggs Tha... 2 1 \n", "\n", " Pclass Sex Age \\\n", "Name \n", "Braund, Mr. Owen Harris 3 male 22.0 \n", "Cumings, Mrs. John Bradley (Florence Briggs Tha... 1 female 38.0 \n", "\n", " SibSp Parch Ticket \\\n", "Name \n", "Braund, Mr. Owen Harris 1 0 A/5 21171 \n", "Cumings, Mrs. John Bradley (Florence Briggs Tha... 1 0 PC 17599 \n", "\n", " Fare Cabin Embarked \n", "Name \n", "Braund, Mr. Owen Harris 7.2500 NaN S \n", "Cumings, Mrs. John Bradley (Florence Briggs Tha... 71.2833 C85 C " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Lets create a small sample of data" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "sample = titanic[0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. Iteration of rows implementing ```iterrows```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Braund, Mr. Owen Harris [1, 0, 3, 'male', 22.0, 1, 0, 'A/5 21171', 7.25, nan, 'S']\n", "Cumings, Mrs. John Bradley (Florence Briggs Thayer) [2, 1, 1, 'female', 38.0, 1, 0, 'PC 17599', 71.2833, 'C85', 'C']\n", "Heikkinen, Miss. Laina [3, 1, 3, 'female', 26.0, 0, 0, 'STON/O2. 3101282', 7.925, nan, 'S']\n", "Futrelle, Mrs. Jacques Heath (Lily May Peel) [4, 1, 1, 'female', 35.0, 1, 0, '113803', 53.1, 'C123', 'S']\n", "Allen, Mr. William Henry [5, 0, 3, 'male', 35.0, 0, 0, '373450', 8.05, nan, 'S']\n" ] } ], "source": [ "for index,row in sample.iterrows():\n", " print(index,list(row))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- We can select specific columns by passing column names in ```row()``` input" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Braund, Mr. Owen Harris male 22.0\n", "Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0\n", "Heikkinen, Miss. Laina female 26.0\n", "Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0\n", "Allen, Mr. William Henry male 35.0\n" ] } ], "source": [ "for index,row in sample.iterrows():\n", " print(index,row['Sex'],row['Age'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Iteration of rows with ```iteritems```" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Braund, Mr. Owen Harris [1, 0, 3, 'male', 22.0, 1, 0, 'A/5 21171', 7.25, nan, 'S']\n", "Cumings, Mrs. John Bradley (Florence Briggs Thayer) [2, 1, 1, 'female', 38.0, 1, 0, 'PC 17599', 71.2833, 'C85', 'C']\n", "Heikkinen, Miss. Laina [3, 1, 3, 'female', 26.0, 0, 0, 'STON/O2. 3101282', 7.925, nan, 'S']\n", "Futrelle, Mrs. Jacques Heath (Lily May Peel) [4, 1, 1, 'female', 35.0, 1, 0, '113803', 53.1, 'C123', 'S']\n", "Allen, Mr. William Henry [5, 0, 3, 'male', 35.0, 0, 0, '373450', 8.05, nan, 'S']\n" ] } ], "source": [ "for index,row in sample.T.iteritems():\n", " print(index,list(row))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PassengerId 1 2 3\n", "Survived 0 1 1\n", "Pclass 3 1 3\n", "Sex male female female\n", "Age 22.0 38.0 26.0\n", "SibSp 1 1 0\n", "Parch 0 0 0\n", "Ticket A/5 21171 PC 17599 STON/O2. 3101282\n", "Fare 7.25 71.2833 7.925\n", "Cabin nan C85 nan\n", "Embarked S C S\n" ] } ], "source": [ "for index,row in sample.iteritems():\n", " print(index,row[0],row[1],row[2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3. Indexing Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Source: [Using iloc, loc, & ix to select rows and columns in Pandas DataFrames](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- ```loc``` and ```iloc``` :\n", "\n", "The iloc indexer for Pandas Dataframe is used for integer-location based indexing / selection by position.\n", "\n", "The Pandas loc indexer can be used with DataFrames for two different use cases:\n", "\n", " - a.) Selecting rows by label/index\n", " - b.) Selecting rows with a boolean / conditional lookup" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>PassengerId</th>\n", " <th>Survived</th>\n", " <th>Pclass</th>\n", " <th>Sex</th>\n", " <th>Age</th>\n", " <th>SibSp</th>\n", " <th>Parch</th>\n", " <th>Ticket</th>\n", " <th>Fare</th>\n", " <th>Cabin</th>\n", " <th>Embarked</th>\n", " </tr>\n", " <tr>\n", " <th>Name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Braund, Mr. Owen Harris</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>male</td>\n", " <td>22.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>A/5 21171</td>\n", " <td>7.2500</td>\n", " <td>NaN</td>\n", " <td>S</td>\n", " </tr>\n", " <tr>\n", " <th>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</th>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>female</td>\n", " <td>38.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>PC 17599</td>\n", " <td>71.2833</td>\n", " <td>C85</td>\n", " <td>C</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " PassengerId Survived \\\n", "Name \n", "Braund, Mr. Owen Harris 1 0 \n", "Cumings, Mrs. John Bradley (Florence Briggs Tha... 2 1 \n", "\n", " Pclass Sex Age \\\n", "Name \n", "Braund, Mr. Owen Harris 3 male 22.0 \n", "Cumings, Mrs. John Bradley (Florence Briggs Tha... 1 female 38.0 \n", "\n", " SibSp Parch Ticket \\\n", "Name \n", "Braund, Mr. Owen Harris 1 0 A/5 21171 \n", "Cumings, Mrs. John Bradley (Florence Briggs Tha... 1 0 PC 17599 \n", "\n", " Fare Cabin Embarked \n", "Name \n", "Braund, Mr. Owen Harris 7.2500 NaN S \n", "Cumings, Mrs. John Bradley (Florence Briggs Tha... 71.2833 C85 C " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample.iloc[0:2,:]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PassengerId 2\n", "Survived 1\n", "Pclass 1\n", "Name: Cumings, Mrs. John Bradley (Florence Briggs Thayer), dtype: object" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample.iloc[1,0:3]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Survived</th>\n", " <th>Pclass</th>\n", " <th>Sex</th>\n", " <th>Age</th>\n", " <th>SibSp</th>\n", " <th>Parch</th>\n", " <th>Ticket</th>\n", " </tr>\n", " <tr>\n", " <th>Name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Braund, Mr. Owen Harris</th>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>male</td>\n", " <td>22.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>A/5 21171</td>\n", " </tr>\n", " <tr>\n", " <th>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>female</td>\n", " <td>38.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>PC 17599</td>\n", " </tr>\n", " <tr>\n", " <th>Heikkinen, Miss. Laina</th>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>female</td>\n", " <td>26.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>STON/O2. 3101282</td>\n", " </tr>\n", " <tr>\n", " <th>Futrelle, Mrs. Jacques Heath (Lily May Peel)</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>female</td>\n", " <td>35.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>113803</td>\n", " </tr>\n", " <tr>\n", " <th>Allen, Mr. William Henry</th>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>male</td>\n", " <td>35.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>373450</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Survived Pclass Sex \\\n", "Name \n", "Braund, Mr. Owen Harris 0 3 male \n", "Cumings, Mrs. John Bradley (Florence Briggs Tha... 1 1 female \n", "Heikkinen, Miss. Laina 1 3 female \n", "Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 1 female \n", "Allen, Mr. William Henry 0 3 male \n", "\n", " Age SibSp Parch \\\n", "Name \n", "Braund, Mr. Owen Harris 22.0 1 0 \n", "Cumings, Mrs. John Bradley (Florence Briggs Tha... 38.0 1 0 \n", "Heikkinen, Miss. Laina 26.0 0 0 \n", "Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 \n", "Allen, Mr. William Henry 35.0 0 0 \n", "\n", " Ticket \n", "Name \n", "Braund, Mr. Owen Harris A/5 21171 \n", "Cumings, Mrs. John Bradley (Florence Briggs Tha... PC 17599 \n", "Heikkinen, Miss. Laina STON/O2. 3101282 \n", "Futrelle, Mrs. Jacques Heath (Lily May Peel) 113803 \n", "Allen, Mr. William Henry 373450 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample.loc[:,'Survived': 'Ticket']" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PassengerId 1\n", "Survived 0\n", "Pclass 3\n", "Sex male\n", "Age 22\n", "SibSp 1\n", "Parch 0\n", "Ticket A/5 21171\n", "Fare 7.25\n", "Cabin NaN\n", "Embarked S\n", "Name: Braund, Mr. Owen Harris, dtype: object" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample.loc['Braund, Mr. Owen Harris',:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4 - Data Filters:\n", "\n", "Data Filters can be performed by either selecting specific set of column names or by seting ```boolean filters```." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Dictionary to DataFrame" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>country</th>\n", " <th>population</th>\n", " <th>area</th>\n", " <th>capital</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Belgium</td>\n", " <td>11.3</td>\n", " <td>30510</td>\n", " <td>Brussels</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>France</td>\n", " <td>64.3</td>\n", " <td>671308</td>\n", " <td>Paris</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Germany</td>\n", " <td>81.3</td>\n", " <td>357050</td>\n", " <td>Berlin</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Netherlands</td>\n", " <td>16.9</td>\n", " <td>41526</td>\n", " <td>Amsterdam</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>United Kingdom</td>\n", " <td>64.9</td>\n", " <td>244820</td>\n", " <td>London</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " country population area capital\n", "0 Belgium 11.3 30510 Brussels\n", "1 France 64.3 671308 Paris\n", "2 Germany 81.3 357050 Berlin\n", "3 Netherlands 16.9 41526 Amsterdam\n", "4 United Kingdom 64.9 244820 London" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],\n", " 'population': [11.3, 64.3, 81.3, 16.9, 64.9],\n", " 'area': [30510, 671308, 357050, 41526, 244820],\n", " 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}\n", "\n", "\n", "\n", "countries = pd.DataFrame(data)\n", "countries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- To set the ```country``` as index:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>population</th>\n", " <th>area</th>\n", " <th>capital</th>\n", " </tr>\n", " <tr>\n", " <th>country</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Belgium</th>\n", " <td>11.3</td>\n", " <td>30510</td>\n", " <td>Brussels</td>\n", " </tr>\n", " <tr>\n", " <th>France</th>\n", " <td>64.3</td>\n", " <td>671308</td>\n", " <td>Paris</td>\n", " </tr>\n", " <tr>\n", " <th>Germany</th>\n", " <td>81.3</td>\n", " <td>357050</td>\n", " <td>Berlin</td>\n", " </tr>\n", " <tr>\n", " <th>Netherlands</th>\n", " <td>16.9</td>\n", " <td>41526</td>\n", " <td>Amsterdam</td>\n", " </tr>\n", " <tr>\n", " <th>United Kingdom</th>\n", " <td>64.9</td>\n", " <td>244820</td>\n", " <td>London</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " population area capital\n", "country \n", "Belgium 11.3 30510 Brussels\n", "France 64.3 671308 Paris\n", "Germany 81.3 357050 Berlin\n", "Netherlands 16.9 41526 Amsterdam\n", "United Kingdom 64.9 244820 London" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries = countries.set_index('country')\n", "countries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- To find the ```area``` of each country:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "country\n", "Belgium 30510\n", "France 671308\n", "Germany 357050\n", "Netherlands 41526\n", "United Kingdom 244820\n", "Name: area, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries['area']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- To create a ```list``` instead of pandas ```Series``` object." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "Area = list(countries['area'])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[30510, 671308, 357050, 41526, 244820]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Area" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- To create a ```sub-DataFrame``` by choosing ```area``` and ```population```" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>area</th>\n", " <th>population</th>\n", " </tr>\n", " <tr>\n", " <th>country</th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Belgium</th>\n", " <td>30510</td>\n", " <td>11.3</td>\n", " </tr>\n", " <tr>\n", " <th>France</th>\n", " <td>671308</td>\n", " <td>64.3</td>\n", " </tr>\n", " <tr>\n", " <th>Germany</th>\n", " <td>357050</td>\n", " <td>81.3</td>\n", " </tr>\n", " <tr>\n", " <th>Netherlands</th>\n", " <td>41526</td>\n", " <td>16.9</td>\n", " </tr>\n", " <tr>\n", " <th>United Kingdom</th>\n", " <td>244820</td>\n", " <td>64.9</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " area population\n", "country \n", "Belgium 30510 11.3\n", "France 671308 64.3\n", "Germany 357050 81.3\n", "Netherlands 41526 16.9\n", "United Kingdom 244820 64.9" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries[['area', 'population']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- To observe slice of data by ```index``` names." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>population</th>\n", " <th>area</th>\n", " <th>capital</th>\n", " </tr>\n", " <tr>\n", " <th>country</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>France</th>\n", " <td>64.3</td>\n", " <td>671308</td>\n", " <td>Paris</td>\n", " </tr>\n", " <tr>\n", " <th>Germany</th>\n", " <td>81.3</td>\n", " <td>357050</td>\n", " <td>Berlin</td>\n", " </tr>\n", " <tr>\n", " <th>Netherlands</th>\n", " <td>16.9</td>\n", " <td>41526</td>\n", " <td>Amsterdam</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " population area capital\n", "country \n", "France 64.3 671308 Paris\n", "Germany 81.3 357050 Berlin\n", "Netherlands 16.9 41526 Amsterdam" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries['France':'Netherlands']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- To observe the data at specific row identified by ```index``` name(s) and spefic column identified by ```column```name(s):" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "357050" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries.loc['Germany', 'area']" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>area</th>\n", " <th>population</th>\n", " </tr>\n", " <tr>\n", " <th>country</th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>France</th>\n", " <td>671308</td>\n", " <td>64.3</td>\n", " </tr>\n", " <tr>\n", " <th>Germany</th>\n", " <td>357050</td>\n", " <td>81.3</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " area population\n", "country \n", "France 671308 64.3\n", "Germany 357050 81.3" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries.loc['France':'Germany', ['area', 'population']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- To set ```boolean``` selection for ```subDataFrame```:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "country\n", "Belgium False\n", "France True\n", "Germany True\n", "Netherlands False\n", "United Kingdom True\n", "Name: area, dtype: bool" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries['area'] > 100000" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>population</th>\n", " <th>area</th>\n", " <th>capital</th>\n", " </tr>\n", " <tr>\n", " <th>country</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>France</th>\n", " <td>64.3</td>\n", " <td>671308</td>\n", " <td>Paris</td>\n", " </tr>\n", " <tr>\n", " <th>Germany</th>\n", " <td>81.3</td>\n", " <td>357050</td>\n", " <td>Berlin</td>\n", " </tr>\n", " <tr>\n", " <th>United Kingdom</th>\n", " <td>64.9</td>\n", " <td>244820</td>\n", " <td>London</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " population area capital\n", "country \n", "France 64.3 671308 Paris\n", "Germany 81.3 357050 Berlin\n", "United Kingdom 64.9 244820 London" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries[countries['area'] > 100000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5. Exercises:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<div class=\"alert alert-success\">\n", " <b>EXERCISE</b>: Add a column `density` with the population density (note: population column is expressed in millions)\n", "</div>" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>population</th>\n", " <th>area</th>\n", " <th>capital</th>\n", " <th>density</th>\n", " </tr>\n", " <tr>\n", " <th>country</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Belgium</th>\n", " <td>11.3</td>\n", " <td>30510</td>\n", " <td>Brussels</td>\n", " <td>370.370370</td>\n", " </tr>\n", " <tr>\n", " <th>France</th>\n", " <td>64.3</td>\n", " <td>671308</td>\n", " <td>Paris</td>\n", " <td>95.783158</td>\n", " </tr>\n", " <tr>\n", " <th>Germany</th>\n", " <td>81.3</td>\n", " <td>357050</td>\n", " <td>Berlin</td>\n", " <td>227.699202</td>\n", " </tr>\n", " <tr>\n", " <th>Netherlands</th>\n", " <td>16.9</td>\n", " <td>41526</td>\n", " <td>Amsterdam</td>\n", " <td>406.973944</td>\n", " </tr>\n", " <tr>\n", " <th>United Kingdom</th>\n", " <td>64.9</td>\n", " <td>244820</td>\n", " <td>London</td>\n", " <td>265.092721</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " population area capital density\n", "country \n", "Belgium 11.3 30510 Brussels 370.370370\n", "France 64.3 671308 Paris 95.783158\n", "Germany 81.3 357050 Berlin 227.699202\n", "Netherlands 16.9 41526 Amsterdam 406.973944\n", "United Kingdom 64.9 244820 London 265.092721" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries['density'] = countries['population']*1000000 / countries['area']\n", "countries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<div class=\"alert alert-success\">\n", " <b>EXERCISE</b>: Select the capital and the population column of those countries where the density is larger than 300\n", "</div>" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>capital</th>\n", " <th>population</th>\n", " </tr>\n", " <tr>\n", " <th>country</th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Belgium</th>\n", " <td>Brussels</td>\n", " <td>11.3</td>\n", " </tr>\n", " <tr>\n", " <th>Netherlands</th>\n", " <td>Amsterdam</td>\n", " <td>16.9</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " capital population\n", "country \n", "Belgium Brussels 11.3\n", "Netherlands Amsterdam 16.9" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries.loc[countries['density'] > 300, ['capital', 'population']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<div class=\"alert alert-success\">\n", " <b>EXERCISE</b>: Add a column 'density_ratio' with the ratio of the density to the mean density\n", "</div>" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>population</th>\n", " <th>area</th>\n", " <th>capital</th>\n", " <th>density</th>\n", " <th>density_ratio</th>\n", " </tr>\n", " <tr>\n", " <th>country</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Belgium</th>\n", " <td>11.3</td>\n", " <td>30510</td>\n", " <td>Brussels</td>\n", " <td>370.370370</td>\n", " <td>1.355755</td>\n", " </tr>\n", " <tr>\n", " <th>France</th>\n", " <td>64.3</td>\n", " <td>671308</td>\n", " <td>Paris</td>\n", " <td>95.783158</td>\n", " <td>0.350618</td>\n", " </tr>\n", " <tr>\n", " <th>Germany</th>\n", " <td>81.3</td>\n", " <td>357050</td>\n", " <td>Berlin</td>\n", " <td>227.699202</td>\n", " <td>0.833502</td>\n", " </tr>\n", " <tr>\n", " <th>Netherlands</th>\n", " <td>16.9</td>\n", " <td>41526</td>\n", " <td>Amsterdam</td>\n", " <td>406.973944</td>\n", " <td>1.489744</td>\n", " </tr>\n", " <tr>\n", " <th>United Kingdom</th>\n", " <td>64.9</td>\n", " <td>244820</td>\n", " <td>London</td>\n", " <td>265.092721</td>\n", " <td>0.970382</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " population area capital density density_ratio\n", "country \n", "Belgium 11.3 30510 Brussels 370.370370 1.355755\n", "France 64.3 671308 Paris 95.783158 0.350618\n", "Germany 81.3 357050 Berlin 227.699202 0.833502\n", "Netherlands 16.9 41526 Amsterdam 406.973944 1.489744\n", "United Kingdom 64.9 244820 London 265.092721 0.970382" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries['density_ratio'] = countries['density'] / countries['density'].mean()\n", "countries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<div class=\"alert alert-success\">\n", " <b>EXERCISE</b>: Change the capital of the UK to Cambridge\n", "</div>" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>population</th>\n", " <th>area</th>\n", " <th>capital</th>\n", " <th>density</th>\n", " <th>density_ratio</th>\n", " </tr>\n", " <tr>\n", " <th>country</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Belgium</th>\n", " <td>11.3</td>\n", " <td>30510</td>\n", " <td>Brussels</td>\n", " <td>370.370370</td>\n", " <td>1.355755</td>\n", " </tr>\n", " <tr>\n", " <th>France</th>\n", " <td>64.3</td>\n", " <td>671308</td>\n", " <td>Paris</td>\n", " <td>95.783158</td>\n", " <td>0.350618</td>\n", " </tr>\n", " <tr>\n", " <th>Germany</th>\n", " <td>81.3</td>\n", " <td>357050</td>\n", " <td>Berlin</td>\n", " <td>227.699202</td>\n", " <td>0.833502</td>\n", " </tr>\n", " <tr>\n", " <th>Netherlands</th>\n", " <td>16.9</td>\n", " <td>41526</td>\n", " <td>Amsterdam</td>\n", " <td>406.973944</td>\n", " <td>1.489744</td>\n", " </tr>\n", " <tr>\n", " <th>United Kingdom</th>\n", " <td>64.9</td>\n", " <td>244820</td>\n", " <td>Cambridge</td>\n", " <td>265.092721</td>\n", " <td>0.970382</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " population area capital density density_ratio\n", "country \n", "Belgium 11.3 30510 Brussels 370.370370 1.355755\n", "France 64.3 671308 Paris 95.783158 0.350618\n", "Germany 81.3 357050 Berlin 227.699202 0.833502\n", "Netherlands 16.9 41526 Amsterdam 406.973944 1.489744\n", "United Kingdom 64.9 244820 Cambridge 265.092721 0.970382" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries.loc['United Kingdom', 'capital'] = 'Cambridge'\n", "countries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<div class=\"alert alert-success\">\n", " <b>EXERCISE</b>: Select all countries whose population density is between 100 and 300 people/kmĀ²\n", "</div>" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>population</th>\n", " <th>area</th>\n", " <th>capital</th>\n", " <th>density</th>\n", " <th>density_ratio</th>\n", " </tr>\n", " <tr>\n", " <th>country</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Germany</th>\n", " <td>81.3</td>\n", " <td>357050</td>\n", " <td>Berlin</td>\n", " <td>227.699202</td>\n", " <td>0.833502</td>\n", " </tr>\n", " <tr>\n", " <th>United Kingdom</th>\n", " <td>64.9</td>\n", " <td>244820</td>\n", " <td>Cambridge</td>\n", " <td>265.092721</td>\n", " <td>0.970382</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " population area capital density density_ratio\n", "country \n", "Germany 81.3 357050 Berlin 227.699202 0.833502\n", "United Kingdom 64.9 244820 Cambridge 265.092721 0.970382" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries[(countries['density'] > 100) & (countries['density'] < 300)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### References:\n", "1. [Pydata documentation: Group by: split-apply-combine](https://pandas.pydata.org/docs/user_guide/groupby.html)\n", "2. [EuroScipy 2016 Pandas Tutorial by Joris Van den Bossche and Nicholas Devenish](https://github.com/jorisvandenbossche/pandas-tutorial)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }