{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data analysis in Python\n",
    "\n",
    "## Lesson preamble\n",
    "\n",
    "### Learning objectives\n",
    "\n",
    "- Describe what a data frame is.\n",
    "- Load external data from a .csv file into a data frame with pandas.\n",
    "- Summarize the contents of a data frame with pandas.\n",
    "- Learn to use data frame attributes `loc[]`, `head()`, `info()`, `describe()`, `shape`, `columns`, `index`.\n",
    "- Learn to clean dirty data.\n",
    "- Understand the split-apply-combine concept for data analysis.\n",
    "    - Use `groupby()`, `mean()`, `agg()` and `size()` to apply this technique.\n",
    "\n",
    "### Lesson outline\n",
    "\n",
    "- Manipulating and analyzing data with pandas\n",
    "    - Data set background (10 min)\n",
    "    - What are data frames (15 min)\n",
    "    - Data wrangling with pandas (40 min)\n",
    "- Cleaning data (20 min)\n",
    "- Split-apply-combine techniques in `pandas`\n",
    "    - Using `mean()` to summarize categorical data (20 min)\n",
    "    - Using `size()` to summarize categorical data (15 min)\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Manipulating and analyzing data with pandas\n",
    "\n",
    "### Dataset background\n",
    "\n",
    "Today, we will be working with real data about the world combined from multiple sources by the [Gapminder foundation](https://www.gapminder.org/about-gapminder/). Gapminder is and independent Swedish foundation that fights devastating misconceptions about global development and promotes as fact-based world view through the production of free teaching and data exploration resources. Insights from the combined Gapminder data sources have been popularized through the efforts of public health professor Hans Rosling, and it is highly recommended to check out his entertaining videos, for example this one.\n",
    "\n",
    "As a start, we recommend taking [this 5-10 min quiz](http://forms.gapminder.org/s3/test-2018) to see how ignorant you are about the world. Then we will learn how to dive into this data further using Python!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Overview of the gapminder world data\n",
    "\n",
    "We are studying the species and weight of animals caught in plots in our study\n",
    "area. The dataset is stored as a comma separated value (CSV) file. Each row\n",
    "holds information for a single animal, and the columns represent:\n",
    "\n",
    "| Column                | Description                        |\n",
    "|-----------------------|------------------------------------|\n",
    "| country               | Country name                       |\n",
    "| year                  | Year of observation                |\n",
    "| population            | Population in the country at each year |\n",
    "| region                | Continent the country belongs to   |\n",
    "| sub_region            | Sub regions as defined by          |\n",
    "| income_group          | Income group [as specified by the world bank](https://datahelpdesk.worldbank.org/knowledgebase/articles/378833-how-are-the-income-group-thresholds-determined)                  |\n",
    "| life_expectancy       | The average number of years a newborn child would <br>live if mortality patterns were to stay the same |\n",
    "| income                | GDP per capita (in USD) adjusted <br>for differences in purchasing power|\n",
    "| children_per_woman    | Number of children born to each woman|\n",
    "| child_mortality       | Deaths of children under 5 years <break>of age per 1000 live births|\n",
    "| pop_density           | Average number of people per km<sup>2</sup>|\n",
    "| co2_per_capita        | CO2 emissions from fossil fuels (tonnes per capita)|\n",
    "| years_in_school_men   | Average number of years attending primary, secondary,<br>and tertiary school for 25-36 years old men|\n",
    "| years_in_school_women | Average number of years attending primary, secondary,<br>and tertiary school for 25-36 years old women|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To read the data into Python, we are going to use a function called `read_csv` from the Python-package [`pandas`](https://pandas.pydata.org/). As mentioned previously, Python-packages are a bit like browser extensions, they are not essential, but can provide nifty functionality. To use a package, it first needs to be imported."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pandas is given the nickname `pd`\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pandas` can read CSV-files saved on the computer or directly from an URL."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "world_data = pd.read_csv('https://raw.githubusercontent.com/UofTCoders/2018-09-10-utoronto/gh-pages/data/world-data-gapminder.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To view the result, type `world_data` in a cell and run it, just as when viewing the content of any variable in Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "      <th>population</th>\n",
       "      <th>region</th>\n",
       "      <th>sub_region</th>\n",
       "      <th>income_group</th>\n",
       "      <th>life_expectancy</th>\n",
       "      <th>income</th>\n",
       "      <th>children_per_woman</th>\n",
       "      <th>child_mortality</th>\n",
       "      <th>pop_density</th>\n",
       "      <th>co2_per_capita</th>\n",
       "      <th>years_in_school_men</th>\n",
       "      <th>years_in_school_women</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1800</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.00</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1801</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.00</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1802</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.00</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1803</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.00</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1804</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.00</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1805</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.00</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1806</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.1</td>\n",
       "      <td>603</td>\n",
       "      <td>7.00</td>\n",
       "      <td>470.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1807</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.1</td>\n",
       "      <td>603</td>\n",
       "      <td>7.00</td>\n",
       "      <td>470.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1808</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.1</td>\n",
       "      <td>603</td>\n",
       "      <td>7.00</td>\n",
       "      <td>470.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1809</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.1</td>\n",
       "      <td>603</td>\n",
       "      <td>7.00</td>\n",
       "      <td>470.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1810</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.1</td>\n",
       "      <td>604</td>\n",
       "      <td>7.00</td>\n",
       "      <td>470.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1811</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.1</td>\n",
       "      <td>604</td>\n",
       "      <td>7.00</td>\n",
       "      <td>470.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1812</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.1</td>\n",
       "      <td>604</td>\n",
       "      <td>7.00</td>\n",
       "      <td>470.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1813</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.1</td>\n",
       "      <td>604</td>\n",
       "      <td>7.00</td>\n",
       "      <td>470.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1814</td>\n",
       "      <td>3290000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.1</td>\n",
       "      <td>604</td>\n",
       "      <td>7.00</td>\n",
       "      <td>470.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1815</td>\n",
       "      <td>3290000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.1</td>\n",
       "      <td>604</td>\n",
       "      <td>7.00</td>\n",
       "      <td>470.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1816</td>\n",
       "      <td>3300000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.1</td>\n",
       "      <td>604</td>\n",
       "      <td>7.00</td>\n",
       "      <td>471.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1817</td>\n",
       "      <td>3300000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.0</td>\n",
       "      <td>604</td>\n",
       "      <td>7.00</td>\n",
       "      <td>471.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1818</td>\n",
       "      <td>3310000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.0</td>\n",
       "      <td>604</td>\n",
       "      <td>7.00</td>\n",
       "      <td>471.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1819</td>\n",
       "      <td>3320000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.0</td>\n",
       "      <td>604</td>\n",
       "      <td>7.00</td>\n",
       "      <td>471.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1820</td>\n",
       "      <td>3320000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.0</td>\n",
       "      <td>604</td>\n",
       "      <td>7.00</td>\n",
       "      <td>471.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1821</td>\n",
       "      <td>3330000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.0</td>\n",
       "      <td>607</td>\n",
       "      <td>7.00</td>\n",
       "      <td>471.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1822</td>\n",
       "      <td>3340000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.0</td>\n",
       "      <td>609</td>\n",
       "      <td>7.00</td>\n",
       "      <td>471.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1823</td>\n",
       "      <td>3350000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.0</td>\n",
       "      <td>611</td>\n",
       "      <td>7.00</td>\n",
       "      <td>471.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1824</td>\n",
       "      <td>3360000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.0</td>\n",
       "      <td>613</td>\n",
       "      <td>7.00</td>\n",
       "      <td>471.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1825</td>\n",
       "      <td>3380000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>27.9</td>\n",
       "      <td>615</td>\n",
       "      <td>7.00</td>\n",
       "      <td>471.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1826</td>\n",
       "      <td>3390000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>27.9</td>\n",
       "      <td>617</td>\n",
       "      <td>7.00</td>\n",
       "      <td>473.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1827</td>\n",
       "      <td>3400000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>27.9</td>\n",
       "      <td>619</td>\n",
       "      <td>7.00</td>\n",
       "      <td>473.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1828</td>\n",
       "      <td>3420000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>27.9</td>\n",
       "      <td>621</td>\n",
       "      <td>7.00</td>\n",
       "      <td>473.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1829</td>\n",
       "      <td>3430000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>27.9</td>\n",
       "      <td>623</td>\n",
       "      <td>7.00</td>\n",
       "      <td>473.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38952</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>1989</td>\n",
       "      <td>9900000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>62.7</td>\n",
       "      <td>2490</td>\n",
       "      <td>5.37</td>\n",
       "      <td>73.9</td>\n",
       "      <td>25.6</td>\n",
       "      <td>1.630</td>\n",
       "      <td>7.61</td>\n",
       "      <td>6.01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38953</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>1990</td>\n",
       "      <td>10200000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>61.7</td>\n",
       "      <td>2590</td>\n",
       "      <td>5.18</td>\n",
       "      <td>75.2</td>\n",
       "      <td>26.3</td>\n",
       "      <td>1.540</td>\n",
       "      <td>7.74</td>\n",
       "      <td>6.16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38954</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>1991</td>\n",
       "      <td>10400000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>61.0</td>\n",
       "      <td>2670</td>\n",
       "      <td>5.00</td>\n",
       "      <td>77.4</td>\n",
       "      <td>27.0</td>\n",
       "      <td>1.530</td>\n",
       "      <td>7.88</td>\n",
       "      <td>6.31</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38955</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>1992</td>\n",
       "      <td>10700000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>59.4</td>\n",
       "      <td>2370</td>\n",
       "      <td>4.84</td>\n",
       "      <td>80.2</td>\n",
       "      <td>27.6</td>\n",
       "      <td>1.590</td>\n",
       "      <td>8.01</td>\n",
       "      <td>6.46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38956</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>1993</td>\n",
       "      <td>10900000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>57.6</td>\n",
       "      <td>2350</td>\n",
       "      <td>4.69</td>\n",
       "      <td>83.4</td>\n",
       "      <td>28.2</td>\n",
       "      <td>1.500</td>\n",
       "      <td>8.14</td>\n",
       "      <td>6.61</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38957</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>1994</td>\n",
       "      <td>11100000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>55.8</td>\n",
       "      <td>2520</td>\n",
       "      <td>4.56</td>\n",
       "      <td>86.8</td>\n",
       "      <td>28.7</td>\n",
       "      <td>1.600</td>\n",
       "      <td>8.28</td>\n",
       "      <td>6.76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38958</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>1995</td>\n",
       "      <td>11300000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>53.7</td>\n",
       "      <td>2480</td>\n",
       "      <td>4.43</td>\n",
       "      <td>90.1</td>\n",
       "      <td>29.3</td>\n",
       "      <td>1.340</td>\n",
       "      <td>8.41</td>\n",
       "      <td>6.92</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38959</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>1996</td>\n",
       "      <td>11500000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>52.2</td>\n",
       "      <td>2690</td>\n",
       "      <td>4.33</td>\n",
       "      <td>92.8</td>\n",
       "      <td>29.8</td>\n",
       "      <td>1.300</td>\n",
       "      <td>8.54</td>\n",
       "      <td>7.07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38960</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>1997</td>\n",
       "      <td>11700000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>50.8</td>\n",
       "      <td>2710</td>\n",
       "      <td>4.24</td>\n",
       "      <td>94.7</td>\n",
       "      <td>30.3</td>\n",
       "      <td>1.230</td>\n",
       "      <td>8.67</td>\n",
       "      <td>7.23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38961</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>1998</td>\n",
       "      <td>11900000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>49.1</td>\n",
       "      <td>2750</td>\n",
       "      <td>4.16</td>\n",
       "      <td>95.9</td>\n",
       "      <td>30.7</td>\n",
       "      <td>1.200</td>\n",
       "      <td>8.80</td>\n",
       "      <td>7.39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38962</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>1999</td>\n",
       "      <td>12100000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>47.8</td>\n",
       "      <td>2690</td>\n",
       "      <td>4.10</td>\n",
       "      <td>96.4</td>\n",
       "      <td>31.2</td>\n",
       "      <td>1.310</td>\n",
       "      <td>8.93</td>\n",
       "      <td>7.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38963</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2000</td>\n",
       "      <td>12200000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>46.7</td>\n",
       "      <td>2570</td>\n",
       "      <td>4.06</td>\n",
       "      <td>96.8</td>\n",
       "      <td>31.6</td>\n",
       "      <td>1.140</td>\n",
       "      <td>9.07</td>\n",
       "      <td>7.71</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38964</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2001</td>\n",
       "      <td>12400000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>46.2</td>\n",
       "      <td>2580</td>\n",
       "      <td>4.02</td>\n",
       "      <td>97.1</td>\n",
       "      <td>32.0</td>\n",
       "      <td>1.020</td>\n",
       "      <td>9.20</td>\n",
       "      <td>7.87</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38965</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2002</td>\n",
       "      <td>12500000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>45.6</td>\n",
       "      <td>2320</td>\n",
       "      <td>4.00</td>\n",
       "      <td>97.7</td>\n",
       "      <td>32.3</td>\n",
       "      <td>0.957</td>\n",
       "      <td>9.33</td>\n",
       "      <td>8.03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38966</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2003</td>\n",
       "      <td>12600000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>45.3</td>\n",
       "      <td>1910</td>\n",
       "      <td>3.99</td>\n",
       "      <td>98.2</td>\n",
       "      <td>32.7</td>\n",
       "      <td>0.843</td>\n",
       "      <td>9.47</td>\n",
       "      <td>8.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38967</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2004</td>\n",
       "      <td>12800000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>45.1</td>\n",
       "      <td>1780</td>\n",
       "      <td>3.98</td>\n",
       "      <td>99.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>0.742</td>\n",
       "      <td>9.60</td>\n",
       "      <td>8.36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38968</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2005</td>\n",
       "      <td>12900000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>45.3</td>\n",
       "      <td>1650</td>\n",
       "      <td>3.99</td>\n",
       "      <td>99.7</td>\n",
       "      <td>33.4</td>\n",
       "      <td>0.832</td>\n",
       "      <td>9.73</td>\n",
       "      <td>8.53</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38969</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2006</td>\n",
       "      <td>13100000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>45.7</td>\n",
       "      <td>1580</td>\n",
       "      <td>3.99</td>\n",
       "      <td>100.0</td>\n",
       "      <td>33.9</td>\n",
       "      <td>0.796</td>\n",
       "      <td>9.87</td>\n",
       "      <td>8.69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38970</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2007</td>\n",
       "      <td>13300000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>46.4</td>\n",
       "      <td>1490</td>\n",
       "      <td>4.00</td>\n",
       "      <td>100.0</td>\n",
       "      <td>34.5</td>\n",
       "      <td>0.742</td>\n",
       "      <td>10.00</td>\n",
       "      <td>8.86</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38971</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2008</td>\n",
       "      <td>13600000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>46.7</td>\n",
       "      <td>1210</td>\n",
       "      <td>4.01</td>\n",
       "      <td>98.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0.573</td>\n",
       "      <td>10.10</td>\n",
       "      <td>9.03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38972</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2009</td>\n",
       "      <td>13800000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>47.5</td>\n",
       "      <td>1290</td>\n",
       "      <td>4.02</td>\n",
       "      <td>94.9</td>\n",
       "      <td>35.7</td>\n",
       "      <td>0.406</td>\n",
       "      <td>10.30</td>\n",
       "      <td>9.19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38973</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2010</td>\n",
       "      <td>14100000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>49.6</td>\n",
       "      <td>1460</td>\n",
       "      <td>4.03</td>\n",
       "      <td>89.9</td>\n",
       "      <td>36.4</td>\n",
       "      <td>0.552</td>\n",
       "      <td>10.40</td>\n",
       "      <td>9.36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38974</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2011</td>\n",
       "      <td>14400000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>51.9</td>\n",
       "      <td>1660</td>\n",
       "      <td>4.02</td>\n",
       "      <td>83.8</td>\n",
       "      <td>37.2</td>\n",
       "      <td>0.665</td>\n",
       "      <td>10.50</td>\n",
       "      <td>9.53</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38975</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2012</td>\n",
       "      <td>14700000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>54.1</td>\n",
       "      <td>1850</td>\n",
       "      <td>4.00</td>\n",
       "      <td>76.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>0.530</td>\n",
       "      <td>10.70</td>\n",
       "      <td>9.70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38976</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2013</td>\n",
       "      <td>15100000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>55.6</td>\n",
       "      <td>1900</td>\n",
       "      <td>3.96</td>\n",
       "      <td>70.0</td>\n",
       "      <td>38.9</td>\n",
       "      <td>0.776</td>\n",
       "      <td>10.80</td>\n",
       "      <td>9.86</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38977</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2014</td>\n",
       "      <td>15400000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>57.0</td>\n",
       "      <td>1910</td>\n",
       "      <td>3.90</td>\n",
       "      <td>64.3</td>\n",
       "      <td>39.8</td>\n",
       "      <td>0.780</td>\n",
       "      <td>10.90</td>\n",
       "      <td>10.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38978</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2015</td>\n",
       "      <td>15800000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>58.3</td>\n",
       "      <td>1890</td>\n",
       "      <td>3.84</td>\n",
       "      <td>59.9</td>\n",
       "      <td>40.8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>11.10</td>\n",
       "      <td>10.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38979</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2016</td>\n",
       "      <td>16200000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>59.3</td>\n",
       "      <td>1860</td>\n",
       "      <td>3.76</td>\n",
       "      <td>56.4</td>\n",
       "      <td>41.7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38980</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2017</td>\n",
       "      <td>16500000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>59.8</td>\n",
       "      <td>1910</td>\n",
       "      <td>3.68</td>\n",
       "      <td>56.8</td>\n",
       "      <td>42.7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38981</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>2018</td>\n",
       "      <td>16900000</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Sub-Saharan Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>60.2</td>\n",
       "      <td>1950</td>\n",
       "      <td>3.61</td>\n",
       "      <td>55.5</td>\n",
       "      <td>43.7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>38982 rows × 14 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           country  year  population  region          sub_region income_group  \\\n",
       "0      Afghanistan  1800     3280000    Asia       Southern Asia          Low   \n",
       "1      Afghanistan  1801     3280000    Asia       Southern Asia          Low   \n",
       "2      Afghanistan  1802     3280000    Asia       Southern Asia          Low   \n",
       "3      Afghanistan  1803     3280000    Asia       Southern Asia          Low   \n",
       "4      Afghanistan  1804     3280000    Asia       Southern Asia          Low   \n",
       "5      Afghanistan  1805     3280000    Asia       Southern Asia          Low   \n",
       "6      Afghanistan  1806     3280000    Asia       Southern Asia          Low   \n",
       "7      Afghanistan  1807     3280000    Asia       Southern Asia          Low   \n",
       "8      Afghanistan  1808     3280000    Asia       Southern Asia          Low   \n",
       "9      Afghanistan  1809     3280000    Asia       Southern Asia          Low   \n",
       "10     Afghanistan  1810     3280000    Asia       Southern Asia          Low   \n",
       "11     Afghanistan  1811     3280000    Asia       Southern Asia          Low   \n",
       "12     Afghanistan  1812     3280000    Asia       Southern Asia          Low   \n",
       "13     Afghanistan  1813     3280000    Asia       Southern Asia          Low   \n",
       "14     Afghanistan  1814     3290000    Asia       Southern Asia          Low   \n",
       "15     Afghanistan  1815     3290000    Asia       Southern Asia          Low   \n",
       "16     Afghanistan  1816     3300000    Asia       Southern Asia          Low   \n",
       "17     Afghanistan  1817     3300000    Asia       Southern Asia          Low   \n",
       "18     Afghanistan  1818     3310000    Asia       Southern Asia          Low   \n",
       "19     Afghanistan  1819     3320000    Asia       Southern Asia          Low   \n",
       "20     Afghanistan  1820     3320000    Asia       Southern Asia          Low   \n",
       "21     Afghanistan  1821     3330000    Asia       Southern Asia          Low   \n",
       "22     Afghanistan  1822     3340000    Asia       Southern Asia          Low   \n",
       "23     Afghanistan  1823     3350000    Asia       Southern Asia          Low   \n",
       "24     Afghanistan  1824     3360000    Asia       Southern Asia          Low   \n",
       "25     Afghanistan  1825     3380000    Asia       Southern Asia          Low   \n",
       "26     Afghanistan  1826     3390000    Asia       Southern Asia          Low   \n",
       "27     Afghanistan  1827     3400000    Asia       Southern Asia          Low   \n",
       "28     Afghanistan  1828     3420000    Asia       Southern Asia          Low   \n",
       "29     Afghanistan  1829     3430000    Asia       Southern Asia          Low   \n",
       "...            ...   ...         ...     ...                 ...          ...   \n",
       "38952     Zimbabwe  1989     9900000  Africa  Sub-Saharan Africa          Low   \n",
       "38953     Zimbabwe  1990    10200000  Africa  Sub-Saharan Africa          Low   \n",
       "38954     Zimbabwe  1991    10400000  Africa  Sub-Saharan Africa          Low   \n",
       "38955     Zimbabwe  1992    10700000  Africa  Sub-Saharan Africa          Low   \n",
       "38956     Zimbabwe  1993    10900000  Africa  Sub-Saharan Africa          Low   \n",
       "38957     Zimbabwe  1994    11100000  Africa  Sub-Saharan Africa          Low   \n",
       "38958     Zimbabwe  1995    11300000  Africa  Sub-Saharan Africa          Low   \n",
       "38959     Zimbabwe  1996    11500000  Africa  Sub-Saharan Africa          Low   \n",
       "38960     Zimbabwe  1997    11700000  Africa  Sub-Saharan Africa          Low   \n",
       "38961     Zimbabwe  1998    11900000  Africa  Sub-Saharan Africa          Low   \n",
       "38962     Zimbabwe  1999    12100000  Africa  Sub-Saharan Africa          Low   \n",
       "38963     Zimbabwe  2000    12200000  Africa  Sub-Saharan Africa          Low   \n",
       "38964     Zimbabwe  2001    12400000  Africa  Sub-Saharan Africa          Low   \n",
       "38965     Zimbabwe  2002    12500000  Africa  Sub-Saharan Africa          Low   \n",
       "38966     Zimbabwe  2003    12600000  Africa  Sub-Saharan Africa          Low   \n",
       "38967     Zimbabwe  2004    12800000  Africa  Sub-Saharan Africa          Low   \n",
       "38968     Zimbabwe  2005    12900000  Africa  Sub-Saharan Africa          Low   \n",
       "38969     Zimbabwe  2006    13100000  Africa  Sub-Saharan Africa          Low   \n",
       "38970     Zimbabwe  2007    13300000  Africa  Sub-Saharan Africa          Low   \n",
       "38971     Zimbabwe  2008    13600000  Africa  Sub-Saharan Africa          Low   \n",
       "38972     Zimbabwe  2009    13800000  Africa  Sub-Saharan Africa          Low   \n",
       "38973     Zimbabwe  2010    14100000  Africa  Sub-Saharan Africa          Low   \n",
       "38974     Zimbabwe  2011    14400000  Africa  Sub-Saharan Africa          Low   \n",
       "38975     Zimbabwe  2012    14700000  Africa  Sub-Saharan Africa          Low   \n",
       "38976     Zimbabwe  2013    15100000  Africa  Sub-Saharan Africa          Low   \n",
       "38977     Zimbabwe  2014    15400000  Africa  Sub-Saharan Africa          Low   \n",
       "38978     Zimbabwe  2015    15800000  Africa  Sub-Saharan Africa          Low   \n",
       "38979     Zimbabwe  2016    16200000  Africa  Sub-Saharan Africa          Low   \n",
       "38980     Zimbabwe  2017    16500000  Africa  Sub-Saharan Africa          Low   \n",
       "38981     Zimbabwe  2018    16900000  Africa  Sub-Saharan Africa          Low   \n",
       "\n",
       "       life_expectancy  income  children_per_woman  child_mortality  \\\n",
       "0                 28.2     603                7.00            469.0   \n",
       "1                 28.2     603                7.00            469.0   \n",
       "2                 28.2     603                7.00            469.0   \n",
       "3                 28.2     603                7.00            469.0   \n",
       "4                 28.2     603                7.00            469.0   \n",
       "5                 28.2     603                7.00            469.0   \n",
       "6                 28.1     603                7.00            470.0   \n",
       "7                 28.1     603                7.00            470.0   \n",
       "8                 28.1     603                7.00            470.0   \n",
       "9                 28.1     603                7.00            470.0   \n",
       "10                28.1     604                7.00            470.0   \n",
       "11                28.1     604                7.00            470.0   \n",
       "12                28.1     604                7.00            470.0   \n",
       "13                28.1     604                7.00            470.0   \n",
       "14                28.1     604                7.00            470.0   \n",
       "15                28.1     604                7.00            470.0   \n",
       "16                28.1     604                7.00            471.0   \n",
       "17                28.0     604                7.00            471.0   \n",
       "18                28.0     604                7.00            471.0   \n",
       "19                28.0     604                7.00            471.0   \n",
       "20                28.0     604                7.00            471.0   \n",
       "21                28.0     607                7.00            471.0   \n",
       "22                28.0     609                7.00            471.0   \n",
       "23                28.0     611                7.00            471.0   \n",
       "24                28.0     613                7.00            471.0   \n",
       "25                27.9     615                7.00            471.0   \n",
       "26                27.9     617                7.00            473.0   \n",
       "27                27.9     619                7.00            473.0   \n",
       "28                27.9     621                7.00            473.0   \n",
       "29                27.9     623                7.00            473.0   \n",
       "...                ...     ...                 ...              ...   \n",
       "38952             62.7    2490                5.37             73.9   \n",
       "38953             61.7    2590                5.18             75.2   \n",
       "38954             61.0    2670                5.00             77.4   \n",
       "38955             59.4    2370                4.84             80.2   \n",
       "38956             57.6    2350                4.69             83.4   \n",
       "38957             55.8    2520                4.56             86.8   \n",
       "38958             53.7    2480                4.43             90.1   \n",
       "38959             52.2    2690                4.33             92.8   \n",
       "38960             50.8    2710                4.24             94.7   \n",
       "38961             49.1    2750                4.16             95.9   \n",
       "38962             47.8    2690                4.10             96.4   \n",
       "38963             46.7    2570                4.06             96.8   \n",
       "38964             46.2    2580                4.02             97.1   \n",
       "38965             45.6    2320                4.00             97.7   \n",
       "38966             45.3    1910                3.99             98.2   \n",
       "38967             45.1    1780                3.98             99.0   \n",
       "38968             45.3    1650                3.99             99.7   \n",
       "38969             45.7    1580                3.99            100.0   \n",
       "38970             46.4    1490                4.00            100.0   \n",
       "38971             46.7    1210                4.01             98.0   \n",
       "38972             47.5    1290                4.02             94.9   \n",
       "38973             49.6    1460                4.03             89.9   \n",
       "38974             51.9    1660                4.02             83.8   \n",
       "38975             54.1    1850                4.00             76.0   \n",
       "38976             55.6    1900                3.96             70.0   \n",
       "38977             57.0    1910                3.90             64.3   \n",
       "38978             58.3    1890                3.84             59.9   \n",
       "38979             59.3    1860                3.76             56.4   \n",
       "38980             59.8    1910                3.68             56.8   \n",
       "38981             60.2    1950                3.61             55.5   \n",
       "\n",
       "       pop_density  co2_per_capita  years_in_school_men  years_in_school_women  \n",
       "0              NaN             NaN                  NaN                    NaN  \n",
       "1              NaN             NaN                  NaN                    NaN  \n",
       "2              NaN             NaN                  NaN                    NaN  \n",
       "3              NaN             NaN                  NaN                    NaN  \n",
       "4              NaN             NaN                  NaN                    NaN  \n",
       "5              NaN             NaN                  NaN                    NaN  \n",
       "6              NaN             NaN                  NaN                    NaN  \n",
       "7              NaN             NaN                  NaN                    NaN  \n",
       "8              NaN             NaN                  NaN                    NaN  \n",
       "9              NaN             NaN                  NaN                    NaN  \n",
       "10             NaN             NaN                  NaN                    NaN  \n",
       "11             NaN             NaN                  NaN                    NaN  \n",
       "12             NaN             NaN                  NaN                    NaN  \n",
       "13             NaN             NaN                  NaN                    NaN  \n",
       "14             NaN             NaN                  NaN                    NaN  \n",
       "15             NaN             NaN                  NaN                    NaN  \n",
       "16             NaN             NaN                  NaN                    NaN  \n",
       "17             NaN             NaN                  NaN                    NaN  \n",
       "18             NaN             NaN                  NaN                    NaN  \n",
       "19             NaN             NaN                  NaN                    NaN  \n",
       "20             NaN             NaN                  NaN                    NaN  \n",
       "21             NaN             NaN                  NaN                    NaN  \n",
       "22             NaN             NaN                  NaN                    NaN  \n",
       "23             NaN             NaN                  NaN                    NaN  \n",
       "24             NaN             NaN                  NaN                    NaN  \n",
       "25             NaN             NaN                  NaN                    NaN  \n",
       "26             NaN             NaN                  NaN                    NaN  \n",
       "27             NaN             NaN                  NaN                    NaN  \n",
       "28             NaN             NaN                  NaN                    NaN  \n",
       "29             NaN             NaN                  NaN                    NaN  \n",
       "...            ...             ...                  ...                    ...  \n",
       "38952         25.6           1.630                 7.61                   6.01  \n",
       "38953         26.3           1.540                 7.74                   6.16  \n",
       "38954         27.0           1.530                 7.88                   6.31  \n",
       "38955         27.6           1.590                 8.01                   6.46  \n",
       "38956         28.2           1.500                 8.14                   6.61  \n",
       "38957         28.7           1.600                 8.28                   6.76  \n",
       "38958         29.3           1.340                 8.41                   6.92  \n",
       "38959         29.8           1.300                 8.54                   7.07  \n",
       "38960         30.3           1.230                 8.67                   7.23  \n",
       "38961         30.7           1.200                 8.80                   7.39  \n",
       "38962         31.2           1.310                 8.93                   7.55  \n",
       "38963         31.6           1.140                 9.07                   7.71  \n",
       "38964         32.0           1.020                 9.20                   7.87  \n",
       "38965         32.3           0.957                 9.33                   8.03  \n",
       "38966         32.7           0.843                 9.47                   8.20  \n",
       "38967         33.0           0.742                 9.60                   8.36  \n",
       "38968         33.4           0.832                 9.73                   8.53  \n",
       "38969         33.9           0.796                 9.87                   8.69  \n",
       "38970         34.5           0.742                10.00                   8.86  \n",
       "38971         35.0           0.573                10.10                   9.03  \n",
       "38972         35.7           0.406                10.30                   9.19  \n",
       "38973         36.4           0.552                10.40                   9.36  \n",
       "38974         37.2           0.665                10.50                   9.53  \n",
       "38975         38.0           0.530                10.70                   9.70  \n",
       "38976         38.9           0.776                10.80                   9.86  \n",
       "38977         39.8           0.780                10.90                  10.00  \n",
       "38978         40.8             NaN                11.10                  10.20  \n",
       "38979         41.7             NaN                  NaN                    NaN  \n",
       "38980         42.7             NaN                  NaN                    NaN  \n",
       "38981         43.7             NaN                  NaN                    NaN  \n",
       "\n",
       "[38982 rows x 14 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is how a data frame is displayed in the JupyterLab Notebook. Although the data frame itself just consists of the values, the Notebook knows that this is a data frame and displays it in a nice tabular format (by adding HTML decorators), and adds some cosmetic conveniences such as the bold font type for the column and row names, the alternating grey and white zebra stripes for the rows and highlights the row the mouse pointer moves over. The increasing numbers on the far left is the data frame's index, which was added by `pandas` to easily distinguish between the rows.\n",
    "\n",
    "## What are data frames?\n",
    "\n",
    "A data frame is the representation of data in a tabular format, similar to how data is often arranged in spreadsheets. The data is rectangular, meaning that all rows have the same amount of columns and all columns have the same amount of rows. Data frames are the *de facto* data structure for most tabular data, and what we use for statistics and plotting. A data frame can be created by hand, but most commonly they are generated by an input function, such as `read_csv()`. In other words, when importing spreadsheets from your hard drive (or the web).\n",
    "\n",
    "As can be seen above, the default is to display the first and last 30 rows and truncate everything in between, as indicated by the ellipsis (`...`). Although it is truncated, this output is still quite space consuming. To glance at how the data frame looks, it is sufficient to display only the top (the first 5 lines) using the `head()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "      <th>population</th>\n",
       "      <th>region</th>\n",
       "      <th>sub_region</th>\n",
       "      <th>income_group</th>\n",
       "      <th>life_expectancy</th>\n",
       "      <th>income</th>\n",
       "      <th>children_per_woman</th>\n",
       "      <th>child_mortality</th>\n",
       "      <th>pop_density</th>\n",
       "      <th>co2_per_capita</th>\n",
       "      <th>years_in_school_men</th>\n",
       "      <th>years_in_school_women</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1800</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1801</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1802</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1803</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1804</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country  year  population region     sub_region income_group  \\\n",
       "0  Afghanistan  1800     3280000   Asia  Southern Asia          Low   \n",
       "1  Afghanistan  1801     3280000   Asia  Southern Asia          Low   \n",
       "2  Afghanistan  1802     3280000   Asia  Southern Asia          Low   \n",
       "3  Afghanistan  1803     3280000   Asia  Southern Asia          Low   \n",
       "4  Afghanistan  1804     3280000   Asia  Southern Asia          Low   \n",
       "\n",
       "   life_expectancy  income  children_per_woman  child_mortality  pop_density  \\\n",
       "0             28.2     603                 7.0            469.0          NaN   \n",
       "1             28.2     603                 7.0            469.0          NaN   \n",
       "2             28.2     603                 7.0            469.0          NaN   \n",
       "3             28.2     603                 7.0            469.0          NaN   \n",
       "4             28.2     603                 7.0            469.0          NaN   \n",
       "\n",
       "   co2_per_capita  years_in_school_men  years_in_school_women  \n",
       "0             NaN                  NaN                    NaN  \n",
       "1             NaN                  NaN                    NaN  \n",
       "2             NaN                  NaN                    NaN  \n",
       "3             NaN                  NaN                    NaN  \n",
       "4             NaN                  NaN                    NaN  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Methods are very similar to functions, the main difference is that they belong to an object (above, the method `head()` belongs to the data frame `world_data`). Methods operate on the object they belong to, that's why we can call the method with an empty parenthesis without any arguments. Compare this with the function `type()` that was introduced previously."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(world_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, the `world_data` variable is explicitly passed as an argument to `type()`. An immediately tangible advantage with methods is that they simplify tab completion. Just type the name of the dataframe, a period, and then hit tab to see all the relevant methods for that data frame instead of fumbling around with all the available functions in Python (there's quite a few!) and figuring out which ones operate on data frames and which do not. Methods also facilitates readability when chaining many operations together, which will be shown in detail later.\n",
    "\n",
    "The columns in a data frame can contain data of different types, e.g. integers, floats, and objects (which includes strings, lists, dictionaries, and more)). General information about the data frame (including the column data types) can be obtained with the `info()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 38982 entries, 0 to 38981\n",
      "Data columns (total 14 columns):\n",
      "country                  38982 non-null object\n",
      "year                     38982 non-null int64\n",
      "population               38982 non-null int64\n",
      "region                   38982 non-null object\n",
      "sub_region               38982 non-null object\n",
      "income_group             38982 non-null object\n",
      "life_expectancy          38982 non-null float64\n",
      "income                   38982 non-null int64\n",
      "children_per_woman       38982 non-null float64\n",
      "child_mortality          38980 non-null float64\n",
      "pop_density              12282 non-null float64\n",
      "co2_per_capita           16285 non-null float64\n",
      "years_in_school_men      8188 non-null float64\n",
      "years_in_school_women    8188 non-null float64\n",
      "dtypes: float64(7), int64(3), object(4)\n",
      "memory usage: 4.2+ MB\n"
     ]
    }
   ],
   "source": [
    "world_data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The information includes the total number of rows and columns, the number of non-null observations, the column data types, and the memory (RAM) usage. The number of non-null observation is not the same for all columns, which means that some columns contain null (or NA) values representing that there is missing information. The column data type is often indicative of which type of data is stored in that column, and approximately corresponds to the following\n",
    "\n",
    "- **Qualitative/Categorical**\n",
    "    - Nominal (labels, e.g. 'red', 'green', 'blue')\n",
    "        - `object`, `category`\n",
    "    - Ordinal (labels with order, e.g. 'Jan', 'Feb', 'Mar')\n",
    "        - `object`, `category`, `int`\n",
    "    - Binary (only two outcomes, e.g. True or False)\n",
    "        - `bool`\n",
    "- **Quantitative/Numerical**\n",
    "    - Discrete (whole numbers, often counting, e.g. number of children)\n",
    "        - `int`\n",
    "    - Continuous (measured values with decimals, e.g. weight)\n",
    "        - `float`\n",
    "    \n",
    "Note that an `object` could contain different types, e.g. `str` or `list`. Also note that there can be exceptions to the schema above, but it is still a useful rough guide.\n",
    "\n",
    "After reading in the data into a data frame, `head()` and `info()` are two of the most useful methods to get an idea of the structure of this data frame. There are many additional methods that can facilitate the understanding of what a data frame contains:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Size:\n",
    "    - `world_data.shape` - a tuple with the number of rows in the first element\n",
    "      and the number of columns as the second element\n",
    "    - `world_data.shape[0]` - the number of rows\n",
    "    - `world_data.shape[1]`- the number of columns\n",
    "\n",
    "- Content:\n",
    "    - `world_data.head()` - shows the first 5 rows\n",
    "    - `world_data.tail()` - shows the last 5 rows\n",
    "\n",
    "- Names:\n",
    "    - `world_data.columns` - returns the names of the columns (also called variable names) \n",
    "      objects)\n",
    "    - `world_data.index` - returns the names of the rows (referred to as the index in pandas)\n",
    "\n",
    "- Summary:\n",
    "    - `world_data.info()` - column names and data types, number of observations, memory consumptions\n",
    "      length, and content of  each column\n",
    "    - `world_data.describe()` - summary statistics for each column\n",
    "\n",
    "These belong to a data frame and are commonly referred to as *attributes* of the data frame. All attributes are accessed with the dot-syntax (`.`), which returns the attribute's value. If the attribute is a method, parentheses can be appended to the name to carry out the method's operation on the data frame. Attributes that are not methods often hold a value that has been precomputed because it is commonly accessed and it saves time store the value in an attribute instead of recomputing it every time it is needed. For example, every time `pandas` creates a data frame, the number of rows and columns is computed and stored in the `shape` attribute.\n",
    "\n",
    ">#### Challenge\n",
    ">\n",
    ">Based on the output of `world_data.info()`, can you answer the following questions?\n",
    ">\n",
    ">* What is the class of the object `world_data`?\n",
    ">* How many rows and how many columns are in this object?\n",
    ">* Why is there not the same number of rows (observations) for each column?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Saving data frames locally\n",
    "\n",
    "It is good practice to keep a copy of the data stored locally on your computer in case you want to do offline analyses,  the online version of the file changes, or the file is taken down. For this, the data could be downloaded manually or the current `world_data` data frame could be saved to disk as a CSV-file with `to_csv()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "world_data.to_csv('world-data.csv', index=False)\n",
    "# `index=False` because the index (the row names) was generated automatically when pandas opened\n",
    "# the file and this information is not needed to be saved"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the data is now saved locally, the next time this Notebook is opened, it could be loaded from the local path instead of downloading it from the URL."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "      <th>population</th>\n",
       "      <th>region</th>\n",
       "      <th>sub_region</th>\n",
       "      <th>income_group</th>\n",
       "      <th>life_expectancy</th>\n",
       "      <th>income</th>\n",
       "      <th>children_per_woman</th>\n",
       "      <th>child_mortality</th>\n",
       "      <th>pop_density</th>\n",
       "      <th>co2_per_capita</th>\n",
       "      <th>years_in_school_men</th>\n",
       "      <th>years_in_school_women</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1800</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1801</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1802</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1803</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1804</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country  year  population region     sub_region income_group  \\\n",
       "0  Afghanistan  1800     3280000   Asia  Southern Asia          Low   \n",
       "1  Afghanistan  1801     3280000   Asia  Southern Asia          Low   \n",
       "2  Afghanistan  1802     3280000   Asia  Southern Asia          Low   \n",
       "3  Afghanistan  1803     3280000   Asia  Southern Asia          Low   \n",
       "4  Afghanistan  1804     3280000   Asia  Southern Asia          Low   \n",
       "\n",
       "   life_expectancy  income  children_per_woman  child_mortality  pop_density  \\\n",
       "0             28.2     603                 7.0            469.0          NaN   \n",
       "1             28.2     603                 7.0            469.0          NaN   \n",
       "2             28.2     603                 7.0            469.0          NaN   \n",
       "3             28.2     603                 7.0            469.0          NaN   \n",
       "4             28.2     603                 7.0            469.0          NaN   \n",
       "\n",
       "   co2_per_capita  years_in_school_men  years_in_school_women  \n",
       "0             NaN                  NaN                    NaN  \n",
       "1             NaN                  NaN                    NaN  \n",
       "2             NaN                  NaN                    NaN  \n",
       "3             NaN                  NaN                    NaN  \n",
       "4             NaN                  NaN                    NaN  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data = pd.read_csv('world-data.csv')\n",
    "world_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Indexing and subsetting data frames\n",
    "\n",
    "The world data data frame has rows and columns (it has 2 dimensions). To extract specific data from it (also referred to as \"subsetting\"), columns can be selected by their name.The JupyterLab Notebook (technically, the underlying IPython interpreter) knows about the columns in the data frame, so tab autocompletion can be used to get the correct column name. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1800\n",
       "1    1801\n",
       "2    1802\n",
       "3    1803\n",
       "4    1804\n",
       "Name: year, dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data['year'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The name of the column is not shown, since there is only one. Remember that the numbers on the left is just the index of the data frame, which was added by `pandas` upon importing the data.\n",
    "\n",
    "Another syntax that is often used to specify column names is `.<column_name>`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1800\n",
       "1    1801\n",
       "2    1802\n",
       "3    1803\n",
       "4    1804\n",
       "Name: year, dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.year.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using brackets is clearer and also allows for passing multiple columns as a list, so this tutorial will stick to that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1801</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1802</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1803</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1804</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country  year\n",
       "0  Afghanistan  1800\n",
       "1  Afghanistan  1801\n",
       "2  Afghanistan  1802\n",
       "3  Afghanistan  1803\n",
       "4  Afghanistan  1804"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data[['country', 'year']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output is displayed a bit differently this time. The reason is that when there was only one column `pandas` technically returned a `Series`, not a `Dataframe`. This can be confirmed by using `type` as previously."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(world_data['year'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(world_data[['country', 'year']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, every individual column is actually a `Series` and together they constitute a `Dataframe`. There can be performance benefits to work with `Series`, but `pandas` often takes care of conversions between these two object types under the hood, so this introductory tutorial will not make any further distinction between a `Series` and a `Dataframe`. Many of the analysis techniques used here will apply to both series and data frames."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting with single brackets (`[]`) as above is a shortcut to common operations, such as selecting columns by labels as above. For more flexible and robust row and column selection the more verbose `loc[<rows>, <columns>]` (location) syntax is used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1802</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1804</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country  year\n",
       "0  Afghanistan  1800\n",
       "2  Afghanistan  1802\n",
       "4  Afghanistan  1804"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.loc[[0, 2, 4], ['country', 'year']]\n",
    "# Although methods usually have trailing parenthesis, square brackets are used with `loc[]` to stay\n",
    "# consistent with the indexing with square brackets in general in Python (e.g. lists and Numpy arrays)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A single number can be selected, which returns that value (here, an integer) rather than a `Dataframe` or `Series` with one value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1804"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.loc[4, 'year']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "numpy.int64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(world_data.loc[4, 'year'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To select all rows, but only a subset of columns, the colon character (`:`) can be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1801</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1802</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1803</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1804</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country  year\n",
       "0  Afghanistan  1800\n",
       "1  Afghanistan  1801\n",
       "2  Afghanistan  1802\n",
       "3  Afghanistan  1803\n",
       "4  Afghanistan  1804"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.loc[:, ['country', 'year']].head() # head() is used to limit the length of the output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The same syntax can be used to select all columns but only a subset of rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "      <th>population</th>\n",
       "      <th>region</th>\n",
       "      <th>sub_region</th>\n",
       "      <th>income_group</th>\n",
       "      <th>life_expectancy</th>\n",
       "      <th>income</th>\n",
       "      <th>children_per_woman</th>\n",
       "      <th>child_mortality</th>\n",
       "      <th>pop_density</th>\n",
       "      <th>co2_per_capita</th>\n",
       "      <th>years_in_school_men</th>\n",
       "      <th>years_in_school_women</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1803</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1804</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country  year  population region     sub_region income_group  \\\n",
       "3  Afghanistan  1803     3280000   Asia  Southern Asia          Low   \n",
       "4  Afghanistan  1804     3280000   Asia  Southern Asia          Low   \n",
       "\n",
       "   life_expectancy  income  children_per_woman  child_mortality  pop_density  \\\n",
       "3             28.2     603                 7.0            469.0          NaN   \n",
       "4             28.2     603                 7.0            469.0          NaN   \n",
       "\n",
       "   co2_per_capita  years_in_school_men  years_in_school_women  \n",
       "3             NaN                  NaN                    NaN  \n",
       "4             NaN                  NaN                    NaN  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.loc[[3, 4], :]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When selecting all columns, the `:` could also be left out as a convenience."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "      <th>population</th>\n",
       "      <th>region</th>\n",
       "      <th>sub_region</th>\n",
       "      <th>income_group</th>\n",
       "      <th>life_expectancy</th>\n",
       "      <th>income</th>\n",
       "      <th>children_per_woman</th>\n",
       "      <th>child_mortality</th>\n",
       "      <th>pop_density</th>\n",
       "      <th>co2_per_capita</th>\n",
       "      <th>years_in_school_men</th>\n",
       "      <th>years_in_school_women</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1803</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1804</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country  year  population region     sub_region income_group  \\\n",
       "3  Afghanistan  1803     3280000   Asia  Southern Asia          Low   \n",
       "4  Afghanistan  1804     3280000   Asia  Southern Asia          Low   \n",
       "\n",
       "   life_expectancy  income  children_per_woman  child_mortality  pop_density  \\\n",
       "3             28.2     603                 7.0            469.0          NaN   \n",
       "4             28.2     603                 7.0            469.0          NaN   \n",
       "\n",
       "   co2_per_capita  years_in_school_men  years_in_school_women  \n",
       "3             NaN                  NaN                    NaN  \n",
       "4             NaN                  NaN                    NaN  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.loc[[3, 4]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is also possible to select slices of rows and column labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "      <th>population</th>\n",
       "      <th>region</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1802</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1803</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1804</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country  year  population region\n",
       "2  Afghanistan  1802     3280000   Asia\n",
       "3  Afghanistan  1803     3280000   Asia\n",
       "4  Afghanistan  1804     3280000   Asia"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.loc[2:4, 'country':'region']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is important to realize that `loc[]` selects rows and columns by their *labels*. To instead select by row or column *position*, use `iloc[]` (integer location)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "      <th>population</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1802</td>\n",
       "      <td>3280000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1803</td>\n",
       "      <td>3280000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1804</td>\n",
       "      <td>3280000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country  year  population\n",
       "2  Afghanistan  1802     3280000\n",
       "3  Afghanistan  1803     3280000\n",
       "4  Afghanistan  1804     3280000"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.iloc[[2, 3, 4], [0, 1, 2]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The index of `world_data` consists of consecutive integers so in this case selecting from the index by labels or position will look the same. As will be shown later, an index could also consist of text names just like the columns.\n",
    "\n",
    "While selecting slices by label is inclusive of both the start and end, selecting slices by position is inclusive of the start but exclusive of the end position, just like when slicing in lists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "      <th>population</th>\n",
       "      <th>region</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1802</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1803</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1804</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country  year  population region\n",
       "2  Afghanistan  1802     3280000   Asia\n",
       "3  Afghanistan  1803     3280000   Asia\n",
       "4  Afghanistan  1804     3280000   Asia"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.iloc[2:5, :4] # `iloc[2:5]` gives the same result as `loc[2:4]` above"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting slices of row positions is a common operation, and has thus been given a shortcut syntax with single brackets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "      <th>population</th>\n",
       "      <th>region</th>\n",
       "      <th>sub_region</th>\n",
       "      <th>income_group</th>\n",
       "      <th>life_expectancy</th>\n",
       "      <th>income</th>\n",
       "      <th>children_per_woman</th>\n",
       "      <th>child_mortality</th>\n",
       "      <th>pop_density</th>\n",
       "      <th>co2_per_capita</th>\n",
       "      <th>years_in_school_men</th>\n",
       "      <th>years_in_school_women</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1802</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1803</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1804</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country  year  population region     sub_region income_group  \\\n",
       "2  Afghanistan  1802     3280000   Asia  Southern Asia          Low   \n",
       "3  Afghanistan  1803     3280000   Asia  Southern Asia          Low   \n",
       "4  Afghanistan  1804     3280000   Asia  Southern Asia          Low   \n",
       "\n",
       "   life_expectancy  income  children_per_woman  child_mortality  pop_density  \\\n",
       "2             28.2     603                 7.0            469.0          NaN   \n",
       "3             28.2     603                 7.0            469.0          NaN   \n",
       "4             28.2     603                 7.0            469.0          NaN   \n",
       "\n",
       "   co2_per_capita  years_in_school_men  years_in_school_women  \n",
       "2             NaN                  NaN                    NaN  \n",
       "3             NaN                  NaN                    NaN  \n",
       "4             NaN                  NaN                    NaN  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data[2:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">#### Challenge\n",
    ">\n",
    ">1. Extract the 200th and 201st row of the `world_data` dataset and assign the resulting data frame to a new variable name (`world_data_200_201`). Remember that Python indexing starts at 0!\n",
    ">\n",
    ">2. How can you get the same result as from `world_data.head()` by using row slices instead of the `head()` method?\n",
    ">\n",
    ">3. There are at least three distinct ways to extract the last row of the data frame. Which can you find?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `describe()` method was mentioned above as a way of retrieving summary statistics of a data frame. Together with `info()` and `head()` this is often a good place to start exploratory data analysis as it gives a nice overview of the numeric valuables the data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>population</th>\n",
       "      <th>life_expectancy</th>\n",
       "      <th>income</th>\n",
       "      <th>children_per_woman</th>\n",
       "      <th>child_mortality</th>\n",
       "      <th>pop_density</th>\n",
       "      <th>co2_per_capita</th>\n",
       "      <th>years_in_school_men</th>\n",
       "      <th>years_in_school_women</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>38982.000000</td>\n",
       "      <td>3.898200e+04</td>\n",
       "      <td>38982.000000</td>\n",
       "      <td>38982.000000</td>\n",
       "      <td>38982.000000</td>\n",
       "      <td>38980.000000</td>\n",
       "      <td>12282.000000</td>\n",
       "      <td>16285.000000</td>\n",
       "      <td>8188.000000</td>\n",
       "      <td>8188.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>1909.000000</td>\n",
       "      <td>1.422075e+07</td>\n",
       "      <td>43.073468</td>\n",
       "      <td>4527.128033</td>\n",
       "      <td>5.384391</td>\n",
       "      <td>292.050891</td>\n",
       "      <td>120.900572</td>\n",
       "      <td>3.236894</td>\n",
       "      <td>7.681019</td>\n",
       "      <td>6.948334</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>63.220006</td>\n",
       "      <td>6.722423e+07</td>\n",
       "      <td>16.219216</td>\n",
       "      <td>9753.116041</td>\n",
       "      <td>1.642597</td>\n",
       "      <td>161.562290</td>\n",
       "      <td>382.454242</td>\n",
       "      <td>6.079257</td>\n",
       "      <td>3.185983</td>\n",
       "      <td>3.876399</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1800.000000</td>\n",
       "      <td>1.250000e+04</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>247.000000</td>\n",
       "      <td>1.120000</td>\n",
       "      <td>1.950000</td>\n",
       "      <td>0.502000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.900000</td>\n",
       "      <td>0.210000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>1854.000000</td>\n",
       "      <td>5.060000e+05</td>\n",
       "      <td>31.200000</td>\n",
       "      <td>876.000000</td>\n",
       "      <td>4.550000</td>\n",
       "      <td>141.000000</td>\n",
       "      <td>14.800000</td>\n",
       "      <td>0.188000</td>\n",
       "      <td>5.160000</td>\n",
       "      <td>3.620000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>1909.000000</td>\n",
       "      <td>2.140000e+06</td>\n",
       "      <td>35.500000</td>\n",
       "      <td>1450.000000</td>\n",
       "      <td>5.910000</td>\n",
       "      <td>361.000000</td>\n",
       "      <td>46.000000</td>\n",
       "      <td>0.944000</td>\n",
       "      <td>7.650000</td>\n",
       "      <td>6.980000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>1964.000000</td>\n",
       "      <td>6.870000e+06</td>\n",
       "      <td>55.600000</td>\n",
       "      <td>3520.000000</td>\n",
       "      <td>6.630000</td>\n",
       "      <td>420.000000</td>\n",
       "      <td>110.000000</td>\n",
       "      <td>4.020000</td>\n",
       "      <td>10.100000</td>\n",
       "      <td>9.980000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>2018.000000</td>\n",
       "      <td>1.420000e+09</td>\n",
       "      <td>84.200000</td>\n",
       "      <td>178000.000000</td>\n",
       "      <td>8.870000</td>\n",
       "      <td>756.000000</td>\n",
       "      <td>8270.000000</td>\n",
       "      <td>101.000000</td>\n",
       "      <td>15.300000</td>\n",
       "      <td>15.700000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               year    population  life_expectancy         income  \\\n",
       "count  38982.000000  3.898200e+04     38982.000000   38982.000000   \n",
       "mean    1909.000000  1.422075e+07        43.073468    4527.128033   \n",
       "std       63.220006  6.722423e+07        16.219216    9753.116041   \n",
       "min     1800.000000  1.250000e+04         1.000000     247.000000   \n",
       "25%     1854.000000  5.060000e+05        31.200000     876.000000   \n",
       "50%     1909.000000  2.140000e+06        35.500000    1450.000000   \n",
       "75%     1964.000000  6.870000e+06        55.600000    3520.000000   \n",
       "max     2018.000000  1.420000e+09        84.200000  178000.000000   \n",
       "\n",
       "       children_per_woman  child_mortality   pop_density  co2_per_capita  \\\n",
       "count        38982.000000     38980.000000  12282.000000    16285.000000   \n",
       "mean             5.384391       292.050891    120.900572        3.236894   \n",
       "std              1.642597       161.562290    382.454242        6.079257   \n",
       "min              1.120000         1.950000      0.502000        0.000000   \n",
       "25%              4.550000       141.000000     14.800000        0.188000   \n",
       "50%              5.910000       361.000000     46.000000        0.944000   \n",
       "75%              6.630000       420.000000    110.000000        4.020000   \n",
       "max              8.870000       756.000000   8270.000000      101.000000   \n",
       "\n",
       "       years_in_school_men  years_in_school_women  \n",
       "count          8188.000000            8188.000000  \n",
       "mean              7.681019               6.948334  \n",
       "std               3.185983               3.876399  \n",
       "min               0.900000               0.210000  \n",
       "25%               5.160000               3.620000  \n",
       "50%               7.650000               6.980000  \n",
       "75%              10.100000               9.980000  \n",
       "max              15.300000              15.700000  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A common next step would be to plot the data to explore relationships between different variables, but before getting into plotting, it is beneficial to elaborate on the data frame object and several of its common operations.\n",
    "\n",
    "An often desired operation is to select a subset of rows matching a criteria, e.g. which observations have a life expectancy above 83 years. To do this, the \"less than\" comparison operator that was introduced previously can be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        False\n",
       "1        False\n",
       "2        False\n",
       "3        False\n",
       "4        False\n",
       "5        False\n",
       "6        False\n",
       "7        False\n",
       "8        False\n",
       "9        False\n",
       "10       False\n",
       "11       False\n",
       "12       False\n",
       "13       False\n",
       "14       False\n",
       "15       False\n",
       "16       False\n",
       "17       False\n",
       "18       False\n",
       "19       False\n",
       "20       False\n",
       "21       False\n",
       "22       False\n",
       "23       False\n",
       "24       False\n",
       "25       False\n",
       "26       False\n",
       "27       False\n",
       "28       False\n",
       "29       False\n",
       "         ...  \n",
       "38952    False\n",
       "38953    False\n",
       "38954    False\n",
       "38955    False\n",
       "38956    False\n",
       "38957    False\n",
       "38958    False\n",
       "38959    False\n",
       "38960    False\n",
       "38961    False\n",
       "38962    False\n",
       "38963    False\n",
       "38964    False\n",
       "38965    False\n",
       "38966    False\n",
       "38967    False\n",
       "38968    False\n",
       "38969    False\n",
       "38970    False\n",
       "38971    False\n",
       "38972    False\n",
       "38973    False\n",
       "38974    False\n",
       "38975    False\n",
       "38976    False\n",
       "38977    False\n",
       "38978    False\n",
       "38979    False\n",
       "38980    False\n",
       "38981    False\n",
       "Name: life_expectancy, Length: 38982, dtype: bool"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data['life_expectancy'] > 83"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a boolean array with one value for every row in the data frame indicating whether it is `True` or `False` that this row has a value above 83 in the column `life_expectancy`. To find out how many observations there are matching this condition, the `sum()` method can used since each `True` will be `1` and each `False` will be `0`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "20"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "above_83_bool = world_data['life_expectancy'] > 83\n",
    "above_83_bool.sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of assigning to an intermediate variable, it is possible to use methods directly on the resulting boolean series by surrounding it with parentheses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "20"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(world_data['life_expectancy'] > 83).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The boolean array can be used to select only those rows from the data frame that meet the specified condition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "      <th>population</th>\n",
       "      <th>region</th>\n",
       "      <th>sub_region</th>\n",
       "      <th>income_group</th>\n",
       "      <th>life_expectancy</th>\n",
       "      <th>income</th>\n",
       "      <th>children_per_woman</th>\n",
       "      <th>child_mortality</th>\n",
       "      <th>pop_density</th>\n",
       "      <th>co2_per_capita</th>\n",
       "      <th>years_in_school_men</th>\n",
       "      <th>years_in_school_women</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>17513</th>\n",
       "      <td>Japan</td>\n",
       "      <td>2012</td>\n",
       "      <td>128000000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Eastern Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>83.2</td>\n",
       "      <td>36400</td>\n",
       "      <td>1.40</td>\n",
       "      <td>3.00</td>\n",
       "      <td>352.0</td>\n",
       "      <td>9.58</td>\n",
       "      <td>14.8</td>\n",
       "      <td>15.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17514</th>\n",
       "      <td>Japan</td>\n",
       "      <td>2013</td>\n",
       "      <td>128000000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Eastern Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>83.4</td>\n",
       "      <td>37100</td>\n",
       "      <td>1.42</td>\n",
       "      <td>2.90</td>\n",
       "      <td>352.0</td>\n",
       "      <td>9.71</td>\n",
       "      <td>14.9</td>\n",
       "      <td>15.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17515</th>\n",
       "      <td>Japan</td>\n",
       "      <td>2014</td>\n",
       "      <td>128000000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Eastern Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>83.6</td>\n",
       "      <td>37300</td>\n",
       "      <td>1.43</td>\n",
       "      <td>2.80</td>\n",
       "      <td>352.0</td>\n",
       "      <td>9.47</td>\n",
       "      <td>15.0</td>\n",
       "      <td>15.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17516</th>\n",
       "      <td>Japan</td>\n",
       "      <td>2015</td>\n",
       "      <td>128000000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Eastern Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>83.8</td>\n",
       "      <td>37800</td>\n",
       "      <td>1.44</td>\n",
       "      <td>3.00</td>\n",
       "      <td>351.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>15.1</td>\n",
       "      <td>15.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17517</th>\n",
       "      <td>Japan</td>\n",
       "      <td>2016</td>\n",
       "      <td>128000000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Eastern Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>83.9</td>\n",
       "      <td>38200</td>\n",
       "      <td>1.46</td>\n",
       "      <td>2.70</td>\n",
       "      <td>350.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17518</th>\n",
       "      <td>Japan</td>\n",
       "      <td>2017</td>\n",
       "      <td>127000000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Eastern Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>84.0</td>\n",
       "      <td>38600</td>\n",
       "      <td>1.47</td>\n",
       "      <td>2.83</td>\n",
       "      <td>350.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17519</th>\n",
       "      <td>Japan</td>\n",
       "      <td>2018</td>\n",
       "      <td>127000000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Eastern Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>84.2</td>\n",
       "      <td>39100</td>\n",
       "      <td>1.48</td>\n",
       "      <td>2.76</td>\n",
       "      <td>349.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30653</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>2012</td>\n",
       "      <td>5270000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>South-eastern Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>83.2</td>\n",
       "      <td>76000</td>\n",
       "      <td>1.26</td>\n",
       "      <td>2.80</td>\n",
       "      <td>7530.0</td>\n",
       "      <td>6.90</td>\n",
       "      <td>13.6</td>\n",
       "      <td>13.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30654</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>2013</td>\n",
       "      <td>5360000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>South-eastern Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>83.2</td>\n",
       "      <td>78500</td>\n",
       "      <td>1.25</td>\n",
       "      <td>2.70</td>\n",
       "      <td>7660.0</td>\n",
       "      <td>10.40</td>\n",
       "      <td>13.7</td>\n",
       "      <td>13.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30655</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>2014</td>\n",
       "      <td>5450000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>South-eastern Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>83.4</td>\n",
       "      <td>80300</td>\n",
       "      <td>1.25</td>\n",
       "      <td>2.70</td>\n",
       "      <td>7780.0</td>\n",
       "      <td>10.30</td>\n",
       "      <td>13.8</td>\n",
       "      <td>13.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30656</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>2015</td>\n",
       "      <td>5540000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>South-eastern Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>83.6</td>\n",
       "      <td>80900</td>\n",
       "      <td>1.24</td>\n",
       "      <td>2.70</td>\n",
       "      <td>7910.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14.0</td>\n",
       "      <td>13.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30657</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>2016</td>\n",
       "      <td>5620000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>South-eastern Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>83.7</td>\n",
       "      <td>81400</td>\n",
       "      <td>1.25</td>\n",
       "      <td>2.80</td>\n",
       "      <td>8030.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30658</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>2017</td>\n",
       "      <td>5710000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>South-eastern Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>83.8</td>\n",
       "      <td>82600</td>\n",
       "      <td>1.25</td>\n",
       "      <td>2.58</td>\n",
       "      <td>8160.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30659</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>2018</td>\n",
       "      <td>5790000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>South-eastern Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>84.0</td>\n",
       "      <td>83900</td>\n",
       "      <td>1.26</td>\n",
       "      <td>2.52</td>\n",
       "      <td>8270.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32410</th>\n",
       "      <td>Spain</td>\n",
       "      <td>2017</td>\n",
       "      <td>46400000</td>\n",
       "      <td>Europe</td>\n",
       "      <td>Southern Europe</td>\n",
       "      <td>High</td>\n",
       "      <td>83.1</td>\n",
       "      <td>34000</td>\n",
       "      <td>1.38</td>\n",
       "      <td>3.14</td>\n",
       "      <td>92.9</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32411</th>\n",
       "      <td>Spain</td>\n",
       "      <td>2018</td>\n",
       "      <td>46400000</td>\n",
       "      <td>Europe</td>\n",
       "      <td>Southern Europe</td>\n",
       "      <td>High</td>\n",
       "      <td>83.2</td>\n",
       "      <td>34700</td>\n",
       "      <td>1.39</td>\n",
       "      <td>3.02</td>\n",
       "      <td>93.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33722</th>\n",
       "      <td>Switzerland</td>\n",
       "      <td>2015</td>\n",
       "      <td>8320000</td>\n",
       "      <td>Europe</td>\n",
       "      <td>Western Europe</td>\n",
       "      <td>High</td>\n",
       "      <td>83.1</td>\n",
       "      <td>56500</td>\n",
       "      <td>1.54</td>\n",
       "      <td>4.10</td>\n",
       "      <td>211.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14.6</td>\n",
       "      <td>14.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33723</th>\n",
       "      <td>Switzerland</td>\n",
       "      <td>2016</td>\n",
       "      <td>8400000</td>\n",
       "      <td>Europe</td>\n",
       "      <td>Western Europe</td>\n",
       "      <td>High</td>\n",
       "      <td>83.1</td>\n",
       "      <td>56600</td>\n",
       "      <td>1.55</td>\n",
       "      <td>4.10</td>\n",
       "      <td>213.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33724</th>\n",
       "      <td>Switzerland</td>\n",
       "      <td>2017</td>\n",
       "      <td>8480000</td>\n",
       "      <td>Europe</td>\n",
       "      <td>Western Europe</td>\n",
       "      <td>High</td>\n",
       "      <td>83.3</td>\n",
       "      <td>56900</td>\n",
       "      <td>1.55</td>\n",
       "      <td>3.86</td>\n",
       "      <td>214.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33725</th>\n",
       "      <td>Switzerland</td>\n",
       "      <td>2018</td>\n",
       "      <td>8540000</td>\n",
       "      <td>Europe</td>\n",
       "      <td>Western Europe</td>\n",
       "      <td>High</td>\n",
       "      <td>83.5</td>\n",
       "      <td>57100</td>\n",
       "      <td>1.55</td>\n",
       "      <td>3.75</td>\n",
       "      <td>216.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           country  year  population  region          sub_region income_group  \\\n",
       "17513        Japan  2012   128000000    Asia        Eastern Asia         High   \n",
       "17514        Japan  2013   128000000    Asia        Eastern Asia         High   \n",
       "17515        Japan  2014   128000000    Asia        Eastern Asia         High   \n",
       "17516        Japan  2015   128000000    Asia        Eastern Asia         High   \n",
       "17517        Japan  2016   128000000    Asia        Eastern Asia         High   \n",
       "17518        Japan  2017   127000000    Asia        Eastern Asia         High   \n",
       "17519        Japan  2018   127000000    Asia        Eastern Asia         High   \n",
       "30653    Singapore  2012     5270000    Asia  South-eastern Asia         High   \n",
       "30654    Singapore  2013     5360000    Asia  South-eastern Asia         High   \n",
       "30655    Singapore  2014     5450000    Asia  South-eastern Asia         High   \n",
       "30656    Singapore  2015     5540000    Asia  South-eastern Asia         High   \n",
       "30657    Singapore  2016     5620000    Asia  South-eastern Asia         High   \n",
       "30658    Singapore  2017     5710000    Asia  South-eastern Asia         High   \n",
       "30659    Singapore  2018     5790000    Asia  South-eastern Asia         High   \n",
       "32410        Spain  2017    46400000  Europe     Southern Europe         High   \n",
       "32411        Spain  2018    46400000  Europe     Southern Europe         High   \n",
       "33722  Switzerland  2015     8320000  Europe      Western Europe         High   \n",
       "33723  Switzerland  2016     8400000  Europe      Western Europe         High   \n",
       "33724  Switzerland  2017     8480000  Europe      Western Europe         High   \n",
       "33725  Switzerland  2018     8540000  Europe      Western Europe         High   \n",
       "\n",
       "       life_expectancy  income  children_per_woman  child_mortality  \\\n",
       "17513             83.2   36400                1.40             3.00   \n",
       "17514             83.4   37100                1.42             2.90   \n",
       "17515             83.6   37300                1.43             2.80   \n",
       "17516             83.8   37800                1.44             3.00   \n",
       "17517             83.9   38200                1.46             2.70   \n",
       "17518             84.0   38600                1.47             2.83   \n",
       "17519             84.2   39100                1.48             2.76   \n",
       "30653             83.2   76000                1.26             2.80   \n",
       "30654             83.2   78500                1.25             2.70   \n",
       "30655             83.4   80300                1.25             2.70   \n",
       "30656             83.6   80900                1.24             2.70   \n",
       "30657             83.7   81400                1.25             2.80   \n",
       "30658             83.8   82600                1.25             2.58   \n",
       "30659             84.0   83900                1.26             2.52   \n",
       "32410             83.1   34000                1.38             3.14   \n",
       "32411             83.2   34700                1.39             3.02   \n",
       "33722             83.1   56500                1.54             4.10   \n",
       "33723             83.1   56600                1.55             4.10   \n",
       "33724             83.3   56900                1.55             3.86   \n",
       "33725             83.5   57100                1.55             3.75   \n",
       "\n",
       "       pop_density  co2_per_capita  years_in_school_men  years_in_school_women  \n",
       "17513        352.0            9.58                 14.8                   15.2  \n",
       "17514        352.0            9.71                 14.9                   15.3  \n",
       "17515        352.0            9.47                 15.0                   15.4  \n",
       "17516        351.0             NaN                 15.1                   15.5  \n",
       "17517        350.0             NaN                  NaN                    NaN  \n",
       "17518        350.0             NaN                  NaN                    NaN  \n",
       "17519        349.0             NaN                  NaN                    NaN  \n",
       "30653       7530.0            6.90                 13.6                   13.3  \n",
       "30654       7660.0           10.40                 13.7                   13.5  \n",
       "30655       7780.0           10.30                 13.8                   13.7  \n",
       "30656       7910.0             NaN                 14.0                   13.8  \n",
       "30657       8030.0             NaN                  NaN                    NaN  \n",
       "30658       8160.0             NaN                  NaN                    NaN  \n",
       "30659       8270.0             NaN                  NaN                    NaN  \n",
       "32410         92.9             NaN                  NaN                    NaN  \n",
       "32411         93.0             NaN                  NaN                    NaN  \n",
       "33722        211.0             NaN                 14.6                   14.4  \n",
       "33723        213.0             NaN                  NaN                    NaN  \n",
       "33724        214.0             NaN                  NaN                    NaN  \n",
       "33725        216.0             NaN                  NaN                    NaN  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data[world_data['life_expectancy'] > 83]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As before, this can be combined with selection of a particular set of columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "      <th>life_expectancy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>17513</th>\n",
       "      <td>Japan</td>\n",
       "      <td>2012</td>\n",
       "      <td>83.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17514</th>\n",
       "      <td>Japan</td>\n",
       "      <td>2013</td>\n",
       "      <td>83.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17515</th>\n",
       "      <td>Japan</td>\n",
       "      <td>2014</td>\n",
       "      <td>83.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17516</th>\n",
       "      <td>Japan</td>\n",
       "      <td>2015</td>\n",
       "      <td>83.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17517</th>\n",
       "      <td>Japan</td>\n",
       "      <td>2016</td>\n",
       "      <td>83.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17518</th>\n",
       "      <td>Japan</td>\n",
       "      <td>2017</td>\n",
       "      <td>84.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17519</th>\n",
       "      <td>Japan</td>\n",
       "      <td>2018</td>\n",
       "      <td>84.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30653</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>2012</td>\n",
       "      <td>83.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30654</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>2013</td>\n",
       "      <td>83.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30655</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>2014</td>\n",
       "      <td>83.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30656</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>2015</td>\n",
       "      <td>83.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30657</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>2016</td>\n",
       "      <td>83.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30658</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>2017</td>\n",
       "      <td>83.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30659</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>2018</td>\n",
       "      <td>84.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32410</th>\n",
       "      <td>Spain</td>\n",
       "      <td>2017</td>\n",
       "      <td>83.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32411</th>\n",
       "      <td>Spain</td>\n",
       "      <td>2018</td>\n",
       "      <td>83.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33722</th>\n",
       "      <td>Switzerland</td>\n",
       "      <td>2015</td>\n",
       "      <td>83.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33723</th>\n",
       "      <td>Switzerland</td>\n",
       "      <td>2016</td>\n",
       "      <td>83.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33724</th>\n",
       "      <td>Switzerland</td>\n",
       "      <td>2017</td>\n",
       "      <td>83.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33725</th>\n",
       "      <td>Switzerland</td>\n",
       "      <td>2018</td>\n",
       "      <td>83.5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           country  year  life_expectancy\n",
       "17513        Japan  2012             83.2\n",
       "17514        Japan  2013             83.4\n",
       "17515        Japan  2014             83.6\n",
       "17516        Japan  2015             83.8\n",
       "17517        Japan  2016             83.9\n",
       "17518        Japan  2017             84.0\n",
       "17519        Japan  2018             84.2\n",
       "30653    Singapore  2012             83.2\n",
       "30654    Singapore  2013             83.2\n",
       "30655    Singapore  2014             83.4\n",
       "30656    Singapore  2015             83.6\n",
       "30657    Singapore  2016             83.7\n",
       "30658    Singapore  2017             83.8\n",
       "30659    Singapore  2018             84.0\n",
       "32410        Spain  2017             83.1\n",
       "32411        Spain  2018             83.2\n",
       "33722  Switzerland  2015             83.1\n",
       "33723  Switzerland  2016             83.1\n",
       "33724  Switzerland  2017             83.3\n",
       "33725  Switzerland  2018             83.5"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.loc[world_data['life_expectancy'] > 83, ['country', 'year', 'life_expectancy']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A single expression can also be used to filter for several criteria, either matching *all* criteria (`&`) or *any* criteria (`|`). These special operators are used instead of `and` and `or` to make sure that the comparison occurs for each row in the data frame. Parentheses are added to indicate the priority of the comparisons."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sub_region</th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>9496</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Denmark</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11248</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Estonia</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11905</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Finland</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15409</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Iceland</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16504</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Ireland</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19132</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Latvia</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20227</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Lithuania</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25921</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Norway</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33367</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Sweden</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36871</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            sub_region         country  year\n",
       "9496   Northern Europe         Denmark  1879\n",
       "11248  Northern Europe         Estonia  1879\n",
       "11905  Northern Europe         Finland  1879\n",
       "15409  Northern Europe         Iceland  1879\n",
       "16504  Northern Europe         Ireland  1879\n",
       "19132  Northern Europe          Latvia  1879\n",
       "20227  Northern Europe       Lithuania  1879\n",
       "25921  Northern Europe          Norway  1879\n",
       "33367  Northern Europe          Sweden  1879\n",
       "36871  Northern Europe  United Kingdom  1879"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# AND = &\n",
    "world_data.loc[(world_data['sub_region'] == 'Northern Europe') & (world_data['year'] == 1879), ['sub_region', 'country', 'year']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To increase readability, these statements can be put on multiple rows. Anything that is within a parameter or bracket in Python can be continued on the next row. When inside a bracket or parenthesis, the indentation is not significant to the Python interpreter, but it is still recommended to include it in order to make the code more readable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sub_region</th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>9496</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Denmark</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11248</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Estonia</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11905</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Finland</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15409</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Iceland</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16504</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Ireland</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19132</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Latvia</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20227</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Lithuania</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25921</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Norway</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33367</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>Sweden</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36871</th>\n",
       "      <td>Northern Europe</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>1879</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            sub_region         country  year\n",
       "9496   Northern Europe         Denmark  1879\n",
       "11248  Northern Europe         Estonia  1879\n",
       "11905  Northern Europe         Finland  1879\n",
       "15409  Northern Europe         Iceland  1879\n",
       "16504  Northern Europe         Ireland  1879\n",
       "19132  Northern Europe          Latvia  1879\n",
       "20227  Northern Europe       Lithuania  1879\n",
       "25921  Northern Europe          Norway  1879\n",
       "33367  Northern Europe          Sweden  1879\n",
       "36871  Northern Europe  United Kingdom  1879"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.loc[(world_data['sub_region'] == 'Northern Europe') &\n",
    "               (world_data['year'] == 1879),\n",
    "               ['sub_region', 'country', 'year']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above it was assumed that `'Northern Europe'` was a vaue within the `sub_region` column. When it is not known which values are available in a column, the `unique()` method can be used to find this out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Southern Asia', 'Southern Europe', 'Northern Africa',\n",
       "       'Sub-Saharan Africa', 'Latin America and the Caribbean',\n",
       "       'Western Asia', 'Australia and New Zealand', 'Western Europe',\n",
       "       'Eastern Europe', 'South-eastern Asia', 'Northern America',\n",
       "       'Eastern Asia', 'Northern Europe', 'Melanesia', 'Central Asia',\n",
       "       'Micronesia', 'Polynesia'], dtype=object)"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data['sub_region'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the `|` operator, rows matching either of the supplied criteria are returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1801</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>219</th>\n",
       "      <td>Albania</td>\n",
       "      <td>1800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>220</th>\n",
       "      <td>Albania</td>\n",
       "      <td>1801</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>438</th>\n",
       "      <td>Algeria</td>\n",
       "      <td>1800</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         country  year\n",
       "0    Afghanistan  1800\n",
       "1    Afghanistan  1801\n",
       "219      Albania  1800\n",
       "220      Albania  1801\n",
       "438      Algeria  1800"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# OR = |\n",
    "world_data.loc[(world_data['year'] == 1800) |\n",
    "            (world_data['year'] == 1801) ,\n",
    "            ['country', 'year']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Additional useful ways of subsetting the data includes `between()` which checks if a numerical valule is within a given range, and `isin()` which checks if a value is contained in a given list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,\n",
       "       2011, 2012, 2013, 2014, 2015])"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.loc[world_data['year'].between(2000, 2015), 'year'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Asia', 'Africa', 'Americas'], dtype=object)"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.loc[world_data['region'].isin(['Africa', 'Asia', 'Americas']), 'region'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating new columns\n",
    "\n",
    "A frequent operation when working with data, is to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. To create a new column of the weight in kg instead of in grams:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>population</th>\n",
       "      <th>income</th>\n",
       "      <th>population_income</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3280000</td>\n",
       "      <td>603</td>\n",
       "      <td>1977840000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3280000</td>\n",
       "      <td>603</td>\n",
       "      <td>1977840000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3280000</td>\n",
       "      <td>603</td>\n",
       "      <td>1977840000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3280000</td>\n",
       "      <td>603</td>\n",
       "      <td>1977840000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3280000</td>\n",
       "      <td>603</td>\n",
       "      <td>1977840000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   population  income  population_income\n",
       "0     3280000     603         1977840000\n",
       "1     3280000     603         1977840000\n",
       "2     3280000     603         1977840000\n",
       "3     3280000     603         1977840000\n",
       "4     3280000     603         1977840000"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data['population_income'] = world_data['income'] * world_data['population']\n",
    "world_data[['population', 'income', 'population_income']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">#### Challenge\n",
    ">\n",
    ">1. Subset `world_data` to include observations from 1995 to 2001. Check that the dimensions of the resulting data frame is 1253 x 15.\n",
    "> \n",
    ">2. Subset the data to include only observation from year 2000 and onwards, from all regions except 'Asia', and retain only the columns `country`, `year`, and `sub_region`. The dimensions of the resulting data frame should be 2508 x 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2489, 3)"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Challenge solutions\n",
    "\n",
    "# 1. \n",
    "world_data.loc[world_data['year'].between(1995, 2001)].shape\n",
    "\n",
    "# 2.\n",
    "world_data.loc[(world_data['year'] >= 2000) &\n",
    "               (world_data['region'] != 'Asia'),\n",
    "               ['country', 'year', 'sub_region']].shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Split-apply-combine techniques in pandas\n",
    "\n",
    "Many data analysis tasks can be approached using the *split-apply-combine* paradigm: split the data into groups, apply some analysis to each group, and then combine the results.\n",
    "\n",
    "`pandas` facilitates this workflow through the use of `groupby()` to split data and summary/aggregation functions such as `mean()`, which collapses each group into a single-row summary of that group. The arguments to `groupby()` are the column names that contain the *categorical* variables by which  summary statistics should be calculated. To start, compute the mean `weight` by sex.\n",
    "\n",
    "![Image credit Jake VanderPlas](img/split-apply-combine.png)\n",
    "\n",
    "*Image credit Jake VanderPlas*\n",
    "\n",
    "### Summarizing categorical data \n",
    "\n",
    "Aggregation (or \"summary\") methods, such as `.sum()` and `.mean()` can be used to calculate their respective statistics on subsets (groups) in the data. When the mean is computed, the default behavior is to ignore NA values, so they only need to be dropped if they are to be excluded from the visual output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region\n",
       "Africa       59192998600\n",
       "Americas     63837885500\n",
       "Asia        330133218800\n",
       "Europe       98766930400\n",
       "Oceania       2422277600\n",
       "Name: population, dtype: int64"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.groupby('region')['population'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output here is a series that is indexed with the grouped variable (the region) as the index and the result of the aggregation (the total population) as the values (conceptually, the only column).\n",
    "\n",
    "These populations numbers are abnormally high because the summary is made for all the years instead of only one. To view only the data from this year, use the learnt methods to subset for only 2018. Compare these results to the picture in the survey that placed 4 million people in Asia and 1 million in each of the other regions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region\n",
       "Africa      1286388200\n",
       "Americas    1010688000\n",
       "Asia        4514211000\n",
       "Europe       742109000\n",
       "Oceania       40212000\n",
       "Name: population, dtype: int64"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data_2018 = world_data.loc[world_data['year'] == 2018]\n",
    "world_data_2018.groupby('region')['population'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These numbers are closer to the survey we took earlier. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Individual countries can be selected from the resulting series using `loc[]`, just as previously."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region\n",
       "Asia      4514211000\n",
       "Europe     742109000\n",
       "Name: population, dtype: int64"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "avg_density = world_data_2018.groupby('region')['population'].sum()\n",
    "avg_density.loc[['Asia', 'Europe']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a shortcut, `loc[]` can be omitted when indexing a series. This is similar to selecting columns from a data frame with just `[]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region\n",
       "Asia      4514211000\n",
       "Europe     742109000\n",
       "Name: population, dtype: int64"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "avg_density[['Asia', 'Europe']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " This indexing can be used to normalize the population numbers to the region of interest. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region\n",
       "Africa      1.733422\n",
       "Americas    1.361913\n",
       "Asia        6.082949\n",
       "Europe      1.000000\n",
       "Oceania     0.054186\n",
       "Name: population, dtype: float64"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "region_pop_2018 = world_data_2018.groupby('region')['population'].sum()\n",
    "region_pop_2018 / region_pop_2018['Europe']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are 6 times as many people living in Asia than in Europe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Groups can also be created from multiple columns, e.g. it could be interesting to compare the how densely populated countries are on average in different income brackets around the world."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region    income_group\n",
       "Africa    High             207.000000\n",
       "          Low              118.640741\n",
       "          Lower middle      69.331250\n",
       "          Upper middle      94.457500\n",
       "Americas  High             136.426000\n",
       "          Low              403.000000\n",
       "          Lower middle     113.950000\n",
       "          Upper middle      92.931875\n",
       "Asia      High            1121.654545\n",
       "          Low              115.866667\n",
       "          Lower middle     262.606471\n",
       "          Upper middle     235.447692\n",
       "Europe    High             176.563214\n",
       "          Lower middle      99.500000\n",
       "          Upper middle      67.832222\n",
       "Oceania   High              10.610000\n",
       "          Lower middle      52.500000\n",
       "          Upper middle      90.266667\n",
       "Name: pop_density, dtype: float64"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data_2018.groupby(['region', 'income_group'])['pop_density'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that `income_group` is an ordinal variable, i.e. a categorical variable with an inherent order to it. Here, `pandas` has not listed the values of that variable in the order we would expect (low, lower-middle, upper-middle, high). The order of a variable can be specified in the data frame itself, using the top level `pandas` function `Categorical()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CategoricalDtype(categories=['Low', 'Lower middle', 'Upper middle', 'High'], ordered=True)"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reassign in the main data frame since we will use more than just the 2018 data later\n",
    "world_data['income_group'] = (\n",
    "    pd.Categorical(world_data['income_group'], ordered=True,\n",
    "                   categories=['Low', 'Lower middle', 'Upper middle', 'High'])\n",
    ")\n",
    "\n",
    "# Need to recreate the 2018 data frame since the categorical was changed in the main frame\n",
    "world_data_2018 = world_data.loc[world_data['year'] == 2018]\n",
    "world_data_2018['income_group'].dtype"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region    income_group\n",
       "Africa    Low              118.640741\n",
       "          Lower middle      69.331250\n",
       "          Upper middle      94.457500\n",
       "          High             207.000000\n",
       "Americas  Low              403.000000\n",
       "          Lower middle     113.950000\n",
       "          Upper middle      92.931875\n",
       "          High             136.426000\n",
       "Asia      Low              115.866667\n",
       "          Lower middle     262.606471\n",
       "          Upper middle     235.447692\n",
       "          High            1121.654545\n",
       "Europe    Lower middle      99.500000\n",
       "          Upper middle      67.832222\n",
       "          High             176.563214\n",
       "Oceania   Lower middle      52.500000\n",
       "          Upper middle      90.266667\n",
       "          High              10.610000\n",
       "Name: pop_density, dtype: float64"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data_2018.groupby(['region', 'income_group'])['pop_density'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the values appear in the order we would expect. The value for Asia in the high income bracket looks suspiciously high. It would be interesting to see which countries were averaged to that value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>pop_density</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2627</th>\n",
       "      <td>Bahrain</td>\n",
       "      <td>2060.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9197</th>\n",
       "      <td>Cyprus</td>\n",
       "      <td>129.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16862</th>\n",
       "      <td>Israel</td>\n",
       "      <td>391.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17519</th>\n",
       "      <td>Japan</td>\n",
       "      <td>349.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18614</th>\n",
       "      <td>Kuwait</td>\n",
       "      <td>236.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26279</th>\n",
       "      <td>Oman</td>\n",
       "      <td>15.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28469</th>\n",
       "      <td>Qatar</td>\n",
       "      <td>232.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29564</th>\n",
       "      <td>Saudi Arabia</td>\n",
       "      <td>15.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30659</th>\n",
       "      <td>Singapore</td>\n",
       "      <td>8270.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31973</th>\n",
       "      <td>South Korea</td>\n",
       "      <td>526.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36791</th>\n",
       "      <td>United Arab Emirates</td>\n",
       "      <td>114.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    country  pop_density\n",
       "2627                Bahrain       2060.0\n",
       "9197                 Cyprus        129.0\n",
       "16862                Israel        391.0\n",
       "17519                 Japan        349.0\n",
       "18614                Kuwait        236.0\n",
       "26279                  Oman         15.6\n",
       "28469                 Qatar        232.0\n",
       "29564          Saudi Arabia         15.6\n",
       "30659             Singapore       8270.0\n",
       "31973           South Korea        526.0\n",
       "36791  United Arab Emirates        114.0"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data_2018.loc[(world_data['region'] == 'Asia') &\n",
    "                    (world_data['income_group'] == 'High'),\n",
    "                    ['country', 'pop_density']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Extreme values, such as the city-state Singapore, can heavily skew averages and it could be a good idea to use a more robust statistics such as the median instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region    income_group\n",
       "Africa    Low              66.70\n",
       "          Lower middle     74.75\n",
       "          Upper middle     12.81\n",
       "          High            207.00\n",
       "Americas  Low             403.00\n",
       "          Lower middle     68.20\n",
       "          Upper middle     55.95\n",
       "          High             37.80\n",
       "Asia      Low              82.35\n",
       "          Lower middle     92.00\n",
       "          Upper middle    106.00\n",
       "          High            236.00\n",
       "Europe    Lower middle     99.50\n",
       "          Upper middle     68.70\n",
       "          High            109.50\n",
       "Oceania   Lower middle     22.70\n",
       "          Upper middle     69.90\n",
       "          High             10.61\n",
       "Name: pop_density, dtype: float64"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data_2018.groupby(['region', 'income_group'])['pop_density'].median()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The returned series has an index that is a combination of the columns `region` and `sub_region`, and referred to as a `MultiIndex`. The same syntax as previously can be used to select rows on the species-level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region    income_group\n",
       "Africa    Low              66.70\n",
       "          Lower middle     74.75\n",
       "          Upper middle     12.81\n",
       "          High            207.00\n",
       "Americas  Low             403.00\n",
       "          Lower middle     68.20\n",
       "          Upper middle     55.95\n",
       "          High             37.80\n",
       "Name: pop_density, dtype: float64"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "med_density_2018 = world_data_2018.groupby(['region', 'income_group'])['pop_density'].median()\n",
    "med_density_2018[['Africa', 'Americas']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To select specific values from both levels of the `MultiIndex`, a list of tuples can be passed to `loc[]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region    income_group\n",
       "Africa    High            207.0\n",
       "Americas  High             37.8\n",
       "Name: pop_density, dtype: float64"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "med_density_2018.loc[[('Africa', 'High'), ('Americas', 'High')]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To select only the low income values from all region, the `xs()` (cross section) method can be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region\n",
       "Africa       66.70\n",
       "Americas    403.00\n",
       "Asia         82.35\n",
       "Name: pop_density, dtype: float64"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "med_density_2018.xs('Low', level='income_group')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The names and values of the index levels can be seen by inspecting the index object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultiIndex(levels=[['Africa', 'Americas', 'Asia', 'Europe', 'Oceania'], ['Low', 'Lower middle', 'Upper middle', 'High']],\n",
       "           labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 1, 2, 3, 1, 2, 3]],\n",
       "           names=['region', 'income_group'])"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "med_density_2018.index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although `MultiIndexes` offer succinct and fast ways to access data, they also requires memorization of additional syntax and are strictly speaking not essential unless speed is of particular concern. It can therefore be easier to reset the index, so that all values are stored in columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>region</th>\n",
       "      <th>income_group</th>\n",
       "      <th>pop_density</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>66.70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Africa</td>\n",
       "      <td>Lower middle</td>\n",
       "      <td>74.75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Africa</td>\n",
       "      <td>Upper middle</td>\n",
       "      <td>12.81</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Africa</td>\n",
       "      <td>High</td>\n",
       "      <td>207.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Americas</td>\n",
       "      <td>Low</td>\n",
       "      <td>403.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Americas</td>\n",
       "      <td>Lower middle</td>\n",
       "      <td>68.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Americas</td>\n",
       "      <td>Upper middle</td>\n",
       "      <td>55.95</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Americas</td>\n",
       "      <td>High</td>\n",
       "      <td>37.80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>82.35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Asia</td>\n",
       "      <td>Lower middle</td>\n",
       "      <td>92.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Asia</td>\n",
       "      <td>Upper middle</td>\n",
       "      <td>106.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Asia</td>\n",
       "      <td>High</td>\n",
       "      <td>236.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Europe</td>\n",
       "      <td>Lower middle</td>\n",
       "      <td>99.50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Europe</td>\n",
       "      <td>Upper middle</td>\n",
       "      <td>68.70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Europe</td>\n",
       "      <td>High</td>\n",
       "      <td>109.50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Oceania</td>\n",
       "      <td>Lower middle</td>\n",
       "      <td>22.70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Oceania</td>\n",
       "      <td>Upper middle</td>\n",
       "      <td>69.90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Oceania</td>\n",
       "      <td>High</td>\n",
       "      <td>10.61</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      region  income_group  pop_density\n",
       "0     Africa           Low        66.70\n",
       "1     Africa  Lower middle        74.75\n",
       "2     Africa  Upper middle        12.81\n",
       "3     Africa          High       207.00\n",
       "4   Americas           Low       403.00\n",
       "5   Americas  Lower middle        68.20\n",
       "6   Americas  Upper middle        55.95\n",
       "7   Americas          High        37.80\n",
       "8       Asia           Low        82.35\n",
       "9       Asia  Lower middle        92.00\n",
       "10      Asia  Upper middle       106.00\n",
       "11      Asia          High       236.00\n",
       "12    Europe  Lower middle        99.50\n",
       "13    Europe  Upper middle        68.70\n",
       "14    Europe          High       109.50\n",
       "15   Oceania  Lower middle        22.70\n",
       "16   Oceania  Upper middle        69.90\n",
       "17   Oceania          High        10.61"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "med_density_2018_res = med_density_2018.reset_index()\n",
    "med_density_2018_res"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After resetting the index, the same comparison syntax introduced earlier can be used instead of `xs()` or passing lists of tuples to `loc[]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>region</th>\n",
       "      <th>income_group</th>\n",
       "      <th>pop_density</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Africa</td>\n",
       "      <td>Low</td>\n",
       "      <td>66.70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Americas</td>\n",
       "      <td>Low</td>\n",
       "      <td>403.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>82.35</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     region income_group  pop_density\n",
       "0    Africa          Low        66.70\n",
       "4  Americas          Low       403.00\n",
       "8      Asia          Low        82.35"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "med_density_2018_asia = med_density_2018_res.loc[med_density_2018_res['income_group'] == 'Low']\n",
    "med_density_2018_asia"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`reset_index()` grants the freedom of not having to work with indexes, but it is still worth keeping in mind that selecting on an index level with `xs()` can be orders of magnitude faster than using boolean comparisons (on large data frames).\n",
    "\n",
    "The opposite operation (to create an index) can be performed with `set_index()` on any column (or combination of columns) that creates an index with unique values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>pop_density</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>region</th>\n",
       "      <th>income_group</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Africa</th>\n",
       "      <th>Low</th>\n",
       "      <td>66.70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Americas</th>\n",
       "      <th>Low</th>\n",
       "      <td>403.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Asia</th>\n",
       "      <th>Low</th>\n",
       "      <td>82.35</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       pop_density\n",
       "region   income_group             \n",
       "Africa   Low                 66.70\n",
       "Americas Low                403.00\n",
       "Asia     Low                 82.35"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "med_density_2018_asia.set_index(['region', 'income_group'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Challenge\n",
    ">\n",
    "> 1. Which is the highest population density in each region?\n",
    ">\n",
    "> 2. The low income group for the Americas had the same population density for both the mean and the median. This could mean that there are few observations in this group. List all the low income countries in the Americas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region\n",
       "Africa       625.0\n",
       "Americas     666.0\n",
       "Asia        8270.0\n",
       "Europe      1350.0\n",
       "Oceania      151.0\n",
       "Name: pop_density, dtype: float64"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Challenge solutions\n",
    "\n",
    "# 1.\n",
    "world_data_2018.groupby('region')['pop_density'].max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>pop_density</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>14891</th>\n",
       "      <td>Haiti</td>\n",
       "      <td>403.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      country  pop_density\n",
       "14891   Haiti        403.0"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This will be a challenge\n",
    "\n",
    "# 2.\n",
    "world_data_2018.loc[(world_data['region'] == 'Americas') & (world_data['income_group'] == 'Low'), ['country', 'pop_density']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multiple aggregations on grouped data\n",
    "\n",
    "Since the same grouped data frame will be used in multiple code chunks below, this can be assigned to a new variable instead of typing out the grouping expression each time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region    sub_region                     \n",
       "Africa    Northern Africa                    74.716667\n",
       "          Sub-Saharan Africa                 63.682609\n",
       "Americas  Latin America and the Caribbean    75.600000\n",
       "          Northern America                   80.650000\n",
       "Asia      Central Asia                       71.340000\n",
       "          Eastern Asia                       76.440000\n",
       "          South-eastern Asia                 73.630000\n",
       "          Southern Asia                      72.211111\n",
       "          Western Asia                       76.122222\n",
       "Europe    Eastern Europe                     75.110000\n",
       "          Northern Europe                    80.140000\n",
       "          Southern Europe                    79.466667\n",
       "          Western Europe                     82.100000\n",
       "Oceania   Australia and New Zealand          82.350000\n",
       "          Melanesia                          63.700000\n",
       "          Micronesia                         62.200000\n",
       "          Polynesia                          71.550000\n",
       "Name: life_expectancy, dtype: float64"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_world_data = world_data_2018.groupby(['region', 'sub_region'])\n",
    "grouped_world_data['life_expectancy'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of using the `mean()` or `sum()` methods directly, the more general `agg()` method could be called to aggregate by *any* existing aggregation functions. The equivalent to the `mean()` method would be to call `agg()` and specify `'mean'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region    sub_region                     \n",
       "Africa    Northern Africa                    74.716667\n",
       "          Sub-Saharan Africa                 63.682609\n",
       "Americas  Latin America and the Caribbean    75.600000\n",
       "          Northern America                   80.650000\n",
       "Asia      Central Asia                       71.340000\n",
       "          Eastern Asia                       76.440000\n",
       "          South-eastern Asia                 73.630000\n",
       "          Southern Asia                      72.211111\n",
       "          Western Asia                       76.122222\n",
       "Europe    Eastern Europe                     75.110000\n",
       "          Northern Europe                    80.140000\n",
       "          Southern Europe                    79.466667\n",
       "          Western Europe                     82.100000\n",
       "Oceania   Australia and New Zealand          82.350000\n",
       "          Melanesia                          63.700000\n",
       "          Micronesia                         62.200000\n",
       "          Polynesia                          71.550000\n",
       "Name: life_expectancy, dtype: float64"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_world_data['life_expectancy'].agg('mean')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This general approach is more flexible and powerful since multiple aggregation functions can be applied in the same line of code by passing them as a list to `agg()`. For instance, the standard deviation and mean could be computed in the same call by passing them in a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>region</th>\n",
       "      <th>sub_region</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">Africa</th>\n",
       "      <th>Northern Africa</th>\n",
       "      <td>74.716667</td>\n",
       "      <td>3.510793</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sub-Saharan Africa</th>\n",
       "      <td>63.682609</td>\n",
       "      <td>4.540108</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">Americas</th>\n",
       "      <th>Latin America and the Caribbean</th>\n",
       "      <td>75.600000</td>\n",
       "      <td>3.721559</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Northern America</th>\n",
       "      <td>80.650000</td>\n",
       "      <td>2.192031</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">Asia</th>\n",
       "      <th>Central Asia</th>\n",
       "      <td>71.340000</td>\n",
       "      <td>0.808084</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Eastern Asia</th>\n",
       "      <td>76.440000</td>\n",
       "      <td>6.566430</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>South-eastern Asia</th>\n",
       "      <td>73.630000</td>\n",
       "      <td>4.835298</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Southern Asia</th>\n",
       "      <td>72.211111</td>\n",
       "      <td>6.426983</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Western Asia</th>\n",
       "      <td>76.122222</td>\n",
       "      <td>4.585214</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">Europe</th>\n",
       "      <th>Eastern Europe</th>\n",
       "      <td>75.110000</td>\n",
       "      <td>2.711478</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Northern Europe</th>\n",
       "      <td>80.140000</td>\n",
       "      <td>2.958678</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Southern Europe</th>\n",
       "      <td>79.466667</td>\n",
       "      <td>2.694889</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Western Europe</th>\n",
       "      <td>82.100000</td>\n",
       "      <td>0.804156</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">Oceania</th>\n",
       "      <th>Australia and New Zealand</th>\n",
       "      <td>82.350000</td>\n",
       "      <td>0.777817</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Melanesia</th>\n",
       "      <td>63.700000</td>\n",
       "      <td>1.961292</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Micronesia</th>\n",
       "      <td>62.200000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Polynesia</th>\n",
       "      <td>71.550000</td>\n",
       "      <td>1.202082</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               mean       std\n",
       "region   sub_region                                          \n",
       "Africa   Northern Africa                  74.716667  3.510793\n",
       "         Sub-Saharan Africa               63.682609  4.540108\n",
       "Americas Latin America and the Caribbean  75.600000  3.721559\n",
       "         Northern America                 80.650000  2.192031\n",
       "Asia     Central Asia                     71.340000  0.808084\n",
       "         Eastern Asia                     76.440000  6.566430\n",
       "         South-eastern Asia               73.630000  4.835298\n",
       "         Southern Asia                    72.211111  6.426983\n",
       "         Western Asia                     76.122222  4.585214\n",
       "Europe   Eastern Europe                   75.110000  2.711478\n",
       "         Northern Europe                  80.140000  2.958678\n",
       "         Southern Europe                  79.466667  2.694889\n",
       "         Western Europe                   82.100000  0.804156\n",
       "Oceania  Australia and New Zealand        82.350000  0.777817\n",
       "         Melanesia                        63.700000  1.961292\n",
       "         Micronesia                       62.200000       NaN\n",
       "         Polynesia                        71.550000  1.202082"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_world_data['life_expectancy'].agg(['mean', 'std'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The returned output is in this case a data frame and the `MultiIndex` is indicated in bold font.\n",
    "\n",
    "By passing a dictionary to `.agg()` it is possible to apply different aggregations to the different columns. Long code statements can be broken down into multiple lines if they are enclosed by parentheses, brackets or braces, something that will be described in detail later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>population</th>\n",
       "      <th colspan=\"3\" halign=\"left\">income</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>sum</th>\n",
       "      <th>min</th>\n",
       "      <th>median</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>region</th>\n",
       "      <th>sub_region</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">Africa</th>\n",
       "      <th>Northern Africa</th>\n",
       "      <td>237270000</td>\n",
       "      <td>4440</td>\n",
       "      <td>11200</td>\n",
       "      <td>18300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sub-Saharan Africa</th>\n",
       "      <td>1049118200</td>\n",
       "      <td>629</td>\n",
       "      <td>1985</td>\n",
       "      <td>27500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">Americas</th>\n",
       "      <th>Latin America and the Caribbean</th>\n",
       "      <td>646688000</td>\n",
       "      <td>1710</td>\n",
       "      <td>13700</td>\n",
       "      <td>30300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Northern America</th>\n",
       "      <td>364000000</td>\n",
       "      <td>43800</td>\n",
       "      <td>49350</td>\n",
       "      <td>54900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">Asia</th>\n",
       "      <th>Central Asia</th>\n",
       "      <td>71890000</td>\n",
       "      <td>2920</td>\n",
       "      <td>6690</td>\n",
       "      <td>24200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Eastern Asia</th>\n",
       "      <td>1626920000</td>\n",
       "      <td>1390</td>\n",
       "      <td>16000</td>\n",
       "      <td>39100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>South-eastern Asia</th>\n",
       "      <td>655870000</td>\n",
       "      <td>1490</td>\n",
       "      <td>7255</td>\n",
       "      <td>83900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Southern Asia</th>\n",
       "      <td>1887261000</td>\n",
       "      <td>1870</td>\n",
       "      <td>6890</td>\n",
       "      <td>17400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Western Asia</th>\n",
       "      <td>272270000</td>\n",
       "      <td>2430</td>\n",
       "      <td>20750</td>\n",
       "      <td>121000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">Europe</th>\n",
       "      <th>Eastern Europe</th>\n",
       "      <td>291970000</td>\n",
       "      <td>5330</td>\n",
       "      <td>24100</td>\n",
       "      <td>32300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Northern Europe</th>\n",
       "      <td>104478000</td>\n",
       "      <td>25500</td>\n",
       "      <td>43450</td>\n",
       "      <td>65600</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Southern Europe</th>\n",
       "      <td>151681000</td>\n",
       "      <td>12100</td>\n",
       "      <td>24050</td>\n",
       "      <td>37900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Western Europe</th>\n",
       "      <td>193980000</td>\n",
       "      <td>39000</td>\n",
       "      <td>45200</td>\n",
       "      <td>99000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">Oceania</th>\n",
       "      <th>Australia and New Zealand</th>\n",
       "      <td>29550000</td>\n",
       "      <td>36400</td>\n",
       "      <td>41100</td>\n",
       "      <td>45800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Melanesia</th>\n",
       "      <td>10237000</td>\n",
       "      <td>2110</td>\n",
       "      <td>2850</td>\n",
       "      <td>9420</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Micronesia</th>\n",
       "      <td>118000</td>\n",
       "      <td>1890</td>\n",
       "      <td>1890</td>\n",
       "      <td>1890</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Polynesia</th>\n",
       "      <td>307000</td>\n",
       "      <td>5500</td>\n",
       "      <td>5725</td>\n",
       "      <td>5950</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          population income               \n",
       "                                                 sum    min median     max\n",
       "region   sub_region                                                       \n",
       "Africa   Northern Africa                   237270000   4440  11200   18300\n",
       "         Sub-Saharan Africa               1049118200    629   1985   27500\n",
       "Americas Latin America and the Caribbean   646688000   1710  13700   30300\n",
       "         Northern America                  364000000  43800  49350   54900\n",
       "Asia     Central Asia                       71890000   2920   6690   24200\n",
       "         Eastern Asia                     1626920000   1390  16000   39100\n",
       "         South-eastern Asia                655870000   1490   7255   83900\n",
       "         Southern Asia                    1887261000   1870   6890   17400\n",
       "         Western Asia                      272270000   2430  20750  121000\n",
       "Europe   Eastern Europe                    291970000   5330  24100   32300\n",
       "         Northern Europe                   104478000  25500  43450   65600\n",
       "         Southern Europe                   151681000  12100  24050   37900\n",
       "         Western Europe                    193980000  39000  45200   99000\n",
       "Oceania  Australia and New Zealand          29550000  36400  41100   45800\n",
       "         Melanesia                          10237000   2110   2850    9420\n",
       "         Micronesia                           118000   1890   1890    1890\n",
       "         Polynesia                            307000   5500   5725    5950"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_world_data[['population', 'income']].agg(\n",
    "    {'population': 'sum',\n",
    "     'income': ['min', 'median', 'max']\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are plenty of aggregation methods available in pandas (e.g. `sem`, `mad`, `sum`, most of which can be seen at [the end of this section](https://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation) in the `pandas` documentation, or explored using tab-complete on the grouped data frame)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is a side note, no need to bring up unless someone has issues\n",
    "# Tab completion might only work like this:\n",
    "# find_agg_methods = grouped_world_data['weight']\n",
    "# find_agg_methods.<tab>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Even if a function is not part of the `pandas` library, it can be passed to `agg()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region    sub_region                     \n",
       "Africa    Northern Africa                     50.113333\n",
       "          Sub-Saharan Africa                 108.143043\n",
       "Americas  Latin America and the Caribbean    126.558966\n",
       "          Northern America                    19.880000\n",
       "Asia      Central Asia                        38.504000\n",
       "          Eastern Asia                       248.202000\n",
       "          South-eastern Asia                 961.110000\n",
       "          Southern Asia                      460.388889\n",
       "          Western Asia                       298.355556\n",
       "Europe    Eastern Europe                      88.629000\n",
       "          Northern Europe                     64.897000\n",
       "          Southern Europe                    202.166667\n",
       "          Western Europe                     256.000000\n",
       "Oceania   Australia and New Zealand           10.610000\n",
       "          Melanesia                           28.475000\n",
       "          Micronesia                         146.000000\n",
       "          Polynesia                          110.450000\n",
       "Name: pop_density, dtype: float64"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "grouped_world_data['pop_density'].agg(np.mean)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Any function can be passed like this, including user-created functions. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> #### Challenge\n",
    ">\n",
    "> 1. What's the mean life expectancy for each income group in 2018?\n",
    "> \n",
    "> 2. What's the min, median, and max life expectancies for each income group within each region?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "income_group\n",
       "Low             63.744118\n",
       "Lower middle    69.053488\n",
       "Upper middle    74.283673\n",
       "High            79.919231\n",
       "Name: life_expectancy, dtype: float64"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Challenge solutions\n",
    "\n",
    "# 1.\n",
    "world_data_2018.groupby('income_group')['life_expectancy'].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>min</th>\n",
       "      <th>median</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>region</th>\n",
       "      <th>income_group</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">Africa</th>\n",
       "      <th>Low</th>\n",
       "      <td>51.6</td>\n",
       "      <td>62.50</td>\n",
       "      <td>68.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Lower middle</th>\n",
       "      <td>51.1</td>\n",
       "      <td>66.35</td>\n",
       "      <td>78.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Upper middle</th>\n",
       "      <td>63.5</td>\n",
       "      <td>67.10</td>\n",
       "      <td>77.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>High</th>\n",
       "      <td>74.2</td>\n",
       "      <td>74.20</td>\n",
       "      <td>74.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">Americas</th>\n",
       "      <th>Low</th>\n",
       "      <td>64.5</td>\n",
       "      <td>64.50</td>\n",
       "      <td>64.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Lower middle</th>\n",
       "      <td>73.1</td>\n",
       "      <td>74.90</td>\n",
       "      <td>78.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Upper middle</th>\n",
       "      <td>68.2</td>\n",
       "      <td>75.80</td>\n",
       "      <td>81.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>High</th>\n",
       "      <td>73.4</td>\n",
       "      <td>77.60</td>\n",
       "      <td>82.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">Asia</th>\n",
       "      <th>Low</th>\n",
       "      <td>58.7</td>\n",
       "      <td>70.45</td>\n",
       "      <td>72.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Lower middle</th>\n",
       "      <td>67.9</td>\n",
       "      <td>71.50</td>\n",
       "      <td>77.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Upper middle</th>\n",
       "      <td>68.0</td>\n",
       "      <td>76.50</td>\n",
       "      <td>80.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>High</th>\n",
       "      <td>76.9</td>\n",
       "      <td>80.70</td>\n",
       "      <td>84.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">Europe</th>\n",
       "      <th>Lower middle</th>\n",
       "      <td>72.3</td>\n",
       "      <td>72.35</td>\n",
       "      <td>72.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Upper middle</th>\n",
       "      <td>71.1</td>\n",
       "      <td>75.50</td>\n",
       "      <td>78.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>High</th>\n",
       "      <td>75.1</td>\n",
       "      <td>81.30</td>\n",
       "      <td>83.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">Oceania</th>\n",
       "      <th>Lower middle</th>\n",
       "      <td>61.1</td>\n",
       "      <td>62.90</td>\n",
       "      <td>64.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Upper middle</th>\n",
       "      <td>65.8</td>\n",
       "      <td>70.70</td>\n",
       "      <td>72.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>High</th>\n",
       "      <td>81.8</td>\n",
       "      <td>82.35</td>\n",
       "      <td>82.9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        min  median   max\n",
       "region   income_group                    \n",
       "Africa   Low           51.6   62.50  68.3\n",
       "         Lower middle  51.1   66.35  78.0\n",
       "         Upper middle  63.5   67.10  77.9\n",
       "         High          74.2   74.20  74.2\n",
       "Americas Low           64.5   64.50  64.5\n",
       "         Lower middle  73.1   74.90  78.7\n",
       "         Upper middle  68.2   75.80  81.4\n",
       "         High          73.4   77.60  82.2\n",
       "Asia     Low           58.7   70.45  72.2\n",
       "         Lower middle  67.9   71.50  77.8\n",
       "         Upper middle  68.0   76.50  80.5\n",
       "         High          76.9   80.70  84.2\n",
       "Europe   Lower middle  72.3   72.35  72.4\n",
       "         Upper middle  71.1   75.50  78.0\n",
       "         High          75.1   81.30  83.5\n",
       "Oceania  Lower middle  61.1   62.90  64.3\n",
       "         Upper middle  65.8   70.70  72.4\n",
       "         High          81.8   82.35  82.9"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 2.\n",
    "world_data_2018.groupby(['region', 'income_group'])['life_expectancy'].agg(['min', 'median', 'max'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Additional sections (time permitting)\n",
    "\n",
    "### Using `size()` to summarize categorical data \n",
    "\n",
    "When working with data, it is common to want to know the number of observations present for each categorical variable. For this, `pandas` provides the `size()` method. For example, to find the number of observations (in this case unique countries during year 2018) per region:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region\n",
       "Africa      52\n",
       "Americas    31\n",
       "Asia        47\n",
       "Europe      39\n",
       "Oceania      9\n",
       "dtype: int64"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data_2018.groupby('region').size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`size()` can also be used when grouping on multiple variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region    income_group\n",
       "Africa    Low             27\n",
       "          Lower middle    16\n",
       "          Upper middle     8\n",
       "          High             1\n",
       "Americas  Low              1\n",
       "          Lower middle     4\n",
       "          Upper middle    16\n",
       "          High            10\n",
       "Asia      Low              6\n",
       "          Lower middle    17\n",
       "          Upper middle    13\n",
       "          High            11\n",
       "Europe    Lower middle     2\n",
       "          Upper middle     9\n",
       "          High            28\n",
       "Oceania   Lower middle     4\n",
       "          Upper middle     3\n",
       "          High             2\n",
       "dtype: int64"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data_2018.groupby(['region', 'income_group']).size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If there are many groups, `size()` is not that useful on its own. For example, it is difficult to quickly find the five most abundant species among the observations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sub_region\n",
       "Australia and New Zealand           2\n",
       "Central Asia                        5\n",
       "Eastern Asia                        5\n",
       "Eastern Europe                     10\n",
       "Latin America and the Caribbean    29\n",
       "Melanesia                           4\n",
       "Micronesia                          1\n",
       "Northern Africa                     6\n",
       "Northern America                    2\n",
       "Northern Europe                    10\n",
       "Polynesia                           2\n",
       "South-eastern Asia                 10\n",
       "Southern Asia                       9\n",
       "Southern Europe                    12\n",
       "Sub-Saharan Africa                 46\n",
       "Western Asia                       18\n",
       "Western Europe                      7\n",
       "dtype: int64"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data_2018.groupby('sub_region').size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since there are many rows in this output, it would be beneficial to sort the table values and display the most abundant species first. This is easy to do with the `sort_values()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sub_region\n",
       "Micronesia                          1\n",
       "Australia and New Zealand           2\n",
       "Polynesia                           2\n",
       "Northern America                    2\n",
       "Melanesia                           4\n",
       "Eastern Asia                        5\n",
       "Central Asia                        5\n",
       "Northern Africa                     6\n",
       "Western Europe                      7\n",
       "Southern Asia                       9\n",
       "Northern Europe                    10\n",
       "South-eastern Asia                 10\n",
       "Eastern Europe                     10\n",
       "Southern Europe                    12\n",
       "Western Asia                       18\n",
       "Latin America and the Caribbean    29\n",
       "Sub-Saharan Africa                 46\n",
       "dtype: int64"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data_2018.groupby('sub_region').size().sort_values()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's better, but it could be helpful to display the most abundant species on top. In other words, the output should be arranged in descending order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sub_region\n",
       "Sub-Saharan Africa                 46\n",
       "Latin America and the Caribbean    29\n",
       "Western Asia                       18\n",
       "Southern Europe                    12\n",
       "Eastern Europe                     10\n",
       "dtype: int64"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data_2018.groupby('sub_region').size().sort_values(ascending=False).head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks good! By now, the code statement has grown quite long because many methods have been *chained* together. It can be tricky to keep track of what is going on in long method chains. To make the code more readable, it can be broken up multiple lines by adding a surrounding parenthesis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sub_region\n",
       "Sub-Saharan Africa                 46\n",
       "Latin America and the Caribbean    29\n",
       "Western Asia                       18\n",
       "Southern Europe                    12\n",
       "Eastern Europe                     10\n",
       "dtype: int64"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(world_data_2018\n",
    "     .groupby('sub_region')\n",
    "     .size()\n",
    "     .sort_values(ascending=False)\n",
    "     .head(5)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This looks neater and makes long method chains easier to reads. There is no absolute rule for when to break code into multiple line, but always try to write code that is easy for collaborators (your most common collaborator is a future version of yourself!) to understand.\n",
    "\n",
    "`pandas` actually has a convenience function for returning the top five results, so the values don't need to be sorted explicitly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sub_region\n",
       "Sub-Saharan Africa                 46\n",
       "Latin America and the Caribbean    29\n",
       "Western Asia                       18\n",
       "Southern Europe                    12\n",
       "Eastern Europe                     10\n",
       "dtype: int64"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(world_data_2018\n",
    "     .groupby(['sub_region'])\n",
    "     .size()\n",
    "     .nlargest() # the default is 5\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To include more attributes about these countries, add those columns to `groupby()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "region    sub_region                     \n",
       "Africa    Sub-Saharan Africa                 46\n",
       "Americas  Latin America and the Caribbean    29\n",
       "Asia      Western Asia                       18\n",
       "Europe    Southern Europe                    12\n",
       "Asia      South-eastern Asia                 10\n",
       "dtype: int64"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(world_data_2018\n",
    "     .groupby(['region', 'sub_region'])\n",
    "     .size()\n",
    "     .nlargest() # the default is 5\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "      <th>population</th>\n",
       "      <th>region</th>\n",
       "      <th>sub_region</th>\n",
       "      <th>income_group</th>\n",
       "      <th>life_expectancy</th>\n",
       "      <th>income</th>\n",
       "      <th>children_per_woman</th>\n",
       "      <th>child_mortality</th>\n",
       "      <th>pop_density</th>\n",
       "      <th>co2_per_capita</th>\n",
       "      <th>years_in_school_men</th>\n",
       "      <th>years_in_school_women</th>\n",
       "      <th>population_income</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1800</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1977840000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1801</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1977840000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1802</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1977840000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1803</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1977840000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>1804</td>\n",
       "      <td>3280000</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Southern Asia</td>\n",
       "      <td>Low</td>\n",
       "      <td>28.2</td>\n",
       "      <td>603</td>\n",
       "      <td>7.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1977840000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country  year  population region     sub_region income_group  \\\n",
       "0  Afghanistan  1800     3280000   Asia  Southern Asia          Low   \n",
       "1  Afghanistan  1801     3280000   Asia  Southern Asia          Low   \n",
       "2  Afghanistan  1802     3280000   Asia  Southern Asia          Low   \n",
       "3  Afghanistan  1803     3280000   Asia  Southern Asia          Low   \n",
       "4  Afghanistan  1804     3280000   Asia  Southern Asia          Low   \n",
       "\n",
       "   life_expectancy  income  children_per_woman  child_mortality  pop_density  \\\n",
       "0             28.2     603                 7.0            469.0          NaN   \n",
       "1             28.2     603                 7.0            469.0          NaN   \n",
       "2             28.2     603                 7.0            469.0          NaN   \n",
       "3             28.2     603                 7.0            469.0          NaN   \n",
       "4             28.2     603                 7.0            469.0          NaN   \n",
       "\n",
       "   co2_per_capita  years_in_school_men  years_in_school_women  \\\n",
       "0             NaN                  NaN                    NaN   \n",
       "1             NaN                  NaN                    NaN   \n",
       "2             NaN                  NaN                    NaN   \n",
       "3             NaN                  NaN                    NaN   \n",
       "4             NaN                  NaN                    NaN   \n",
       "\n",
       "   population_income  \n",
       "0         1977840000  \n",
       "1         1977840000  \n",
       "2         1977840000  \n",
       "3         1977840000  \n",
       "4         1977840000  "
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">#### Challenge\n",
    ">\n",
    "> 1. How many countries are there in each income group worldwide?\n",
    "> 2. Assign the variable name `world_data_2015` to a data frame containing only the values from year 2015 (e.g. the same way as `world_data_2018` was created)\n",
    "> 3. \n",
    ">    1. For those countries where women went to school longer than men, how many are in each income group.\n",
    ">    2. Do the same as above but for countries where men went to school longer than women. What does this distribution tell you?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "income_group\n",
       "Low             34\n",
       "Lower middle    43\n",
       "Upper middle    49\n",
       "High            52\n",
       "dtype: int64"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Challenge solutions\n",
    "# 1.\n",
    "world_data_2018.groupby('income_group').size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2\n",
    "world_data_2015 = world_data.loc[world_data['year'] == 2015]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "income_group\n",
       "Low              0\n",
       "Lower middle    14\n",
       "Upper middle    33\n",
       "High            47\n",
       "dtype: int64"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 3a\n",
    "world_data_2015.loc[world_data_2015['years_in_school_men'] < world_data_2015['years_in_school_women']].groupby('income_group').size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "income_group\n",
       "Low             34\n",
       "Lower middle    29\n",
       "Upper middle    11\n",
       "High             5\n",
       "dtype: int64"
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 3b\n",
    "world_data_2015.loc[world_data_2015['years_in_school_men'] > world_data_2015['years_in_school_women']].groupby('income_group').size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data cleaning tips"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`dropna()` removes both explicit `NaN` values and value that pandas are assumed to be `NaN`, such as the non-numeric values in the life_expectancy column. Non-numeric values can also be coerced into explicit `NaN` values via the `to_numeric()` top level function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     71.5\n",
       "1      NaN\n",
       "2     71.6\n",
       "3      NaN\n",
       "4     71.7\n",
       "5     72.0\n",
       "6     70.0\n",
       "7     70.0\n",
       "8     70.1\n",
       "9     70.2\n",
       "10     NaN\n",
       "11    70.4\n",
       "Name: life_expectancy, dtype: float64"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.to_numeric(clean_df['life_expectancy'], errors='coerce')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}