{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data analysis in Python\n", "\n", "## Lesson preamble\n", "\n", "### Learning objectives\n", "\n", "- Describe what a data frame is.\n", "- Load external data from a .csv file into a data frame with pandas.\n", "- Summarize the contents of a data frame with pandas.\n", "- Learn to use data frame attributes `loc[]`, `head()`, `info()`, `describe()`, `shape`, `columns`, `index`.\n", "- Learn to clean dirty data.\n", "- Understand the split-apply-combine concept for data analysis.\n", " - Use `groupby()`, `mean()`, `agg()` and `size()` to apply this technique.\n", "\n", "### Lesson outline\n", "\n", "- Manipulating and analyzing data with pandas\n", " - Data set background (10 min)\n", " - What are data frames (15 min)\n", " - Data wrangling with pandas (40 min)\n", "- Cleaning data (20 min)\n", "- Split-apply-combine techniques in `pandas`\n", " - Using `mean()` to summarize categorical data (20 min)\n", " - Using `size()` to summarize categorical data (15 min)\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Manipulating and analyzing data with pandas\n", "\n", "### Dataset background\n", "\n", "Today, we will be working with real data about the world combined from multiple sources by the [Gapminder foundation](https://www.gapminder.org/about-gapminder/). Gapminder is and independent Swedish foundation that fights devastating misconceptions about global development and promotes as fact-based world view through the production of free teaching and data exploration resources. Insights from the combined Gapminder data sources have been popularized through the efforts of public health professor Hans Rosling, and it is highly recommended to check out his entertaining videos, for example this one.\n", "\n", "As a start, we recommend taking [this 5-10 min quiz](http://forms.gapminder.org/s3/test-2018) to see how ignorant you are about the world. Then we will learn how to dive into this data further using Python!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Overview of the gapminder world data\n", "\n", "We are studying the species and weight of animals caught in plots in our study\n", "area. The dataset is stored as a comma separated value (CSV) file. Each row\n", "holds information for a single animal, and the columns represent:\n", "\n", "| Column | Description |\n", "|-----------------------|------------------------------------|\n", "| country | Country name |\n", "| year | Year of observation |\n", "| population | Population in the country at each year |\n", "| region | Continent the country belongs to |\n", "| sub_region | Sub regions as defined by |\n", "| income_group | Income group [as specified by the world bank](https://datahelpdesk.worldbank.org/knowledgebase/articles/378833-how-are-the-income-group-thresholds-determined) |\n", "| life_expectancy | The average number of years a newborn child would
live if mortality patterns were to stay the same |\n", "| income | GDP per capita (in USD) adjusted
for differences in purchasing power|\n", "| children_per_woman | Number of children born to each woman|\n", "| child_mortality | Deaths of children under 5 years of age per 1000 live births|\n", "| pop_density | Average number of people per km2|\n", "| co2_per_capita | CO2 emissions from fossil fuels (tonnes per capita)|\n", "| years_in_school_men | Average number of years attending primary, secondary,
and tertiary school for 25-36 years old men|\n", "| years_in_school_women | Average number of years attending primary, secondary,
and tertiary school for 25-36 years old women|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To read the data into Python, we are going to use a function called `read_csv` from the Python-package [`pandas`](https://pandas.pydata.org/). As mentioned previously, Python-packages are a bit like browser extensions, they are not essential, but can provide nifty functionality. To use a package, it first needs to be imported." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# pandas is given the nickname `pd`\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas` can read CSV-files saved on the computer or directly from an URL." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "world_data = pd.read_csv('https://raw.githubusercontent.com/UofTCoders/2018-09-10-utoronto/gh-pages/data/world-data-gapminder.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To view the result, type `world_data` in a cell and run it, just as when viewing the content of any variable in Python." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearpopulationregionsub_regionincome_grouplife_expectancyincomechildren_per_womanchild_mortalitypop_densityco2_per_capitayears_in_school_menyears_in_school_women
0Afghanistan18003280000AsiaSouthern AsiaLow28.26037.00469.0NaNNaNNaNNaN
1Afghanistan18013280000AsiaSouthern AsiaLow28.26037.00469.0NaNNaNNaNNaN
2Afghanistan18023280000AsiaSouthern AsiaLow28.26037.00469.0NaNNaNNaNNaN
3Afghanistan18033280000AsiaSouthern AsiaLow28.26037.00469.0NaNNaNNaNNaN
4Afghanistan18043280000AsiaSouthern AsiaLow28.26037.00469.0NaNNaNNaNNaN
5Afghanistan18053280000AsiaSouthern AsiaLow28.26037.00469.0NaNNaNNaNNaN
6Afghanistan18063280000AsiaSouthern AsiaLow28.16037.00470.0NaNNaNNaNNaN
7Afghanistan18073280000AsiaSouthern AsiaLow28.16037.00470.0NaNNaNNaNNaN
8Afghanistan18083280000AsiaSouthern AsiaLow28.16037.00470.0NaNNaNNaNNaN
9Afghanistan18093280000AsiaSouthern AsiaLow28.16037.00470.0NaNNaNNaNNaN
10Afghanistan18103280000AsiaSouthern AsiaLow28.16047.00470.0NaNNaNNaNNaN
11Afghanistan18113280000AsiaSouthern AsiaLow28.16047.00470.0NaNNaNNaNNaN
12Afghanistan18123280000AsiaSouthern AsiaLow28.16047.00470.0NaNNaNNaNNaN
13Afghanistan18133280000AsiaSouthern AsiaLow28.16047.00470.0NaNNaNNaNNaN
14Afghanistan18143290000AsiaSouthern AsiaLow28.16047.00470.0NaNNaNNaNNaN
15Afghanistan18153290000AsiaSouthern AsiaLow28.16047.00470.0NaNNaNNaNNaN
16Afghanistan18163300000AsiaSouthern AsiaLow28.16047.00471.0NaNNaNNaNNaN
17Afghanistan18173300000AsiaSouthern AsiaLow28.06047.00471.0NaNNaNNaNNaN
18Afghanistan18183310000AsiaSouthern AsiaLow28.06047.00471.0NaNNaNNaNNaN
19Afghanistan18193320000AsiaSouthern AsiaLow28.06047.00471.0NaNNaNNaNNaN
20Afghanistan18203320000AsiaSouthern AsiaLow28.06047.00471.0NaNNaNNaNNaN
21Afghanistan18213330000AsiaSouthern AsiaLow28.06077.00471.0NaNNaNNaNNaN
22Afghanistan18223340000AsiaSouthern AsiaLow28.06097.00471.0NaNNaNNaNNaN
23Afghanistan18233350000AsiaSouthern AsiaLow28.06117.00471.0NaNNaNNaNNaN
24Afghanistan18243360000AsiaSouthern AsiaLow28.06137.00471.0NaNNaNNaNNaN
25Afghanistan18253380000AsiaSouthern AsiaLow27.96157.00471.0NaNNaNNaNNaN
26Afghanistan18263390000AsiaSouthern AsiaLow27.96177.00473.0NaNNaNNaNNaN
27Afghanistan18273400000AsiaSouthern AsiaLow27.96197.00473.0NaNNaNNaNNaN
28Afghanistan18283420000AsiaSouthern AsiaLow27.96217.00473.0NaNNaNNaNNaN
29Afghanistan18293430000AsiaSouthern AsiaLow27.96237.00473.0NaNNaNNaNNaN
.............................................
38952Zimbabwe19899900000AfricaSub-Saharan AfricaLow62.724905.3773.925.61.6307.616.01
38953Zimbabwe199010200000AfricaSub-Saharan AfricaLow61.725905.1875.226.31.5407.746.16
38954Zimbabwe199110400000AfricaSub-Saharan AfricaLow61.026705.0077.427.01.5307.886.31
38955Zimbabwe199210700000AfricaSub-Saharan AfricaLow59.423704.8480.227.61.5908.016.46
38956Zimbabwe199310900000AfricaSub-Saharan AfricaLow57.623504.6983.428.21.5008.146.61
38957Zimbabwe199411100000AfricaSub-Saharan AfricaLow55.825204.5686.828.71.6008.286.76
38958Zimbabwe199511300000AfricaSub-Saharan AfricaLow53.724804.4390.129.31.3408.416.92
38959Zimbabwe199611500000AfricaSub-Saharan AfricaLow52.226904.3392.829.81.3008.547.07
38960Zimbabwe199711700000AfricaSub-Saharan AfricaLow50.827104.2494.730.31.2308.677.23
38961Zimbabwe199811900000AfricaSub-Saharan AfricaLow49.127504.1695.930.71.2008.807.39
38962Zimbabwe199912100000AfricaSub-Saharan AfricaLow47.826904.1096.431.21.3108.937.55
38963Zimbabwe200012200000AfricaSub-Saharan AfricaLow46.725704.0696.831.61.1409.077.71
38964Zimbabwe200112400000AfricaSub-Saharan AfricaLow46.225804.0297.132.01.0209.207.87
38965Zimbabwe200212500000AfricaSub-Saharan AfricaLow45.623204.0097.732.30.9579.338.03
38966Zimbabwe200312600000AfricaSub-Saharan AfricaLow45.319103.9998.232.70.8439.478.20
38967Zimbabwe200412800000AfricaSub-Saharan AfricaLow45.117803.9899.033.00.7429.608.36
38968Zimbabwe200512900000AfricaSub-Saharan AfricaLow45.316503.9999.733.40.8329.738.53
38969Zimbabwe200613100000AfricaSub-Saharan AfricaLow45.715803.99100.033.90.7969.878.69
38970Zimbabwe200713300000AfricaSub-Saharan AfricaLow46.414904.00100.034.50.74210.008.86
38971Zimbabwe200813600000AfricaSub-Saharan AfricaLow46.712104.0198.035.00.57310.109.03
38972Zimbabwe200913800000AfricaSub-Saharan AfricaLow47.512904.0294.935.70.40610.309.19
38973Zimbabwe201014100000AfricaSub-Saharan AfricaLow49.614604.0389.936.40.55210.409.36
38974Zimbabwe201114400000AfricaSub-Saharan AfricaLow51.916604.0283.837.20.66510.509.53
38975Zimbabwe201214700000AfricaSub-Saharan AfricaLow54.118504.0076.038.00.53010.709.70
38976Zimbabwe201315100000AfricaSub-Saharan AfricaLow55.619003.9670.038.90.77610.809.86
38977Zimbabwe201415400000AfricaSub-Saharan AfricaLow57.019103.9064.339.80.78010.9010.00
38978Zimbabwe201515800000AfricaSub-Saharan AfricaLow58.318903.8459.940.8NaN11.1010.20
38979Zimbabwe201616200000AfricaSub-Saharan AfricaLow59.318603.7656.441.7NaNNaNNaN
38980Zimbabwe201716500000AfricaSub-Saharan AfricaLow59.819103.6856.842.7NaNNaNNaN
38981Zimbabwe201816900000AfricaSub-Saharan AfricaLow60.219503.6155.543.7NaNNaNNaN
\n", "

38982 rows × 14 columns

\n", "
" ], "text/plain": [ " country year population region sub_region income_group \\\n", "0 Afghanistan 1800 3280000 Asia Southern Asia Low \n", "1 Afghanistan 1801 3280000 Asia Southern Asia Low \n", "2 Afghanistan 1802 3280000 Asia Southern Asia Low \n", "3 Afghanistan 1803 3280000 Asia Southern Asia Low \n", "4 Afghanistan 1804 3280000 Asia Southern Asia Low \n", "5 Afghanistan 1805 3280000 Asia Southern Asia Low \n", "6 Afghanistan 1806 3280000 Asia Southern Asia Low \n", "7 Afghanistan 1807 3280000 Asia Southern Asia Low \n", "8 Afghanistan 1808 3280000 Asia Southern Asia Low \n", "9 Afghanistan 1809 3280000 Asia Southern Asia Low \n", "10 Afghanistan 1810 3280000 Asia Southern Asia Low \n", "11 Afghanistan 1811 3280000 Asia Southern Asia Low \n", "12 Afghanistan 1812 3280000 Asia Southern Asia Low \n", "13 Afghanistan 1813 3280000 Asia Southern Asia Low \n", "14 Afghanistan 1814 3290000 Asia Southern Asia Low \n", "15 Afghanistan 1815 3290000 Asia Southern Asia Low \n", "16 Afghanistan 1816 3300000 Asia Southern Asia Low \n", "17 Afghanistan 1817 3300000 Asia Southern Asia Low \n", "18 Afghanistan 1818 3310000 Asia Southern Asia Low \n", "19 Afghanistan 1819 3320000 Asia Southern Asia Low \n", "20 Afghanistan 1820 3320000 Asia Southern Asia Low \n", "21 Afghanistan 1821 3330000 Asia Southern Asia Low \n", "22 Afghanistan 1822 3340000 Asia Southern Asia Low \n", "23 Afghanistan 1823 3350000 Asia Southern Asia Low \n", "24 Afghanistan 1824 3360000 Asia Southern Asia Low \n", "25 Afghanistan 1825 3380000 Asia Southern Asia Low \n", "26 Afghanistan 1826 3390000 Asia Southern Asia Low \n", "27 Afghanistan 1827 3400000 Asia Southern Asia Low \n", "28 Afghanistan 1828 3420000 Asia Southern Asia Low \n", "29 Afghanistan 1829 3430000 Asia Southern Asia Low \n", "... ... ... ... ... ... ... \n", "38952 Zimbabwe 1989 9900000 Africa Sub-Saharan Africa Low \n", "38953 Zimbabwe 1990 10200000 Africa Sub-Saharan Africa Low \n", "38954 Zimbabwe 1991 10400000 Africa Sub-Saharan Africa Low \n", "38955 Zimbabwe 1992 10700000 Africa Sub-Saharan Africa Low \n", "38956 Zimbabwe 1993 10900000 Africa Sub-Saharan Africa Low \n", "38957 Zimbabwe 1994 11100000 Africa Sub-Saharan Africa Low \n", "38958 Zimbabwe 1995 11300000 Africa Sub-Saharan Africa Low \n", "38959 Zimbabwe 1996 11500000 Africa Sub-Saharan Africa Low \n", "38960 Zimbabwe 1997 11700000 Africa Sub-Saharan Africa Low \n", "38961 Zimbabwe 1998 11900000 Africa Sub-Saharan Africa Low \n", "38962 Zimbabwe 1999 12100000 Africa Sub-Saharan Africa Low \n", "38963 Zimbabwe 2000 12200000 Africa Sub-Saharan Africa Low \n", "38964 Zimbabwe 2001 12400000 Africa Sub-Saharan Africa Low \n", "38965 Zimbabwe 2002 12500000 Africa Sub-Saharan Africa Low \n", "38966 Zimbabwe 2003 12600000 Africa Sub-Saharan Africa Low \n", "38967 Zimbabwe 2004 12800000 Africa Sub-Saharan Africa Low \n", "38968 Zimbabwe 2005 12900000 Africa Sub-Saharan Africa Low \n", "38969 Zimbabwe 2006 13100000 Africa Sub-Saharan Africa Low \n", "38970 Zimbabwe 2007 13300000 Africa Sub-Saharan Africa Low \n", "38971 Zimbabwe 2008 13600000 Africa Sub-Saharan Africa Low \n", "38972 Zimbabwe 2009 13800000 Africa Sub-Saharan Africa Low \n", "38973 Zimbabwe 2010 14100000 Africa Sub-Saharan Africa Low \n", "38974 Zimbabwe 2011 14400000 Africa Sub-Saharan Africa Low \n", "38975 Zimbabwe 2012 14700000 Africa Sub-Saharan Africa Low \n", "38976 Zimbabwe 2013 15100000 Africa Sub-Saharan Africa Low \n", "38977 Zimbabwe 2014 15400000 Africa Sub-Saharan Africa Low \n", "38978 Zimbabwe 2015 15800000 Africa Sub-Saharan Africa Low \n", "38979 Zimbabwe 2016 16200000 Africa Sub-Saharan Africa Low \n", "38980 Zimbabwe 2017 16500000 Africa Sub-Saharan Africa Low \n", "38981 Zimbabwe 2018 16900000 Africa Sub-Saharan Africa Low \n", "\n", " life_expectancy income children_per_woman child_mortality \\\n", "0 28.2 603 7.00 469.0 \n", "1 28.2 603 7.00 469.0 \n", "2 28.2 603 7.00 469.0 \n", "3 28.2 603 7.00 469.0 \n", "4 28.2 603 7.00 469.0 \n", "5 28.2 603 7.00 469.0 \n", "6 28.1 603 7.00 470.0 \n", "7 28.1 603 7.00 470.0 \n", "8 28.1 603 7.00 470.0 \n", "9 28.1 603 7.00 470.0 \n", "10 28.1 604 7.00 470.0 \n", "11 28.1 604 7.00 470.0 \n", "12 28.1 604 7.00 470.0 \n", "13 28.1 604 7.00 470.0 \n", "14 28.1 604 7.00 470.0 \n", "15 28.1 604 7.00 470.0 \n", "16 28.1 604 7.00 471.0 \n", "17 28.0 604 7.00 471.0 \n", "18 28.0 604 7.00 471.0 \n", "19 28.0 604 7.00 471.0 \n", "20 28.0 604 7.00 471.0 \n", "21 28.0 607 7.00 471.0 \n", "22 28.0 609 7.00 471.0 \n", "23 28.0 611 7.00 471.0 \n", "24 28.0 613 7.00 471.0 \n", "25 27.9 615 7.00 471.0 \n", "26 27.9 617 7.00 473.0 \n", "27 27.9 619 7.00 473.0 \n", "28 27.9 621 7.00 473.0 \n", "29 27.9 623 7.00 473.0 \n", "... ... ... ... ... \n", "38952 62.7 2490 5.37 73.9 \n", "38953 61.7 2590 5.18 75.2 \n", "38954 61.0 2670 5.00 77.4 \n", "38955 59.4 2370 4.84 80.2 \n", "38956 57.6 2350 4.69 83.4 \n", "38957 55.8 2520 4.56 86.8 \n", "38958 53.7 2480 4.43 90.1 \n", "38959 52.2 2690 4.33 92.8 \n", "38960 50.8 2710 4.24 94.7 \n", "38961 49.1 2750 4.16 95.9 \n", "38962 47.8 2690 4.10 96.4 \n", "38963 46.7 2570 4.06 96.8 \n", "38964 46.2 2580 4.02 97.1 \n", "38965 45.6 2320 4.00 97.7 \n", "38966 45.3 1910 3.99 98.2 \n", "38967 45.1 1780 3.98 99.0 \n", "38968 45.3 1650 3.99 99.7 \n", "38969 45.7 1580 3.99 100.0 \n", "38970 46.4 1490 4.00 100.0 \n", "38971 46.7 1210 4.01 98.0 \n", "38972 47.5 1290 4.02 94.9 \n", "38973 49.6 1460 4.03 89.9 \n", "38974 51.9 1660 4.02 83.8 \n", "38975 54.1 1850 4.00 76.0 \n", "38976 55.6 1900 3.96 70.0 \n", "38977 57.0 1910 3.90 64.3 \n", "38978 58.3 1890 3.84 59.9 \n", "38979 59.3 1860 3.76 56.4 \n", "38980 59.8 1910 3.68 56.8 \n", "38981 60.2 1950 3.61 55.5 \n", "\n", " pop_density co2_per_capita years_in_school_men years_in_school_women \n", "0 NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN \n", "4 NaN NaN NaN NaN \n", "5 NaN NaN NaN NaN \n", "6 NaN NaN NaN NaN \n", "7 NaN NaN NaN NaN \n", "8 NaN NaN NaN NaN \n", "9 NaN NaN NaN NaN \n", "10 NaN NaN NaN NaN \n", "11 NaN NaN NaN NaN \n", "12 NaN NaN NaN NaN \n", "13 NaN NaN NaN NaN \n", "14 NaN NaN NaN NaN \n", "15 NaN NaN NaN NaN \n", "16 NaN NaN NaN NaN \n", "17 NaN NaN NaN NaN \n", "18 NaN NaN NaN NaN \n", "19 NaN NaN NaN NaN \n", "20 NaN NaN NaN NaN \n", "21 NaN NaN NaN NaN \n", "22 NaN NaN NaN NaN \n", "23 NaN NaN NaN NaN \n", "24 NaN NaN NaN NaN \n", "25 NaN NaN NaN NaN \n", "26 NaN NaN NaN NaN \n", "27 NaN NaN NaN NaN \n", "28 NaN NaN NaN NaN \n", "29 NaN NaN NaN NaN \n", "... ... ... ... ... \n", "38952 25.6 1.630 7.61 6.01 \n", "38953 26.3 1.540 7.74 6.16 \n", "38954 27.0 1.530 7.88 6.31 \n", "38955 27.6 1.590 8.01 6.46 \n", "38956 28.2 1.500 8.14 6.61 \n", "38957 28.7 1.600 8.28 6.76 \n", "38958 29.3 1.340 8.41 6.92 \n", "38959 29.8 1.300 8.54 7.07 \n", "38960 30.3 1.230 8.67 7.23 \n", "38961 30.7 1.200 8.80 7.39 \n", "38962 31.2 1.310 8.93 7.55 \n", "38963 31.6 1.140 9.07 7.71 \n", "38964 32.0 1.020 9.20 7.87 \n", "38965 32.3 0.957 9.33 8.03 \n", "38966 32.7 0.843 9.47 8.20 \n", "38967 33.0 0.742 9.60 8.36 \n", "38968 33.4 0.832 9.73 8.53 \n", "38969 33.9 0.796 9.87 8.69 \n", "38970 34.5 0.742 10.00 8.86 \n", "38971 35.0 0.573 10.10 9.03 \n", "38972 35.7 0.406 10.30 9.19 \n", "38973 36.4 0.552 10.40 9.36 \n", "38974 37.2 0.665 10.50 9.53 \n", "38975 38.0 0.530 10.70 9.70 \n", "38976 38.9 0.776 10.80 9.86 \n", "38977 39.8 0.780 10.90 10.00 \n", "38978 40.8 NaN 11.10 10.20 \n", "38979 41.7 NaN NaN NaN \n", "38980 42.7 NaN NaN NaN \n", "38981 43.7 NaN NaN NaN \n", "\n", "[38982 rows x 14 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is how a data frame is displayed in the JupyterLab Notebook. Although the data frame itself just consists of the values, the Notebook knows that this is a data frame and displays it in a nice tabular format (by adding HTML decorators), and adds some cosmetic conveniences such as the bold font type for the column and row names, the alternating grey and white zebra stripes for the rows and highlights the row the mouse pointer moves over. The increasing numbers on the far left is the data frame's index, which was added by `pandas` to easily distinguish between the rows.\n", "\n", "## What are data frames?\n", "\n", "A data frame is the representation of data in a tabular format, similar to how data is often arranged in spreadsheets. The data is rectangular, meaning that all rows have the same amount of columns and all columns have the same amount of rows. Data frames are the *de facto* data structure for most tabular data, and what we use for statistics and plotting. A data frame can be created by hand, but most commonly they are generated by an input function, such as `read_csv()`. In other words, when importing spreadsheets from your hard drive (or the web).\n", "\n", "As can be seen above, the default is to display the first and last 30 rows and truncate everything in between, as indicated by the ellipsis (`...`). Although it is truncated, this output is still quite space consuming. To glance at how the data frame looks, it is sufficient to display only the top (the first 5 lines) using the `head()` method." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearpopulationregionsub_regionincome_grouplife_expectancyincomechildren_per_womanchild_mortalitypop_densityco2_per_capitayears_in_school_menyears_in_school_women
0Afghanistan18003280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
1Afghanistan18013280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
2Afghanistan18023280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
3Afghanistan18033280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
4Afghanistan18043280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " country year population region sub_region income_group \\\n", "0 Afghanistan 1800 3280000 Asia Southern Asia Low \n", "1 Afghanistan 1801 3280000 Asia Southern Asia Low \n", "2 Afghanistan 1802 3280000 Asia Southern Asia Low \n", "3 Afghanistan 1803 3280000 Asia Southern Asia Low \n", "4 Afghanistan 1804 3280000 Asia Southern Asia Low \n", "\n", " life_expectancy income children_per_woman child_mortality pop_density \\\n", "0 28.2 603 7.0 469.0 NaN \n", "1 28.2 603 7.0 469.0 NaN \n", "2 28.2 603 7.0 469.0 NaN \n", "3 28.2 603 7.0 469.0 NaN \n", "4 28.2 603 7.0 469.0 NaN \n", "\n", " co2_per_capita years_in_school_men years_in_school_women \n", "0 NaN NaN NaN \n", "1 NaN NaN NaN \n", "2 NaN NaN NaN \n", "3 NaN NaN NaN \n", "4 NaN NaN NaN " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Methods are very similar to functions, the main difference is that they belong to an object (above, the method `head()` belongs to the data frame `world_data`). Methods operate on the object they belong to, that's why we can call the method with an empty parenthesis without any arguments. Compare this with the function `type()` that was introduced previously." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(world_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, the `world_data` variable is explicitly passed as an argument to `type()`. An immediately tangible advantage with methods is that they simplify tab completion. Just type the name of the dataframe, a period, and then hit tab to see all the relevant methods for that data frame instead of fumbling around with all the available functions in Python (there's quite a few!) and figuring out which ones operate on data frames and which do not. Methods also facilitates readability when chaining many operations together, which will be shown in detail later.\n", "\n", "The columns in a data frame can contain data of different types, e.g. integers, floats, and objects (which includes strings, lists, dictionaries, and more)). General information about the data frame (including the column data types) can be obtained with the `info()` method." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 38982 entries, 0 to 38981\n", "Data columns (total 14 columns):\n", "country 38982 non-null object\n", "year 38982 non-null int64\n", "population 38982 non-null int64\n", "region 38982 non-null object\n", "sub_region 38982 non-null object\n", "income_group 38982 non-null object\n", "life_expectancy 38982 non-null float64\n", "income 38982 non-null int64\n", "children_per_woman 38982 non-null float64\n", "child_mortality 38980 non-null float64\n", "pop_density 12282 non-null float64\n", "co2_per_capita 16285 non-null float64\n", "years_in_school_men 8188 non-null float64\n", "years_in_school_women 8188 non-null float64\n", "dtypes: float64(7), int64(3), object(4)\n", "memory usage: 4.2+ MB\n" ] } ], "source": [ "world_data.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The information includes the total number of rows and columns, the number of non-null observations, the column data types, and the memory (RAM) usage. The number of non-null observation is not the same for all columns, which means that some columns contain null (or NA) values representing that there is missing information. The column data type is often indicative of which type of data is stored in that column, and approximately corresponds to the following\n", "\n", "- **Qualitative/Categorical**\n", " - Nominal (labels, e.g. 'red', 'green', 'blue')\n", " - `object`, `category`\n", " - Ordinal (labels with order, e.g. 'Jan', 'Feb', 'Mar')\n", " - `object`, `category`, `int`\n", " - Binary (only two outcomes, e.g. True or False)\n", " - `bool`\n", "- **Quantitative/Numerical**\n", " - Discrete (whole numbers, often counting, e.g. number of children)\n", " - `int`\n", " - Continuous (measured values with decimals, e.g. weight)\n", " - `float`\n", " \n", "Note that an `object` could contain different types, e.g. `str` or `list`. Also note that there can be exceptions to the schema above, but it is still a useful rough guide.\n", "\n", "After reading in the data into a data frame, `head()` and `info()` are two of the most useful methods to get an idea of the structure of this data frame. There are many additional methods that can facilitate the understanding of what a data frame contains:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Size:\n", " - `world_data.shape` - a tuple with the number of rows in the first element\n", " and the number of columns as the second element\n", " - `world_data.shape[0]` - the number of rows\n", " - `world_data.shape[1]`- the number of columns\n", "\n", "- Content:\n", " - `world_data.head()` - shows the first 5 rows\n", " - `world_data.tail()` - shows the last 5 rows\n", "\n", "- Names:\n", " - `world_data.columns` - returns the names of the columns (also called variable names) \n", " objects)\n", " - `world_data.index` - returns the names of the rows (referred to as the index in pandas)\n", "\n", "- Summary:\n", " - `world_data.info()` - column names and data types, number of observations, memory consumptions\n", " length, and content of each column\n", " - `world_data.describe()` - summary statistics for each column\n", "\n", "These belong to a data frame and are commonly referred to as *attributes* of the data frame. All attributes are accessed with the dot-syntax (`.`), which returns the attribute's value. If the attribute is a method, parentheses can be appended to the name to carry out the method's operation on the data frame. Attributes that are not methods often hold a value that has been precomputed because it is commonly accessed and it saves time store the value in an attribute instead of recomputing it every time it is needed. For example, every time `pandas` creates a data frame, the number of rows and columns is computed and stored in the `shape` attribute.\n", "\n", ">#### Challenge\n", ">\n", ">Based on the output of `world_data.info()`, can you answer the following questions?\n", ">\n", ">* What is the class of the object `world_data`?\n", ">* How many rows and how many columns are in this object?\n", ">* Why is there not the same number of rows (observations) for each column?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving data frames locally\n", "\n", "It is good practice to keep a copy of the data stored locally on your computer in case you want to do offline analyses, the online version of the file changes, or the file is taken down. For this, the data could be downloaded manually or the current `world_data` data frame could be saved to disk as a CSV-file with `to_csv()`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "world_data.to_csv('world-data.csv', index=False)\n", "# `index=False` because the index (the row names) was generated automatically when pandas opened\n", "# the file and this information is not needed to be saved" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the data is now saved locally, the next time this Notebook is opened, it could be loaded from the local path instead of downloading it from the URL." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearpopulationregionsub_regionincome_grouplife_expectancyincomechildren_per_womanchild_mortalitypop_densityco2_per_capitayears_in_school_menyears_in_school_women
0Afghanistan18003280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
1Afghanistan18013280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
2Afghanistan18023280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
3Afghanistan18033280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
4Afghanistan18043280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " country year population region sub_region income_group \\\n", "0 Afghanistan 1800 3280000 Asia Southern Asia Low \n", "1 Afghanistan 1801 3280000 Asia Southern Asia Low \n", "2 Afghanistan 1802 3280000 Asia Southern Asia Low \n", "3 Afghanistan 1803 3280000 Asia Southern Asia Low \n", "4 Afghanistan 1804 3280000 Asia Southern Asia Low \n", "\n", " life_expectancy income children_per_woman child_mortality pop_density \\\n", "0 28.2 603 7.0 469.0 NaN \n", "1 28.2 603 7.0 469.0 NaN \n", "2 28.2 603 7.0 469.0 NaN \n", "3 28.2 603 7.0 469.0 NaN \n", "4 28.2 603 7.0 469.0 NaN \n", "\n", " co2_per_capita years_in_school_men years_in_school_women \n", "0 NaN NaN NaN \n", "1 NaN NaN NaN \n", "2 NaN NaN NaN \n", "3 NaN NaN NaN \n", "4 NaN NaN NaN " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data = pd.read_csv('world-data.csv')\n", "world_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexing and subsetting data frames\n", "\n", "The world data data frame has rows and columns (it has 2 dimensions). To extract specific data from it (also referred to as \"subsetting\"), columns can be selected by their name.The JupyterLab Notebook (technically, the underlying IPython interpreter) knows about the columns in the data frame, so tab autocompletion can be used to get the correct column name. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1800\n", "1 1801\n", "2 1802\n", "3 1803\n", "4 1804\n", "Name: year, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data['year'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The name of the column is not shown, since there is only one. Remember that the numbers on the left is just the index of the data frame, which was added by `pandas` upon importing the data.\n", "\n", "Another syntax that is often used to specify column names is `.`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1800\n", "1 1801\n", "2 1802\n", "3 1803\n", "4 1804\n", "Name: year, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.year.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using brackets is clearer and also allows for passing multiple columns as a list, so this tutorial will stick to that." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyear
0Afghanistan1800
1Afghanistan1801
2Afghanistan1802
3Afghanistan1803
4Afghanistan1804
\n", "
" ], "text/plain": [ " country year\n", "0 Afghanistan 1800\n", "1 Afghanistan 1801\n", "2 Afghanistan 1802\n", "3 Afghanistan 1803\n", "4 Afghanistan 1804" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data[['country', 'year']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output is displayed a bit differently this time. The reason is that when there was only one column `pandas` technically returned a `Series`, not a `Dataframe`. This can be confirmed by using `type` as previously." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(world_data['year'])" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(world_data[['country', 'year']])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, every individual column is actually a `Series` and together they constitute a `Dataframe`. There can be performance benefits to work with `Series`, but `pandas` often takes care of conversions between these two object types under the hood, so this introductory tutorial will not make any further distinction between a `Series` and a `Dataframe`. Many of the analysis techniques used here will apply to both series and data frames." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting with single brackets (`[]`) as above is a shortcut to common operations, such as selecting columns by labels as above. For more flexible and robust row and column selection the more verbose `loc[, ]` (location) syntax is used." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyear
0Afghanistan1800
2Afghanistan1802
4Afghanistan1804
\n", "
" ], "text/plain": [ " country year\n", "0 Afghanistan 1800\n", "2 Afghanistan 1802\n", "4 Afghanistan 1804" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.loc[[0, 2, 4], ['country', 'year']]\n", "# Although methods usually have trailing parenthesis, square brackets are used with `loc[]` to stay\n", "# consistent with the indexing with square brackets in general in Python (e.g. lists and Numpy arrays)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A single number can be selected, which returns that value (here, an integer) rather than a `Dataframe` or `Series` with one value." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1804" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.loc[4, 'year']" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "numpy.int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(world_data.loc[4, 'year'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select all rows, but only a subset of columns, the colon character (`:`) can be used." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyear
0Afghanistan1800
1Afghanistan1801
2Afghanistan1802
3Afghanistan1803
4Afghanistan1804
\n", "
" ], "text/plain": [ " country year\n", "0 Afghanistan 1800\n", "1 Afghanistan 1801\n", "2 Afghanistan 1802\n", "3 Afghanistan 1803\n", "4 Afghanistan 1804" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.loc[:, ['country', 'year']].head() # head() is used to limit the length of the output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same syntax can be used to select all columns but only a subset of rows." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearpopulationregionsub_regionincome_grouplife_expectancyincomechildren_per_womanchild_mortalitypop_densityco2_per_capitayears_in_school_menyears_in_school_women
3Afghanistan18033280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
4Afghanistan18043280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " country year population region sub_region income_group \\\n", "3 Afghanistan 1803 3280000 Asia Southern Asia Low \n", "4 Afghanistan 1804 3280000 Asia Southern Asia Low \n", "\n", " life_expectancy income children_per_woman child_mortality pop_density \\\n", "3 28.2 603 7.0 469.0 NaN \n", "4 28.2 603 7.0 469.0 NaN \n", "\n", " co2_per_capita years_in_school_men years_in_school_women \n", "3 NaN NaN NaN \n", "4 NaN NaN NaN " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.loc[[3, 4], :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When selecting all columns, the `:` could also be left out as a convenience." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearpopulationregionsub_regionincome_grouplife_expectancyincomechildren_per_womanchild_mortalitypop_densityco2_per_capitayears_in_school_menyears_in_school_women
3Afghanistan18033280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
4Afghanistan18043280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " country year population region sub_region income_group \\\n", "3 Afghanistan 1803 3280000 Asia Southern Asia Low \n", "4 Afghanistan 1804 3280000 Asia Southern Asia Low \n", "\n", " life_expectancy income children_per_woman child_mortality pop_density \\\n", "3 28.2 603 7.0 469.0 NaN \n", "4 28.2 603 7.0 469.0 NaN \n", "\n", " co2_per_capita years_in_school_men years_in_school_women \n", "3 NaN NaN NaN \n", "4 NaN NaN NaN " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.loc[[3, 4]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to select slices of rows and column labels." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearpopulationregion
2Afghanistan18023280000Asia
3Afghanistan18033280000Asia
4Afghanistan18043280000Asia
\n", "
" ], "text/plain": [ " country year population region\n", "2 Afghanistan 1802 3280000 Asia\n", "3 Afghanistan 1803 3280000 Asia\n", "4 Afghanistan 1804 3280000 Asia" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.loc[2:4, 'country':'region']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is important to realize that `loc[]` selects rows and columns by their *labels*. To instead select by row or column *position*, use `iloc[]` (integer location)." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearpopulation
2Afghanistan18023280000
3Afghanistan18033280000
4Afghanistan18043280000
\n", "
" ], "text/plain": [ " country year population\n", "2 Afghanistan 1802 3280000\n", "3 Afghanistan 1803 3280000\n", "4 Afghanistan 1804 3280000" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.iloc[[2, 3, 4], [0, 1, 2]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The index of `world_data` consists of consecutive integers so in this case selecting from the index by labels or position will look the same. As will be shown later, an index could also consist of text names just like the columns.\n", "\n", "While selecting slices by label is inclusive of both the start and end, selecting slices by position is inclusive of the start but exclusive of the end position, just like when slicing in lists." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearpopulationregion
2Afghanistan18023280000Asia
3Afghanistan18033280000Asia
4Afghanistan18043280000Asia
\n", "
" ], "text/plain": [ " country year population region\n", "2 Afghanistan 1802 3280000 Asia\n", "3 Afghanistan 1803 3280000 Asia\n", "4 Afghanistan 1804 3280000 Asia" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.iloc[2:5, :4] # `iloc[2:5]` gives the same result as `loc[2:4]` above" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting slices of row positions is a common operation, and has thus been given a shortcut syntax with single brackets." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearpopulationregionsub_regionincome_grouplife_expectancyincomechildren_per_womanchild_mortalitypop_densityco2_per_capitayears_in_school_menyears_in_school_women
2Afghanistan18023280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
3Afghanistan18033280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
4Afghanistan18043280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " country year population region sub_region income_group \\\n", "2 Afghanistan 1802 3280000 Asia Southern Asia Low \n", "3 Afghanistan 1803 3280000 Asia Southern Asia Low \n", "4 Afghanistan 1804 3280000 Asia Southern Asia Low \n", "\n", " life_expectancy income children_per_woman child_mortality pop_density \\\n", "2 28.2 603 7.0 469.0 NaN \n", "3 28.2 603 7.0 469.0 NaN \n", "4 28.2 603 7.0 469.0 NaN \n", "\n", " co2_per_capita years_in_school_men years_in_school_women \n", "2 NaN NaN NaN \n", "3 NaN NaN NaN \n", "4 NaN NaN NaN " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data[2:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">#### Challenge\n", ">\n", ">1. Extract the 200th and 201st row of the `world_data` dataset and assign the resulting data frame to a new variable name (`world_data_200_201`). Remember that Python indexing starts at 0!\n", ">\n", ">2. How can you get the same result as from `world_data.head()` by using row slices instead of the `head()` method?\n", ">\n", ">3. There are at least three distinct ways to extract the last row of the data frame. Which can you find?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `describe()` method was mentioned above as a way of retrieving summary statistics of a data frame. Together with `info()` and `head()` this is often a good place to start exploratory data analysis as it gives a nice overview of the numeric valuables the data set." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearpopulationlife_expectancyincomechildren_per_womanchild_mortalitypop_densityco2_per_capitayears_in_school_menyears_in_school_women
count38982.0000003.898200e+0438982.00000038982.00000038982.00000038980.00000012282.00000016285.0000008188.0000008188.000000
mean1909.0000001.422075e+0743.0734684527.1280335.384391292.050891120.9005723.2368947.6810196.948334
std63.2200066.722423e+0716.2192169753.1160411.642597161.562290382.4542426.0792573.1859833.876399
min1800.0000001.250000e+041.000000247.0000001.1200001.9500000.5020000.0000000.9000000.210000
25%1854.0000005.060000e+0531.200000876.0000004.550000141.00000014.8000000.1880005.1600003.620000
50%1909.0000002.140000e+0635.5000001450.0000005.910000361.00000046.0000000.9440007.6500006.980000
75%1964.0000006.870000e+0655.6000003520.0000006.630000420.000000110.0000004.02000010.1000009.980000
max2018.0000001.420000e+0984.200000178000.0000008.870000756.0000008270.000000101.00000015.30000015.700000
\n", "
" ], "text/plain": [ " year population life_expectancy income \\\n", "count 38982.000000 3.898200e+04 38982.000000 38982.000000 \n", "mean 1909.000000 1.422075e+07 43.073468 4527.128033 \n", "std 63.220006 6.722423e+07 16.219216 9753.116041 \n", "min 1800.000000 1.250000e+04 1.000000 247.000000 \n", "25% 1854.000000 5.060000e+05 31.200000 876.000000 \n", "50% 1909.000000 2.140000e+06 35.500000 1450.000000 \n", "75% 1964.000000 6.870000e+06 55.600000 3520.000000 \n", "max 2018.000000 1.420000e+09 84.200000 178000.000000 \n", "\n", " children_per_woman child_mortality pop_density co2_per_capita \\\n", "count 38982.000000 38980.000000 12282.000000 16285.000000 \n", "mean 5.384391 292.050891 120.900572 3.236894 \n", "std 1.642597 161.562290 382.454242 6.079257 \n", "min 1.120000 1.950000 0.502000 0.000000 \n", "25% 4.550000 141.000000 14.800000 0.188000 \n", "50% 5.910000 361.000000 46.000000 0.944000 \n", "75% 6.630000 420.000000 110.000000 4.020000 \n", "max 8.870000 756.000000 8270.000000 101.000000 \n", "\n", " years_in_school_men years_in_school_women \n", "count 8188.000000 8188.000000 \n", "mean 7.681019 6.948334 \n", "std 3.185983 3.876399 \n", "min 0.900000 0.210000 \n", "25% 5.160000 3.620000 \n", "50% 7.650000 6.980000 \n", "75% 10.100000 9.980000 \n", "max 15.300000 15.700000 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common next step would be to plot the data to explore relationships between different variables, but before getting into plotting, it is beneficial to elaborate on the data frame object and several of its common operations.\n", "\n", "An often desired operation is to select a subset of rows matching a criteria, e.g. which observations have a life expectancy above 83 years. To do this, the \"less than\" comparison operator that was introduced previously can be used." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", "6 False\n", "7 False\n", "8 False\n", "9 False\n", "10 False\n", "11 False\n", "12 False\n", "13 False\n", "14 False\n", "15 False\n", "16 False\n", "17 False\n", "18 False\n", "19 False\n", "20 False\n", "21 False\n", "22 False\n", "23 False\n", "24 False\n", "25 False\n", "26 False\n", "27 False\n", "28 False\n", "29 False\n", " ... \n", "38952 False\n", "38953 False\n", "38954 False\n", "38955 False\n", "38956 False\n", "38957 False\n", "38958 False\n", "38959 False\n", "38960 False\n", "38961 False\n", "38962 False\n", "38963 False\n", "38964 False\n", "38965 False\n", "38966 False\n", "38967 False\n", "38968 False\n", "38969 False\n", "38970 False\n", "38971 False\n", "38972 False\n", "38973 False\n", "38974 False\n", "38975 False\n", "38976 False\n", "38977 False\n", "38978 False\n", "38979 False\n", "38980 False\n", "38981 False\n", "Name: life_expectancy, Length: 38982, dtype: bool" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data['life_expectancy'] > 83" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a boolean array with one value for every row in the data frame indicating whether it is `True` or `False` that this row has a value above 83 in the column `life_expectancy`. To find out how many observations there are matching this condition, the `sum()` method can used since each `True` will be `1` and each `False` will be `0`." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "above_83_bool = world_data['life_expectancy'] > 83\n", "above_83_bool.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of assigning to an intermediate variable, it is possible to use methods directly on the resulting boolean series by surrounding it with parentheses." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(world_data['life_expectancy'] > 83).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The boolean array can be used to select only those rows from the data frame that meet the specified condition." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearpopulationregionsub_regionincome_grouplife_expectancyincomechildren_per_womanchild_mortalitypop_densityco2_per_capitayears_in_school_menyears_in_school_women
17513Japan2012128000000AsiaEastern AsiaHigh83.2364001.403.00352.09.5814.815.2
17514Japan2013128000000AsiaEastern AsiaHigh83.4371001.422.90352.09.7114.915.3
17515Japan2014128000000AsiaEastern AsiaHigh83.6373001.432.80352.09.4715.015.4
17516Japan2015128000000AsiaEastern AsiaHigh83.8378001.443.00351.0NaN15.115.5
17517Japan2016128000000AsiaEastern AsiaHigh83.9382001.462.70350.0NaNNaNNaN
17518Japan2017127000000AsiaEastern AsiaHigh84.0386001.472.83350.0NaNNaNNaN
17519Japan2018127000000AsiaEastern AsiaHigh84.2391001.482.76349.0NaNNaNNaN
30653Singapore20125270000AsiaSouth-eastern AsiaHigh83.2760001.262.807530.06.9013.613.3
30654Singapore20135360000AsiaSouth-eastern AsiaHigh83.2785001.252.707660.010.4013.713.5
30655Singapore20145450000AsiaSouth-eastern AsiaHigh83.4803001.252.707780.010.3013.813.7
30656Singapore20155540000AsiaSouth-eastern AsiaHigh83.6809001.242.707910.0NaN14.013.8
30657Singapore20165620000AsiaSouth-eastern AsiaHigh83.7814001.252.808030.0NaNNaNNaN
30658Singapore20175710000AsiaSouth-eastern AsiaHigh83.8826001.252.588160.0NaNNaNNaN
30659Singapore20185790000AsiaSouth-eastern AsiaHigh84.0839001.262.528270.0NaNNaNNaN
32410Spain201746400000EuropeSouthern EuropeHigh83.1340001.383.1492.9NaNNaNNaN
32411Spain201846400000EuropeSouthern EuropeHigh83.2347001.393.0293.0NaNNaNNaN
33722Switzerland20158320000EuropeWestern EuropeHigh83.1565001.544.10211.0NaN14.614.4
33723Switzerland20168400000EuropeWestern EuropeHigh83.1566001.554.10213.0NaNNaNNaN
33724Switzerland20178480000EuropeWestern EuropeHigh83.3569001.553.86214.0NaNNaNNaN
33725Switzerland20188540000EuropeWestern EuropeHigh83.5571001.553.75216.0NaNNaNNaN
\n", "
" ], "text/plain": [ " country year population region sub_region income_group \\\n", "17513 Japan 2012 128000000 Asia Eastern Asia High \n", "17514 Japan 2013 128000000 Asia Eastern Asia High \n", "17515 Japan 2014 128000000 Asia Eastern Asia High \n", "17516 Japan 2015 128000000 Asia Eastern Asia High \n", "17517 Japan 2016 128000000 Asia Eastern Asia High \n", "17518 Japan 2017 127000000 Asia Eastern Asia High \n", "17519 Japan 2018 127000000 Asia Eastern Asia High \n", "30653 Singapore 2012 5270000 Asia South-eastern Asia High \n", "30654 Singapore 2013 5360000 Asia South-eastern Asia High \n", "30655 Singapore 2014 5450000 Asia South-eastern Asia High \n", "30656 Singapore 2015 5540000 Asia South-eastern Asia High \n", "30657 Singapore 2016 5620000 Asia South-eastern Asia High \n", "30658 Singapore 2017 5710000 Asia South-eastern Asia High \n", "30659 Singapore 2018 5790000 Asia South-eastern Asia High \n", "32410 Spain 2017 46400000 Europe Southern Europe High \n", "32411 Spain 2018 46400000 Europe Southern Europe High \n", "33722 Switzerland 2015 8320000 Europe Western Europe High \n", "33723 Switzerland 2016 8400000 Europe Western Europe High \n", "33724 Switzerland 2017 8480000 Europe Western Europe High \n", "33725 Switzerland 2018 8540000 Europe Western Europe High \n", "\n", " life_expectancy income children_per_woman child_mortality \\\n", "17513 83.2 36400 1.40 3.00 \n", "17514 83.4 37100 1.42 2.90 \n", "17515 83.6 37300 1.43 2.80 \n", "17516 83.8 37800 1.44 3.00 \n", "17517 83.9 38200 1.46 2.70 \n", "17518 84.0 38600 1.47 2.83 \n", "17519 84.2 39100 1.48 2.76 \n", "30653 83.2 76000 1.26 2.80 \n", "30654 83.2 78500 1.25 2.70 \n", "30655 83.4 80300 1.25 2.70 \n", "30656 83.6 80900 1.24 2.70 \n", "30657 83.7 81400 1.25 2.80 \n", "30658 83.8 82600 1.25 2.58 \n", "30659 84.0 83900 1.26 2.52 \n", "32410 83.1 34000 1.38 3.14 \n", "32411 83.2 34700 1.39 3.02 \n", "33722 83.1 56500 1.54 4.10 \n", "33723 83.1 56600 1.55 4.10 \n", "33724 83.3 56900 1.55 3.86 \n", "33725 83.5 57100 1.55 3.75 \n", "\n", " pop_density co2_per_capita years_in_school_men years_in_school_women \n", "17513 352.0 9.58 14.8 15.2 \n", "17514 352.0 9.71 14.9 15.3 \n", "17515 352.0 9.47 15.0 15.4 \n", "17516 351.0 NaN 15.1 15.5 \n", "17517 350.0 NaN NaN NaN \n", "17518 350.0 NaN NaN NaN \n", "17519 349.0 NaN NaN NaN \n", "30653 7530.0 6.90 13.6 13.3 \n", "30654 7660.0 10.40 13.7 13.5 \n", "30655 7780.0 10.30 13.8 13.7 \n", "30656 7910.0 NaN 14.0 13.8 \n", "30657 8030.0 NaN NaN NaN \n", "30658 8160.0 NaN NaN NaN \n", "30659 8270.0 NaN NaN NaN \n", "32410 92.9 NaN NaN NaN \n", "32411 93.0 NaN NaN NaN \n", "33722 211.0 NaN 14.6 14.4 \n", "33723 213.0 NaN NaN NaN \n", "33724 214.0 NaN NaN NaN \n", "33725 216.0 NaN NaN NaN " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data[world_data['life_expectancy'] > 83]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, this can be combined with selection of a particular set of columns." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearlife_expectancy
17513Japan201283.2
17514Japan201383.4
17515Japan201483.6
17516Japan201583.8
17517Japan201683.9
17518Japan201784.0
17519Japan201884.2
30653Singapore201283.2
30654Singapore201383.2
30655Singapore201483.4
30656Singapore201583.6
30657Singapore201683.7
30658Singapore201783.8
30659Singapore201884.0
32410Spain201783.1
32411Spain201883.2
33722Switzerland201583.1
33723Switzerland201683.1
33724Switzerland201783.3
33725Switzerland201883.5
\n", "
" ], "text/plain": [ " country year life_expectancy\n", "17513 Japan 2012 83.2\n", "17514 Japan 2013 83.4\n", "17515 Japan 2014 83.6\n", "17516 Japan 2015 83.8\n", "17517 Japan 2016 83.9\n", "17518 Japan 2017 84.0\n", "17519 Japan 2018 84.2\n", "30653 Singapore 2012 83.2\n", "30654 Singapore 2013 83.2\n", "30655 Singapore 2014 83.4\n", "30656 Singapore 2015 83.6\n", "30657 Singapore 2016 83.7\n", "30658 Singapore 2017 83.8\n", "30659 Singapore 2018 84.0\n", "32410 Spain 2017 83.1\n", "32411 Spain 2018 83.2\n", "33722 Switzerland 2015 83.1\n", "33723 Switzerland 2016 83.1\n", "33724 Switzerland 2017 83.3\n", "33725 Switzerland 2018 83.5" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.loc[world_data['life_expectancy'] > 83, ['country', 'year', 'life_expectancy']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A single expression can also be used to filter for several criteria, either matching *all* criteria (`&`) or *any* criteria (`|`). These special operators are used instead of `and` and `or` to make sure that the comparison occurs for each row in the data frame. Parentheses are added to indicate the priority of the comparisons." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sub_regioncountryyear
9496Northern EuropeDenmark1879
11248Northern EuropeEstonia1879
11905Northern EuropeFinland1879
15409Northern EuropeIceland1879
16504Northern EuropeIreland1879
19132Northern EuropeLatvia1879
20227Northern EuropeLithuania1879
25921Northern EuropeNorway1879
33367Northern EuropeSweden1879
36871Northern EuropeUnited Kingdom1879
\n", "
" ], "text/plain": [ " sub_region country year\n", "9496 Northern Europe Denmark 1879\n", "11248 Northern Europe Estonia 1879\n", "11905 Northern Europe Finland 1879\n", "15409 Northern Europe Iceland 1879\n", "16504 Northern Europe Ireland 1879\n", "19132 Northern Europe Latvia 1879\n", "20227 Northern Europe Lithuania 1879\n", "25921 Northern Europe Norway 1879\n", "33367 Northern Europe Sweden 1879\n", "36871 Northern Europe United Kingdom 1879" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# AND = &\n", "world_data.loc[(world_data['sub_region'] == 'Northern Europe') & (world_data['year'] == 1879), ['sub_region', 'country', 'year']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To increase readability, these statements can be put on multiple rows. Anything that is within a parameter or bracket in Python can be continued on the next row. When inside a bracket or parenthesis, the indentation is not significant to the Python interpreter, but it is still recommended to include it in order to make the code more readable." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sub_regioncountryyear
9496Northern EuropeDenmark1879
11248Northern EuropeEstonia1879
11905Northern EuropeFinland1879
15409Northern EuropeIceland1879
16504Northern EuropeIreland1879
19132Northern EuropeLatvia1879
20227Northern EuropeLithuania1879
25921Northern EuropeNorway1879
33367Northern EuropeSweden1879
36871Northern EuropeUnited Kingdom1879
\n", "
" ], "text/plain": [ " sub_region country year\n", "9496 Northern Europe Denmark 1879\n", "11248 Northern Europe Estonia 1879\n", "11905 Northern Europe Finland 1879\n", "15409 Northern Europe Iceland 1879\n", "16504 Northern Europe Ireland 1879\n", "19132 Northern Europe Latvia 1879\n", "20227 Northern Europe Lithuania 1879\n", "25921 Northern Europe Norway 1879\n", "33367 Northern Europe Sweden 1879\n", "36871 Northern Europe United Kingdom 1879" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.loc[(world_data['sub_region'] == 'Northern Europe') &\n", " (world_data['year'] == 1879),\n", " ['sub_region', 'country', 'year']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above it was assumed that `'Northern Europe'` was a vaue within the `sub_region` column. When it is not known which values are available in a column, the `unique()` method can be used to find this out." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Southern Asia', 'Southern Europe', 'Northern Africa',\n", " 'Sub-Saharan Africa', 'Latin America and the Caribbean',\n", " 'Western Asia', 'Australia and New Zealand', 'Western Europe',\n", " 'Eastern Europe', 'South-eastern Asia', 'Northern America',\n", " 'Eastern Asia', 'Northern Europe', 'Melanesia', 'Central Asia',\n", " 'Micronesia', 'Polynesia'], dtype=object)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data['sub_region'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the `|` operator, rows matching either of the supplied criteria are returned." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyear
0Afghanistan1800
1Afghanistan1801
219Albania1800
220Albania1801
438Algeria1800
\n", "
" ], "text/plain": [ " country year\n", "0 Afghanistan 1800\n", "1 Afghanistan 1801\n", "219 Albania 1800\n", "220 Albania 1801\n", "438 Algeria 1800" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# OR = |\n", "world_data.loc[(world_data['year'] == 1800) |\n", " (world_data['year'] == 1801) ,\n", " ['country', 'year']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Additional useful ways of subsetting the data includes `between()` which checks if a numerical valule is within a given range, and `isin()` which checks if a value is contained in a given list." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,\n", " 2011, 2012, 2013, 2014, 2015])" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.loc[world_data['year'].between(2000, 2015), 'year'].unique()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Asia', 'Africa', 'Americas'], dtype=object)" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.loc[world_data['region'].isin(['Africa', 'Asia', 'Americas']), 'region'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating new columns\n", "\n", "A frequent operation when working with data, is to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. To create a new column of the weight in kg instead of in grams:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationincomepopulation_income
032800006031977840000
132800006031977840000
232800006031977840000
332800006031977840000
432800006031977840000
\n", "
" ], "text/plain": [ " population income population_income\n", "0 3280000 603 1977840000\n", "1 3280000 603 1977840000\n", "2 3280000 603 1977840000\n", "3 3280000 603 1977840000\n", "4 3280000 603 1977840000" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data['population_income'] = world_data['income'] * world_data['population']\n", "world_data[['population', 'income', 'population_income']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">#### Challenge\n", ">\n", ">1. Subset `world_data` to include observations from 1995 to 2001. Check that the dimensions of the resulting data frame is 1253 x 15.\n", "> \n", ">2. Subset the data to include only observation from year 2000 and onwards, from all regions except 'Asia', and retain only the columns `country`, `year`, and `sub_region`. The dimensions of the resulting data frame should be 2508 x 3." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2489, 3)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Challenge solutions\n", "\n", "# 1. \n", "world_data.loc[world_data['year'].between(1995, 2001)].shape\n", "\n", "# 2.\n", "world_data.loc[(world_data['year'] >= 2000) &\n", " (world_data['region'] != 'Asia'),\n", " ['country', 'year', 'sub_region']].shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Split-apply-combine techniques in pandas\n", "\n", "Many data analysis tasks can be approached using the *split-apply-combine* paradigm: split the data into groups, apply some analysis to each group, and then combine the results.\n", "\n", "`pandas` facilitates this workflow through the use of `groupby()` to split data and summary/aggregation functions such as `mean()`, which collapses each group into a single-row summary of that group. The arguments to `groupby()` are the column names that contain the *categorical* variables by which summary statistics should be calculated. To start, compute the mean `weight` by sex.\n", "\n", "![Image credit Jake VanderPlas](img/split-apply-combine.png)\n", "\n", "*Image credit Jake VanderPlas*\n", "\n", "### Summarizing categorical data \n", "\n", "Aggregation (or \"summary\") methods, such as `.sum()` and `.mean()` can be used to calculate their respective statistics on subsets (groups) in the data. When the mean is computed, the default behavior is to ignore NA values, so they only need to be dropped if they are to be excluded from the visual output." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region\n", "Africa 59192998600\n", "Americas 63837885500\n", "Asia 330133218800\n", "Europe 98766930400\n", "Oceania 2422277600\n", "Name: population, dtype: int64" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.groupby('region')['population'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output here is a series that is indexed with the grouped variable (the region) as the index and the result of the aggregation (the total population) as the values (conceptually, the only column).\n", "\n", "These populations numbers are abnormally high because the summary is made for all the years instead of only one. To view only the data from this year, use the learnt methods to subset for only 2018. Compare these results to the picture in the survey that placed 4 million people in Asia and 1 million in each of the other regions." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region\n", "Africa 1286388200\n", "Americas 1010688000\n", "Asia 4514211000\n", "Europe 742109000\n", "Oceania 40212000\n", "Name: population, dtype: int64" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data_2018 = world_data.loc[world_data['year'] == 2018]\n", "world_data_2018.groupby('region')['population'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These numbers are closer to the survey we took earlier. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Individual countries can be selected from the resulting series using `loc[]`, just as previously." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region\n", "Asia 4514211000\n", "Europe 742109000\n", "Name: population, dtype: int64" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "avg_density = world_data_2018.groupby('region')['population'].sum()\n", "avg_density.loc[['Asia', 'Europe']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a shortcut, `loc[]` can be omitted when indexing a series. This is similar to selecting columns from a data frame with just `[]`." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region\n", "Asia 4514211000\n", "Europe 742109000\n", "Name: population, dtype: int64" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "avg_density[['Asia', 'Europe']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " This indexing can be used to normalize the population numbers to the region of interest. " ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region\n", "Africa 1.733422\n", "Americas 1.361913\n", "Asia 6.082949\n", "Europe 1.000000\n", "Oceania 0.054186\n", "Name: population, dtype: float64" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "region_pop_2018 = world_data_2018.groupby('region')['population'].sum()\n", "region_pop_2018 / region_pop_2018['Europe']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 6 times as many people living in Asia than in Europe." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Groups can also be created from multiple columns, e.g. it could be interesting to compare the how densely populated countries are on average in different income brackets around the world." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region income_group\n", "Africa High 207.000000\n", " Low 118.640741\n", " Lower middle 69.331250\n", " Upper middle 94.457500\n", "Americas High 136.426000\n", " Low 403.000000\n", " Lower middle 113.950000\n", " Upper middle 92.931875\n", "Asia High 1121.654545\n", " Low 115.866667\n", " Lower middle 262.606471\n", " Upper middle 235.447692\n", "Europe High 176.563214\n", " Lower middle 99.500000\n", " Upper middle 67.832222\n", "Oceania High 10.610000\n", " Lower middle 52.500000\n", " Upper middle 90.266667\n", "Name: pop_density, dtype: float64" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data_2018.groupby(['region', 'income_group'])['pop_density'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that `income_group` is an ordinal variable, i.e. a categorical variable with an inherent order to it. Here, `pandas` has not listed the values of that variable in the order we would expect (low, lower-middle, upper-middle, high). The order of a variable can be specified in the data frame itself, using the top level `pandas` function `Categorical()`." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CategoricalDtype(categories=['Low', 'Lower middle', 'Upper middle', 'High'], ordered=True)" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reassign in the main data frame since we will use more than just the 2018 data later\n", "world_data['income_group'] = (\n", " pd.Categorical(world_data['income_group'], ordered=True,\n", " categories=['Low', 'Lower middle', 'Upper middle', 'High'])\n", ")\n", "\n", "# Need to recreate the 2018 data frame since the categorical was changed in the main frame\n", "world_data_2018 = world_data.loc[world_data['year'] == 2018]\n", "world_data_2018['income_group'].dtype" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region income_group\n", "Africa Low 118.640741\n", " Lower middle 69.331250\n", " Upper middle 94.457500\n", " High 207.000000\n", "Americas Low 403.000000\n", " Lower middle 113.950000\n", " Upper middle 92.931875\n", " High 136.426000\n", "Asia Low 115.866667\n", " Lower middle 262.606471\n", " Upper middle 235.447692\n", " High 1121.654545\n", "Europe Lower middle 99.500000\n", " Upper middle 67.832222\n", " High 176.563214\n", "Oceania Lower middle 52.500000\n", " Upper middle 90.266667\n", " High 10.610000\n", "Name: pop_density, dtype: float64" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data_2018.groupby(['region', 'income_group'])['pop_density'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the values appear in the order we would expect. The value for Asia in the high income bracket looks suspiciously high. It would be interesting to see which countries were averaged to that value." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrypop_density
2627Bahrain2060.0
9197Cyprus129.0
16862Israel391.0
17519Japan349.0
18614Kuwait236.0
26279Oman15.6
28469Qatar232.0
29564Saudi Arabia15.6
30659Singapore8270.0
31973South Korea526.0
36791United Arab Emirates114.0
\n", "
" ], "text/plain": [ " country pop_density\n", "2627 Bahrain 2060.0\n", "9197 Cyprus 129.0\n", "16862 Israel 391.0\n", "17519 Japan 349.0\n", "18614 Kuwait 236.0\n", "26279 Oman 15.6\n", "28469 Qatar 232.0\n", "29564 Saudi Arabia 15.6\n", "30659 Singapore 8270.0\n", "31973 South Korea 526.0\n", "36791 United Arab Emirates 114.0" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data_2018.loc[(world_data['region'] == 'Asia') &\n", " (world_data['income_group'] == 'High'),\n", " ['country', 'pop_density']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extreme values, such as the city-state Singapore, can heavily skew averages and it could be a good idea to use a more robust statistics such as the median instead." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region income_group\n", "Africa Low 66.70\n", " Lower middle 74.75\n", " Upper middle 12.81\n", " High 207.00\n", "Americas Low 403.00\n", " Lower middle 68.20\n", " Upper middle 55.95\n", " High 37.80\n", "Asia Low 82.35\n", " Lower middle 92.00\n", " Upper middle 106.00\n", " High 236.00\n", "Europe Lower middle 99.50\n", " Upper middle 68.70\n", " High 109.50\n", "Oceania Lower middle 22.70\n", " Upper middle 69.90\n", " High 10.61\n", "Name: pop_density, dtype: float64" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data_2018.groupby(['region', 'income_group'])['pop_density'].median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The returned series has an index that is a combination of the columns `region` and `sub_region`, and referred to as a `MultiIndex`. The same syntax as previously can be used to select rows on the species-level." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region income_group\n", "Africa Low 66.70\n", " Lower middle 74.75\n", " Upper middle 12.81\n", " High 207.00\n", "Americas Low 403.00\n", " Lower middle 68.20\n", " Upper middle 55.95\n", " High 37.80\n", "Name: pop_density, dtype: float64" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "med_density_2018 = world_data_2018.groupby(['region', 'income_group'])['pop_density'].median()\n", "med_density_2018[['Africa', 'Americas']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select specific values from both levels of the `MultiIndex`, a list of tuples can be passed to `loc[]`." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region income_group\n", "Africa High 207.0\n", "Americas High 37.8\n", "Name: pop_density, dtype: float64" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "med_density_2018.loc[[('Africa', 'High'), ('Americas', 'High')]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select only the low income values from all region, the `xs()` (cross section) method can be used." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region\n", "Africa 66.70\n", "Americas 403.00\n", "Asia 82.35\n", "Name: pop_density, dtype: float64" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "med_density_2018.xs('Low', level='income_group')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The names and values of the index levels can be seen by inspecting the index object." ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MultiIndex(levels=[['Africa', 'Americas', 'Asia', 'Europe', 'Oceania'], ['Low', 'Lower middle', 'Upper middle', 'High']],\n", " labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 1, 2, 3, 1, 2, 3]],\n", " names=['region', 'income_group'])" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "med_density_2018.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although `MultiIndexes` offer succinct and fast ways to access data, they also requires memorization of additional syntax and are strictly speaking not essential unless speed is of particular concern. It can therefore be easier to reset the index, so that all values are stored in columns." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
regionincome_grouppop_density
0AfricaLow66.70
1AfricaLower middle74.75
2AfricaUpper middle12.81
3AfricaHigh207.00
4AmericasLow403.00
5AmericasLower middle68.20
6AmericasUpper middle55.95
7AmericasHigh37.80
8AsiaLow82.35
9AsiaLower middle92.00
10AsiaUpper middle106.00
11AsiaHigh236.00
12EuropeLower middle99.50
13EuropeUpper middle68.70
14EuropeHigh109.50
15OceaniaLower middle22.70
16OceaniaUpper middle69.90
17OceaniaHigh10.61
\n", "
" ], "text/plain": [ " region income_group pop_density\n", "0 Africa Low 66.70\n", "1 Africa Lower middle 74.75\n", "2 Africa Upper middle 12.81\n", "3 Africa High 207.00\n", "4 Americas Low 403.00\n", "5 Americas Lower middle 68.20\n", "6 Americas Upper middle 55.95\n", "7 Americas High 37.80\n", "8 Asia Low 82.35\n", "9 Asia Lower middle 92.00\n", "10 Asia Upper middle 106.00\n", "11 Asia High 236.00\n", "12 Europe Lower middle 99.50\n", "13 Europe Upper middle 68.70\n", "14 Europe High 109.50\n", "15 Oceania Lower middle 22.70\n", "16 Oceania Upper middle 69.90\n", "17 Oceania High 10.61" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "med_density_2018_res = med_density_2018.reset_index()\n", "med_density_2018_res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After resetting the index, the same comparison syntax introduced earlier can be used instead of `xs()` or passing lists of tuples to `loc[]`." ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
regionincome_grouppop_density
0AfricaLow66.70
4AmericasLow403.00
8AsiaLow82.35
\n", "
" ], "text/plain": [ " region income_group pop_density\n", "0 Africa Low 66.70\n", "4 Americas Low 403.00\n", "8 Asia Low 82.35" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "med_density_2018_asia = med_density_2018_res.loc[med_density_2018_res['income_group'] == 'Low']\n", "med_density_2018_asia" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`reset_index()` grants the freedom of not having to work with indexes, but it is still worth keeping in mind that selecting on an index level with `xs()` can be orders of magnitude faster than using boolean comparisons (on large data frames).\n", "\n", "The opposite operation (to create an index) can be performed with `set_index()` on any column (or combination of columns) that creates an index with unique values." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pop_density
regionincome_group
AfricaLow66.70
AmericasLow403.00
AsiaLow82.35
\n", "
" ], "text/plain": [ " pop_density\n", "region income_group \n", "Africa Low 66.70\n", "Americas Low 403.00\n", "Asia Low 82.35" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "med_density_2018_asia.set_index(['region', 'income_group'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Challenge\n", ">\n", "> 1. Which is the highest population density in each region?\n", ">\n", "> 2. The low income group for the Americas had the same population density for both the mean and the median. This could mean that there are few observations in this group. List all the low income countries in the Americas." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region\n", "Africa 625.0\n", "Americas 666.0\n", "Asia 8270.0\n", "Europe 1350.0\n", "Oceania 151.0\n", "Name: pop_density, dtype: float64" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Challenge solutions\n", "\n", "# 1.\n", "world_data_2018.groupby('region')['pop_density'].max()" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrypop_density
14891Haiti403.0
\n", "
" ], "text/plain": [ " country pop_density\n", "14891 Haiti 403.0" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This will be a challenge\n", "\n", "# 2.\n", "world_data_2018.loc[(world_data['region'] == 'Americas') & (world_data['income_group'] == 'Low'), ['country', 'pop_density']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Multiple aggregations on grouped data\n", "\n", "Since the same grouped data frame will be used in multiple code chunks below, this can be assigned to a new variable instead of typing out the grouping expression each time." ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region sub_region \n", "Africa Northern Africa 74.716667\n", " Sub-Saharan Africa 63.682609\n", "Americas Latin America and the Caribbean 75.600000\n", " Northern America 80.650000\n", "Asia Central Asia 71.340000\n", " Eastern Asia 76.440000\n", " South-eastern Asia 73.630000\n", " Southern Asia 72.211111\n", " Western Asia 76.122222\n", "Europe Eastern Europe 75.110000\n", " Northern Europe 80.140000\n", " Southern Europe 79.466667\n", " Western Europe 82.100000\n", "Oceania Australia and New Zealand 82.350000\n", " Melanesia 63.700000\n", " Micronesia 62.200000\n", " Polynesia 71.550000\n", "Name: life_expectancy, dtype: float64" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_world_data = world_data_2018.groupby(['region', 'sub_region'])\n", "grouped_world_data['life_expectancy'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of using the `mean()` or `sum()` methods directly, the more general `agg()` method could be called to aggregate by *any* existing aggregation functions. The equivalent to the `mean()` method would be to call `agg()` and specify `'mean'`." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region sub_region \n", "Africa Northern Africa 74.716667\n", " Sub-Saharan Africa 63.682609\n", "Americas Latin America and the Caribbean 75.600000\n", " Northern America 80.650000\n", "Asia Central Asia 71.340000\n", " Eastern Asia 76.440000\n", " South-eastern Asia 73.630000\n", " Southern Asia 72.211111\n", " Western Asia 76.122222\n", "Europe Eastern Europe 75.110000\n", " Northern Europe 80.140000\n", " Southern Europe 79.466667\n", " Western Europe 82.100000\n", "Oceania Australia and New Zealand 82.350000\n", " Melanesia 63.700000\n", " Micronesia 62.200000\n", " Polynesia 71.550000\n", "Name: life_expectancy, dtype: float64" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_world_data['life_expectancy'].agg('mean')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This general approach is more flexible and powerful since multiple aggregation functions can be applied in the same line of code by passing them as a list to `agg()`. For instance, the standard deviation and mean could be computed in the same call by passing them in a list." ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meanstd
regionsub_region
AfricaNorthern Africa74.7166673.510793
Sub-Saharan Africa63.6826094.540108
AmericasLatin America and the Caribbean75.6000003.721559
Northern America80.6500002.192031
AsiaCentral Asia71.3400000.808084
Eastern Asia76.4400006.566430
South-eastern Asia73.6300004.835298
Southern Asia72.2111116.426983
Western Asia76.1222224.585214
EuropeEastern Europe75.1100002.711478
Northern Europe80.1400002.958678
Southern Europe79.4666672.694889
Western Europe82.1000000.804156
OceaniaAustralia and New Zealand82.3500000.777817
Melanesia63.7000001.961292
Micronesia62.200000NaN
Polynesia71.5500001.202082
\n", "
" ], "text/plain": [ " mean std\n", "region sub_region \n", "Africa Northern Africa 74.716667 3.510793\n", " Sub-Saharan Africa 63.682609 4.540108\n", "Americas Latin America and the Caribbean 75.600000 3.721559\n", " Northern America 80.650000 2.192031\n", "Asia Central Asia 71.340000 0.808084\n", " Eastern Asia 76.440000 6.566430\n", " South-eastern Asia 73.630000 4.835298\n", " Southern Asia 72.211111 6.426983\n", " Western Asia 76.122222 4.585214\n", "Europe Eastern Europe 75.110000 2.711478\n", " Northern Europe 80.140000 2.958678\n", " Southern Europe 79.466667 2.694889\n", " Western Europe 82.100000 0.804156\n", "Oceania Australia and New Zealand 82.350000 0.777817\n", " Melanesia 63.700000 1.961292\n", " Micronesia 62.200000 NaN\n", " Polynesia 71.550000 1.202082" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_world_data['life_expectancy'].agg(['mean', 'std'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The returned output is in this case a data frame and the `MultiIndex` is indicated in bold font.\n", "\n", "By passing a dictionary to `.agg()` it is possible to apply different aggregations to the different columns. Long code statements can be broken down into multiple lines if they are enclosed by parentheses, brackets or braces, something that will be described in detail later." ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationincome
summinmedianmax
regionsub_region
AfricaNorthern Africa23727000044401120018300
Sub-Saharan Africa1049118200629198527500
AmericasLatin America and the Caribbean64668800017101370030300
Northern America364000000438004935054900
AsiaCentral Asia718900002920669024200
Eastern Asia162692000013901600039100
South-eastern Asia6558700001490725583900
Southern Asia18872610001870689017400
Western Asia272270000243020750121000
EuropeEastern Europe29197000053302410032300
Northern Europe104478000255004345065600
Southern Europe151681000121002405037900
Western Europe193980000390004520099000
OceaniaAustralia and New Zealand29550000364004110045800
Melanesia10237000211028509420
Micronesia118000189018901890
Polynesia307000550057255950
\n", "
" ], "text/plain": [ " population income \n", " sum min median max\n", "region sub_region \n", "Africa Northern Africa 237270000 4440 11200 18300\n", " Sub-Saharan Africa 1049118200 629 1985 27500\n", "Americas Latin America and the Caribbean 646688000 1710 13700 30300\n", " Northern America 364000000 43800 49350 54900\n", "Asia Central Asia 71890000 2920 6690 24200\n", " Eastern Asia 1626920000 1390 16000 39100\n", " South-eastern Asia 655870000 1490 7255 83900\n", " Southern Asia 1887261000 1870 6890 17400\n", " Western Asia 272270000 2430 20750 121000\n", "Europe Eastern Europe 291970000 5330 24100 32300\n", " Northern Europe 104478000 25500 43450 65600\n", " Southern Europe 151681000 12100 24050 37900\n", " Western Europe 193980000 39000 45200 99000\n", "Oceania Australia and New Zealand 29550000 36400 41100 45800\n", " Melanesia 10237000 2110 2850 9420\n", " Micronesia 118000 1890 1890 1890\n", " Polynesia 307000 5500 5725 5950" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_world_data[['population', 'income']].agg(\n", " {'population': 'sum',\n", " 'income': ['min', 'median', 'max']\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are plenty of aggregation methods available in pandas (e.g. `sem`, `mad`, `sum`, most of which can be seen at [the end of this section](https://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation) in the `pandas` documentation, or explored using tab-complete on the grouped data frame)." ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [], "source": [ "# This is a side note, no need to bring up unless someone has issues\n", "# Tab completion might only work like this:\n", "# find_agg_methods = grouped_world_data['weight']\n", "# find_agg_methods." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even if a function is not part of the `pandas` library, it can be passed to `agg()`." ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region sub_region \n", "Africa Northern Africa 50.113333\n", " Sub-Saharan Africa 108.143043\n", "Americas Latin America and the Caribbean 126.558966\n", " Northern America 19.880000\n", "Asia Central Asia 38.504000\n", " Eastern Asia 248.202000\n", " South-eastern Asia 961.110000\n", " Southern Asia 460.388889\n", " Western Asia 298.355556\n", "Europe Eastern Europe 88.629000\n", " Northern Europe 64.897000\n", " Southern Europe 202.166667\n", " Western Europe 256.000000\n", "Oceania Australia and New Zealand 10.610000\n", " Melanesia 28.475000\n", " Micronesia 146.000000\n", " Polynesia 110.450000\n", "Name: pop_density, dtype: float64" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "grouped_world_data['pop_density'].agg(np.mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any function can be passed like this, including user-created functions. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> #### Challenge\n", ">\n", "> 1. What's the mean life expectancy for each income group in 2018?\n", "> \n", "> 2. What's the min, median, and max life expectancies for each income group within each region?" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "income_group\n", "Low 63.744118\n", "Lower middle 69.053488\n", "Upper middle 74.283673\n", "High 79.919231\n", "Name: life_expectancy, dtype: float64" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Challenge solutions\n", "\n", "# 1.\n", "world_data_2018.groupby('income_group')['life_expectancy'].mean()" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
minmedianmax
regionincome_group
AfricaLow51.662.5068.3
Lower middle51.166.3578.0
Upper middle63.567.1077.9
High74.274.2074.2
AmericasLow64.564.5064.5
Lower middle73.174.9078.7
Upper middle68.275.8081.4
High73.477.6082.2
AsiaLow58.770.4572.2
Lower middle67.971.5077.8
Upper middle68.076.5080.5
High76.980.7084.2
EuropeLower middle72.372.3572.4
Upper middle71.175.5078.0
High75.181.3083.5
OceaniaLower middle61.162.9064.3
Upper middle65.870.7072.4
High81.882.3582.9
\n", "
" ], "text/plain": [ " min median max\n", "region income_group \n", "Africa Low 51.6 62.50 68.3\n", " Lower middle 51.1 66.35 78.0\n", " Upper middle 63.5 67.10 77.9\n", " High 74.2 74.20 74.2\n", "Americas Low 64.5 64.50 64.5\n", " Lower middle 73.1 74.90 78.7\n", " Upper middle 68.2 75.80 81.4\n", " High 73.4 77.60 82.2\n", "Asia Low 58.7 70.45 72.2\n", " Lower middle 67.9 71.50 77.8\n", " Upper middle 68.0 76.50 80.5\n", " High 76.9 80.70 84.2\n", "Europe Lower middle 72.3 72.35 72.4\n", " Upper middle 71.1 75.50 78.0\n", " High 75.1 81.30 83.5\n", "Oceania Lower middle 61.1 62.90 64.3\n", " Upper middle 65.8 70.70 72.4\n", " High 81.8 82.35 82.9" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 2.\n", "world_data_2018.groupby(['region', 'income_group'])['life_expectancy'].agg(['min', 'median', 'max'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional sections (time permitting)\n", "\n", "### Using `size()` to summarize categorical data \n", "\n", "When working with data, it is common to want to know the number of observations present for each categorical variable. For this, `pandas` provides the `size()` method. For example, to find the number of observations (in this case unique countries during year 2018) per region:" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region\n", "Africa 52\n", "Americas 31\n", "Asia 47\n", "Europe 39\n", "Oceania 9\n", "dtype: int64" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data_2018.groupby('region').size()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`size()` can also be used when grouping on multiple variables." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region income_group\n", "Africa Low 27\n", " Lower middle 16\n", " Upper middle 8\n", " High 1\n", "Americas Low 1\n", " Lower middle 4\n", " Upper middle 16\n", " High 10\n", "Asia Low 6\n", " Lower middle 17\n", " Upper middle 13\n", " High 11\n", "Europe Lower middle 2\n", " Upper middle 9\n", " High 28\n", "Oceania Lower middle 4\n", " Upper middle 3\n", " High 2\n", "dtype: int64" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data_2018.groupby(['region', 'income_group']).size()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If there are many groups, `size()` is not that useful on its own. For example, it is difficult to quickly find the five most abundant species among the observations." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sub_region\n", "Australia and New Zealand 2\n", "Central Asia 5\n", "Eastern Asia 5\n", "Eastern Europe 10\n", "Latin America and the Caribbean 29\n", "Melanesia 4\n", "Micronesia 1\n", "Northern Africa 6\n", "Northern America 2\n", "Northern Europe 10\n", "Polynesia 2\n", "South-eastern Asia 10\n", "Southern Asia 9\n", "Southern Europe 12\n", "Sub-Saharan Africa 46\n", "Western Asia 18\n", "Western Europe 7\n", "dtype: int64" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data_2018.groupby('sub_region').size()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since there are many rows in this output, it would be beneficial to sort the table values and display the most abundant species first. This is easy to do with the `sort_values()` method." ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sub_region\n", "Micronesia 1\n", "Australia and New Zealand 2\n", "Polynesia 2\n", "Northern America 2\n", "Melanesia 4\n", "Eastern Asia 5\n", "Central Asia 5\n", "Northern Africa 6\n", "Western Europe 7\n", "Southern Asia 9\n", "Northern Europe 10\n", "South-eastern Asia 10\n", "Eastern Europe 10\n", "Southern Europe 12\n", "Western Asia 18\n", "Latin America and the Caribbean 29\n", "Sub-Saharan Africa 46\n", "dtype: int64" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data_2018.groupby('sub_region').size().sort_values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's better, but it could be helpful to display the most abundant species on top. In other words, the output should be arranged in descending order." ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sub_region\n", "Sub-Saharan Africa 46\n", "Latin America and the Caribbean 29\n", "Western Asia 18\n", "Southern Europe 12\n", "Eastern Europe 10\n", "dtype: int64" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data_2018.groupby('sub_region').size().sort_values(ascending=False).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks good! By now, the code statement has grown quite long because many methods have been *chained* together. It can be tricky to keep track of what is going on in long method chains. To make the code more readable, it can be broken up multiple lines by adding a surrounding parenthesis." ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sub_region\n", "Sub-Saharan Africa 46\n", "Latin America and the Caribbean 29\n", "Western Asia 18\n", "Southern Europe 12\n", "Eastern Europe 10\n", "dtype: int64" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(world_data_2018\n", " .groupby('sub_region')\n", " .size()\n", " .sort_values(ascending=False)\n", " .head(5)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks neater and makes long method chains easier to reads. There is no absolute rule for when to break code into multiple line, but always try to write code that is easy for collaborators (your most common collaborator is a future version of yourself!) to understand.\n", "\n", "`pandas` actually has a convenience function for returning the top five results, so the values don't need to be sorted explicitly." ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sub_region\n", "Sub-Saharan Africa 46\n", "Latin America and the Caribbean 29\n", "Western Asia 18\n", "Southern Europe 12\n", "Eastern Europe 10\n", "dtype: int64" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(world_data_2018\n", " .groupby(['sub_region'])\n", " .size()\n", " .nlargest() # the default is 5\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To include more attributes about these countries, add those columns to `groupby()`." ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "region sub_region \n", "Africa Sub-Saharan Africa 46\n", "Americas Latin America and the Caribbean 29\n", "Asia Western Asia 18\n", "Europe Southern Europe 12\n", "Asia South-eastern Asia 10\n", "dtype: int64" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(world_data_2018\n", " .groupby(['region', 'sub_region'])\n", " .size()\n", " .nlargest() # the default is 5\n", ")" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearpopulationregionsub_regionincome_grouplife_expectancyincomechildren_per_womanchild_mortalitypop_densityco2_per_capitayears_in_school_menyears_in_school_womenpopulation_income
0Afghanistan18003280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN1977840000
1Afghanistan18013280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN1977840000
2Afghanistan18023280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN1977840000
3Afghanistan18033280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN1977840000
4Afghanistan18043280000AsiaSouthern AsiaLow28.26037.0469.0NaNNaNNaNNaN1977840000
\n", "
" ], "text/plain": [ " country year population region sub_region income_group \\\n", "0 Afghanistan 1800 3280000 Asia Southern Asia Low \n", "1 Afghanistan 1801 3280000 Asia Southern Asia Low \n", "2 Afghanistan 1802 3280000 Asia Southern Asia Low \n", "3 Afghanistan 1803 3280000 Asia Southern Asia Low \n", "4 Afghanistan 1804 3280000 Asia Southern Asia Low \n", "\n", " life_expectancy income children_per_woman child_mortality pop_density \\\n", "0 28.2 603 7.0 469.0 NaN \n", "1 28.2 603 7.0 469.0 NaN \n", "2 28.2 603 7.0 469.0 NaN \n", "3 28.2 603 7.0 469.0 NaN \n", "4 28.2 603 7.0 469.0 NaN \n", "\n", " co2_per_capita years_in_school_men years_in_school_women \\\n", "0 NaN NaN NaN \n", "1 NaN NaN NaN \n", "2 NaN NaN NaN \n", "3 NaN NaN NaN \n", "4 NaN NaN NaN \n", "\n", " population_income \n", "0 1977840000 \n", "1 1977840000 \n", "2 1977840000 \n", "3 1977840000 \n", "4 1977840000 " ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">#### Challenge\n", ">\n", "> 1. How many countries are there in each income group worldwide?\n", "> 2. Assign the variable name `world_data_2015` to a data frame containing only the values from year 2015 (e.g. the same way as `world_data_2018` was created)\n", "> 3. \n", "> 1. For those countries where women went to school longer than men, how many are in each income group.\n", "> 2. Do the same as above but for countries where men went to school longer than women. What does this distribution tell you?" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "income_group\n", "Low 34\n", "Lower middle 43\n", "Upper middle 49\n", "High 52\n", "dtype: int64" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Challenge solutions\n", "# 1.\n", "world_data_2018.groupby('income_group').size()" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [], "source": [ "# 2\n", "world_data_2015 = world_data.loc[world_data['year'] == 2015]" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "income_group\n", "Low 0\n", "Lower middle 14\n", "Upper middle 33\n", "High 47\n", "dtype: int64" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 3a\n", "world_data_2015.loc[world_data_2015['years_in_school_men'] < world_data_2015['years_in_school_women']].groupby('income_group').size()" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "income_group\n", "Low 34\n", "Lower middle 29\n", "Upper middle 11\n", "High 5\n", "dtype: int64" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 3b\n", "world_data_2015.loc[world_data_2015['years_in_school_men'] > world_data_2015['years_in_school_women']].groupby('income_group').size()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data cleaning tips" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`dropna()` removes both explicit `NaN` values and value that pandas are assumed to be `NaN`, such as the non-numeric values in the life_expectancy column. Non-numeric values can also be coerced into explicit `NaN` values via the `to_numeric()` top level function." ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 71.5\n", "1 NaN\n", "2 71.6\n", "3 NaN\n", "4 71.7\n", "5 72.0\n", "6 70.0\n", "7 70.0\n", "8 70.1\n", "9 70.2\n", "10 NaN\n", "11 70.4\n", "Name: life_expectancy, dtype: float64" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.to_numeric(clean_df['life_expectancy'], errors='coerce')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }