One notices that the "mountains" are plotted on a log-base-2 x-axis and without a y-axis annotation. Why? Furthermore, world income data usually involve mean income per country, so I got curious how/if these plots were made without access to finer granularity level data? Aim of this blog post is to answer these questions by using Gapminder data freely available from their webpage. The answer ended up as a nice `tidyverse` exercise and could serve as motivating application for basic probability course content.
## Data Munging Gapminder
Data on income, population and Gini coefficient were needed to analyse
the above formulated questions. I have done this previously in order to visualize the [Olympic Medal Table Gapminder Style](http://staff.math.su.se/hoehle/blog/2016/08/21/gapMedal.html). We start by downloading the GDP data, which is the annual gross domestic product per capita by Purchasing Power Parities (PPP) measured in
[international dollars](https://en.wikipedia.org/wiki/Geary–Khamis_dollar), fixed 2011 prices. Hence, the inflation over
the years and differences in the cost of living between countries is
accounted for and can thus be compared - see the [Gapminder documentation](https://www.gapminder.org/data/documentation/gd001/) for further details. We download the
[data from Gapminder](https://github.com/Gapminder-Indicators/gdppc_cppp/raw/master/gdppc_cppp-by-gapminder.xlsx) where they are available in *wide format* as Excel-file. For tidyverse handling we reshape them into *long format*.
```{r DATA_GDPLOAD, echo=TRUE, message=FALSE}
##Download gdp data from gapminder - available under a CC BY-4 license.
if (!file.exists(file.path(fullFigPath, "gapminder-gdp.xlsx"))) {
download.file("https://github.com/Gapminder-Indicators/gdppc_cppp/raw/master/gdppc_cppp-by-gapminder.xlsx", destfile=file.path(fullFigPath,"gapminder-gdp.xlsx"))
}
gdp_long <- readxl::read_xlsx(file.path(fullFigPath, "gapminder-gdp.xlsx"), sheet=2) %>%
rename(country=`geo.name`) %>%
select(-geo,-indicator,-indicator.name) %>%
gather(key="year", value="gdp", -country,) %>%
filter(!is.na(gdp))
```
Furthermore, we rescale GDP per year to daily income, because this is the unit used in the book.
```{r, echo=TRUE}
gdp_long %<>% mutate(gdp = gdp / 365.25)
```
Similar code segments are written for (see the [code](`r paste0("https://raw.githubusercontent.com/hoehleatsu/hoehleatsu.github.io/master/_source/",current_input())`) on github for details)
* the gini (`gini_long`) and population (`pop_long`) data
* the regional group (=continent) each country belongs two (`group`)
```{r DATA_LOAD, echo=FALSE, results='hide', message=FALSE}
if (!file.exists(file.path(fullFigPath, "gapminder-gini-v2.xlsx"))) {
download.file("https://docs.google.com/spreadsheets/d/1V9ueokiba2KKO0Un8UwJ73rBPr2Zub8j7bfT6Fi222E/export?format=xlsx", destfile=file.path(fullFigPath,"gapminder-gini-v2.xlsx"))
}
gini <- readxl::read_xlsx(file.path(fullFigPath, "gapminder-gini-v2.xlsx"), sheet=2) %>%
rename(country=`Row Labels`) %>% select(-`Geo code`, -indicator) %>% as.tbl()
gini_long <- gini %>%
gather(key="year", value="gini", -country) %>%
mutate(year=as.character(as.numeric(year)))
if (!file.exists(file.path(fullFigPath,"gapminder-pop.csv"))) {
download.file("https://docs.google.com/spreadsheet/pub?key=phAwcNAVuyj0XOoBL_n5tAQ&output=csv", destfile=file.path(fullFigPath, "gapminder-pop.csv"))
}
pop <- readr::read_csv(file=file.path(fullFigPath, "gapminder-pop.csv")) %>%
rename(country=`Total population`)
pop_long <- pop %>% gather(key="year", value="population", -country) %>%
filter(!is.na(population))
##Regions, see https://www.gapminder.org/gsdev
if (!file.exists(file.path(fullFigPath,"gapminder-countrygroups.xls"))) {
download.file("https://www.gapminder.org/gsdev/files/popReg/en/list_country_groups_en.xls", destfile=file.path(fullFigPath, "gapminder-countrygroups.xls"))
}
groups <- readxl::read_xls(file.path(fullFigPath, "gapminder-countrygroups.xls"), sheet=1) %>%
rename(country=Entity, region=Region, code=ID)
```
The four data sources are then joined into one long tibble `gm`. For each year we also compute the fraction a country's population makes up of the world population that year (column `w`) as well as the fraction within the year and region the population makes up (column `w_region`) :
```{r DATA_MERGE}
##Merge data together
gm <- inner_join(gini_long, gdp_long, by=c("country","year")) %>%
inner_join(pop_long, by=c("country","year")) %>%
inner_join(groups, by=c("country")) %>%
select(country, region, code, everything())
##Add weights (by year) column and by year and region
gm %<>% group_by(year) %>%
mutate(w=population/sum(population)) %>%
group_by(year, region) %>%
mutate(w_region = population/sum(population)) %>%
ungroup
gm
```
## Income Mountain Plots
The construction of the income mountain plots is thoroughly described
on the
[Gapminder webpage](https://www.gapminder.org/data/documentation/income-mountains-dataset/), but without mathematical detail. With
respect to the math it says: *"Bas van Leeuwen shared his formulas
with us and explained how to the math from ginis and mean income, to
accumulated distribution shapes on a logarithmic scale."*
Unfortunately, the formulas are not shared with the reader. It's not
black magic though: The income distribution of a country is assumed to
be [log-normal](https://en.wikipedia.org/wiki/Log-normal_distribution) with a given mean $\mu$ and standard deviation $\sigma$
on the log-scale, i.e. $X \sim \operatorname{LogN}(\mu,\sigma^2)$.
From knowing the mean income
$\overline{x}$ of the distribution as well as the Gini index $G$ of the distribution, one can show that it's possible to directly infer $(\mu, \sigma)$ of the log-normal distribution.
Because the
[Gini index](https://en.wikipedia.org/wiki/Gini_coefficient) of the log-normal distribution is given by
$$
G = 2\Phi\left(\frac{\sigma}{\sqrt{2}}\right)-1,
$$
where $\Phi$ denotes the CDF of the
standard normal distribution, and by knowing that the expectation of the log-normal is $E(X) = \exp(\mu + \frac{1}{2}\sigma^2)$, it is possible to determine $(\mu,\sigma)$ as:
$$
\sigma = \sqrt{2}\> \Phi^{-1}\left(\frac{G+1}{2}\right)
\quad\text{and}\quad
\mu = \log(\overline{x}) - \frac{1}{2} \sigma^2.
$$
We can use this to determine the parameters of the log-normal for every country in each year.
```{r XBARGINI2MUSIGMA}
## Compute the parameters of the log-normal distribution from a specified
## mean and gini-index.
xbarg_2_musigma <- function(xbar,g) {
sigma <- sqrt(2)*qnorm((g+1)/2)
mu <- log(xbar) - 1/2*sigma^2
return(c(mu=mu, sigma=sigma))
}
##Determine parameters of the log-normal modelling the gdp per country and year
gm <- gm %>% group_by(country,year) %>% do({
##Determine parameters of the log-normal distrib
theta <- xbarg_2_musigma(xbar=.$gdp, g=.$gini)
data.frame(., meanlog=theta[1], sdlog=theta[2], median=exp(theta[1]))
}) %>% ungroup
```
```{r}
n_country <- length(unique(gm$country))
```
### Mixture distribution
The income distribution of a **set of countries** is now given as a
[Mixture distribution](https://en.wikipedia.org/wiki/Mixture_distribution) of log-normals, i.e. one component for each of the countries in the set with a weight proportional to the population of the country. As an example, the world income distribution would be a mixture of the `r n_country` countries in the Gapminder dataset, i.e.
$$
f_{\text{mix}}(x) = \sum_{i=1}^{`r n_country`} w_i \>\cdot
\>f_{\operatorname{LogN}}(x; \mu_i, \sigma_i^2), \quad\text{where}
\quad w_i = \frac{\text{population}_i}{\sum_{j=1}^{`r n_country`} \text{population}_j},
$$
and $f_{\operatorname{LogN}}(x; \mu_i, \sigma_i^2)$ is the density of the log-normal distribution with country specific parameters. Note that we could have equally used the mixture approach to define the income of, e.g., a continent region. With the above definition we define standard R-functions for computing the PDF
(`dmix`), CDF (`pmix`), quantile function (`qmix`) and a function for sampling from the distribution (`rmix`) - see the [github code](`r paste0("https://raw.githubusercontent.com/hoehleatsu/hoehleatsu.github.io/master/_source/",current_input())`) for details.
```{r MIXFUNCS, echo=FALSE}
#' Generate a sample of size n from the mixture distribution.
#'
#' @param x Value where to evaluate the density (can be a vector)
#' @param meanLog Vector containing the individual meanlog parameters of each component
#' @param sdLog Vector containing the individual sdlog parameters of each component
#' @param Weight vector, should be the same length as meanlog and sum to one.
#' @return A vector of length n.
rmix <- function(n, meanlog, sdlog, w) {
##Sanity checks
stopifnot( (length(meanlog) == length(sdlog)) & (length(meanlog) == length(w)) )
##Sample component and then from its density
i <- sample(seq_len(length(meanlog)), size=n, prob=w, replace=TRUE)
rlnorm(n, meanlog=meanlog[i], sdlog=sdlog[i])
}
#' Density of the log-normal mixture
dmix <- function(x, meanlog, sdlog, w) {
##Sanity check
stopifnot( (length(meanlog) == length(sdlog)) & (length(meanlog) == length(w)) )
one_x <- function(x) sum(w*dlnorm(x, meanlog=meanlog, sdlog=sdlog))
sapply(x, one_x)
}
#' Cumulative density of the log-normal mixture
pmix <- function(q, meanlog, sdlog, w) {
##Sanity check
stopifnot( (length(meanlog) == length(sdlog)) & (length(meanlog) == length(w)) )
one_q <- function(q) sum(w*plnorm(q, meanlog=meanlog, sdlog=sdlog))
sapply(q, one_q)
}
##Quantile function of the log-normal mixture
qmix <- function(p, meanlog, sdlog, w) {
##Sanity check
stopifnot( (length(meanlog) == length(sdlog)) & (length(meanlog) == length(w)) )
##Find one quantile numerically
one_p <- function(p) {
target <- function(x) {
pmix(x, meanlog=meanlog, sdlog=sdlog, w=w) - p
}
uniroot(target,lower=0, upper=1e99)$root
}
sapply(p, one_p)
}
```
```{r GMRECENT, results='hide'}
##Restrict to year 2015
gm_recent <- gm %>% filter(year == 2015) %>% ungroup
```
We use the mixture approach to compute the density of the world income distribution obtained by "mixing" all `r n_country`
log-normal distributions. This is shown below for the World income distribution of
the year `r mean(as.numeric(gm_recent$year), na.rm=TRUE)`. Note the $\log_2$ x-axis. This presentation
is *Factfulness*' preferred way of illustrating the skew income distribution.
```{r DMIXWORLD, echo=TRUE}
<

**Edit 2018-07-05:** I checked this and read the [documentation](https://www.gapminder.org/data/documentation/income-mountains-dataset/) again more carefully: Apparently, Gapminder uses *mean household income (or consumption) per person per day (measured in PPP$ 2011)* opposite to the *GDP/capita* used in the scientific literature they quote and for which you can download the data from their website. To me this was not clear when reading Factfulness and, unfortunately, there is no documentation for how exactly their mean household income value per individual is computed from the GDP/capita.^[The Gapminder [Household Income v1](http://www.gapm.io/ioihhinc) description is currently (as of 2018-07-05) blank and so is the link specified as reference [Gapminder [3]](gapm.io/elev) in *Factfulness*. The detailed data just contain the column `household_income` without further explanation. Altogether, a somewhat disappointing number of links in the book are currently still under construction. Reading the document [Data Sources used in Don’t Panic — End Poverty](https://www.gapminder.org/news/data-sources-dont-panic-end-poverty/) it appears to me that the GDP/capita are converted to household incomes by scaling all countries GDPs per capita until the global income log-normal mixture distribution is such that *11.3% of world population are below the extreme poverty line of 1.85$/day (in PPP 2011) in year 2015*. When I tried this rather ad-hoc approach I got a scale parameter of approximately 0.379 for the GDP, which corresponds to a shift of 1.402 to the left on the log-base-2 scale. This worked ok for the single benchmark of Sweden in 1970 that I tested. Furthermore, Gapminder uses something they call **log-normal-topping** per country in order to get better tail behaviour. [Adventurous Excel-files](https://drive.google.com/drive/folders/11_k8_sTa7ycuprJjaORotbGVQyX1b_tx) not directly linked to in any explanation are used for the calculation and can be consulted for further details. The authors note themselves that they hope to convert these computations to python soon...`r emo::ji("smile")`] For the sake of illustrating the dynamics in the world income the difference in scale is not that important, though.
We end the post by animating the dynamics of the income mountains since 1950 using `gganimate`. To put it in possibilistic terms:
Let the [world move forward](https://youtu.be/hVimVzgtD6w?t=8m12s)! It is not as bad as it seems. Facts matter.
```{r ANIMATE, cache=TRUE, results='hide', message=FALSE}
##Make mountain plot with smaller intervals than in previous plot.
people_all <- gm %>% group_by(year) %>% do({
make_mountain_df(., log2x=seq(-2,9,by=0.01))
})
##Helper function to do the animation for a set of years.
animation <- function(years) {
##Largest bin to show
total <- people_all %>%
group_by(year,as.factor(mid_log2)) %>%
summarise(people=sum(people))
ymax <- max(total %$% people)
##Show all on the same y-scale
for (theYear in years) {
people_all %>% filter(year == theYear) %>%
ggplot_oneyear_mountain(ymax=ymax)
}
invisible()
}
##Years to show
years <- gm %>% filter(year >= 1950) %>% distinct(year) %$% year
ani.options(interval=0.2)
##Make an animate
animation::saveGIF(animation(years),
movie.name=file.path(fullFigPath, "moving-mountains.gif"),
ani.width=600,ani.height=300)
```