--- title: "Tibble Join Demonstration" author: "Albina Gibadullina" date: "October 18, 2021" output: html_document --- # Demonstration with `gapminder` Load necessary packages ```{r setup, message=FALSE, warning=FALSE} library(tidyverse) library(gapminder) library(gsheet) # make sure to install this package if you don't have it ``` Get an overview of `gapminder` data ```{r} glimpse(gapminder) ``` ## Part 1 Obtain additional information on countries from other open data sources ```{r} country_data <- read.csv(file = "https://raw.githubusercontent.com/open-numbers/ddf--gapminder--geo_entity_domain/master/ddf--entities--geo--country.csv") glimpse(country_data) ``` Narrow down information to income groups, OECD status, and religion ```{r} country_data <- country_data %>% select(name, income_groups, g77_and_oecd_countries, main_religion_2008) # Check data structure glimpse(country_data) ``` Count how many unique country names are in `gapminder` and `country_data` ```{r} nlevels(gapminder$country) nlevels(as.factor(country_data$name)) ``` Merge `gapminder` and `country_data` using `left_join()` ```{r} gapminder_extended <- left_join(gapminder, country_data, by=c("country"="name")) head(gapminder_extended) ``` **Note:**: `left_join()` is probably the most useful and the most used join. It is often used when you want to expand your existing dataset with new variables from other sources. Compare lifeExp for OECD, G77, and other countries ```{r} gapminder_extended %>% ggplot(aes(x=g77_and_oecd_countries,y=lifeExp))+ geom_boxplot()+ geom_jitter(aes(color=continent), alpha=0.3)+ labs(x="Country group") ``` Compare lifeExp for OECD, G77, and other countries by most common religion ```{r} gapminder_extended %>% filter(main_religion_2008 %in% c("christian","eastern_religions","muslim")) %>% ggplot(aes(x=g77_and_oecd_countries,y=lifeExp))+ geom_boxplot()+ geom_jitter(aes(color=continent), alpha=0.3)+ labs(x="Country group")+ facet_wrap(~main_religion_2008) ``` ## Part 2 Gapminder data is only available from 1952 to 2007. What if we wanted to examine data after 2007 as well as population projections? Download population size estimates by country from 1800 to 2100 ```{r} population <- gsheet2tbl("https://docs.google.com/spreadsheets/d/14_suWY8fCPEXV0MH7ZQMZ-KndzMVsSsA5HdR-7WqAC0/edit#gid=176703676") ``` See what population data looks like ```{r} glimpse(population) ``` Only retain population estimates after 2007, rename variables to match gapminder variable names ```{r} population <- population %>% filter(time>2007) %>% rename(year=time, country=name, pop=Population) %>% select(-geo) ``` Add continent data to `population` from `gapminder` ```{r} # create a data frame listing continent for every country continent <- gapminder %>% select(country, continent) %>% distinct() # add continent data to population data frame population <- left_join(population, continent, by = "country") # see how many countries are missing continent data by continent population %>% group_by(year) %>% summarise(missing_continent = sum(is.na(continent))) ``` Use `bind_rows()` to stack `population` below `gapminder` ```{r} gapminder_pop <- bind_rows(gapminder, population) %>% arrange(country,year) ``` Visualize trends in population growth by continent ```{r} gapminder_pop %>% filter(!is.na(continent)) %>% group_by(continent, year) %>% summarise(pop=sum(pop)/1000000) %>% ggplot(aes(x=year, y=pop, fill=continent))+ geom_area()+ labs(title="Population projections by continent", y="Population (in mil)") ```