3.6 An Application to the Gender Gap of Earnings
This section discusses how to reproduce the results presented in the box The Gender Gap of Earnings of College Graduates in the United States of the book.
In order to reproduce Table 3.1 of the book you need to download the replication data which are hosted by Pearson and can be downloaded here. This file contains data that range from \(1992\) to \(2008\) and earnings are reported in prices of \(2008\).
There are several ways to import the .xlsx-files into R. Our suggestion is the function read_excel() from the readxl package (Wickham and Bryan 2023). The package is not a part of R’s base version and has to be installed manually.
You are now ready to import the dataset. Make sure you use the correct path to import the downloaded file! In our example, the file is saved in a subfolder of the working directory named data. If you are not sure what your current working directory is, use getwd(), see also ?getwd. This will give you the path that points to the place R is currently looking for files to work with.
Next, install and load the package dyplr (Wickham et al. 2023). This package provides some handy functions that simplify data wrangling a lot. It makes use of the %>% operator.
First, get an overview over the dataset. Next, use %>% and some functions from the dplyr package to group the observations by gender and year and compute descriptive statistics for both groups.
# get an overview of the data structure
head(cps)
#> # A tibble: 6 × 3
#> a_sex year ahe08
#> <dbl> <dbl> <dbl>
#> 1 1 1992 17.2
#> 2 1 1992 15.3
#> 3 1 1992 22.9
#> 4 2 1992 13.3
#> 5 1 1992 22.1
#> 6 2 1992 12.2
# group data by gender and year and compute the mean, standard deviation
# and number of observations for each group
avgs <- cps %>%
group_by(a_sex, year) %>%
summarise(mean(ahe08),
sd(ahe08),
n())
# print the results to the console
print(avgs)
#> # A tibble: 10 × 5
#> # Groups: a_sex [2]
#> a_sex year `mean(ahe08)` `sd(ahe08)` `n()`
#> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 1 1992 23.3 10.2 1594
#> 2 1 1996 22.5 10.1 1379
#> 3 1 2000 24.9 11.6 1303
#> 4 1 2004 25.1 12.0 1894
#> 5 1 2008 25.0 11.8 1838
#> 6 2 1992 20.0 7.87 1368
#> 7 2 1996 19.0 7.95 1230
#> 8 2 2000 20.7 9.36 1181
#> 9 2 2004 21.0 9.36 1735
#> 10 2 2008 20.9 9.66 1871
With the pipe operator %>%, we simply chain different R functions that produce compatible input and output. In the code above, we take the dataset cps and use it as an input for the function group_by(). The output of group_by is subsequently used as an input for summarise() and so forth.
Now that we have computed the statistics of interest for both genders, we can investigate how the gap in earnings between both groups evolves over time.
# split the dataset by gender
male <- avgs %>% dplyr::filter(a_sex == 1)
female <- avgs %>% dplyr::filter(a_sex == 2)
# rename columns of both splits
colnames(male) <- c("Sex", "Year", "Y_bar_m", "s_m", "n_m")
colnames(female) <- c("Sex", "Year", "Y_bar_f", "s_f", "n_f")
# estimate gender gaps
gap <- male$Y_bar_m - female$Y_bar_f
# compute standard errors
gap_se <- sqrt(male$s_m^2 / male$n_m + female$s_f^2 / female$n_f)
#Compute confidence intervals for all dates
gap_ci_l <- gap - 1.96 * gap_se
gap_ci_u <- gap + 1.96 * gap_se
result <- cbind(male[,-1], female[,-(1:2)], gap, gap_se, gap_ci_l, gap_ci_u)
# print the results to the console
print(result, digits = 3)
#> Year Y_bar_m s_m n_m Y_bar_f s_f n_f gap gap_se gap_ci_l gap_ci_u
#> 1 1992 23.3 10.2 1594 20.0 7.87 1368 3.23 0.332 2.58 3.88
#> 2 1996 22.5 10.1 1379 19.0 7.95 1230 3.49 0.354 2.80 4.19
#> 3 2000 24.9 11.6 1303 20.7 9.36 1181 4.14 0.421 3.32 4.97
#> 4 2004 25.1 12.0 1894 21.0 9.36 1735 4.10 0.356 3.40 4.80
#> 5 2008 25.0 11.8 1838 20.9 9.66 1871 4.10 0.354 3.41 4.80
We observe virtually the same results as the ones presented in the book. The computed statistics suggest that there is a gender gap in earnings. Note that we can reject the null hypothesis that the gap is zero for all periods. Further, estimates of the gap and bounds of the \(95\%\) confidence intervals indicate that the gap has been quite stable in the recent past.