--- title: "Reproducible data collection" author: Ian D. Gow date: 2026-01-05 date-format: "D MMMM YYYY" number-sections: true format: html: colorlinks: true pdf: colorlinks: true geometry: - left=2cm - right=2cm papersize: a4 mainfont: TeX Gyre Pagella mathfont: TeX Gyre Pagella Math bibliography: papers.bib csl: jfe.csl --- # Introduction An exercise I have assigned to students in the past is to go to the [online supplements and datasheets](https://www.chicagobooth.edu/research/chookaszian/journal-of-accounting-research/online-supplements-and-datasheets) page for the *Journal of Accounting Research*, pick an issue of the journal and evaluate whether one could reproduce the analysis in the associated paper using the materials made available there. Generally, the answer has been negative.^[Not always, as several of the papers covered in [*Empirical Research in Accounting: Tools and Methods*](https://iangow.github.io/far_book/) are replicated there using code from the *Journal of Accounting Research* page.] That said, it seems that the *Journal of Accounting Research* is still the (relative) leader among accounting-focused academic journals with regard to requiring authors to supply materials. :::{.callout-tip text-align="left"} In writing this note, I use several packages including those listed below.^[Execute `install.packages(c("tidyverse", "DBI", "duckdb", "remotes", "arrow", "dbplyr"))` within R to install all the packages you need to run the code in this note.] This note was written using [Quarto](https://quarto.org) and compiled with [RStudio](https://posit.co/products/open-source/rstudio/), an integrated development environment (IDE) for working with R. The source code for this note is available [here](https://raw.githubusercontent.com/iangow/notes/main/repro_data.qmd) and the latest version of this PDF is [here](https://raw.githubusercontent.com/iangow/notes/main/repro_data.pdf). Note that the `duckdb_to_parquet()` function used below is currently only available in the development version of `farr`. Use `remotes::install_github("iangow/farr")` to install this version. ```{r} #| message: false library(tidyverse) library(DBI) library(farr) ``` ::: My reason for visiting the datasheets page this week was to get a sense for how people are doing data analysis these days. Are people moving from R to Python, as the latter gets stronger for data science? Or are they preferring the domain-specific strengths of R? The answer is: neither. Based on the six papers in [Volume 63, Issue 1](https://www.chicagobooth.edu/research/chookaszian/journal-of-accounting-research/online-supplements-and-datasheets/volume-63) that are not listed as "datasheet and code forthcoming" and that provide something in terms of data, all used some combination of SAS and Stata. This is probably not much different from what you would have seen around fifteen years ago. The persistence of SAS and Stata surprises me given how easy modern tools make a lot of data analysis steps.^[It seems that the typical SAS user dumps data to Excel to make plots and the typical Stata user exports tables to Excel for copy-pasting to Word. Does the typical Python or R user do that too?] But looking at the [first paper](https://onlinelibrary.wiley.com/doi/10.1111/1475-679X.12591) in Volume 63, Issue 2, I see a mix of Stata and R. So, moving to the assignment I have given to students in the past, how reproducible is the analysis contained therein? I made some progress on this, but I didn't get far (to be fair, I didn't try very hard). # Employment data The first data set in the R code file is described as "Employment and Investment" in the PDF data sheet (`DLR datasheet.pdf`). The source is "the [U.S. Bureau of Labor Statistics (BLS) website](https://www.bls.gov/audience/economists.htm)" and the authors say "we downloaded Zip files. We then merged these data with our sample dataset. There was no manual entry at any point in the process." However, when I go to the website, I struggled to figure out what data file covers "employment and investment". ## Getting the raw data Based on the file names listed in the code (e.g., `2001_annual_singlefile.zip`), ChatGPT suggested that I should go to the BLS's [QCEW data files page](https://www.bls.gov/cew/downloadable-data-files.htm), where QCEW stands for "Quarterly Census of Employment and Wages" (not sure what happened to "investment" ... I guess it's a typo of sorts). If the authors had provided that link, it would've been very helpful for anyone trying to reproduce their analysis. Getting to the QCEW site, I guess the authors clicked the link for each year's data---the paper used data from 2001 to 2019---and saved the `.zip` data file somewhere on one of their computers. Again, one can do better. A small function can get a single year's data and `lapply()` can do this for all years. I tend to use an environment variable (`RAW_DATA_DIR`) to indicate where I put "raw" data files like these and in this case, I put them in a directory named `qcew` under that directory. Using environment variables in this way yields code that is much more reproducible than things like `D:\Users\me\my_projects\this_project` that seem very common on the JAR datasheets site. Note that the code below does not download a file if I already have it: ```{r} #| label: years #| cache: true years <- 2001:2019L ``` ```{r} get_qcew_zip_data <- function(year, raw_data_dir = NULL, schema = "qcew") { if (is.null(raw_data_dir)) { raw_data_dir <- file.path(Sys.getenv("RAW_DATA_DIR"), schema) } if (!dir.exists(raw_data_dir)) dir.create(raw_data_dir, recursive = TRUE) url <- stringr::str_glue("https://data.bls.gov/cew/data/files/{year}", "/csv/{year}_annual_singlefile.zip") t <- file.path(raw_data_dir, basename(url)) if (!file.exists(t)) download.file(url, t) invisible(t) } res <- lapply(years, get_qcew_zip_data) ``` ## Processing the data: The authors' approach The original R code supplied by the authors had code more or less like this (I omitted the code for `bls03` through `bls17` for reasons of space): ```{r} #| eval: false bls01_in <- read.csv(unzip("Data/BLS/2001_annual_singlefile.zip")); bls01 <- bls_process(bls01_in); rm(bls01_in) bls02_in <- read.csv(unzip("Data/BLS/2002_annual_singlefile.zip")); bls02 <- bls_process(bls02_in); rm(bls02_in) ... bls18_in <- read.csv(unzip("Data/BLS/2018_annual_singlefile.zip")); bls18 <- bls_process(bls18_in); rm(bls18_in) bls19_in <- read.csv(unzip("Data/BLS/2019_annual_singlefile.zip")); bls19 <- bls_process(bls19_in); rm(bls19_in) bls_all <- rbind(bls01, bls02, bls03, bls04, bls05, bls06, bls07, bls08, bls09, bls10, bls11, bls12, bls13, bls14, bls15, bls16, bls17, bls18, bls19) ``` I tweaked this code a little bit without changing its basic function. First, I replaced `Data/BLS/` with something based on where I downloaded the files (`{raw_data_dir}`). Second, I put the `unzip()` and `read.csv()` calls *inside* the `bls_process()` function. Third, I replaced hard-coded file names such as `2001_annual_singlefile.zip` with something like `{year}_annual_singlefile.zip`. Finally, I replaced the code creating `bls_all` with `bls_all <- rbind(lapply(2001:2019L, bls_process))`. I put this code in a small R file and I can call this code using `source()`. The modified "original" code I used is available [here](https://raw.githubusercontent.com/iangow/notes/main/get_bls_data_orig.R). ```{r} #| cache: true #| warning: false system.time(source("get_bls_data_orig.R")) ``` So this takes about 6 minutes. This is not a big deal for the researchers, who probably worked on this project for years. But it's an additional cost for anyone attempting to replicate the analysis in the paper. Note that the original "original" code left a lot of unzipped text files lying about on my hard drive; my modified code eliminates these (using `unlink()`). Also, quite a lot of memory is used up by having all the dataframes in memory before calling `rbind()`. ## Processing the data: My first attempt Can we do better? I think one approach is to put the processed data into a **parquet data repository** of the kind I describe [here](https://iangow.github.io/far_book/parquet-wrds.html). I use the environment variable `DATA_DIR` to keep track of the location of my repository (note that this could be different for different projects) and put data files in different "schemas" within `DATA_DIR`. In this case, I will put the data under `qcew`. For my first attempt, I will use `read_csv()` from the `readr` package to read in the unzipped files and then `write_parquet()` from the `arrow` package to save the data to a parquet file in my data repository. ```{r} #| label: get_qcew_data #| eval: true #| cache: true get_qcew_data <- function(year, schema = "qcew", data_dir = NULL, raw_data_dir = NULL) { if (is.null(raw_data_dir)) { raw_data_dir <- file.path(Sys.getenv("RAW_DATA_DIR"), schema) } if (is.null(data_dir)) data_dir <- Sys.getenv("DATA_DIR") data_dir <- file.path(data_dir, schema) if (!dir.exists(data_dir)) dir.create(data_dir, recursive = TRUE) filename <- stringr::str_glue("{year}_annual_singlefile.zip") t <- path.expand(file.path(raw_data_dir, filename)) pq_path <- stringr::str_c("annual_", year, ".parquet") readr::read_csv(t, show_col_types = FALSE, guess_max = 100000) |> arrow::write_parquet(sink = file.path(data_dir, pq_path)) return(TRUE) } ``` Running this code, I see a bit of a performance pick-up, though it's not (yet) an apples-to-apples comparison because my code is simply transforming the zipped CSV files into parquet files, while the "original" code was creating CSV files based on subsets of the data. While it will turn out that this is still a fair comparison, I wonder if I can do better. For one thing, there is still about 3GB of RAM used in the process of loading data into R and then saving to parquet files one at a time. ```{r} #| label: get_qcew_data_apply #| cache: true #| dependson: years, get_qcew_data system.time(lapply(years, get_qcew_data)) ``` ## Processing the data: My second attempt For my second approach, I will do all the data processing using DuckDB. I unzip the data, then use DuckDB's `read_csv()` piped directly into parquet files, again saved in the same location.^[So the files created in the previous step will be overwritten here.] ```{r} #| label: get_qcew_data_duckdb #| cache: true get_qcew_data_duckdb <- function(year, schema = "qcew", data_dir = NULL, raw_data_dir = NULL) { if (is.null(raw_data_dir)) { raw_data_dir <- file.path(Sys.getenv("RAW_DATA_DIR"), schema) } if (is.null(data_dir)) data_dir <- Sys.getenv("DATA_DIR") data_dir <- file.path(data_dir, schema) if (!dir.exists(data_dir)) dir.create(data_dir, recursive = TRUE) filename <- stringr::str_glue("{year}_annual_singlefile.zip") t <- path.expand(file.path(raw_data_dir, filename)) csv_file <- unzip(t, exdir = tempdir()) pq_file <- stringr::str_glue("annual_{year}.parquet") pq_path <- path.expand(file.path(data_dir, pq_file)) db <- DBI::dbConnect(duckdb::duckdb()) args <- ", types = {'year': 'INTEGER'}" sql <- stringr::str_glue("COPY (SELECT * FROM read_csv('{csv_file}'{args})) ", "TO '{pq_path}' (FORMAT parquet)") res <- DBI::dbExecute(db, sql) DBI::dbDisconnect(db) unlink(csv_file) res } ``` How does this code do? Well, much faster! Also, much less RAM used (more like hundreds of megabytes than gigabytes). ```{r} #| label: get_qcew_data_duckdb_apply #| cache: true #| dependson: years, get_qcew_data_duckdb system.time(lapply(2001:2019L, get_qcew_data_duckdb)) ``` ### Creating a code repository for the raw data Given the general-purpose nature of the QCEW data, it seems to be a good candidate for creating a public repository that facilitates sharing of the data. I made such a repository [here](https://github.com/iangow/qcew). ### Performing the final steps I said before that the comparison was a bit apples-to-oranges because I've skipped some steps from the "original" code. First, the original code applied some filters (e.g., `industry_code < 100` and `own_code %in% c(0, 5)`). I apply these filters in the code below. First, I connect to a fresh DuckDB database (a single line of code). Second, I use `load_parquet()` from the `farr` package with a wildcard (`"annual_*"`) to load *all* the QCEW data into a single table. I then apply the filters. As you can see, this step is *fast*. One reason it is so fast is because the `dbplyr` package I am using to connect to DuckDB uses **lazy evaluation**, so it doesn't actually realize any data in this step. ```{r} #| include: false # Prevent conflict regarding `mutate()`. if ("package:plyr" %in% search()) { detach("package:plyr", unload = TRUE) } ``` ```{r} db <- DBI::dbConnect(duckdb::duckdb()) qcew_data <- load_parquet(db, "annual_*", schema = "qcew") |> filter(year %in% years) |> mutate(industry_code = case_when(industry_code == "31-33" ~ "31", industry_code == "44-45" ~ "44", industry_code == "48-49" ~ "48", .default = industry_code), industry_code = as.integer(industry_code)) |> filter(industry_code < 100) |> # and for now just keep the ownership codes for # - (a) total employment levels -- own_code = 0; and # - (b) total private sector employment -- own_code = 5 filter(own_code %in% c(0, 5)) |> system_time() ``` I can take a quick peek at a sample from this data set: ```{r} qcew_data ``` The original code created three data sets (all saved as CSV files): `bls_state`, `bls_county`, and `bls_national`.^[I use the same names for these files as were used by the original authors.] The following code chunks creates each of these in turn and saves them in the `dlr` schema (this might be considered to be the "project directory" for the paper, with `qcew` being a schema shared across projects). Note that the `duckdb_to_parquet()` function returns a DuckDB table based on the created parquet file, so it can be examined and used in subsequent analysis. As can be seen, none of these steps takes much time. Hence the apples-to-apples aspect of the performance comparison can be restored by adding up the time taken for all the steps (under a minute given that the zipped CSV files are already on my hard drive). Note that the underlying CSV files represented about 10 gigabytes of data, so the fact that creating data sets from the parquet versions of these can take a second or less is quite impressive. ```{r} bls_county <- qcew_data |> filter(agglvl_code > 69, agglvl_code < 80) |> duckdb_to_parquet(name = "bls_county", schema = "dlr") |> system_time() bls_county ``` ```{r} bls_state <- qcew_data |> filter(agglvl_code > 49, agglvl_code < 60) |> duckdb_to_parquet(name = "bls_state", schema = "dlr") |> system_time() ``` ```{r} bls_national <- qcew_data |> filter(agglvl_code == 10) |> duckdb_to_parquet(name = "bls_national", schema = "dlr") |> system_time() ``` # Revising the datasheet With all the pieces above in place, we can reimagine the datasheet entry for this data set. Because everything has been done using code and because elements have been done in a way that is easy to share, preparing the datasheet requires very little incremental effort. In many ways, research teams should maintain datasheets like this even for their own purposes, so preparing the final datasheet becomes quite trivial. Here's my attempt: :::{.callout text-align="left"} **2. Employment** We sourced raw data from the [website](https://www.bls.gov/cew/downloadable-data-files.htm) for the Quarterly Census of Employment and Wages (QCEW) program of the United States Bureau of Labor Statistics (BLS). We downloaded and processed the raw data using code available at . The downloaded Zip files can be found [here](https://www.dropbox.com/scl/fo/rb1298e5oyio1oqujk4w3/AH12yMReDIe86N-jarXuKyI?rlkey=lipfmv05c6nqwlinkokql9w89&dl=0) and the processed parquet data files can be found [here](https://www.dropbox.com/scl/fo/g5yat3ugd6n95fl0y71kp/AHX6QGGxWA6-Bid69a-Ys1Q?rlkey=dgzcgfdq7h7by0b1qpxwgedce&dl=0). ```{r} #| message: false library(DBI) library(tidyverse) library(farr) years <- 2001:2019L db <- DBI::dbConnect(duckdb::duckdb()) qcew_data <- load_parquet(db, "annual_*", schema = "qcew") |> filter(year %in% years) |> mutate(industry_code = case_when(industry_code == "31-33" ~ "31", industry_code == "44-45" ~ "44", industry_code == "48-49" ~ "48", .default = industry_code), industry_code = as.integer(industry_code)) |> filter(industry_code < 100) |> # and for now just keep the ownership codes for # - (a) total employment levels -- own_code = 0; and # - (b) total private sector employment -- own_code = 5 filter(own_code %in% c(0, 5)) ``` We used data from years `r min(years)` through `r max(years)` and applied filters based on industry (`industry_code` less than 100) and ownership (`own_code %in% c(0, 5))`). We then created three data sets at different levels of aggregation: county, state, and national. ```{r} bls_county <- qcew_data |> filter(agglvl_code > 69, agglvl_code < 80) |> duckdb_to_parquet(name = "bls_county", schema = "dlr") bls_state <- qcew_data |> filter(agglvl_code > 49, agglvl_code < 60) |> duckdb_to_parquet(name = "bls_state", schema = "dlr") bls_national <- qcew_data |> filter(agglvl_code == 10) |> duckdb_to_parquet(name = "bls_national", schema = "dlr") ``` These data sets are available [here](https://www.dropbox.com/scl/fo/lb3r9vi2gxouavm5i1pn2/AIR3AbhBa26Rr09B967_MXM?rlkey=8vwxts0zcirc8gnf77vjvrp4l&dl=0). ::: While this entry is much longer than the original entry, I think the benefits of greater transparency for any reader make the extra length more than worthwhile. # Conclusion The upshot of all this is that I think this demonstrates how it is possible to have completely reproducible data steps that are also researcher-friendly and take very little computing resources to process. For example, all the QCEW parquet data files created using my version of the code can be found [here](https://www.dropbox.com/scl/fo/g5yat3ugd6n95fl0y71kp/AHX6QGGxWA6-Bid69a-Ys1Q?rlkey=dgzcgfdq7h7by0b1qpxwgedce&dl=0).^[Note that this is as far as I got in looking at the reproducibility of the steps in the original paper.] ```{r} #| include: false dbDisconnect(db) ```