--- title: 'Data Steps' output: html_document: theme: readable df_print: paged highlight: tango toc: yes toc_float: no --- # Tutorial on cleaning the LTDB ## Background As is mentioned in the [Week 02 lab instructions](lab-02-instructions.html), this course will be using the Longitudinal Tracts Data Base (LTDB) for over-time analysis with census data. This is a great resource for communities because the researchers harmonized 40 years of census data by apportioning old data so it fits the new 2010 census tracts and allows for analysis of consistent geographic units over time. ## Challenge Unfortunately the data is not ready to be used right away thus requiring us to clean it beforehand. The challenge is we need to restructure the input census datasets to enable us to fully utilize the over-time aspects of the data. ## Goal The following chunks depend on you having clean data in your `data/rodeo` folder. These files are generated by following along this tutorial. ## Run this tutorial locally. This HTML file is meant for you to nicely view the data cleaning steps from you laptop. However, you actually need the files produced in this tutorial for future labs so **you need to run this tutorial locally**. To re-run the tutorial locally, follow these steps: 1. Open a blank text file in RStudio by going to `File` --> `New File` --> `Text File`. 2. [Copy the raw `.rmd` file](https://raw.githubusercontent.com/DS4PS/cpp-528-fall-2020/master/labs/PROJECT-DATA-STEPS.rmd); 3. Paste the `.rmd` code into the blank text file. 4. Save the blank test file in `labs/wk03/data_steps.rmd`. You do not **have** to store it in the WK03 sub-directory but you **must** save the file with the file extension `.rmd`. 5. Click on the `Knit` to knit the results to a HTML file and to produce all of the clean data files within `data/rodeo`. ```{r setup, include=FALSE} knitr::opts_chunk$set( echo=TRUE, message=F, warning=F, eval=T ) ``` With all that said, let's get started on cleaning the LTDB! ```{r load necessary packages, message=FALSE} # load necessary packages ---- library( dplyr ) library( here ) library( knitr ) library( pander ) ``` # Inspect Data First, let's inspect the raw data. *Note: please do not import files using static file paths. Notice the use of `here::here()` down below.* ## Check 2010 Data ```{r check 2010 data} # load all data as character vecs d.2010.samp <- read.csv( here::here("data/raw/ltdb_std_2010_sample.csv"), colClasses="character" ) str( d.2010.samp[1:10] ) ``` Check 2010 summary stats: ```{r check 2010 summary stats} head( d.2010.samp$p65hsp12 ) # missing values coded as -999 sum( d.2010.samp$p65hsp12 == "-999" ) summary( as.numeric(d.2010.samp$p65hsp12) ) ``` We have problems with missing data coded as -999, which will cause issues with any analysis. ## Remove Missing Value Codes Remove missing value codes "-999" and replace with variable mean or NAs. ```{r remove missing value codes} # convert variables to numeric # and remove missing values placeholders; # impute missing values with mean clean_x <- function( x ) { x <- as.numeric( x ) x[ x == -999 ] <- NA mean.x <- mean( x, na.rm=T ) x[ is.na(x) ] <- mean.x return(x) } # apply the clean var x function to all columns clean_d <- function( d, start.column ) { # d <- fix_names( d ) these <- start.column:ncol(d) d[ these ] <- lapply( d[ these ], clean_x ) return( d ) } ``` Test the code: ```{r test the clean_d() function} # first four columns are unique IDs - leave them as character vectors d.2010.samp <- clean_d( d.2010.samp, start.column=5 ) str( d.2010.samp[1:10] ) summary( d.2010.samp$p65hsp12 ) %>% pander() ``` That works! # Tidy Up Dataframes We want to standardize datasets across all of the years so that they are all clean, have the same structure, same variable name conventions, etc. ```{r create standardization functions} # FIX VARIABLE NAMES # input dataframe # standardize variable names # output data frame with fixed names fix_names <- function( d ) { nm <- names( d ) nm <- tolower( nm ) nm[ nm == "statea" ] <- "state" nm[ nm == "countya" ] <- "county" nm[ nm == "tracta" ] <- "tract" nm[ nm == "trtid10" ] <- "tractid" nm[ nm == "mar-70" ] <- "mar70" nm[ nm == "mar-80" ] <- "mar80" nm[ nm == "mar-90" ] <- "mar90" nm[ nm == "mar.00" ] <- "mar00" nm[ nm == "x12.mar" ] <- "mar12" nm <- gsub( "sp1$", "", nm ) nm <- gsub( "sp2$", "", nm ) nm <- gsub( "sf3$", "", nm ) nm <- gsub( "sf4$", "", nm ) # nm <- gsub( "[0-9]{2}$", "", nm ) names( d ) <- nm return( d ) } # FIX TRACT IDS # put into format: SS-CCC-TTTTTT fix_ids <- function( x ) { x <- stringr::str_pad( x, 11, pad = "0" ) state <- substr( x, 1, 2 ) county <- substr( x, 3, 5 ) tract <- substr( x, 6, 11 ) x <- paste( "fips", state, county, tract, sep="-" ) return(x) } tidy_up_data <- function( file.name ) { # store the file path as a character vector path <- paste0( "data/raw/", file.name ) # read in the file path using here::here() d <- read.csv( here::here(path), colClasses="character" ) type <- ifelse( grepl( "sample", file.name ), "sample", "full" ) year <- substr( file.name, 10, 13 ) # fix names d <- fix_names( d ) # fix leading zero problem in tract ids d$tractid <- fix_ids( d$tractid ) # drop meta-vars drop.these <- c("state", "county", "tract", "placefp10", "cbsa10", "metdiv10", "ccflag10", "globd10", "globg10","globd00", "globg00", "globd90", "globg90","globd80", "globg80") d <- d[ ! names(d) %in% drop.these ] # column position where variables start after IDs d <- clean_d( d, start.column=2 ) # add year and type (sample/full) d <- data.frame( year, type, d, stringsAsFactors=F ) return( d ) } ``` Test code: The following is set to `eval=FALSE` because it's not required for you to generate the final outputs. *Note: `tidy_up_data()` is able to read in the data because it is not import files using static file paths. Notice the use of `here::here()` up above in the `tidy_up_data()` source code up above.* ```{r test tidy_up_data() function, eval=F} file.name <- "ltdb_std_2010_sample.csv" d.2010.s <- tidy_up_data( file.name ) head( d.2010.s[1:20] ) %>% pander() file.name <- "LTDB_Std_2010_fullcount.csv" d.2010.f <- tidy_up_data( file.name ) head( d.2010.f[1:20] ) %>% pander() d2 <- bind_rows( d.2010.s, d.2010.f ) file.name <- "ltdb_std_2000_sample.csv" d.2010.s <- tidy_up_data( file.name ) head( d.2010.s[1:20] ) %>% pander() file.name <- "LTDB_Std_2000_fullcount.csv" d.2010.f <- tidy_up_data( file.name ) head( d.2010.f[1:20] ) %>% pander() d2 <- bind_rows( d.2010.s, d.2010.f ) ``` Clean and tidy all data from the same year, then combine sample and full dataframes into a single table. *Notice the use of `here::here()` down below can also be used when telling R where to save a file.* ```{r create build_year() function to tidy data across the years} build_year <- function( fn1, fn2, year ) { d1 <- tidy_up_data( fn1 ) d1 <- select( d1, - type ) d2 <- tidy_up_data( fn2 ) d2 <- select( d2, - type ) d3 <- merge( d1, d2, by=c("year","tractid"), all=T ) # store the file path as a character vector file.name <- paste0( "data/rodeo/LTDB-", year, ".rds" ) # export the object to the file path from above using here::here() saveRDS( d3, here::here( file.name ) ) } year <- 1970 f1 <- "LTDB_Std_1970_fullcount.csv" f2 <- "ltdb_std_1970_sample.csv" build_year( fn1=f1, fn2=f2, year=year ) year <- 1980 f1 <- "LTDB_Std_1980_fullcount.csv" f2 <- "ltdb_std_1980_sample.csv" build_year( fn1=f1, fn2=f2, year=year ) year <- 1990 f1 <- "LTDB_Std_1990_fullcount.csv" f2 <- "ltdb_std_1990_sample.csv" build_year( fn1=f1, fn2=f2, year=year ) year <- 2000 f1 <- "LTDB_Std_2000_fullcount.csv" f2 <- "ltdb_std_2000_sample.csv" build_year( fn1=f1, fn2=f2, year=year ) year <- 2010 f1 <- "LTDB_Std_2010_fullcount.csv" f2 <- "ltdb_std_2010_sample.csv" build_year( fn1=f1, fn2=f2, year=year ) ``` Check a file: *Note: Notice the use of `here::here()` below when importing data.* ```{r check one of the output files} # import the clean file d <- readRDS( here::here( "data/rodeo/LTDB-2000.rds" ) ) head( d ) %>% pander() ``` # Metro Area Metadata Metro areas are designated by the US Census as Core-Based Statistical Areas (CBSA). "A core-based statistical area (CBSA) is a U.S. geographic area defined by the Office of Management and Budget (OMB) that consists of one or more counties (or equivalents) anchored by an urban center of at least 10,000 people plus adjacent counties that are socioeconomically tied to the urban center by commuting. Areas defined on the basis of these standards applied to Census 2000 data were announced by OMB in June 2003. These standards are used to replace the definitions of metropolitan areas that were defined in 1990. The OMB released new standards based on the 2010 Census." [cite](https://en.wikipedia.org/wiki/Core-based_statistical_area) Note that these are defined as sets of counties, so the definition files are organized with one county per row, and attributes associated with the county. Census data files do not always have info about metro areas. If we need this information for our analysis we can get a crosswalk file from the National Bureau of Economic Research: https://data.nber.org/data/cbsa-msa-fips-ssa-county-crosswalk.html *Note: Notice the absence of `here::here()`. It is not necessary here because the file lives outside of our directory.* ```{r import FIPS-CBSA crosswalk} URL <- "https://data.nber.org/cbsa-msa-fips-ssa-county-crosswalk/cbsatocountycrosswalk.csv" cw <- read.csv( URL, colClasses="character" ) # all metro areas in the country sort( unique( cw$cbsaname ) ) %>% head() %>% pander() ``` There are 3,292 counties in 2010. Of these, 35% are urban, 65% are rural. ```{r count rural v. urban} # note in the data dictionary for CBSA Name (copied below): “blanks are rural” cw$urban <- ifelse( cw$cbsaname == "", "rural", "urban" ) table( cw$urban ) %>% pander() ``` ```{r filter the crosswalk} keep.these <- c( "countyname","state","fipscounty", "msa","msaname", "cbsa","cbsaname", "urban" ) cw <- dplyr::select( cw, keep.these ) head( cw ) %>% pander() ``` Save for easy load: *Note: Notice the use of `here::here()` below when exporting data.* ```{r save the crosswalk for easy referencing later} saveRDS( cw, here::here( "data/raw/cbsa-crosswalk.rds") ) ``` It's not technically not strictly raw data because we created a new variable and dropped some columns, but it's input data we are grabbing from an external site as meta-data, and it will not be a final research dataset used for analysis, so we can put it into the raw folder. ``` # DATA DICTIONARY FOR CROSSWALK 1. cbsatocountycrosswalk2005 set up by Jean Roth , jroth@nber.org , 20 Dec 2016 2. Source: fr05_cbsa_msa_xwalk_pub.txt 3. NBER URL: http://www.nber.org/data/cbsa-msa-fips-ssa-county-crosswalk.html 4. Source Page: http://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/AcuteInpatientPPS/Acute-Inpatient-Files-for-Download-Items/CMS022637.html 5. Source File URL: http://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/AcuteInpatientPPS/Downloads/fr05_cbsa_msa_xwalk_pub.zip 6. by Jean Roth , jroth@nber.org , 28 Nov 2016 ssacounty: 1. Los Angeles FIPS 06037 can have two SSA county codes: 05210 and 05200 obs: 3,293 vars: 21 20 Dec 2016 11:41 size: 757,390 (_dta has notes) ----------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ----------------------------------------------------------------------------------------------------------- countyname str26 %26s County Name state str2 %9s State ssacounty str5 %9s * SSA County Code fipscounty str5 %9s FIPS County Code msa str6 %9s Old MSA l str1 %9s Lugar msaname str48 %48s Old MSA Name cbsa str5 %9s CBSA - if blank then rural area (set equal to first 2 digits of ssa code) cbsaname str50 %50s CBSA Name cbsaold long %12.0g (Blanks are Rural) cbsanameold str42 %42s (Blanks are Rural) ssast str2 %9s SSA State code fipst str2 %9s FIPS State code y2005 float %9.0g Present in 2005 source file y2011 float %9.0g Present in 2011 source file y2012 float %9.0g Present in 2012 source file y2013 float %9.0g Present in 2013 source file y2014 float %9.0g Present in 2014 source file y2015 float %9.0g Present in 2015 source file y2016 float %9.0g Present in 2016 source file y2017 float %9.0g Present in 2017 source file * indicated variables have notes ------------------------------------------------------------------------------------------------------------ Sorted by: fipscounty ssacounty ```
-----
# Create Meta-Data Table Each of the file contains redundant meta-data. We can remove it to make merges easier, and consolidate all of the meta-data (attributes of counties and census tracts) into a single file for ease of use. We need one per year from 1980 to 2000 to grab all of the unique meta-data in the files. *Note: Notice the use of `here::here()` below when importing data.* ```{r create function to extract metadata} extract_metadata <- function( file.name ) { # store the file path as a character vector path <- paste0( "data/raw/", file.name ) # import the file using the file path inside of here::here() d <- read.csv( here::here( path ), colClasses="character" ) type <- ifelse( grepl( "sample", file.name ), "sample", "full" ) year <- substr( file.name, 10, 13 ) # fix names d <- fix_names( d ) # fix leading zero problem in tract ids d$tractid <- fix_ids( d$tractid ) # drop meta-vars keep.these <- c("tractid","state", "county", "tract", "placefp10", "cbsa10", "metdiv10", "ccflag10", "globd10", "globg10","globd00", "globg00", "globd90", "globg90","globd80", "globg80") d <- d[ names(d) %in% keep.these ] return( d ) } f.1970 <- "LTDB_Std_1970_fullcount.csv" f.1980 <- "LTDB_Std_1980_fullcount.csv" f.1990 <- "LTDB_Std_1990_fullcount.csv" f.2000 <- "LTDB_Std_2000_fullcount.csv" meta.d.2000 <- extract_metadata( file.name=f.2000 ) meta.d.1990 <- extract_metadata( file.name=f.1990 ) meta.d.1990 <- select( meta.d.1990, tractid, globd90, globg90 ) meta.d.1980 <- extract_metadata( file.name=f.1980 ) meta.d.1980 <- select( meta.d.1980, tractid, globd80, globg80 ) meta.d <- merge( meta.d.2000, meta.d.1990, all=T ) meta.d <- merge( meta.d, meta.d.1980, all=T ) meta.d$fipscounty <- paste0( substr( meta.d$tractid, 6, 7 ), substr( meta.d$tractid, 9, 11 ) ) head( meta.d ) %>% pander() ``` Load the CBSA crosswalk: *Note: Notice the use of `here::here()` below when importing data.* ```{r load the crosswalk from earlier} cw <- readRDS( here::here( "data/raw/cbsa-crosswalk.rds" ) ) head( cw ) %>% pander() ``` Now let's do some analysis to gain a deeper sense of what is inside the data. ```{r data analysis} cw <- select( cw, -countyname, -state ) # new counties since 2010 ? setdiff( cw$fipscounty, meta.d$fipscounty ) # drop duplicate county nrow( cw ) cw <- cw[ ! duplicated(cw$fipscounty) , ] nrow( cw ) nrow( meta.d ) meta.d <- merge( meta.d, cw, by="fipscounty", all.x=T ) nrow( meta.d ) head( meta.d ) %>% pander() ``` Save for easy load: *Note: Notice the use of `here::here()` below when exporting data.* ```{r save the metadata} saveRDS( meta.d, here::here( "data/rodeo/LTDB-META-DATA.rds" ) ) ```





# Alternative Approach Build one large stacked dataset: Hard to use because you don't know which panel years exist for each variable. ```{r, eval=F} d.list <- NULL loop.count <- 1 for( i in these ) { file.name <- i d.i <- tidy_up_data( file.name ) d.list[[ loop.count ]] <- d.i loop.count <- loop.count + 1 } d <- bind_rows( d.list ) ``` Then you can reshape the dataset as needed: ```{r, eval=F} dat <- filter( dat, year %in% c(2000,2010) ) library(data.table) # CRAN version 1.10.4 setDT(world) # coerce to data.table data_wide <- dcast(world, Country ~ Year, value.var = c("Growth", "Unemployment", "Population")) reshape(world, direction = "wide", timevar = "Year", idvar = "Country") d2 <- d[1:20] reshape( d2, direction="wide", timevar="year", idvar="tractid" ) ``` ``` +---------+------+--------+--------------+------------+ | Country | Year | Growth | Unemployment | Population | +---------+------+--------+--------------+------------+ | A | 2015 | 2 | 8.3 | 40 | | B | 2015 | 3 | 9.2 | 32 | | C | 2015 | 2.5 | 9.1 | 30 | | D | 2015 | 1.5 | 6.1 | 27 | | A | 2016 | 4 | 8.1 | 42 | | B | 2016 | 3.5 | 9 | 32.5 | | C | 2016 | 3.7 | 9 | 31 | | D | 2016 | 3.1 | 5.3 | 29 | | A | 2017 | 4.5 | 8.1 | 42.5 | | B | 2017 | 4.4 | 8.4 | 33 | | C | 2017 | 4.3 | 8.5 | 30 | | D | 2017 | 4.2 | 5.2 | 30 | +---------+------+--------+--------------+------------+ +---------+-------------+-------------------+-----------------+-------------+-------------------+-----------------+ | Country | Growth_2015 | Unemployment_2015 | Population_2015 | Growth_2016 | Unemployment_2016 | Population_2016 | +---------+-------------+-------------------+-----------------+-------------+-------------------+-----------------+ | A | 2 | 8.3 | 40 | 4 | 8.1 | 42 | | B | 3 | 9.2 | 32 | 3.5 | 9 | 32.5 | | C | 2.5 | 9.1 | 30 | 3.7 | 9 | 31 | | D | 1.5 | 6.1 | 27 | 3.1 | 5.3 | 29 | +---------+-------------+-------------------+-----------------+-------------+-------------------+-----------------+ ```