## Data management and cleaning in R ## Clinic on the Meaningful Modeling of Epidemiological Data ## International Clinics on Infectious Disease Dynamics and Data (ICI3D) Program ## African Institute for Mathematical Sciences, Muizenberg, RSA ## ## (C) Juliet Pulliam, 2011, 2012, 2017 ## ## The goal of this tutorial is to acquaint you with ways of ## manipulating and cleaning infectious disease data in R. By ## the end of the tutorial you should be able to: ## ## * import CSV files into R ## * ensure imported data is interpreted correctly by R ## * identify errors and inconsistencies in a data frame ## * correct identified data errors ## ## NOTE: The comments will guide you through the tutorial but you ## should make sure you understand what the code is doing. Some ## function arguments are assigned to ?. This will give you an ## error. You should try out values for these arguments as ## suggested in the comments or find them yourselves in a help file. ###################################################################### ## First, you need to import the data you will be working with. ## You should make sure you are in the same ## working directory as your data. ## You presumably downloaded the data and this script to the same place ## You can make sure R studio is "there" by clicking on Session/Set Working ## Directory/To Source File Location ## If you have files in a different place than the script, you can use ##Session/Set Working Directory (or a keyboard shortcut) to navigate there. ## We generally don't recommend using setwd() in scripts, because it causes ## difficulties in sharing and collaboration. ## For this tutorial, we will be using an example dataset that ## resembles (but is not) real data from a measles epidemic in ## Democratic Republic of Congo. The "data" are made available only ## to ICI3D participants and are for use only during the MMED and ## DAIDD clinics, and only for the specified purposes. ## We will be using the 'tidyverse' family of R functions to read in ## and clean the data. Let's get started. library(tidyverse) # Load the tidyverse package dat <- read_csv("tutorial5.csv") # read in the "data" ## The read_csv() function reads data in as a tibble, a data ## structure that is very useful because it can contain different ## data types in different columns (similar to a data frame, but unlike ## a matrix or array). ###################################################################### ## Let's look at the data to get a feeling for the size of the ## data set and what variables we have to work with. dim(dat) # Determine the number of rows and columns dat # Look at the beginning of the dataset ## For each individual in our data set, we have the following ## information: ## ## * Province - the province where the case occurred ## * ZS - the health zone where the case occurred ## * Age_annes - the person's age in years ## * Age_Mois - the person's age in months ## * Sexe - the person's sex ## * Date_dern_vacci - the date when the person was vaccinated against ## measles ## * DDS - the date of symptom onset ## * test - the result of a diagnostic test for IgM antibodies against ## measles ## The read_csv() function has read in most of our data as characters. ## Note that this is a different behavior from what would have happened ## using the read.csv() function demonstrated in Thumbi's lecture, and ## this fixes the problem he demonstrated where factors are sometimes ## treated as numbers when they shouldn't be. It also gives us some ## useful information about problems we're likely to discover with ## the dataset as we explore further - for example, the fact that the ## Age_annes column has been read in as characters indicates there are ## likely to be some problematic entries. ## We should always go through our data sets to ensure that ## R treats each variable as the appropriate data type. To do this, ## it will help to know the different ways in which R can store and ## interpret data. To see a list of the options for data types: ?typeof ######### ## Hands off the keyboard! Pick up a writing pad and... ######### ## ## Decide what data type each variable in the data set should be. ## Refer to the lectures on study design or data cleaning, if necessary. ## ######### ## Before we start working with the data itself, let's rename the columns ## so they're easier to interpret. dat <- rename(dat , province = Province , healthZone = ZS , ageYears = Age_annes , ageMonths = Age_Mois , sex = Sexe , dateVaccinated = Date_dern_vacci , dateOnset = DDS , testResultIgM = test ) ## Here is a cleaned up version of the data key, with the new ## column names and information about the data types we will end up ## with after cleaning the data: ## ## * province - the province where the case occurred (character) ## * healthZone - the health zone where the case occurred (character) ## * ageYears - the person's age in years (integer) ## * ageMonths - the person's age in months (integer) ## * sex - the person's sex (character) ## * dateVaccinated - the date when the person was vaccinated against ## measles (character) ## * dateOnset - the date of symptom onset (character) ## * testResult - the result of a diagnostic test for IgM antibodies against ## measles ## ## We will also create some new variables as we clean the data: ## ## * ageYearsContinuous - the person's (approximate) age in years, as a ## continuous ## variable (double) ## * vaccinationStatus - whether the person had been previously ## vaccinated (logical) ## * testStatus - additional information about the person's diagnostic ## status (character) ###################################################################### ## ## Let's begin by looking at ageYears, which should be stored as a ## numeric value. mode(dat$ageYears) # determine the storage mode of a variable ## As noted above, ageYears is being treated as a character, which ## likely indicates a problem with some of the entries; let's take ## a closer look: distinct(dat,ageYears) ## The distinct() function lists the unique values for a variable or ## or set of variables. So far, everything looks fine, but the ## function only shows us as much of the data as will fit in the ## console window. We need to look at all of the values to indentify ## what's going on: ## CONSOLE (you can try View() instead of print()) print( dat %>% distinct(ageYears) ) ## Scrolling down, we can see the problem: some of the entries have a ## value of 'ND' which of course is not a number, so read_csv() ## decided to treat the entire column as characters. In this dataset, ## 'ND' is a way of denoting missing values, and we can correct ## this easily by telling R that everything in this column should be ## a number. To see what this will do, let's look at the unique values ## again after we cast ageYears as an integer: ## CONSOLE View( dat %>% distinct(as.integer(ageYears)) ) ## We see that 'ND' is no longer in the list of values and has been ## replaced by NA, which stands for 'not applicable' and is R's way ## of treating missing information. We can see that R is treating NA ## differently from 'ND' because the NA row is now greyed out. Let's ## update the dataset to reflect this correction: dat <- ( dat %>% mutate(ageYears = as.integer(ageYears)) # replace ) ## Now let's look more closely at the data in this column to verify ## that the values make sense: ## CONSOLE print( dat %>% distinct(ageYears) %>% arrange(ageYears) ) ## This time, instead of just looking at all the values, we've arranged ## them in ascending order, so we can easily see whether all of the ## numeric values are plausible. We can also look explicitly at the range ## of values given: range(dat$ageYears,na.rm = T) ## Note that the na.rm argument tells the function whether to drop the ## missing values before showing us the range. We have set it to TRUE, ## so we see the range of values after the NAs have been removed. ## Now let's look at the column giving age in months. mode(dat$ageMonths) # determine the storage mode of a variable ## The variable is stored as a numeric value, as seems appropriate. ## Let's also verify that this numeric value is not representing a ## factor: class(dat$ageMonths) # determine the object class of a variable ## Great! The ageMonths column seems to be in better shape that the ## ageYears column was. Fill in the missing information in the command ## below to take a closer look: ## CONSOLE print( dat %>% distinct(???) # View distinct values for month ## FIXME %>% arrange(???) # Arrange values in ascending order ## FIXME ) ## All of the values are numeric, but at least one row has a value of 25, ## which is hard to interpret. In particualr, we need ## to be careful that we understand how the two age columns relate ## to each other, or we could run into trouble at the data analysis phase. ## Let's start by looking at the data with missing values for ageMonths. We ## can also look at the rows that have missing values for ageYears at the ## same time: print( dat %>% filter(is.na(ageMonths)|is.na(ageYears)) # subset the rows that have NA for either age variable ) ## There is one individual with missing values, and they are missing both year ## and month information. While it's not ideal to have missing data, this is ## a reality in most datasets, and in this case it's not that bad. ## Now let's look at the ageMonths values for the subset of the ## data that has an ageYears value of 0: print( dat %>% filter(ageYears==0) # subset the rows that have 0 for ageYears %>% group_by(ageMonths) # group by ageMonths %>% summarize(count = n()) # count occurrences of each ageYears value ) ## All ageMonths values for children with ageYears of 0 are between 0 ## and 11 (inclusive). This seems like a good start. Now let's look the ## ageYears values for the rest of the data: print( dat %>% filter(???) # subset the rows for ages of at least 1 year ## FIXME %>% group_by(ageMonths) # group by ageMonths %>% summarize(count = n()) # count occurrences of each ageMonths value ) ## There is only one row in the dataset where ageMonths is greater than 11, ## so let's take a closer look at that particular row: print( dat %>% filter(ageMonths==25) ) ## How confusing! Both the ageYears and ageMonths values are 25. ## At this stage, we would like to go back to the source of the data ## and double-check the recorded age for this individual. Let's say we ## go back to the data source and learn that the individual had just ## turned 25, meaning that the months entry should be 0 and there was a data ## entry error where the age in years was accidentally entered in both ## columns. We can easily correct this. ## Let's first set up a test case on a simplified version of the dataset: print( dat %>% select(ageYears,ageMonths) # reduce the number of columns for test case %>% filter(ageYears==25) # subset to look only at the data for 25 year olds %>% mutate( ageMonthsNEW = ifelse(ageYears == 25 & ageMonths == 25, 0, ageMonths)) # correct the typo ) ## It looks like the mutate command is appropriately correcting the typo, so we ## can now be comfortable replacing the values in the actual data frame: dat <- ( dat %>% mutate(ageMonths = ifelse(ageYears == 25 & ageMonths == 25, 0, ageMonths)) ) ## Note that it is very difficult to pick up errors in the age variables if they ## don't fall outside the range expected for age or years. To reduce errors that ## are difficult to spot, data are often entered twice by different people ## and then values are compared to look for inconsistencies. ## Now that we have cleaned versions of both ageYears and ageMonths, let's put ## them together to create a new variable ageYearsContinuous that treats age as continuous. ## We can do this using the mutate() function that we saw above. Again, we'll ## first look at the output for a test case and the update the actual data frame: ## CONSOLE print( dat %>% select(ageYears,ageMonths) # reduce the number of columns for test case %>% mutate( ageYearsContinuous = ageYears + ageMonths/12) ) ## Scan the values of the columns to make sure the conversion has been done right. ## If you were doing a more complicated transformation, you would probably want to ## build in additional checks to identify potential errors. ## The transformation looks good, so let's update our data frame to add the new variable. ## Now that we have a new variable that contains all of the age information, we can also ## remove the original age variables, which are difficult to interpret. dat <- ( dat %>% mutate( ageYearsContinuous = ageYears + ageMonths/12) # create new variable %>% select(-ageYears,-ageMonths) # remove original age variables ) summary(dat) ggplot(dat, aes(x=ageYearsContinuous)) + geom_histogram() ###################################################################### ## PROBLEM ###################################################################### ## ## So far we've been correcting errors in a somewhat adhoc manner. In ## the lecture, we saw how to do this is a more robust way ## using a correction table. Follow the examples from the lecture to ## correct the province names for Kasai Oriental and Kasai Occidental ## in the using the correction table kasaiCorrectionTable.csv ## ###################################################################### corTab <- read_csv('kasaiCorrectionTable.csv') dat <- ... ## FIXME