--- title: "Chapter 5 Data Pre-processing" date: "`r Sys.Date()`" output: html_document: toc: true toc_depth: 3 toc_float: collapsed: true smooth_scroll: true --- This notebook illustrates how to perform standard data pre-processing, an essential step for any data science projects. # Load R Packages and Data ```{r, message = FALSE, warning=FALSE, results='hide'} # install packages p_needed <- c('imputeMissings','caret','e1071','psych','car','corrplot','RANN') packages <- rownames(installed.packages()) p_to_install <- p_needed[!(p_needed %in% packages)] if (length(p_to_install) > 0) { install.packages(p_to_install) } lapply(p_needed, require, character.only = TRUE) ``` ```{r} # load the simulated dataset and return summary statistics for each column sim.dat <- read.csv("http://bit.ly/2P5gTw4") summary(sim.dat) ``` # Deal with Problematic Data Set the problematic values as missing and impute them later. ```{r} # set problematic values as missings sim.dat$age[which(sim.dat$age > 100)] <- NA sim.dat$store_exp[which(sim.dat$store_exp < 0)] <- NA # see the results summary(subset(sim.dat, select = c("age", "store_exp"))) ``` # Deal with Missing Values Missing values are common in the raw data set. Based on the mechanism behind missing values, we have a few different ways to impute missing values. First, let's use the `impute()` function from `imputeMissing` package with `method = "median/mode"`. ```{r} # save the result as another object # !!! has to add imputeMissings:: demo_imp <- imputeMissings::impute(sim.dat, method = "median/mode") # check the first five columns. # There are no missing values in other columns summary(demo_imp[, 1:5]) ``` We can also use `preProcess()` function from `caret` package with `method = "medianImpute"`. ```{r} imp <- preProcess(sim.dat, method = "medianImpute") demo_imp2 <- predict(imp, sim.dat) summary(demo_imp2[, 1:5]) ``` Use preProcess() to conduct KNN: ```{r} ## Please note, to use knnImpute you have to install.packages('RANN') # !!! have to install RANN package imp <- preProcess(sim.dat, method = "knnImpute", k = 5) # need to use predict() to get KNN result demo_imp <- predict(imp, sim.dat) # only show the first three elements # !!! below line changed !!! lapply(demo_imp, summary)[1:3] ``` The `preProcess()` will automatically ignore non-numeric columns. When all the columns for a row are missing, then KNN method will fail. For example, the following codes will return an error. In this case, we can identify rows with all columns missing and remove them from the dataset. ```{r} temp <- rbind(sim.dat, rep(NA, ncol(sim.dat))) imp <- preProcess(sim.dat, method = "knnImpute", k = 5) ``` ```r demo_imp <- predict(imp, temp) ``` ```html Error in FUN(newX[, i], ...) : cannot impute when all predictors are missing in the new data point ``` There is an error saying “`cannot impute when all predictors are missing in the new data point`”. It is easy to fix by finding and removing the problematic row(s): ```{r} idx <- apply(temp, 1, function(x) sum(is.na(x))) as.vector(which(idx == ncol(temp))) ``` It shows that row 1001 is problematic. We can go ahead to delete it. Finally, let us try the "`bagImpute`" method, which is more time-consuming. ```{r} imp <- preProcess(sim.dat, method = "bagImpute") demo_imp <- predict(imp, sim.dat) summary(demo_imp[, 1:5]) ``` # Centering and Scaling Centering and scaling are the most common data transformation, and they are easy to apply. For example, one way to do centering and scaling is to use the mean and standard deviation of the data, as described below. ```{r} income <- sim.dat$income # calculate the mean of income mux <- mean(income, na.rm = T) # calculate the standard deviation of income sdx <- sd(income, na.rm = T) # centering tr1 <- income - mux # scaling tr2 <- tr1/sdx ``` But we can use the `preProcess()` function in `caret` directly, as illustrated below for the `age` and `income`: ```{r} sdat <- subset(sim.dat, select = c("age", "income")) # set the 'method' option trans <- preProcess(sdat, method = c("center", "scale")) # use predict() function to get the final result transformed <- predict(trans, sdat) ``` # Resolve Skewness We first show how the left and right skewness looks and then describe how to use a box-cox procedure to identify a transformation to reduce skewness in the data. ```{r} # need skewness() function from e1071 package set.seed(1000) par(mfrow = c(1, 2), oma = c(2, 2, 2, 2)) # random sample 1000 chi-square distribution with df=2 # right skew x1 <- rchisq(1000, 2, ncp = 0) # get left skew variable x2 from x1 x2 <- max(x1) - x1 plot(density(x2), main = paste("left skew, skewnwss =", round(skewness(x2), 2)), xlab = "X2") plot(density(x1), main = paste("right skew, skewness =", round(skewness(x1), 2)), xlab = "X1") ``` In the cell below, we use the `preProcess()` function in `caret` package to find the best Box-Cox transformation for `store_trans` and `online_trans` in our simulated dataset. ```{r} describe(sim.dat) # select the two columns and save them as dat_bc dat_bc <- subset(sim.dat, select = c("store_trans", "online_trans")) trans <- preProcess(dat_bc, method = c("BoxCox")) ``` Compare the histogram of the `store_trans` variable before and after the Box-Cox transformation. ```{r} transformed <- predict(trans, dat_bc) par(mfrow = c(1, 2), oma = c(2, 2, 2, 2)) hist(dat_bc$store_trans, main = "Before Transformation", xlab = "store_trans") hist(transformed$store_trans, main = "After Transformation", xlab = "store_trans") ``` ```{r} skewness(transformed$store_trans) ``` We can also use the `BoxCoxTrans()` function directly to perform the Box-Cox transformation. ```{r} trans <- BoxCoxTrans(dat_bc$store_trans) transformed <- predict(trans, dat_bc$store_trans) skewness(transformed) ``` # Resolve Outliers There are formal definitions of outliers, but it is essential to visualize the data to gain some intuition. ```{r, message=FALSE} # select numerical non-survey data sdat <- subset(sim.dat, select = c("age", "income", "store_exp", "online_exp", "store_trans", "online_trans")) # use scatterplotMatrix() function from car package par(oma = c(2, 2, 1, 2)) car::scatterplotMatrix(sdat, diagonal = TRUE) ``` In addition to visualization, we can also calculate the modified Z-score using mean and MAD (i.e., median of the absolute dispersion) for each data point. We can then define outliers as modified Z-score greater than 3.5. ```{r} # calculate median of the absolute dispersion for income ymad <- mad(na.omit(sdat$income)) # calculate z-score zs <- (sdat$income - mean(na.omit(sdat$income)))/ymad # count the number of outliers sum(na.omit(zs > 3.5)) ``` For models sensitive to outliers, we can use spatial sign transformation to minimize outliers' impact, as described below. ```{r} # KNN imputation sdat <- sim.dat[, c("income", "age")] imp <- preProcess(sdat, method = c("knnImpute"), k = 5) sdat <- predict(imp, sdat) transformed <- spatialSign(sdat) transformed <- as.data.frame(transformed) par(mfrow = c(1, 2), oma = c(2, 2, 2, 2)) plot(income ~ age, data = sdat, col = "blue", main = "Before") plot(income ~ age, data = transformed, col = "blue", main = "After") ``` # Deal with Collinearity When two variables are highly correlated, collinearity exists. We can visualize the correlation between variables. ```{r} # select non-survey numerical variables sdat <- subset(sim.dat, select = c("age", "income", "store_exp", "online_exp", "store_trans", "online_trans")) # use bagging imputation here imp <- preProcess(sdat, method = "bagImpute") sdat <- predict(imp, sdat) # get the correlation matrix correlation <- cor(sdat) # plot par(oma = c(2, 2, 2, 2)) corrplot.mixed(correlation, order = "hclust", tl.pos = "lt", upper = "ellipse") ``` Once we have the correlation between variables, we can define high correlation variables using certain cutoff threshold, and remove them to reduce potential collinearity. ```{r} highCorr <- findCorrelation(cor(sdat), cutoff = 0.7) # delete highly correlated columns sdat <- sdat[-highCorr] # check the new correlation matrix cor(sdat) ``` # Deal with Sparse Variables Other than the highly related predictors, predictors with degenerate distributions can cause problems. Removing those variables can significantly improve some models' performance and stability. One extreme example is a variable with a single value, which is called a zero-variance variable. Variables with a very low frequency of unique values are near-zero variance predictors. ```{r} # make a copy zero_demo <- sim.dat # add two sparse variable zero1 only has one unique value zero2 is a # vector with the first element 1 and the rest are 0s zero_demo$zero1 <- rep(1, nrow(zero_demo)) zero_demo$zero2 <- c(1, rep(0, nrow(zero_demo) - 1)) ``` We can use `nearZeroVar()` function in `caret` package to identify these sparse variables. ```{r} nearZeroVar(zero_demo,freqCut = 95/5, uniqueCut = 10) ``` # Encode Dummy Variables For categorical variables, dummy encoding is a needed step before fitting many models. It converted one column of a categorical variable to multiple columns containing 0 and 1. For one categorical variable, we can use the `class.ind()` function in `nnet` package. But it is more convenient to use the `dummyVars()` function in `caret` package, which can be applied to a data frame. ```{r} dumVar <- nnet::class.ind(sim.dat$gender) head(dumVar) ``` ```{r} # use "origional variable name + level" as new name dumMod <- dummyVars(~gender + house + income, data = sim.dat, levelsOnly = F) head(predict(dumMod, sim.dat)) ``` The function can also create interaction term: ```{r} dumMod <- dummyVars(~gender + house + income + income:gender, data = sim.dat, levelsOnly = F) head(predict(dumMod, sim.dat)) ```