--- title: 'Data Set Profile: Bike Sharing Demand' author: "Daniel Dittenhafer" date: "Saturday, September 27, 2014" output: pdf_document --- ```{r, echo=FALSE} require(ggplot2, quietly=TRUE) require(lubridate, quietly=TRUE) projRoot <- "C:/SourceCode/R/DataAcqMgmt" ``` ---------------------------------------------------------------------------------------------------- Data Set Number of Attribute Number of Characteristics Observations Area Characteristics Attributes Missing Values? --------------- ------------- --------- ---------------------- ---------- --------------- Multivariate 10,886 Business Categorical, Integer, 12 No Date/time, Decimal --------------------------------------- ---------------------- ---------- --------------- ### Source ### [Kaggle.com Competition: Bike Sharing Demand](https://www.kaggle.com/c/bike-sharing-demand): https://www.kaggle.com/c/bike-sharing-demand The goal of the competition, from the Kaggle website, is to "predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period." The data set profiled herein is the training data set. ### Attribute Information ### -------------------------------------------------------------------- Field Data Type Description --------- ------------- -------------------------------------------- datetime date & hour First 19 days of each month Min: 01/01/2011 00:00; Max: 12/19/2012 23:00 season integer Categorical; 1 = spring; 2 = summer; 3 = fall; 4 = winter; holiday boolean 1 = a holiday; 0 = not a holiday; workingday boolean 1 = a work day; 0 = weekend or holiday; weather integer Categorical; 1) Clear, Few clouds, Partly cloudy, Partly cloudy; 2) Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; 3) Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds; 4) Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog; temp decimal temperature in degrees Celsius atemp decimal apparent temperature in degrees Celsius humidity integer relative humidity percentage windspeed decimal the speed that air is moving in unknown units casual integer the number of non-registered bike shares for the hour registered integer the number of registered bike shares for the hour count integer the total number of bike shares for the hour -------------------------------------------------------------------- ### Comments ### There were no inherent character columns and the data, where appropriate, was already converted to factor-like integer values. As such, in order to better map the data set to this exercise, I have added a `seasonName` column and reverted the `season` column into this new column to begin with. Given the goal of the Kaggle competition to predict future bike share count, shifting the `count` values by a fixed period aids in this analysis. Additionally, for analysis purposes the datetime field would be better broken up into components including a simple integer hour of day, individual month value, day of week, segment of the day, etc. Some of these transformations are applied in the code segment that follows including the inclusion of a `nextHourCount` attribute which reflects the following hour's total bike rentals for a given hour. ```{r} # Load the data into a data.frame csv_file <- file.path(projRoot, "Week5", "BikeSharingDemand.csv") csv <- read.table(csv_file, header=TRUE, sep=",") # Revert season to character data. bikes <- data.frame(csv, seasonName=NA, hourOfDay=NA, dayOfWeek=NA, dayOfWeekInt=NA, monthOfYear=NA, segmentOfDay=NA, nextHourDateTime=NA, nextHourCount=NA) bikes[bikes$season == 1,]$seasonName <- "spring" bikes[bikes$season == 2,]$seasonName <- "summer" bikes[bikes$season == 3,]$seasonName <- "fall" bikes[bikes$season == 4,]$seasonName <- "winter" # 14 Add an integer column for hour of day bikes$hourOfDay <- lubridate::hour(bikes$datetime) # 15 Add a factor and integer column for day of week bikes$dayOfWeek <- as.factor(weekdays(strptime(as.character(bikes$datetime), format="%Y-%m-%d %H:%M:%S"))) # 16 bikes$dayOfWeekInt <- as.numeric(bikes$dayOfWeek) # 17 Add an integer column for month of year bikes$monthOfYear <- lubridate::month(bikes$datetime) # 18 Add an integer column for segment of day bikes$segmentOfDay <- ifelse(bikes$hourOfDay >= 5 & bikes$hourOfDay < 12, 1, # Morning ifelse(bikes$hourOfDay >= 12 & bikes$hourOfDay < 17, 2, # Afternoon ifelse(bikes$hourOfDay >= 17 & bikes$hourOfDay < 22, 3, # Evening 4 ))) # Night # Seq vector to help shift 'count' rows by one hour to analysis predicability ind <- seq(2, nrow(bikes) + 1, 1) ind[length(ind)] <- NA # 19, 20 Add column showing the 'next hour" datetime and count bikes$nextHourDateTime <- bikes[ind,"datetime"] bikes$nextHourCount <- bikes[ind, "count"] ``` ### Summary Statistics ### Using the `summary()` R function, the basic statistics about each attribute are summarized through the raw R output that follows. ```{r} # Summary summary(bikes) ``` ### Outlier Analysis ### Using the Data Mining with R (DMwR) package's `lofactor()` function, an outlier analysis was performed. Although a complete analysis was performed across all categorical and numeric attributes, generally no extreme outliers were detected, with a notable exception. The following R code illustrates the approach used in the outlier analysis with the top 5 outlying points highlighted in the following charts via a red plus symbol. ```{r} outlier.scores <- DMwR::lofactor(bikes[,c(2:9,12)], k=5) outliers <- order(outlier.scores, decreasing=T)[1:5] ``` One might reasonably expect temperature and apparent temperature to follow one another more or less linearly. For each unit increase in temperature, apparent temperature would increase by approximately one unit. As shown in the following chart, this is mostly true, except for the August 17, 2012 values. On this day, the connection between temperature and apparent temperature appear broken with `atemp` stuck at 12.12 $^\circ$C. Without further knowledge of the data's origins, one can only speculate as to why this is the case. ```{r, echo=FALSE} n <- nrow(bikes) pch <- rep(".", n) pch[outliers] <- "+" col <- rep("black", n) col[outliers] <- "red" # Plot the outliers along with the other data #pairs(bikes[,c(5:9, 14, 20)], pch=pch, col=col, main="Outliers in Context") pairs(bikes[,c(6:7,17)], pch=pch, col=col, main="Outliers in Context") # Show the raw data (well, lets hide it actually) #bikes[lubridate::day(bikes$datetime) == 17 & lubridate::month(bikes$datetime) == 8 & lubridate::year(bikes$datetime) == 2012,c(1, 6:7)] ``` Using an alternative approach to outlier detection, the mean and standard deviation were calculated for the numeric attributes. A distribution analysis was not performed, and as such normal distribution is only assumed here. With a 3 standard deviation width on either side of the mean, values that appear outside these bounds could be considered outliers. The following R code performs this analysis and shows the results. ```{r} # Calculate mean/standard deviation msd <- sapply(bikes[,c(6:12)], function(cl) c(mean=mean(cl,na.rm=TRUE), stdev=sd(cl,na.rm=TRUE))) # Melt into long form msdDf <- as.data.frame(t(msd)) msdDf <- data.frame(attribute=rownames(msdDf), msdDf) msdDf <- subset(msdDf, !is.na(msdDf$mean) & !is.na(msdDf$stdev)) # Add lower/upper bounds at 3 stdevs xTimes <- 3 lowers <- msdDf$mean - (xTimes * msdDf$stdev) uppers <- msdDf$mean + (xTimes * msdDf$stdev) msdDf <- data.frame(msdDf, lower=lowers, upper = uppers) msdDf ``` ```{r echo=FALSE} # Using for loop, though probably a plyr method would be better here. bikeOut <- data.frame(id=seq(1, nrow(bikes),1)) for(a in msdDf$attribute) { cOut <- bikes[, c(a)] < msdDf[msdDf$attribute == a,"lower"] | msdDf[msdDf$attribute == a,"upper"] < bikes[, c(a)] bikeOut <- cbind(bikeOut, a=cOut) } # Beautify and show results bikeOut <- bikeOut[,c(2:ncol(bikeOut))] colnames(bikeOut) <- msdDf$attribute bikeOutSums <- sapply(bikeOut, sum) bikeOutSums <- as.data.frame(t(bikeOutSums)) bikeOutSums <- cbind(bikeOutSums, id=1) bikeOutSums <- reshape2::melt(bikeOutSums, na.rm=TRUE, variable.name="attribute", value.name="numOutliers", id.vars="id") ``` ```{r, echo=FALSE} g5 <- ggplot(data=bikeOutSums, aes(x=attribute, y=numOutliers)) g5 <- g5 + geom_bar(stat="identity") g5 <- g5 + theme(axis.text.x = element_text(angle=30, vjust=1)) g5 <- g5 + labs(title="Number of Outliers/Attribute at 3 Standard Deviations") g5 ``` ### Correlation Analysis ### Using R's `cor()` function, as shown in the following code, an analysis of correlation between the numeric attributes was performed. ```{r} bikeCor <- cor(bikes[, c(20,6:12,14,16,17,18)], use="complete.obs") bikesCorMelt <- reshape2::melt(bikeCor, varnames=c("x", "y"), value.name="Correlation") bikesCorMelt <- bikesCorMelt[order(bikesCorMelt$Correlation),] ``` ```{r, echo=FALSE} g1 <- ggplot(data=bikesCorMelt, aes(x=x, y=y)) g1 <- g1 + geom_tile(aes(fill=Correlation)) g1 <- g1 + scale_fill_gradient2(low="red", mid="white", high="steelblue", guide=guide_colorbar(ticks=FALSE, barheight=10), limits=c(-1,1)) g1 <- g1 + theme_minimal() g1 <- g1 + theme(axis.text.x = element_text(angle=30, vjust=1)) g1 <- g1 + labs(title="Heatmap of Attribute Correlation") g1 ``` As can be seen in the heat map above and values below, humidity is negatively correlated with bike sharing demand across all three measures (casual, registered and count). Likewise, temperature is positively correlated with bike sharing demand in this data set. Interestingly, the day of the week `dayOfWeekInt` attribute shows virtually no correlation with bike sharing demand, but `segmentOfDay`, the indicator of morning, afternoon, etc, shows a moderate negative correlation whereby apparently more rentals occur in the morning (`segmentOfDay` = 1). ```{r, echo=FALSE} bikeCor ``` ### Entropy Analysis ### Using Entropy and Information Gain functions developed in a prior exercise, an entropy analysis was performed. Raw entropy of the bike shares per hour was calculated initially. ```{r} source(file.path(projRoot, "EntropyFunctions.R"), chdir=TRUE) # Raw Entropy: Total Bike Sharing entropy(bikes$nextHourCount) ``` The `decide()` function from the EntropyFunctions script was used to calculate information gain across all attributes versus the `nextHourCount` bike sharing measure which was added to aid with prediction analysis. The results were then melted into a long format and sorted for better visualization. The R code is shown below. None of the attributes produced particularly staggering information gain, but the `humidity` attribute was found to be the most meaningful, followed by the `hourOfDay` calculated attribute and `atemp`/`temp` attributes. ```{r} # Calculate information gain across all categorical and numeric attributes. nextHrEnt <- decide(bikes[,c(2,3,4,5,6,7,8,9,14,15,17,18,20)], 13) nextHrEntMelt <- reshape2::melt(nextHrEnt$gains, value.name="info.gain") nextHrEntMelt <- cbind(nextHrEntMelt, attribute=rownames(nextHrEntMelt)) nextHrEntMelt <- nextHrEntMelt[order(-nextHrEntMelt$info.gain),] ``` ```{r, echo=FALSE} require(ggplot2) g2 <- ggplot(data=nextHrEntMelt, aes(x=factor(attribute), y=info.gain)) g2 <- g2 + geom_bar(stat="identity") g2 <- g2 + theme_minimal() g2 <- g2 + scale_x_discrete(labels=as.character(nextHrEntMelt$attribute), limits=as.character(nextHrEntMelt$attribute)) g2 <- g2 + theme(axis.text.x = element_text(angle=30, vjust=1)) g2 <- g2 + labs(title="Information Gain - Next Hour Bike Shares by Partition", x="Paritioning Attribute", y="Information Gain") g2 nextHrEntMelt ``` ### Source Code ### The raw R markdown code used to produce this data set profile can be found [on GitHub, in my DataAcqMgmt repository](https://raw.githubusercontent.com/dwdii/DataAcqMgmt/master/Week5/Project2-ProfilingADataSet-Bsd.Rmd). ### References ### Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg. Lander, Jared P. "Correlation and Covariance." R for Everyone: Advanced Analytics and Graphics. New York: Addison-Wesley, 2014. N. pag. Print. "[Means and SD for Columns in a Dataframe with NA Values.](http://stackoverflow.com/a/20797794/2604144)" R. StackOverflow, 27 Dec. 2013. Web. 28 Sept. 2014. Zhao, Yanchang. "[Outlier Detection - RDataMining.com: R and Data Mining](http://www.rdatamining.com/examples/outlier-detection)" RDataMining.com: R and Data Mining. N.p., 2014. Web. 27 Sept. 2014.