---
title: 'Data Set Profile: Bike Sharing Demand'
author: "Daniel Dittenhafer"
date: "Saturday, September 27, 2014"
output: pdf_document
---
```{r, echo=FALSE} 
require(ggplot2, quietly=TRUE)
require(lubridate, quietly=TRUE)
projRoot <- "C:/SourceCode/R/DataAcqMgmt"
```
----------------------------------------------------------------------------------------------------
Data Set         Number of              Attribute              Number of 
Characteristics  Observations Area      Characteristics        Attributes Missing Values?
--------------- ------------- --------- ---------------------- ---------- ---------------
Multivariate    10,886        Business  Categorical, Integer,  12         No     
                                        Date/time, Decimal  
--------------------------------------- ---------------------- ---------- ---------------

### Source ###

[Kaggle.com Competition: Bike Sharing Demand](https://www.kaggle.com/c/bike-sharing-demand): https://www.kaggle.com/c/bike-sharing-demand

The goal of the competition, from the Kaggle website, is to "predict the total 
count of bikes rented during each hour covered by the test set, using only 
information available prior to the rental period." The data set profiled herein is
the training data set.


### Attribute Information ###

--------------------------------------------------------------------
Field      Data Type     Description
---------  ------------- --------------------------------------------
datetime   date & hour   First 19 days of each month
                         Min: 01/01/2011 00:00;    
                         Max: 12/19/2012 23:00

season     integer       Categorical;
                         1 = spring;
                         2 = summer;
                         3 = fall;
                         4 = winter;
                         
holiday    boolean       1 = a holiday;
                         0 = not a holiday;

workingday boolean       1 = a work day;
                         0 = weekend or holiday;
                         
weather    integer       Categorical;
                         1) Clear, Few clouds, Partly cloudy, Partly cloudy;
                         2) Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; 
                         3) Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds;
                         4) Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog; 
                         
temp       decimal       temperature in degrees Celsius

atemp      decimal       apparent temperature in degrees Celsius 

humidity   integer       relative humidity percentage

windspeed  decimal       the speed that air is moving in unknown units

casual     integer       the number of non-registered bike shares for the hour

registered integer       the number of registered bike shares for the hour

count      integer       the total number of bike shares for the hour
--------------------------------------------------------------------

### Comments ###

There were no inherent character columns and the data, where appropriate, was 
already converted to factor-like integer values. As such, in order to better 
map the data set to this exercise, I have added a `seasonName` column and 
reverted the `season` column into this new column to begin with. Given the goal
of the Kaggle competition to predict future bike share count, shifting the 
`count` values by a fixed period aids in this analysis. 
Additionally, for analysis purposes the datetime field would be better broken up into 
components including a simple integer hour of day, individual month value, 
day of week, segment of the day, etc. Some of these transformations are applied
in the code segment that follows including the inclusion of a `nextHourCount`
attribute which reflects the following hour's total bike rentals for a given hour.

```{r}
# Load the data into a data.frame
csv_file <- file.path(projRoot, "Week5", "BikeSharingDemand.csv")
csv <- read.table(csv_file, header=TRUE, sep=",")

# Revert season to character data.
bikes <- data.frame(csv, 
                    seasonName=NA, hourOfDay=NA, 
                    dayOfWeek=NA, dayOfWeekInt=NA, 
                    monthOfYear=NA, segmentOfDay=NA,
                    nextHourDateTime=NA, nextHourCount=NA)
bikes[bikes$season == 1,]$seasonName <- "spring"
bikes[bikes$season == 2,]$seasonName <- "summer"
bikes[bikes$season == 3,]$seasonName <- "fall"
bikes[bikes$season == 4,]$seasonName <- "winter"

# 14 Add an integer column for hour of day
bikes$hourOfDay <- lubridate::hour(bikes$datetime)

# 15 Add a factor and integer column for day of week
bikes$dayOfWeek <- as.factor(weekdays(strptime(as.character(bikes$datetime), 
                                               format="%Y-%m-%d %H:%M:%S")))
# 16
bikes$dayOfWeekInt <- as.numeric(bikes$dayOfWeek)

# 17 Add an integer column for month of year
bikes$monthOfYear <- lubridate::month(bikes$datetime)

# 18 Add an integer column for segment of day
bikes$segmentOfDay <- ifelse(bikes$hourOfDay >= 5 & bikes$hourOfDay < 12, 
                             1, # Morning
                             ifelse(bikes$hourOfDay >= 12 & bikes$hourOfDay < 17, 
                                    2, # Afternoon
                                    ifelse(bikes$hourOfDay >= 17 & bikes$hourOfDay < 22, 
                                           3, # Evening
                                              4 ))) # Night

# Seq vector to help shift 'count' rows by one hour to analysis predicability
ind <- seq(2, nrow(bikes) + 1, 1)
ind[length(ind)] <- NA

# 19, 20 Add column showing the 'next hour" datetime and count
bikes$nextHourDateTime <- bikes[ind,"datetime"]
bikes$nextHourCount <- bikes[ind, "count"]

```

### Summary Statistics ###
Using the `summary()` R function, the basic statistics about each attribute are
summarized through the raw R output that follows.

```{r}
# Summary
summary(bikes)
```

### Outlier Analysis ###
Using the Data Mining with R (DMwR) package's `lofactor()` function, an outlier
analysis was performed. Although a complete analysis was performed across all 
categorical and numeric attributes, generally no extreme outliers were detected, 
with a notable exception. The following R code illustrates the approach used in
the outlier analysis with the top 5 outlying points highlighted in the following
charts via a red plus symbol.

```{r}
outlier.scores <- DMwR::lofactor(bikes[,c(2:9,12)], k=5)

outliers <- order(outlier.scores, decreasing=T)[1:5]

```

One might reasonably expect temperature and apparent temperature to follow one
another more or less linearly. For each unit increase in temperature, apparent
temperature would increase by approximately one unit. As shown in the following 
chart, this is mostly true, except for the August 17, 2012 values. On this day, 
the connection between temperature and apparent temperature appear broken with 
`atemp` stuck at 12.12 $^\circ$C. Without further knowledge of the data's 
origins, one can only speculate as to why this is the case.

```{r, echo=FALSE}
n <- nrow(bikes)
pch <- rep(".", n)
pch[outliers] <- "+"
col <- rep("black", n)
col[outliers] <- "red"

# Plot the outliers along with the other data
#pairs(bikes[,c(5:9, 14, 20)], pch=pch, col=col, main="Outliers in Context")
pairs(bikes[,c(6:7,17)], pch=pch, col=col, main="Outliers in Context")

# Show the raw data (well, lets hide it actually)
#bikes[lubridate::day(bikes$datetime) == 17 & lubridate::month(bikes$datetime) == 8 & lubridate::year(bikes$datetime) == 2012,c(1, 6:7)]
```

Using an alternative approach to outlier detection, the mean and standard 
deviation were calculated for the numeric attributes. A distribution analysis
was not performed, and as such normal distribution is only assumed here. 
With a 3 standard deviation width on either side of the mean, values that 
appear outside these bounds could be considered outliers. The following 
R code performs this analysis and shows the results. 

```{r}
# Calculate mean/standard deviation
msd <- sapply(bikes[,c(6:12)], 
              function(cl) c(mean=mean(cl,na.rm=TRUE), 
                             stdev=sd(cl,na.rm=TRUE)))

# Melt into long form 
msdDf <- as.data.frame(t(msd))
msdDf <- data.frame(attribute=rownames(msdDf), msdDf)
msdDf <- subset(msdDf, !is.na(msdDf$mean) & !is.na(msdDf$stdev))

# Add lower/upper bounds at 3 stdevs
xTimes <- 3
lowers <- msdDf$mean - (xTimes * msdDf$stdev)
uppers <- msdDf$mean + (xTimes * msdDf$stdev)
msdDf <- data.frame(msdDf, lower=lowers, upper = uppers)
msdDf
```

```{r echo=FALSE}
# Using for loop, though probably a plyr method would be better here.
bikeOut <- data.frame(id=seq(1, nrow(bikes),1))
for(a in msdDf$attribute)
{
  cOut <- bikes[, c(a)] < msdDf[msdDf$attribute == a,"lower"] | 
          msdDf[msdDf$attribute == a,"upper"] < bikes[, c(a)]
  bikeOut <- cbind(bikeOut, a=cOut)
}

# Beautify and show results
bikeOut <- bikeOut[,c(2:ncol(bikeOut))]
colnames(bikeOut) <- msdDf$attribute

bikeOutSums <- sapply(bikeOut, sum)
bikeOutSums <- as.data.frame(t(bikeOutSums))
bikeOutSums <- cbind(bikeOutSums, id=1)
bikeOutSums <- reshape2::melt(bikeOutSums, 
                              na.rm=TRUE, 
                              variable.name="attribute", 
                              value.name="numOutliers",
                              id.vars="id")

```

```{r, echo=FALSE}
g5 <- ggplot(data=bikeOutSums, aes(x=attribute, y=numOutliers))
g5 <- g5 + geom_bar(stat="identity")
g5 <- g5 + theme(axis.text.x = element_text(angle=30, vjust=1))
g5 <- g5 + labs(title="Number of Outliers/Attribute at 3 Standard Deviations")
g5

```

### Correlation Analysis ###

Using R's `cor()` function, as shown in the following code, an analysis of 
correlation between the numeric attributes was performed. 
```{r}

bikeCor <- cor(bikes[, c(20,6:12,14,16,17,18)], use="complete.obs")
bikesCorMelt <- reshape2::melt(bikeCor, varnames=c("x", "y"), value.name="Correlation")
bikesCorMelt <- bikesCorMelt[order(bikesCorMelt$Correlation),]

```

```{r, echo=FALSE}
g1 <- ggplot(data=bikesCorMelt, aes(x=x, y=y))
g1 <- g1 + geom_tile(aes(fill=Correlation))
g1 <- g1 + scale_fill_gradient2(low="red", mid="white", high="steelblue", 
                                guide=guide_colorbar(ticks=FALSE, barheight=10),
                                limits=c(-1,1))
g1 <- g1 + theme_minimal()
g1 <- g1 + theme(axis.text.x = element_text(angle=30, vjust=1))
g1 <- g1 + labs(title="Heatmap of Attribute Correlation")
g1

```

As can be seen in the heat map above and values below, humidity is negatively
correlated with bike sharing demand across all three measures (casual, 
registered and count). Likewise, temperature is positively correlated with bike
sharing demand in this data set. Interestingly, the day of the week `dayOfWeekInt`
attribute shows virtually no correlation with bike sharing demand, but `segmentOfDay`,
the indicator of morning, afternoon, etc, shows a moderate negative correlation whereby
apparently more rentals occur in the morning (`segmentOfDay` = 1).

```{r, echo=FALSE}
bikeCor
```

### Entropy Analysis ###

Using Entropy and Information Gain functions developed in a prior exercise, 
an entropy analysis was performed. Raw entropy of the bike shares per hour was
calculated initially.

```{r}
source(file.path(projRoot, "EntropyFunctions.R"), chdir=TRUE)

# Raw Entropy: Total Bike Sharing
entropy(bikes$nextHourCount)
```

The `decide()` function from the EntropyFunctions script was used to calculate 
information gain across all attributes versus the `nextHourCount` bike sharing measure 
which was added to aid with prediction analysis. 

The results were then melted into a long format and sorted for better visualization. 
The R code is shown below. None of the attributes produced particularly staggering
information gain, but the `humidity` attribute was found to be the most meaningful, 
followed by the `hourOfDay` calculated attribute and `atemp`/`temp` attributes.

```{r}
# Calculate information gain across all categorical and numeric attributes. 
nextHrEnt <- decide(bikes[,c(2,3,4,5,6,7,8,9,14,15,17,18,20)], 13)

nextHrEntMelt <- reshape2::melt(nextHrEnt$gains, value.name="info.gain")
nextHrEntMelt <- cbind(nextHrEntMelt, attribute=rownames(nextHrEntMelt))
nextHrEntMelt <- nextHrEntMelt[order(-nextHrEntMelt$info.gain),]
```
```{r, echo=FALSE}
require(ggplot2)
g2 <- ggplot(data=nextHrEntMelt, aes(x=factor(attribute), y=info.gain))
g2 <- g2 + geom_bar(stat="identity")
g2 <- g2 + theme_minimal()
g2 <- g2 + scale_x_discrete(labels=as.character(nextHrEntMelt$attribute), limits=as.character(nextHrEntMelt$attribute))
g2 <- g2 + theme(axis.text.x = element_text(angle=30, vjust=1))
g2 <- g2 + labs(title="Information Gain - Next Hour Bike Shares by Partition", x="Paritioning Attribute", y="Information Gain")
g2
nextHrEntMelt
```

### Source Code ###
The raw R markdown code used to produce this data set profile can be found 
[on GitHub, in my DataAcqMgmt repository](https://raw.githubusercontent.com/dwdii/DataAcqMgmt/master/Week5/Project2-ProfilingADataSet-Bsd.Rmd).

### References ###

Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.

Lander, Jared P. "Correlation and Covariance." R for Everyone: Advanced Analytics and Graphics. New York: Addison-Wesley, 2014. N. pag. Print.

"[Means and SD for Columns in a Dataframe with NA Values.](http://stackoverflow.com/a/20797794/2604144)" R. StackOverflow, 27 Dec. 2013. Web. 28 Sept. 2014.

Zhao, Yanchang. "[Outlier Detection - RDataMining.com: R and Data Mining](http://www.rdatamining.com/examples/outlier-detection)" RDataMining.com: R and Data Mining. N.p., 2014. Web. 27 Sept. 2014.