2.1 Histograms, boxplots, and density curves

Part of learning statistics is learning to correctly use the terminology, some of which is used colloquially differently than it is used in formal statistical settings. The most commonly “misused” statistical term is data. In statistical parlance, we want to note the plurality of data. Specifically, datum is a single measurement, possibly on multiple random variables, and so it is appropriate to say that “a datum is…”. Once we move to discussing data, we are now referring to more than one observation, again on one, or possibly more than one, random variable, and so we need to use “data are…” when talking about our observations. We want to distinguish our use of the term “data” from its more colloquial12 usage that often involves treating it as singular. In a statistical setting “data” refers to measurements of our cases or units. When we summarize the results of a study (say providing the mean and SD), that information is not “data”. We used our data to generate that information. Sometimes we also use the term “data set” to refer to all our observations and this is a singular term to refer to the group of observations and this makes it really easy to make mistakes on the usage of “data”13.

It is also really important to note that variables have to vary – if you measure the level of education of your subjects but all are high school graduates, then you do not have a “variable”. You may not know if you have real variability in a “variable” until you explore the results you obtained.

The last, but probably most important, aspect of data is the context of the measurement. The “who, what, when, and where” of the collection of the observations is critical to the sort of conclusions we can make based on the results. The information on the study design provides information required to assess the scope of inference of the study. Generally, remember to think about the research questions the researchers were trying to answer and whether their study actually would answer those questions. There are no formulas to help us sort some of these things out, just critical thinking about the context of the measurements.

To make this concrete, consider the data collected from a study (Walker, Garrard, and Jowitt 2014) to investigate whether clothing worn by a bicyclist might impact the passing distance of cars. One of the author’s wore seven different outfits (outfit for the day was chosen randomly by shuffling seven playing cards) on his regular 26 km commute near London in the United Kingdom. Using a specially instrumented bicycle, they measured how close the vehicles passed to the widest point on the handlebars. The seven outfits (“conditions”) that you can view at https://www.sciencedirect.com/science/article/pii/S0001457513004636 were:

  • COMMUTE: Plain cycling jersey and pants, reflective cycle clips, commuting helmet, and bike gloves.

  • CASUAL: Rugby shirt with pants tucked into socks, wool hat or baseball cap, plain gloves, and small backpack.

  • HIVIZ: Bright yellow reflective cycle commuting jacket, plain pants, reflective cycle clips, commuting helmet, and bike gloves.

  • RACER: Colorful, skin-tight, Tour de France cycle jersey with sponsor logos, Lycra bike shorts or tights, race helmet, and bike gloves.

  • NOVICE: Yellow reflective vest with “Novice Cyclist, Pass Slowly” and plain pants, reflective cycle clips, commuting helmet, and bike gloves.

  • POLICE: Yellow reflective vest with “POLICEwitness.com – Move Over – Camera Cyclist” and plain pants, reflective cycle clips, commuting helmet, and bike gloves.

  • POLITE: Yellow reflective vest with blue and white checked banding and the words “POLITE notice, Pass Slowly” looking similar to a police jacket and plain pants, reflective cycle clips, commuting helmet, and bike gloves.

They collected data (distance to the vehicle in cm for each car “overtake”) on between 8 and 11 rides in each outfit and between 737 and 868 “overtakings” across these rides. The outfit is a categorical predictor or explanatory variable) that has seven different levels here. The distance is the response variable and is a quantitative variable here14. Note that we do not have the information on which overtake came from which ride in the data provided or the conditions related to individual overtake observations other than the distance to the vehicle (they only included overtakings that had consistent conditions for the road and riding).

The data are posted on my website15 at http://www.math.montana.edu/courses/s217/documents/Walker2014_mod.csv if you want to download the file to a local directory and then import the data into R using “Import Dataset”. Or you can use the code in the following codechunk to directly read the data set into R using the URL.

It is always good to review the data you have read by running the code and printing the tibble by typing the tibble name (here > dd) at the command prompt in the console, using the View function, (here View(dd)), to open a spreadsheet-like view, or using the head and tail functions have been show the first and last ten observations:

## # A tibble: 6 x 8
##   Condition Distance Shirt Helmet Pants Gloves ReflectClips Backpack
##   <chr>        <dbl> <chr> <chr>  <chr> <chr>  <chr>        <chr>   
## 1 casual         132 Rugby hat    plain plain  no           yes     
## 2 casual         137 Rugby hat    plain plain  no           yes     
## 3 casual         174 Rugby hat    plain plain  no           yes     
## 4 casual          82 Rugby hat    plain plain  no           yes     
## 5 casual         106 Rugby hat    plain plain  no           yes     
## 6 casual          48 Rugby hat    plain plain  no           yes
## # A tibble: 6 x 8
##   Condition Distance Shirt      Helmet Pants Gloves ReflectClips Backpack
##   <chr>        <dbl> <chr>      <chr>  <chr> <chr>  <chr>        <chr>   
## 1 racer          122 TourJersey race   lycra bike   yes          no      
## 2 racer          204 TourJersey race   lycra bike   yes          no      
## 3 racer          116 TourJersey race   lycra bike   yes          no      
## 4 racer          132 TourJersey race   lycra bike   yes          no      
## 5 racer          224 TourJersey race   lycra bike   yes          no      
## 6 racer           72 TourJersey race   lycra bike   yes          no

Another option is to directly access specific rows and/or columns of the tibble, especially for larger data sets. In objects containing data, we can select certain rows and columns using the brackets, [..., ...], to specify the row (first element) and column (second element). For example, we can extract the datum in the fourth row and second column using dd[4,2]:

## # A tibble: 1 x 1
##   Distance
##      <dbl>
## 1       82

This provides the distance (in cm) of a pass at 82 cm. To get all of either the rows or columns, a space is used instead of specifying a particular number. For example, the information in all the columns on the fourth observation can be obtained using dd[4, ]:

## # A tibble: 1 x 8
##   Condition Distance Shirt Helmet Pants Gloves ReflectClips Backpack
##   <chr>        <dbl> <chr> <chr>  <chr> <chr>  <chr>        <chr>   
## 1 casual          82 Rugby hat    plain plain  no           yes

So this was an observation from the casual condition that had a passing distance of 82 cm. The other columns describe some other specific aspects of the condition. To get a more complete sense of the data set, we can extract a suite of observations from each condition using their row numbers concatenated, c(), together, extracting all columns for two observations from each of the conditions based on their rows.

## # A tibble: 14 x 8
##    Condition Distance Shirt      Helmet  Pants Gloves ReflectClips Backpack
##    <chr>        <dbl> <chr>      <chr>   <chr> <chr>  <chr>        <chr>   
##  1 casual         132 Rugby      hat     plain plain  no           yes     
##  2 casual         137 Rugby      hat     plain plain  no           yes     
##  3 commute         70 PlainJers~ commut~ plain bike   yes          no      
##  4 commute        151 PlainJers~ commut~ plain bike   yes          no      
##  5 hiviz           94 Jacket     commut~ plain bike   yes          no      
##  6 hiviz          145 Jacket     commut~ plain bike   yes          no      
##  7 novice          12 Vest_Novi~ commut~ plain bike   yes          no      
##  8 novice         122 Vest_Novi~ commut~ plain bike   yes          no      
##  9 police         113 Vest_Poli~ commut~ plain bike   yes          no      
## 10 police         174 Vest_Poli~ commut~ plain bike   yes          no      
## 11 polite         156 Vest_Poli~ commut~ plain bike   yes          no      
## 12 polite          14 Vest_Poli~ commut~ plain bike   yes          no      
## 13 racer          104 TourJersey race    lycra bike   yes          no      
## 14 racer          141 TourJersey race    lycra bike   yes          no

Now we can see the Condition variable seems to have seven different levels, the Distance variable contains the overtake distance, and then a suite of columns that describe aspects of each outfit, such as the type of shirt or whether reflective cycling clips were used or not. We will only use the “Distance” and “Condition” variables to start with.

When working with data, we should always start with summarizing the sample size. We will use n for the number of subjects in the sample and denote the population size (if available) with N. Here, the sample size is n=5690. In this situation, we do not have a random sample from a population (these were all of the overtakes that met the criteria during the rides) so we cannot make inferences from our sample to a larger group (other rides or for other situations like different places, times, or riders). But we can assess whether there is a causal effect16: if sufficient evidence is found to conclude that there is some difference in the responses across the conditions, we can attribute those differences to the treatments applied, since the overtake events should be same otherwise due to the outfit being randomly assigned to the rides. The story of the data set – that it was collected on a particular route for a particular rider in the UK – becomes pretty important in thinking about the ramifications of any results. Are drivers in Montana or South Dakota different from drivers near London? Are the road and traffic conditions likely to be different? If so, then we should not assume that the detected differences, if detected, would also exist in some other location for a different rider. The lack of a random sample here from all the overtakes in the area (or more generally all that happen around the world) makes it impossible to assume that this set of overtakes might be like others. So there are definite limitations to the inferences in the following results. But it is still interesting to see if the outfits worn caused a difference in the mean overtake distances, even though the inferences are limited to the conditions in this individual’s commute. If this had been an observational study (suppose that the researcher could select their outfit), then we would have to avoid any of the “causal” language that we can consider here because the outfits were not randomly assigned to the rides. Without random assignment, the explanatory variable of outfit choice could be confounded with another characteristic of rides that might be related to the passing distances, such as wearing a particular outfit because of an expectation of heavy traffic or poor light conditions. Confounding is not the only reason to avoid causal statements with non-random assignment but the inability to separate the effect of other variables (measured or unmeasured) from the differences we are observing means that our inferences in these situations need to be carefully stated to avoid implying causal effects.

In order to get some summary statistics, we will rely on the R package called mosaic (Pruim, Kaplan, and Horton 2019) as introduced previously. First (but only once), you need to install the package, which can be done either using the Packages tab in the lower right panel of RStudio or using the install.packages function with quotes around the package name:

If you open a .Rmd file that contains code that incorporates packages and they are not installed, the bar at the top of the markdown document will prompt you to install those missing packages. This is the easiest way to get packages you might need installed. After making sure that any required packages are installed, use the library function around the package name (no quotes now!) to load the package, something that you need to do any time you want to use features of a package.

When you are loading a package, R might mention a need to install other packages. If the output says that it needs a package that is unavailable, then follow the same process noted above to install that package and then repeat trying to load the package you wanted. These are called package “dependencies” and are due to one package developer relying on functions that already exist in another package.

With tibbles, you have to declare categorical variables as “factors” to have R correctly handle the variables using the factor function. This can be a bit time repetitive but provides some utility for data wrangling in more complex situations to read in the data and then declare their type. For quantitative variables, this is not required and they are stored as numeric variables. The following code declares the categorical variables in the data set as factors and saves them back into the variables of the same names:

With many variables in a data set, it is often useful to get some quick information about all of them; the summary function provides useful information whether the variables are categorical or quantitative and notes if any values were missing.

##    Condition      Distance             Shirt          Helmet    
##  casual :779   Min.   :  2.0   Jacket     :737   commuter:4059  
##  commute:857   1st Qu.: 99.0   PlainJersey:857   hat     : 779  
##  hiviz  :737   Median :117.0   Rugby      :779   race    : 852  
##  novice :807   Mean   :117.1   TourJersey :852                  
##  police :790   3rd Qu.:134.0   Vest_Novice:807                  
##  polite :868   Max.   :274.0   Vest_Police:790                  
##  racer  :852                   Vest_Polite:868                  
##    Pants        Gloves     ReflectClips Backpack  
##  lycra: 852   bike :4911   no : 779     no :4911  
##  plain:4838   plain: 779   yes:4911     yes: 779  
##                                                   
##                                                   
##                                                   
##                                                   
## 

The output is organized by variable, providing summary information based on the type of variable, either counts by category for categorical variables or the 5-number summary plus the mean for the quantitative variable Distance. If present, you would also get a count of missing values that are called “NAs” in R. For the first variable, called Condition and that we might more explicitly name Outfit, we find counts of the number of overtakes for each outfit: \(779\) out of \(5,690\) were when wearing the casual outfit, \(857\) for “commute”, and the other observations from the other five outfits, with the most observations when wearing the “polite” vest. We can also see that overtake distances (variable Distance) ranged from 2 cm to 274 cm with a median of 117 cm.

To accompany the numerical summaries, histograms and boxplots can provide some initial information on the shape of the distribution of the responses for the different Outfits. Figure 2.1 contains the histogram and boxplot of Distance, ignoring any information on which outfit was being worn. The calls to the two plotting functions are enhanced slightly to add better labels using xlab, ylab, and main.

Histogram and boxplot of passing distances in cm.

Figure 2.1: Histogram and boxplot of passing distances in cm.

The distribution appears to be relatively symmetric with many observations in both tails flagged as potential outliers. Despite being flagged as potential outliers, they seem to be part of a common distribution. In real data sets, outliers are commonly encountered and the first step is to verify that they were not errors in recording (if so, fixing or removing them is easily justified). If they cannot be easily dismissed or fixed, the next step is to study their impact on the statistical analyses performed, potentially considering reporting results with and without the influential observation(s) in the results (if there are just handful). If the analysis is unaffected by the “unusual” observations, then it matters little whether they are dropped or not. If they do affect the results, then reporting both versions of results allows the reader to judge the impacts for themselves. It is important to remember that sometimes the outliers are the most interesting part of the data set. For example, those observations that were the closest would be of great interest, whether they are outliers or not.

Often when statisticians think of distributions of data, we think of the smooth underlying shape that led to the data set that is being displayed in the histogram. Instead of binning up observations and making bars in the histogram, we can estimate what is called a density curve as a smooth curve that represents the observed distribution of the responses. Density curves can sometimes help us see features of the data sets more clearly.

To understand the density curve, it is useful to initially see the histogram and density curve together. The height of the density curve is scaled so that the total area under the curve17 is 1. To make a comparable histogram, the y-axis needs to be scaled so that the histogram is also on the “density” scale which makes the bar heights adjust so that the proportion of the total data set in each bar is represented by the area in each bar (remember that area is height times width). So the height depends on the width of the bars and the total area across all the bars has to be 1. In the hist function, the freq=F option does this required re-scaling to get density-scaled histogram bars. The density curve is added to the histogram using the R code of lines(density()), producing the result in Figure 2.2 with added modifications of options for lwd (line width) and col (color) to make the plot more visually appealing. You can see how the density curve somewhat matches the histogram bars but deals with the bumps up and down and edges a little differently. We can pick out the relatively symmetric distribution using either display and will rarely make both together.

Histogram and density curve of Distance responses.

Figure 2.2: Histogram and density curve of Distance responses.

Histograms can be sensitive to the choice of the number of bars and even the cut-offs used to define the bins for a given number of bars. Small changes in the definition of cut-offs for the bins can have noticeable impacts on the shapes observed but this does not impact density curves. We are not going to tinker with the default choices for bars in histogram, as they are reasonably selected in R, but we can add information on the original observations being included in each bar to better understand the choices that hist is making. In the previous display, we can add what is called a rug to the plot, where a tick mark is made on the x-axis for each observation. Because the responses appear to be rounded to the nearest cm, there is some discreteness in the responses and we need to use a graphical technique called jittering to add a little noise18 to each observation so all the observations at each distance value do not plot as a single line. In Figure 2.3, the added tick marks on the x-axis show the approximate locations of the original observations. We can (barely) see how there are 2 observations at 2 cm (the noise added generates a wider line than for an individual observation so it is possible to see that it is more than one observation there). A limitation of the histogram arises at the center of the distribution where the bar that goes from 100 to 120 cm suggests that the mode (peak) is in this range (but it is unclear where) but the density curve suggests that the peak is closer to 120 than 100. The density curve also shows some small bumps in the tails of the distributions tied to individual observations that are not really displayed in the histogram. Density curves are, however, not perfect and this one shows a tiny bit of area for distances less than 0 cm which is not possible here. When we make density curves below, we will cut off the curves at the most extreme values to avoid this issue.

Histogram with density curve and rug plot of the jittered distance responses.

Figure 2.3: Histogram with density curve and rug plot of the jittered distance responses.

The graphical tools we’ve just discussed are going to help us move to comparing the distribution of responses across more than one group. We will have two displays that will help us make these comparisons. The simplest is the side-by-side boxplot, where a boxplot is displayed for each group of interest using the same y-axis scaling. In R, we can use its formula notation to see if the response (Distance) differs based on the group (Condition) by using something like Y~X or, here, Distance~Condition. We also need to tell R where to find the variables – use the last option in the command, data=DATASETNAME , to inform R of the tibble to look in to find the variables. In this example, data=dd. We will use the formula and data=... options in almost every function we use from here forward. Figure 2.4 contains the side-by-side boxplots showing similar distributions for all the groups, with a slightly higher median in the “police” group and some outliers identified in all groups.

Side-by-side boxplot of distances based on outfits.

Figure 2.4: Side-by-side boxplot of distances based on outfits.

The “~” (which is read as the tilde symbol19, which you can find in the upper left corner of your keyboard) notation will be used in two ways this semester. The formula use in R employed previously declares that the response variable here is Distance and the explanatory variable is Condition. The other use for “~” is as shorthand for “is distributed as” and is used in the context of \(Y \sim N(0,1)\), which translates (in statistics) to defining the random variable Y as following a Normal distribution20 with mean 0 and standard deviation of 1. In the current situation, we could ask whether the Distance variable seems like it may follow a normal distribution in each group, in other words, is \(\text{Years}\sim N(\mu,\sigma^2)\)? Since the responses are relatively symmetric, it is not clear that we have a violation of the assumption of the normality assumption for the Distance variable for any of the seven groups (more later on how we can assess this and the issues that occur when we have a violation of this assumption). Remember that \(\mu\) and \(\sigma\) are parameters where \(\mu\) (“mu”) is our standard symbol for the population mean and that \(\sigma\) (“sigma”) is the symbol of the population standard deviation.

References

Pruim, Randall, Daniel T. Kaplan, and Nicholas J. Horton. 2019. Mosaic: Project Mosaic Statistics and Mathematics Teaching Utilities. https://CRAN.R-project.org/package=mosaic.

Walker, Ian, Ian Garrard, and Felicity Jowitt. 2014. “The Influence of a Bicycle Commuter’s Appearance on Drivers’ Overtaking Proximities: An on-Road Test of Bicyclist Stereotypes, High-Visibility Clothing and Safety Aids in the United Kingdom.” Accident Analysis & Prevention 64: 69–77. https://doi.org/https://doi.org/10.1016/j.aap.2013.11.007.


  1. You will more typically hear “data is” but that more often refers to information, sometimes even statistical summaries of data sets, than to observations made on subjects collected as part of a study, suggesting the confusion of this term in the general public. We will explore a data set in Chapter ?? related to perceptions of this issue collected by researchers at http://fivethirtyeight.com/.

  2. Either try to remember “data is a plural word” or replace “data” with “things” in your sentence and consider whether it sounds right.

  3. Of particular interest to the bicycle rider might be the “close” passes and we will revisit this as a categorical response with “close” and “not close” as its two categories later.

  4. Thanks to Ian Walker for allowing me to use and post these data.

  5. As noted previously, we reserve the term “effect” for situations where random assignment allows us to consider causality as the reason for the differences in the response variable among levels of the explanatory variable, but this is only the case if we find evidence against the null hypothesis of no difference in the groups.

  6. If you’ve taken calculus, you will know that the curve is being constructed so that the integral from \(-\infty\) to \(\infty\) is 1. If you don’t know calculus, think of a rectangle with area of 1 based on its height and width. These cover the same area but the top of the region wiggles.

  7. Jittering typically involves adding random variability to each observation that is uniformly distributed in a range determined based on the spacing of the observation. The idea is to jitter just enough to see all the points but not too much. This means that if you re-run the jitter function, the results will change if you do not set the random number seed using set.seed that is discussed more below. For more details, type help(jitter) in the console in RStudio.

  8. If you want to type this character in Rmarkdown, try “\(\sim\)” outside of codechunks.

  9. Remember the bell-shaped curve you encountered in introductory statistics? If not, you can see some at https://en.wikipedia.org/wiki/Normal_distribution.