Part of learning statistics is learning to correctly use the terminology, some of which is used colloquially differently than it is used in formal statistical settings. The most commonly “misused” statistical term is data. In statistical parlance, we want to note the plurality of data. Specifically, datum is a single measurement, possibly on multiple random variables, and so it is appropriate to say that “a datum is…”. Once we move to discussing data, we are now referring to more than one observation, again on one, or possibly more than one, random variable, and so we need to use “data are…” when talking about our observations. We want to distinguish our use of the term “data” from its more colloquial12 usage that often involves treating it as singular. In a statistical setting “data” refers to measurements of our cases or units. When we summarize the results of a study (say providing the mean and SD), that information is not “data”. We used our data to generate that information. Sometimes we also use the term “data set” to refer to all our observations and this is a singular term to refer to the group of observations and this makes it really easy to make mistakes on the usage of “data”13.
It is also really important to note that variables have to vary – if you measure the level of education of your subjects but all are high school graduates, then you do not have a “variable”. You may not know if you have real variability in a “variable” until you explore the results you obtained.
The last, but probably most important, aspect of data is the context of the measurement. The “who, what, when, and where” of the collection of the observations is critical to the sort of conclusions we can make based on the results. The information on the study design provides information required to assess the scope of inference of the study. Generally, remember to think about the research questions the researchers were trying to answer and whether their study actually would answer those questions. There are no formulas to help us sort some of these things out, just critical thinking about the context of the measurements.
To make this concrete, consider the data collected from a study (Walker, Garrard, and Jowitt 2014) to investigate whether clothing worn by a bicyclist might impact the passing distance of cars. One of the author’s wore seven different outfits (outfit for the day was chosen randomly by shuffling seven playing cards) on his regular 26 km commute near London in the United Kingdom. Using a specially instrumented bicycle, they measured how close the vehicles passed to the widest point on the handlebars. The seven outfits (“conditions”) that you can view at https://www.sciencedirect.com/science/article/pii/S0001457513004636 were:
COMMUTE: Plain cycling jersey and pants, reflective cycle clips, commuting helmet, and bike gloves.
CASUAL: Rugby shirt with pants tucked into socks, wool hat or baseball cap, plain gloves, and small backpack.
HIVIZ: Bright yellow reflective cycle commuting jacket, plain pants, reflective cycle clips, commuting helmet, and bike gloves.
RACER: Colorful, skin-tight, Tour de France cycle jersey with sponsor logos, Lycra bike shorts or tights, race helmet, and bike gloves.
NOVICE: Yellow reflective vest with “Novice Cyclist, Pass Slowly” and plain pants, reflective cycle clips, commuting helmet, and bike gloves.
POLICE: Yellow reflective vest with “POLICEwitness.com – Move Over – Camera Cyclist” and plain pants, reflective cycle clips, commuting helmet, and bike gloves.
POLITE: Yellow reflective vest with blue and white checked banding and the words “POLITE notice, Pass Slowly” looking similar to a police jacket and plain pants, reflective cycle clips, commuting helmet, and bike gloves.
They collected data (distance to the vehicle in cm for each car “overtake”) on between 8 and 11 rides in each outfit and between 737 and 868 “overtakings” across these rides. The outfit is a categorical predictor or explanatory variable) that has seven different levels here. The distance is the response variable and is a quantitative variable here14. Note that we do not have the information on which overtake came from which ride in the data provided or the conditions related to individual overtake observations other than the distance to the vehicle (they only included overtakings that had consistent conditions for the road and riding).
The data are posted on my website15 at http://www.math.montana.edu/courses/s217/documents/Walker2014_mod.csv if you want to download the file to a local directory and then import the data into R using “Import Dataset”. Or you can use the code in the following codechunk to directly read the data set into R using the URL.
suppressMessages(library(readr))
dd <- read_csv("http://www.math.montana.edu/courses/s217/documents/Walker2014_mod.csv")
It is always good to review the data you have read by running the code and printing the tibble by typing the tibble name (here > dd
) at the command prompt in the console, using the View
function, (here View(dd)
), to open a spreadsheet-like view, or using the head
and tail
functions have been show the first and last ten observations:
## # A tibble: 6 x 8
## Condition Distance Shirt Helmet Pants Gloves ReflectClips Backpack
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 casual 132 Rugby hat plain plain no yes
## 2 casual 137 Rugby hat plain plain no yes
## 3 casual 174 Rugby hat plain plain no yes
## 4 casual 82 Rugby hat plain plain no yes
## 5 casual 106 Rugby hat plain plain no yes
## 6 casual 48 Rugby hat plain plain no yes
## # A tibble: 6 x 8
## Condition Distance Shirt Helmet Pants Gloves ReflectClips Backpack
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 racer 122 TourJersey race lycra bike yes no
## 2 racer 204 TourJersey race lycra bike yes no
## 3 racer 116 TourJersey race lycra bike yes no
## 4 racer 132 TourJersey race lycra bike yes no
## 5 racer 224 TourJersey race lycra bike yes no
## 6 racer 72 TourJersey race lycra bike yes no
Another option is to directly access specific rows and/or columns of the tibble, especially for larger data sets. In objects containing data, we can select certain rows and columns using the brackets, [..., ...]
, to specify the row (first element) and column (second element). For example, we can extract the datum in the fourth row and second column using dd[4,2]
:
## # A tibble: 1 x 1
## Distance
## <dbl>
## 1 82
This provides the distance (in cm) of a pass at 82 cm. To get all of either the rows or columns, a space is used instead of specifying a particular number. For example, the information in all the columns on the fourth observation can be obtained using dd[4, ]
:
## # A tibble: 1 x 8
## Condition Distance Shirt Helmet Pants Gloves ReflectClips Backpack
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 casual 82 Rugby hat plain plain no yes
So this was an observation from the casual
condition that had a passing distance of 82 cm. The other columns describe some other specific aspects of the condition. To get a more complete sense of the data set, we can extract a suite of observations from each condition using their row numbers concatenated, c()
, together, extracting all columns for two observations from each of the conditions based on their rows.
## # A tibble: 14 x 8
## Condition Distance Shirt Helmet Pants Gloves ReflectClips Backpack
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 casual 132 Rugby hat plain plain no yes
## 2 casual 137 Rugby hat plain plain no yes
## 3 commute 70 PlainJers~ commut~ plain bike yes no
## 4 commute 151 PlainJers~ commut~ plain bike yes no
## 5 hiviz 94 Jacket commut~ plain bike yes no
## 6 hiviz 145 Jacket commut~ plain bike yes no
## 7 novice 12 Vest_Novi~ commut~ plain bike yes no
## 8 novice 122 Vest_Novi~ commut~ plain bike yes no
## 9 police 113 Vest_Poli~ commut~ plain bike yes no
## 10 police 174 Vest_Poli~ commut~ plain bike yes no
## 11 polite 156 Vest_Poli~ commut~ plain bike yes no
## 12 polite 14 Vest_Poli~ commut~ plain bike yes no
## 13 racer 104 TourJersey race lycra bike yes no
## 14 racer 141 TourJersey race lycra bike yes no
Now we can see the Condition
variable seems to have seven different levels, the Distance
variable contains the overtake distance, and then a suite of columns that describe aspects of each outfit, such as the type of shirt or whether reflective cycling clips were used or not. We will only use the “Distance” and “Condition” variables to start with.
When working with data, we should always start with summarizing the sample size. We will use n for the number of subjects in the sample and denote the population size (if available) with N. Here, the sample size is n=5690. In this situation, we do not have a random sample from a population (these were all of the overtakes that met the criteria during the rides) so we cannot make inferences from our sample to a larger group (other rides or for other situations like different places, times, or riders). But we can assess whether there is a causal effect16: if sufficient evidence is found to conclude that there is some difference in the responses across the conditions, we can attribute those differences to the treatments applied, since the overtake events should be same otherwise due to the outfit being randomly assigned to the rides. The story of the data set – that it was collected on a particular route for a particular rider in the UK – becomes pretty important in thinking about the ramifications of any results. Are drivers in Montana or South Dakota different from drivers near London? Are the road and traffic conditions likely to be different? If so, then we should not assume that the detected differences, if detected, would also exist in some other location for a different rider. The lack of a random sample here from all the overtakes in the area (or more generally all that happen around the world) makes it impossible to assume that this set of overtakes might be like others. So there are definite limitations to the inferences in the following results. But it is still interesting to see if the outfits worn caused a difference in the mean overtake distances, even though the inferences are limited to the conditions in this individual’s commute. If this had been an observational study (suppose that the researcher could select their outfit), then we would have to avoid any of the “causal” language that we can consider here because the outfits were not randomly assigned to the rides. Without random assignment, the explanatory variable of outfit choice could be confounded with another characteristic of rides that might be related to the passing distances, such as wearing a particular outfit because of an expectation of heavy traffic or poor light conditions. Confounding is not the only reason to avoid causal statements with non-random assignment but the inability to separate the effect of other variables (measured or unmeasured) from the differences we are observing means that our inferences in these situations need to be carefully stated to avoid implying causal effects.
In order to get some summary statistics, we will rely on the R package called
mosaic
(Pruim, Kaplan, and Horton 2019) as introduced previously. First (but only once),
you need to install the package, which can
be done either using the Packages tab in the lower right panel of RStudio or
using the install.packages
function with quotes around the package name:
If you open a .Rmd file that contains code that incorporates packages and they are not installed, the bar at the top of the markdown document will prompt you to install those missing packages. This is the easiest way to get packages you might need installed. After making sure that any required packages are installed, use the library
function around the package name (no quotes now!) to load the package, something that
you need to do any time you want to use features of a package.
When you are loading a package, R might mention a need to install other packages. If the output says that it needs a package that is unavailable, then follow the same process noted above to install that package and then repeat trying to load the package you wanted. These are called package “dependencies” and are due to one package developer relying on functions that already exist in another package.
With tibbles, you have to declare categorical variables as “factors” to have R correctly handle the variables using the factor
function. This can be a bit time repetitive but provides some utility for data wrangling in more complex situations to read in the data and then declare their type. For quantitative variables, this is not required and they are stored as numeric variables. The following code declares the categorical variables in the data set as factors and saves them back into the variables of the same names:
dd$Condition <- factor(dd$Condition)
dd$Shirt <- factor(dd$Shirt)
dd$Helmet <- factor(dd$Helmet)
dd$Pants <- factor(dd$Pants)
dd$Gloves <- factor(dd$Gloves)
dd$ReflectClips <- factor(dd$ReflectClips)
dd$Backpack <- factor(dd$Backpack)
With many variables in a data set, it is often useful to get some
quick information about all of them; the summary
function provides
useful information whether the variables are categorical or
quantitative and notes if any values were missing.
## Condition Distance Shirt Helmet
## casual :779 Min. : 2.0 Jacket :737 commuter:4059
## commute:857 1st Qu.: 99.0 PlainJersey:857 hat : 779
## hiviz :737 Median :117.0 Rugby :779 race : 852
## novice :807 Mean :117.1 TourJersey :852
## police :790 3rd Qu.:134.0 Vest_Novice:807
## polite :868 Max. :274.0 Vest_Police:790
## racer :852 Vest_Polite:868
## Pants Gloves ReflectClips Backpack
## lycra: 852 bike :4911 no : 779 no :4911
## plain:4838 plain: 779 yes:4911 yes: 779
##
##
##
##
##
The output is organized by variable,
providing summary information based on the type of
variable, either counts by category for categorical variables or the 5-number summary plus the mean for the quantitative
variable Distance
. If present, you would also get a count of missing values that are
called “NAs” in R. For the first variable, called Condition
and that we might more explicitly name Outfit, we find counts of the
number of overtakes for each outfit: \(779\) out of \(5,690\) were when wearing the casual outfit, \(857\) for “commute”, and the other observations from the other five outfits, with the most observations when wearing the “polite” vest.
We can also see that overtake distances (variable
Distance
) ranged from 2 cm to 274 cm with a median of 117 cm.
To accompany the numerical summaries, histograms and boxplots can
provide some initial information on the shape of the distribution of
the responses for the different Outfits. Figure 2.1
contains the histogram
and boxplot of Distance, ignoring any information on which outfit was being worn. The calls to the two plotting functions are
enhanced slightly to add better labels using xlab
, ylab
, and main
.
hist(dd$Distance, xlab="Distance (cm)", labels=T, main="Histogram of Distances")
boxplot(dd$Distance, ylab="Distance (cm)", main="Boxplot of Distances")
The distribution appears to be relatively symmetric with many observations in both tails flagged as potential outliers. Despite being flagged as potential outliers, they seem to be part of a common distribution. In real data sets, outliers are commonly encountered and the first step is to verify that they were not errors in recording (if so, fixing or removing them is easily justified). If they cannot be easily dismissed or fixed, the next step is to study their impact on the statistical analyses performed, potentially considering reporting results with and without the influential observation(s) in the results (if there are just handful). If the analysis is unaffected by the “unusual” observations, then it matters little whether they are dropped or not. If they do affect the results, then reporting both versions of results allows the reader to judge the impacts for themselves. It is important to remember that sometimes the outliers are the most interesting part of the data set. For example, those observations that were the closest would be of great interest, whether they are outliers or not.
Often when statisticians think of distributions of data, we think of the smooth underlying shape that led to the data set that is being displayed in the histogram. Instead of binning up observations and making bars in the histogram, we can estimate what is called a density curve as a smooth curve that represents the observed distribution of the responses. Density curves can sometimes help us see features of the data sets more clearly.
To understand the density curve, it is useful to initially see
the histogram and density curve together. The height of the density curve is scaled
so that the total area under the curve17 is 1. To make a comparable histogram, the
y-axis needs to be scaled so that the histogram is also on the “density”
scale which makes the bar heights adjust so that the proportion of the
total data set in each bar is represented by the area in each bar
(remember that area is height times width). So the height depends on the
width of the bars and the total area across all the bars has to be 1. In the
hist
function, the freq=F
option does this required re-scaling to get
density-scaled histogram bars. The
density curve is added to the histogram using the R code of
lines(density())
, producing the result in Figure 2.2 with
added modifications of options for lwd
(line width) and col
(color)
to make the plot more visually appealing. You can see how the density curve
somewhat matches the histogram bars but deals with the bumps up and down
and edges a little differently. We can pick out the relatively symmetric distribution using
either display and will rarely make both together.
hist(dd$Distance, freq=F, xlab="Distance (cm)", labels=T, main="Histogram of Distances")
lines(density(dd$Distance), lwd=3,col="purple")
Histograms can be sensitive to the choice of the number of bars and
even the cut-offs used to define the bins for a given number of bars.
Small changes in the definition of cut-offs for the bins can have
noticeable impacts on the shapes observed but
this does not impact density curves. We are not going to tinker with the
default choices for bars in histogram, as they are reasonably selected in R, but we
can add information on the original observations being included in each bar to
better understand the choices that hist
is making. In the previous
display, we can add what is called a rug to the plot, where a tick
mark is made on the x-axis for each observation. Because the responses appear to be rounded to the nearest cm, there is some discreteness in the responses and we need to use a graphical
technique called jittering to add a little noise18 to each observation so all the
observations at each distance value do not
plot as a single line. In Figure 2.3, the added tick marks
on the x-axis show the approximate locations of the original observations.
We can (barely) see how there are 2 observations at 2 cm (the noise
added generates a wider line than for an individual observation so it is possible to see that it is more than one observation there). A limitation of the
histogram arises at the center of the distribution where the bar that goes from 100 to 120 cm suggests that the mode (peak) is in this range (but it is unclear where) but the density curve suggests that the peak is closer to 120 than 100. The
density curve also shows some small bumps in the tails of the distributions tied to individual observations that are not really displayed in the histogram. Density curves are, however,
not perfect and this one shows a tiny bit of area for distances less than 0 cm which is
not possible here. When we make density curves below, we will cut off the curves at the most extreme values to avoid this issue.
hist(dd$Distance, freq=F, xlab="Distance (cm)", labels=T, main="Histogram of Distances with density curve and rug", ylim=c(0,0.017))
lines(density(dd$Distance), lwd=3,col="purple")
set.seed(900)
rug(jitter(dd$Distance), col="red", lwd=1)
The graphical tools we’ve just discussed are going to help us move to comparing the
distribution of responses across more than one group. We will have two displays
that will help us make these comparisons. The simplest is
the side-by-side boxplot, where a boxplot is displayed for each group
of interest using the same y-axis scaling. In R, we can use its formula
notation to see if the response (Distance
) differs based on the group
(Condition
) by using something like Y~X
or, here, Distance~Condition
.
We also need to tell R where to find the variables – use the last option in the command, data=DATASETNAME
, to inform R of the tibble to look in
to find the variables. In this example, data=dd
. We will use
the formula and data=...
options in almost every function we use
from here forward. Figure 2.4 contains the side-by-side
boxplots showing similar distributions for all the groups, with a slightly higher median in the “police” group and some outliers identified in all groups.
The “~” (which is read as the tilde symbol19, which you can find in the
upper left corner of your keyboard) notation will be used in two ways this
semester. The formula use in R employed previously declares that the
response variable here is Distance and the explanatory variable is Condition.
The other use for “~” is as shorthand for “is distributed as” and is used in
the context of \(Y \sim N(0,1)\), which translates (in statistics) to defining the
random variable Y as following a Normal distribution20
with mean 0
and standard deviation of 1. In the current situation, we could ask whether
the Distance
variable seems like it may follow a normal distribution in each group, in
other words, is \(\text{Years}\sim N(\mu,\sigma^2)\)? Since the responses are relatively symmetric, it is not clear that we have a violation of the assumption of the normality assumption for the Distance variable for any of the seven groups (more later on how we can assess this and the issues that occur when we have a violation of this assumption). Remember that
\(\mu\) and \(\sigma\) are parameters where
\(\mu\) (“mu”) is our standard symbol for the population mean
and that \(\sigma\) (“sigma”) is the symbol of the
population standard deviation.
Pruim, Randall, Daniel T. Kaplan, and Nicholas J. Horton. 2019. Mosaic: Project Mosaic Statistics and Mathematics Teaching Utilities. https://CRAN.R-project.org/package=mosaic.
Walker, Ian, Ian Garrard, and Felicity Jowitt. 2014. “The Influence of a Bicycle Commuter’s Appearance on Drivers’ Overtaking Proximities: An on-Road Test of Bicyclist Stereotypes, High-Visibility Clothing and Safety Aids in the United Kingdom.” Accident Analysis & Prevention 64: 69–77. https://doi.org/https://doi.org/10.1016/j.aap.2013.11.007.
You will more typically hear “data is” but that more often refers to information, sometimes even statistical summaries of data sets, than to observations made on subjects collected as part of a study, suggesting the confusion of this term in the general public. We will explore a data set in Chapter ?? related to perceptions of this issue collected by researchers at http://fivethirtyeight.com/.↩
Either try to remember “data is a plural word” or replace “data” with “things” in your sentence and consider whether it sounds right.↩
Of particular interest to the bicycle rider might be the “close” passes and we will revisit this as a categorical response with “close” and “not close” as its two categories later.↩
Thanks to Ian Walker for allowing me to use and post these data.↩
As noted previously, we reserve the term “effect” for situations where random assignment allows us to consider causality as the reason for the differences in the response variable among levels of the explanatory variable, but this is only the case if we find evidence against the null hypothesis of no difference in the groups.↩
If you’ve taken calculus, you will know that the curve is being constructed so that the integral from \(-\infty\) to \(\infty\) is 1. If you don’t know calculus, think of a rectangle with area of 1 based on its height and width. These cover the same area but the top of the region wiggles.↩
Jittering typically
involves adding random variability to each observation that
is uniformly distributed in a range determined based on the spacing of the
observation. The idea is to jitter just enough to see all the points but not too much. This means that if you re-run the jitter
function, the results will change if you do not set the random number seed using set.seed
that is discussed more below.
For more details, type help(jitter)
in the console in RStudio.↩
If you want to type this character in Rmarkdown, try “\(\sim\)” outside of codechunks.↩
Remember the bell-shaped curve you encountered in introductory statistics? If not, you can see some at https://en.wikipedia.org/wiki/Normal_distribution.↩