You will need to download the statistical software package called R and an enhanced interface to R called RStudio (RStudio Team 2018). They are open source and free to download and use (and will always be that way). This means that the skills you learn now can follow you the rest of your life. R is becoming the primary language of statistics and is being adopted across academia, government, and businesses to help manage and learn from the growing volume of data being obtained. Hopefully you will get a sense of some of the power of R in this book.
The next pages will walk you through the process of getting the software downloaded and provide you with an initial experience using RStudio to do things that should look familiar even though the interface will be a new experience. Do not expect to master R quickly – it takes years (sorry!) even if you know the statistical methods being used. We will try to keep all your interactions with R code in a similar code format and that should help you in learning how to use R as we move through various methods. We will also often provide you with example code. Everyone that learns R starts with copying other people’s code and then making changes for specific applications – so expect to go back to examples from the text and focus on learning how to modify that code to work for your particular data set. Only really experienced R users “know” functions without having to check other resources. After we complete this basic introduction, Chapter 2 begins doing more sophisticated things with R, allowing us to compare quantitative responses from two groups, make some graphical displays, do hypothesis testing and create confidence intervals in a couple of different ways.
You will have two3 downloading activities to complete before you can do anything more than read this book4. First, you need to download R. It is the engine that will do all the computing for us, but you will only interact with it once. Go to http://cran.rstudio.com and click on the “Download R for…” button that corresponds to your operating system. On the next page, click on “base” and then it will take you to a screen to download the most current version of R that is compiled for your operating system, something like “Download R 3.6.0 for Windows”. Click on that link and then open the file you downloaded. You will need to select your preferred language (choose English so your instructor can help you), then hit “Next” until it starts to unpack and install the program (all the base settings will be fine). After you hit “Finish” you will not do anything further with R directly.
Second, you need to download RStudio. It is an enhanced interface that will make interacting with R less frustrating and allow you to directly create reports that include the code and output. To download RStudio, go near the bottom of https://www.rstudio.com/products/rstudio/download/ and select the correct version under “Installers for Supported Platforms” for your operating system. Download and then install RStudio using the installer. From this point forward, you should only open RStudio; it provides your interface with R. Note that both R and RStudio are updated frequently (up to four times a year) and if you downloaded either more than a few months previously, you should download the up-to-date versions, especially if something you are trying to do is not working. Sometimes code will not work in older versions of R and sometimes old code won’t work in new versions of R.5
To get started, we can complete some basic tasks in R using the RStudio interface. When you open RStudio, you will see a screen like Figure 1.2. The added annotation in this and the following screen-grabs is there to help you get initially oriented to the software interface. R is command-line software – meaning that most of the time you have to create code and then enter and execute it at a command prompt to get any results. RStudio makes the management and execution of that code more efficient than the basic version of R. In RStudio, the lower left panel is called the “console” window and is where you can type R code directly into R or where you will see the code you run and (most importantly!) where the results of your executed commands will show up. The most basic interaction with R is available once you get the cursor active at the command prompt “>” by clicking in that panel (look for a blinking vertical line). The upper left panel is for writing, saving, and running your R code. Once you have code available in this window, the “Run” button will execute the code for the line that your cursor is on or for any text that you have highlighted with your mouse. The “data management” or environment panel is in the upper right, providing information on what data sets have been loaded. It also contains the “Import Dataset” button that provides the easiest way for you to read a data set into R so you can analyze it. The lower right panel contains information on the “Packages” (additional code we will download and install to add functionality to R) that are available and is where you will see plots that you make and requests for “Help” on specific functions.
As a first interaction with R we can use it as a calculator. To do this, click near the command prompt
(>
) in the lower left “console” panel, type 3+4, and then hit enter. It
should look like this:
You can do more interesting calculations, like finding the mean of the numbers -3, 5, 7, and 8 by adding them up and dividing by 4:
Note that the parentheses help R to figure out your desired order of operations. If you drop that grouping, you get a very different (and wrong!) result:
We could estimate the standard deviation similarly using the formula you might remember from introductory
statistics, but that will only work in very limited situations. To use the real
power of R this semester, we need to work with data sets that store the
observations for our subjects in variables.
Basically, we need to store observations in named vectors (one dimensional
arrays) that contain a list of the observations. To create a vector containing
the four numbers and assign it to a variable named variable1, we need to
create a vector using the function
c
which means “combine the items” that follow, if they are inside
parentheses and have commas separating the values,
as follows:
To get this vector stored in a variable called variable1 we need to
use the assignment operator, <-
(read as “is defined to contain”) that assigns
the information on the right into the variable that you are creating on
the left.
In R, the assignment operator, <-
, is created by typing a
“less than” symbol <
followed by a “minus” sign (-
)
without a space between them. If you
ever want to see what numbers are residing in an object in R, just type
its name and hit enter. You can see how that variable contains the same
information that was initially generated by
c(-3, 5, 7, 8)
but is easier to access since we just need the text
for the variable name representing that vector.
With the data stored in a variable, we can use functions such as
mean
and
sd
to find the mean and standard deviation of the observations contained in
variable1
:
When dealing with real data, we will often have information about more than one variable. We could enter all observations by hand for each variable but this is prone to error and onerous for all but the smallest data sets. If you are to ever utilize the power of statistics in the evolving data-centered world, data management has to be accomplished in a more sophisticated way. While you can manage data sets quite effectively in R, it is often easiest to start with your data set in something like Microsoft Excel or OpenOffice’s Calc. You want to make sure that observations are in the rows and the names of variables are in the columns and that there is no “extra stuff” in the spreadsheet. If you have missing observations, they should be represented with blank cells. The file should be saved as a “.csv” file (stands for comma-separated values although Excel calls it “CSV (Comma Delimited)”), which basically strips off some of the junk that Excel adds to the necessary information in the file. Excel will tell you that this is a bad idea, but it actually creates a more stable archival format and one that R can use directly.6
The following code to read in the data set relies on an R package called
readr
(Wickham, Hester, and Francois 2018). Packages in R provide additional functions and data sets that
are not available in the initial download of R or RStudio. To get access to the packages,
first “install” (basically
download) and then “load” the package. To install an R package, go to the Packages
tab in the lower right panel of
RStudio. Click on the Install button and then type in the name of the package in
the box (here type in readr
).
RStudio will try to auto-complete the package name
you are typing which should help you make sure you got it typed correctly. This will
be the first of many times that we will mention that R is case sensitive – in
other words, Readr
is different from readr
in R syntax and this sort of
thing applies to everything you do in R. You should only need to install each R
package once on a given computer. If you ever see a message that R can’t find a
package, make sure it appears in the list in the Packages tab. If it
doesn’t, repeat the previous steps to install it.
Important: R is case sensitive! Readr is not the same as readr ! |
After installing the package, we need to load it to make it active in a given work
session. Go to the command prompt and type (or copy and paste) require(readr)
or library(readr)
:
> require(readr)
With a data set converted to a CSV file and readr
installed and loaded, we need to read the data set into the active workspace.
There are two ways to do this, either using the point-and-click GUI in RStudio (click
the “Import Dataset” button in the upper right “Environment” panel as
indicated in Figure 1.2) or modifying the read_csv
function to find the file of interest. To practice this, you can
download an Excel (.xls) file from
http://www.math.montana.edu/courses/s217/documents/treadmill.xls
that contains observations on 31 males that volunteered for a study on methods
for measuring fitness (Westfall and Young 1993).
In the spreadsheet, you will find a data set that
starts and ends with the following information (only results for Subjects 1, 2,
30, and 31 shown here):
Sub- ject | Tread- MillOx | TreadMill- MaxPulse | RunTime | RunPulse | Rest Pulse | BodyWeight | Age |
---|---|---|---|---|---|---|---|
1 | 60.05 | 186 | 8.63 | 170 | 48 | 81.87 | 38 |
2 | 59.57 | 172 | 8.17 | 166 | 40 | 68.15 | 42 |
… | … | … | … | … | … | … | … |
30 | 39.2 | 172 | 12.88 | 168 | 44 | 91.63 | 54 |
31 | 37.39 | 192 | 14.03 | 186 | 56 | 87.66 | 45 |
The variables contain information on the subject number (Subject), subjects’ maximum treadmill oxygen consumption (TreadMillOx, in ml per kg per minute, also called maximum VO2) and maximum pulse rate (TreadMillMaxPulse, in beats per minute), time to run 1.5 miles (Run Time, in minutes), maximum pulse during 1.5 mile run (RunPulse, in beats per minute), resting pulse rate (RestPulse, beats per minute), Body Weight (BodyWeight, in kg), and Age (in years). Open the file in Excel or equivalent software and then save it as a .csv file in a location you can find on your computer. Then go to RStudio and click on File, then Import Dataset, then From Text (readr)…7 Find your file and click “Import”. R will store the data set as an object with the same name as the .csv file. You could use another name as well, but it is often easiest just to keep the data set name in R related to the original file name. You should see some text appear in the console (lower left panel) like in Figure 1.3. The text that is created will look something like the following – if you had stored the file in a drive labeled D:, it would be:
What is put inside the
" "
will depend on the location and name of your saved .csv file. A
version of the data set in what looks like a
spreadsheet will appear in the upper left window due to the second line of
code (View(treadmill
)).
Just directly typing (or using) a line of code like this is actually the
other way that we can read in
files. If you choose to use the text-only interface, then you need to tell R
where to look in your computer to find the data file. read_csv
is a
function that takes a path as an argument. To use it, specify the path to
your data file, put quotes around it, and put it as the input to
read_csv(...)
. For some examples later in the book, you will be able to
copy a command like this from the text and read data sets and other
code directly from the course folder, assuming you are connected to the
internet.
To verify that you read the data set in correctly, it is always good to check
its contents. We can view the first and last rows in the data set using the
head
and tail
functions on the data set, which show the following
results for the
treadmill
data. Note that you will sometimes need to resize the console
window in RStudio to get all the columns to display
in a single row which can be performed by dragging the gray bars that separate
the panels.
> head(treadmill)
# A tibble: 6 x 8
Subject TreadMillOx TreadMillMaxPulse RunTime RunPulse RestPulse BodyWeight Age
<int> <dbl> <int> <dbl> <int> <int> <dbl> <int>
1 1 60.05 186 8.63 170 48 81.87 38
2 2 59.57 172 8.17 166 40 68.15 42
3 3 54.62 155 8.92 146 48 70.87 50
4 4 54.30 168 8.65 156 45 85.84 44
5 5 51.85 170 10.33 166 50 83.12 54
6 6 50.55 155 9.93 148 49 59.08 57
> tail(treadmill)
# A tibble: 6 x 8
Subject TreadMillOx TreadMillMaxPulse RunTime RunPulse RestPulse BodyWeight Age
<int> <dbl> <int> <dbl> <int> <int> <dbl> <int>
1 26 44.61 182 11.37 178 62 89.47 44
2 27 40.84 172 10.95 168 57 69.63 51
3 28 39.44 176 13.08 174 63 81.42 44
4 29 39.41 176 12.63 174 58 73.37 57
5 30 39.20 172 12.88 168 44 91.63 54
6 31 37.39 192 14.03 186 56 87.66 45
When you require a package, you may see a warning message about versions of the package and versions of R – this is usually something you can ignore. Other warning messages could be more ominous for proceeding but before getting too concerned, there are couple of basic things to check. First, double check that the package is installed (see previous steps). Second, check for typographical errors in your code – especially for mis-spellings or unintended capitalization. If you are still having issues, try repeating the installation process. If that fails, find someone more used to using R to help you (for example in the Math Learning Center or by emailing your instructor).8
To help you go from basic to intermediate R usage and especially to help with more complicated problems, you will want to learn how to manage and save your R code. The best way to do this is using the upper left panel in RStudio using what are called R Scripts, which are files that have a file extension of “.R”. To start a new “.R” file to store your code, click on File, then New File, then R Script. This will create a blank page to enter and edit code – then save the file as something like “MyFileName.R” in your preferred location. Saving your code will mean that you can return to where you were working last by simply re-running the saved script file. With code in the script window, you can place the cursor on a line of code or highlight a chunk of code and hit the “Run” button9 on the upper part of the panel. It will appear in the console with results just like what you would obtain if you typed it after the command prompt and hit enter for each line. Figure 1.4 shows the screen with the code used in this section in the upper left panel, saved in a file called “CH0.R”, with the results of highlighting and executing the first section of code using the “Run” button.
RStudio Team. 2018. RStudio: Integrated Development Environment for R. Boston, MA: RStudio, Inc. http://www.rstudio.com/.
Westfall, Peter H., and S. Stanley Young. 1993. Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment. New York: Wiley.
Wickham, Hadley, Jim Hester, and Romain Francois. 2018. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.
There is a cloud version of R Studio available at https://rstudio.cloud/ if you want to avoid these steps. We still recommend following the steps to be able to work locally but try this option if you have any issues with the installation process.↩
I recorded a video that walks through getting R and RStudio installed on a PC available in a recorded video. If you want to see them installed on a mac, you can try this video on youtube. Or for either version, try searching YouTube for “How to install R and RStudio”.↩
The need to keep the code up-to-date as R continues to evolve is one reason that this book is locally published and that this is the 6th time it has been revised in six years…↩
There are ways to read “.xls” and “.xlsx” files directly into R that we will explore later.↩
If
you are having trouble getting the file converted and read into R, copy and
run the following code:
treadmill <- read_csv("http://www.math.montana.edu/courses/s217/documents/treadmill.csv")
.↩
Most computer lab computers at Montana State University have RStudio installed and so provide another venue to work.↩
You can also use Ctrl+Enter if you like hot keys.↩