#' --- #' title: Introduction to R #' author: Dan McGlinn #' date: '`r paste("First created on 2015-01-16. Updated on", Sys.Date())`' #' --- #' #' Home Page - http://dmcglinn.github.io/quant_methods/ #' GitHub Repo - https://github.com/dmcglinn/quant_methods #' #' ## Source Code Link #' https://raw.githubusercontent.com/dmcglinn/quant_methods/gh-pages/lessons/R_introduction.R #' #' ## Readings #' * Chapters 1-7 of *The R Book* (1st ed) by Crawley #' * Chapters 1-4 of *MASS* (4th ed)by Venables and Ripley #' #' ## Lesson Outline #' * Arithmetic #' * Logical operations #' * Variable assignment #' * Reading in data #' * Using the help #' * Examine data #' * Subsetting the data #' * Summary statistics #' * Aggregate across rows or columns #' * Plot data #' #+ echo=FALSE library(knitr) opts_knit$set(root.dir='../') #' The purpose of this lesson is to introduce students to the R programming #' environment for the first time. The lesson builds off the Software Carpentry #' Lesson developed here: #' http://swcarpentry.github.io/r-novice-inflammation/ #' #' ## #Arithmetic 3 + 4 # summation 3 * 4 # multiplication 3 / 4 # division 3^4 # exponents log(3) # log base e log(3, 10) # log base 10 log10(3) # log base 10 exp(log(3)) # e #' ## #Logical operations 3 > 4 # greater than 3 < 4 # less than 3 >= 4 # greater than or equal to 3 <= 4 # less than or equal to 3 != 4 # not equal to 3 == 4 # equal to TRUE # True T # True TRUE == 1 # True is set to one in R FALSE # False F # False FALSE == 0 # False is set to zero in R T + T + F # what would this equal? # useful functions # any() and all() any(c(T, F)) all(c(T, F)) #' ## #Variable assignment #' you can use `<-` or `=` to assign a value to a variable but `<-` is recommended weight_kg <- 55 #' print the value of the variable by simply calling its name weight_kg #' weight in pounds: 2.2 * weight_kg weight_kg <- 57.5 #' weight in kilograms is now weight_kg weight_lb <- 2.2 * weight_kg #' weight in kg... weight_kg #' ...and in pounds weight_lb weight_kg <- 100.0 #' weight in kg now... weight_kg #' ...and in weight pounds still weight_lb #' Coming up with good object and file names can be difficult, but there #' are two general rules that can help guide you: #' #' 1) be descriptive #' 2) don't make names you must type a lot too long #' #' So for something like a file name which you'll only type probably once at read and #' write you should use a long descriptive name, but for objects in your R code you #' need to consider typeability and readability when designing the name. A long name #' like root_rhiz_prod_total_mm is very clear but is a pain to read and worse to #' type. R has a built-in name completion system but this doesn't completely #' remove the burden on you for using long object names. #' #' ## #Reading in data #' first check what your working directory is: getwd() #' I am using an Rstudio Project that I called "quant_methods". Projects #' help you to organize your R code for a specific project into a single directory. #' To create your own project simply go to File -> New Project then click either #' "New Directory" or "Existing Directory". Be default the directory and the project name #' will be identical - it is not recommended to diverge from that behavior as it can #' make it very confusing. #' #' The working directory within a project is the main project directory so #' for me it returns: `/home/mcglinndj/quant_methods` #' #' All file paths can be made relative to this directory. #' let's read in the datafile `inflammation-01.csv` which is located in the #' directory: `./quant_methods/data)` where the `.` indicates the directory #' location in which the directory `quant_methods` is stored in. The usage of #' the `./` is a shorthand way to create relative paths. #' Because my working directory is already set to: ``r normalizePath('.')`` #' I can shorten the path to `./data/inflammation-01.csv` where again `.` #' indicates my current working directory path. dat <- read.csv(file = "./data/inflammation-01.csv", header = FALSE) #' two other quick notes about path shorthand: #' #' 1. `../` is shorthand for the #' parent directory of the working directory, these can be nested like `../../` #' but not recommended. #' #' 2. `~` is shorthand for the home directory on your #' machine. On my machine `~` refers to `/home/mcglinndj` #' #' Rather than a relative path, I could have used an absolute path like: #' `"/home/mcglinndj/quant_methods/data/inflammation-01.csv"` but that path #' would only work on my specific machine. For that reason we generally prefer #' relative paths over absolute paths for making your code more reproducible and #' future proof. #' #' Another option is to simply put in the url where the data is stored: dat <- read.csv('https://raw.githubusercontent.com/dmcglinn/quant_methods/gh-pages/data/inflammation-01.csv', header = FALSE) #' this is not always a great option though because remote data and urls can break #' #' Last note about the usage of the function `read.csv`. Notice the argument, #' `header` is set to `FALSE`. This is because this inflammation dataset is a bit #' strange because it does not include column names. Most datasets will include #' column names so `header = TRUE` is the more common setting of this argument, #' in fact that is the default setting for `read.csv`. So in most cases (such as #' on the homework) you will simply use `read.csv('mycsvfile.csv')` without #' specifying the `header` argument explicitly. #' #' #' ## #Using the help #' above we used the function "read.csv" to find out more about this function see #+ eval = FALSE ?read.csv #' or equivalently #+ eval = FALSE help(read.csv) #' to do a fuzzy help search use #+ eval = FALSE help.search('read') help.search('csv') #' ## #Examine data #' visual summary of first 6 rows head(dat) #' visual summary of last 6 rows tail(dat) #' what kind of object is dat class(dat) #' what are the dimensions of dat dim(dat) #' You may notice that the data did not have column names and R auto assigned the #' columns the names V1, V2, V3, and so on. In this dataset, each column represent #' different times. We can assign column names using the function `names` names(dat) names(dat) <- paste("day", 1:ncol(dat), sep='') names(dat) #' Above the function `paste` was used to construct text strings that combined the #' word "patient" with a given index in this case from 1 to the total number of #' columns in the object `dat`. By default the function `paste` inserts a space #' between strings that you wish to paste together, I've set the `sep` argument #' to `''` to ensure that no space is inserted (see also `?paste0`) #' #' ## #Subsetting the data #' There are a variety of ways to subset data in `data.frames`. This section #" demonstrates how to subset data using indices. #' first value in dat dat[1, 1] #' middle value in dat dat[30, 20] #' chunk of data in dat dat[1:4, 1:10] #' select specific rows and columns dat[c(3, 8, 37), c(10, 14, 29)] #' The code above provides the values of `dat` at #' 3,10 ; 8,14 ; and 37,29 where the first number is the row index and the #' second number is the column index #' all columns from row 5 dat[5, ] #' all rows from column 16 dat[ , 16] dat[1:nrow(dat), 16] #' you can also exclude certain indices using the `-` sign dat[1:5 , -16] # gives every column but 16 dat[1:5 , -(1:10)] # gives every column except the first 10 #' An alternative way to carry out subsetting is to reference specific column #' names or to use the `subset` function. #' Here to avoid printing too much information to the screen I'll just focus on #' on the first 5 rows of each subset dat$patient10[1:5] dat[1:5 , 'day10'] dat[1:5 , c('day10', 'day15')] #' notice that the following would give and error #+ error=TRUE dat[ , -c('patient3')] #' but that the following would accomplish the intended goal of dropping patient 3 dat[1:5 , -3] #' let's try using the subset function #' only data for day 3 subset(dat, select = day3)[1:5, ] #' data on all days but 3 subset(dat, select = -day3)[1:5, ] #' data only on day 3 when inflammation in day 1 is equal to 0 subset(dat, subset = day1 == 0, select = day3)[1:5, ] #' ## #Summary statistics #' first row, all of the columns patient_1 <- dat[1, ] #' max inflammation for patient 1 max(patient_1) #' max inflammation for patient 2 max(dat[2, ]) #' minimum inflammation on day 7 min(dat[ , 7]) #' mean inflammation on day 7 mean(dat[ , 7]) #' median inflammation on day 7 median(dat[ , 7]) #' standard deviation of inflammation on day 7 sd(dat[ , 7]) summary(dat[ , 7]) #' ## #Aggregate across rows or columns #' Thus, to obtain the average inflammation of each patient we will need to #' calculate the mean of all of the rows (`MARGIN = 1`) of the data frame. avg_patient_inflammation <- apply(dat, 1, mean) #' And to obtain the average inflammation of each day we will need to calculate #' the mean of all of the columns (`MARGIN = 2`) of the data frame. avg_day_inflammation <- apply(dat, 2, mean) #' standard deviation of day sd_day_inflammation <- apply(dat, 2, sd) #' standard deviation of patients sd_patient_inflammation <- apply(dat, 1, sd) #' ## #Plot data #' use the function plot() to plot the data summaries ?plot #' provides a long list of potential arguments and examples #' at a minimum you must provide a single quantitative variable, for example: plot(avg_day_inflammation) #' notice how R fills in lots of pieces of missing information automatically. #' specifically it assumes that the independent variable is simply an index from #' 1 to the length of the object in this case avg_day_inflamation. A safer more #' clear way to accomplish the same plot is to use the following: plot(1:length(avg_day_inflammation), avg_day_inflammation, xlab='day', ylab='inflammation') #' this makes it clearer that the x-variable is simply an index from 1 to the #' length of avg_day_inflammation, and it makes the x and y axis labels more #' understandable. #' #' Note that in most cases in R you have a specific columns in your `data.frame` #' you wish to plot against one another. If for example my `data.frame` was #' called `dat` and there were two columns `growth` and `temperature`. I could #' plot growth as a function of temperature using #' `plot(dat$temperature, dat$growth)` #' This will come up in the homework as well. #' #' To output multi-panel plots use for example par(mfrow=c(1,2)) #' which will create a single plotting row with two columns plot(1:length(avg_day_inflammation), avg_day_inflammation, xlab='day', ylab='inflammation') plot(1:length(avg_patient_inflammation), avg_patient_inflammation, xlab='patient identity', ylab='inflammation') #' to output the figure to file you can use Rstudio's GUI features or you can use #' the command line which is what I recommend so that the code is fully #' reproducible: #+ fig.height = 8 # png('./inflammation_fig1.png') # to create a png file par(mfrow = c(2,1)) plot(1:length(avg_day_inflammation), avg_day_inflammation, xlab='Day', ylab='Inflammation', frame.plot=F, col='magenta', pch=2, cex=2) plot(1:length(avg_patient_inflammation), avg_patient_inflammation, xlab='patient identity', ylab='inflammation', col='dodgerblue') # dev.off() # to stop writing to the png file. #' #' Home Page - http://dmcglinn.github.io/quant_methods/ #' GitHub Repo - https://github.com/dmcglinn/quant_methods #'