--- title: "Lecture 1.2: Objects in R" subtitle: "DUC 263: Introduction to Programming and Data Management Using R" author: Ozan Jaquette date: classoption: dvipsnames # for colors fontsize: 8pt urlcolor: blue output: beamer_presentation: keep_tex: true toc: true slide_level: 3 theme: default # AnnArbor # push to header? #colortheme: "dolphin" # push to header? #fonttheme: "structurebold" highlight: default # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header df_print: tibble # push to header? latex_engine: xelatex # Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts. includes: in_header: ../beamer_header.tex #after_body: table-of-contents.txt --- ```{r, echo=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #knitr::opts_chunk$set(collapse = TRUE, highlight = TRUE) #knitr::opts_chunk$set(comment = "#>", collapse = TRUE) #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##" ``` # R basics ### R as a calculator ```{r} 5 5+2 10*3 ``` ### Executing commands in R ```{r} 5 5+2 10*3 ``` Three ways to execute commands in R 1. Type/copy commands directly into the "console" 1. `code chunks' in RMarkdown (.Rmd files) - Can execute one command at a time, one chunk at a time, or "knit" the entire document 1. R scripts (.R files) - This is just a text file full of R commands - Can execute one command at a time, several commands at a time, or the entire script ### Shortcuts you should learn for executing commands ```{r} 5+2 10*3 ``` Three ways to execute commands in R 1. Type/copy commands directly into the "console" 1. `code chunks' in RMarkdown (.Rmd files) - __Cmd/Ctrl + Enter__: execute highlighted line(s) within chunk - __Cmd/Ctrl + Shift + k__: "knit" entire document 1. R scripts (.R files) - __Cmd/Ctrl + Enter__: execute highlighted line(s) - __Cmd/Ctrl + Shift + Enter__ (without highlighting any lines): run entire script ### Assignment __Assignment__ means creating a variable -- or more generally, an "object" -- and assigning values to it - `<-` is the assignment operator - in other languages `=` is the assignment operator - good practice to put a space before and after assignment operator ```{r} # Create an object and assign value a <- 5 a b <- "yay!" b ``` # Classification of objects ### Preview of lecture on objects - The remainder of lecture 1.2 is a conceptual and practical introduction to "objects" in R - This lecture is a broad overview that introduces some general programming and applies some of these concepts to data - **Important**: the goal is to begin to develop familiarity with concepts that we will introduce in more detail in later weeks - Therefore, I don't expect you to understand or retain all this information perfectly - So just focus on understanding as much as you can and please interrupt me to ask questions if you have them! ### Objects Most statistical software (e.g., SPSS, Stata) operates on datasets, which consist of rows of observations and columns of variables - Usually, these packages can open only one dataset at a time R is an "object-oriented" programming language (like Python, JavaScript). So, what is an "object"? - formal computer science definitions are confusing because they require knowledge of concepts we haven't introduced yet - More intuitively, I think objects as anything I assign values to - For example, below, "`a`" and "`b`" are objects I assigned values to ```{r} a <- 5 a b <- "yay!" b ``` - Ben Skinner (my R guru) says "Objects are like boxes in which we can put things: data, functions, and even other objects." ### Objects - Objects can be categorized by "__type__" (which we will discuss today) and by "__class__" (which we will discuss in later weeks) - e.g., a date is an object with a numeric _type_ and a date _class_ - a dataset is an object with a particular type and class - There is no limit to the number of objects R can hold (except memory) - R "functions" do different things to different types/classes of objects - e.g., date functions are meant to process objects with type=numeric and class=date; these functions don't work on objects with type=character (e.g., "yay!") ### Vectors The fundamental object in R is the "vector" - A vector is a collection of values - The individual values within a vector are called "elements" - Values in a vector can be numeric, character (e.g., "Apple"), or some other _type_ Below we use the combine function `c()` to create a numeric vector that contains three elements - Help file says that `c()` "combines values into a vector or list" ```{r} ?c x <- c(4, 7, 9) # create object called x, which is a vector with three elements # (each an integer) x # print object x ``` Vector where the elements are characters ```{r} animals <- c("lions", "tigers", "bears", "oh my") # create object called animals animals ``` ### Student task (do with the person next to you) Either in the R console or within the R markdown file, do the following: 1. Create a vector called `v1` with three elements, where all the elements are numbers. Then print the values. 1. Create a vector called `v2` with four elements, where all the elements are characters (i.e., enclosed in single '' or double "" quotes). Then print the values. 1. Create a vector called `v3` with five elements, where some elements are numeric and some elements are characters. Then print the values. ### Solution to student task ```{r} v1 <- c(1, 2, 3) # create a vector called v1 with three elements # all the elements are numbers v1 # print value ``` ```{r} v2 <- c("a", "b", "c", "d") # create a vector called v2 with four elements # all the elements are characters v2 # print value ``` ```{r} v3 <- c(1, 2, 3, "a", "b") # create a vector called v3 with five element # some elements are numeric and some elements are characters v3 # print value ``` Note: the data in a vector must be only one type or mode (numeric, character, or logical). You can’t mix modes in the same vector. ### Formal classification of vectors in R Here, I introduce the classification of vectors by Grolemund and Wickham There are two broad types of vectors 1. __Atomic vectors__. An object that contains elements. There are six types of atomic vectors: - __logical__, __integer__, __double__, __character__, __complex__, and __raw__. - __Integer__ and __double__ vectors are collectively known as __numeric__ vectors. 2. __Lists__. Like atomic vectors, lists are objects that contain elements - elements within a list may be atomic vectors - elements within a list may also be other lists; that is lists can contain other lists - This sounds vague and confusing; I'll explain and give examples below One difference between atomic vectors and lists: __homogeneous__ vs. __heterogeneous__ elements * atomic vectors are __homogeneous__: all elements within atomic vector must be of the same type * lists can be __heterogeneous__: e.g., one element can be an integer and another element can be character ### Formal classification of vectors in R Here is a link to a visual representation of the Grolemund and Wickham classification - [LINK HERE](https://github.com/ozanj/rclass/raw/master/lectures/lecture1/data-structures-overview.png) ### Developing an intuitive understanding of vector types __Grolemund and Wickham classification__: 1. __Atomic vectors__. six types: logical, integer, double, character, complex, raw. 2. __Lists__ Problem with this classification: - Not conceptually intutive - Technically, lists are a type of vector, but people often think of atomic vectors and lists as fundamentally different things __Classification used by my R Guru Ben Skinner__: - data __type__: logical, numeric (integer and double), character, etc. - data __structure__: vector, list, matrix, etc. I find Skinner's classification more intuitive conceptually. However, it isn't consistent with R functions or the way R thinks about objects. If you find this classification of data _type_ and data _structure_ helpful, totally fine to think of objects in this way while you start to learn R. # Atomic vectors ### "Length" of an atomic vector is the number of elements \medskip For remainder of lecture, I'll use the term __vector__ to refer to atomic vectors Use `length()` function to examine vector length ```{r} x <- c(4, 7, 9) x length(x) animals <- c("lions", "tigers", "bears", "oh my") animals length(animals) ``` A single number (or a single string/character) is a vector with `length==1` ```{r} z <- 5 length(z) length("Tommy") ``` ### Data type of a vector \medskip The "type" of an atomic vector refers to the elements within the vector. While there are six "types" of actomic vectors, we'll focus on the following types: - numeric: - "integer" (e.g., 5) - "double" (e.g., 5.5) - character (e.g., "ozan") - logical (e.g., `TRUE`, `FALSE`) Use `typeof()` function to examine vector type ```{r} x typeof(x) p <- c(1.5, 1.6) p typeof(p) animals typeof(animals) ``` ### Data type of a vector, numeric Numeric vectors can be "integer" (e.g., 5) or "double" (e.g., 5.5) ```{r} typeof(1.5) ``` R stores numbers as doubles by default. ```{r} x typeof(x) ``` To make an integer, place an `L` after the number: ```{r} typeof(5) typeof(5L) ``` ### Data type of a vector, character In contrast to "numeric" data types which are used to store numbers, the "character" data type is used to store __strings__ of text. - Strings may contain any combination of numbers, letters, symbols, etc. - Character vectors are sometimes referred to as string vectors When creating a vector where elements have `type==character` (or when referring to the value of a string), place single `` or double "" quotes around text - the text within quotes is the "string" ```{r} c1 <- c("cat",'cash','candy cane') c1 typeof(c1) length(c1) ``` Numeric values can also be stored as strings ```{r} c2 <- c("1","2","3") c2 typeof(c2) ``` ### Data type of a vector, logical Logical vectors can take three possible values: `TRUE`, `FALSE`, `NA` - `TRUE`, `FALSE`, `NA` are special keywords; they are different from the character strings `"TRUE"`, `"FALSE"`, `"NA"` - Don't worry about `"NA"` for now ```{r} typeof(TRUE) typeof("TRUE") typeof(c(TRUE,FALSE,NA)) typeof(c(TRUE,FALSE,NA,"FALSE")) log <- c(TRUE,TRUE,FALSE,NA,FALSE) typeof(log) length(log) ``` We'll learn more about logical vectors later ### All elements in (atomic) vector must have same data type. Atomic vectors are __homogenous__; - An atomic vector has one data type - all elements within an atomic vector must have the same data "type" If a vector contains elements of different type, the vector type will be type of the most "complex" element Atomic vector types from simplest to most complex: - logical < integer < double < character ```{r} typeof(c(TRUE,TRUE,NA)) # recall L after an integer forces type to be integer # rather than double typeof(c(TRUE,TRUE,NA,1L)) typeof(c(TRUE,TRUE,NA,1.5)) typeof(c(TRUE,TRUE,NA,1.5,"howdy!")) ``` ### Named vectors All vectors can be "named" (i.e., name individual elements within vector) Example of creating an unamed vector - the `str()` function "compactly display[s] the internal structure of an R object" [from help file]; very useful for describing objects ```{r} #?str x <- c(1,2,3,"hi!") x str(x) ``` Example of creating a named vector ```{r} y <- c(a=1,b=2,3,c="hi!") y str(y) ``` ### Sequences (Loose) definition: a sequence is a set of numbers in ascending or descending order A vector containing a "sequence" of numbers (e.g., 1, 2, 3) can be created using the colon operator `:` with the notation `start:end` ```{r} -5:5 5:-5 s<- 1:10 #same as this: s<- c(1:10) s length(s) ``` Creating sequences using `seq()` function - basic syntax [with default values]: ```{r, eval=FALSE} seq(from = 1, to = 1, by = 1) ``` ```{r} seq(10,15) seq(from=10,to=15,by=1) seq(from=100,to=150,by=10) ``` ### Vectorized math Most mathematical operations operate on each element of the vector - e.g., add a single value to a vector and that value will be added to each element of the vector ```{r} 1:3 1:3+.5 (1:3)*2 ``` Mathematical operations involving two vectors with the same length behave differently - e.g., for addition: add element 1 of vector 1 to element 1 of vector 2, add element 2 of vector 1 to element 2 of vector 2, etc. ```{r} c(1,1,1)+c(1,0,2) c(1,1,1)*c(1,0,2) ``` # Lists ### Lists What is a __list__? - Like (atomic) vectors, a list is an object that contains __elements__ - Unlike vectors, data types can differ across elements within a list - An element within a list can be another list - this characteristic makes lists more complicated than vectors - suitable for representing hierarchical data Lists are more complicated than vectors; today we'll just provide a basic introduction ### Create lists using `list()` function Create a vector (for comparison purposes) ```{r} a <- c(1,2,3) typeof(a) length(a) ``` Create a list ```{r} b <- list(1,2,3) typeof(b) length(b) b # print list is awkward ``` ### Investigate structure of lists using `str()` function When investigating lists, `str()` is better than printing the list ```{r} b <- list(1,2,3) typeof(b) length(b) str(b) # 3 elements, each element is a numeric vector w/ length=1 ``` Each element of a list can be a vector of different length (i.e., different number of elements) ```{r} c <- list(c(3,4),c(-5,1,3)) typeof(c) length(c) str(c) # 2 elements; element 1=vector w/ length=2; element 2=vector w/length=3 ``` ### Elements within lists can have different data types Lists are __heterogeneous__ - data types can differ across elements within a list ```{r} b <- list(1,2,"apple") typeof(b) length(b) str(b) ``` Vectors are __homogeneous__ ```{r} a <- c(1,2,"apple") typeof(a) str(a) ``` ### Lists can contain other lists ```{r} x1 <- list(c(1,2), list("apple", "orange"), list(1, 2, 3)) str(x1) ``` - first element of list is a numeric vector with length=2 - second element is a list with length=2 - first element is character vector with length=1 - second element is character vector with length=1 - third element is with length=3 - first element is numeric vector with length=1 - second element is numeric vector with length=1 - third element is numeric vector with length=1 ### You can name each element in the list ```{r} x2 <- list(a=c(1,2), b=list("apple", "orange"), c=list(1, 2, 3)) str(x2) ``` `names()` function shows names of elements in the list ```{r} names(x2) # has names names(x1) # no names ``` ### Access individual elements in a "named" list Syntax: `list_name$element_name` ```{r} x2 <- list(a=1, b=list("apple", "orange"), c=list(1, 2, 3)) x2$a typeof(x2$a) length(x2$a) typeof(x2$b) length(x2$b) typeof(x2$c) length(x2$c) ``` Note: We'll spend more time practicing "accessing elements of a list" in upcoming weeks ### Compare structure of list to structure of element within a list ```{r} str(x2) str(x2$c) ``` ### A DATASET IS JUST A LIST!!!!! \medskip A data frame is a list with the following characteristics: - Data type can differ across elements (like all lists) - Each __element__ in data frame must be a __vector__, not a __list__ - Each element (column) is a variable - Each element in a data frame must have the same length - The length of an element is the number of observations (rows) - so each variable in data frame must have same number of observations - Each element is named - these element names are the variable names ```{r, include=FALSE} library(tidyverse) df <- select(mtcars,mpg,cyl,hp) # ignore for now ``` ```{r} names(df) head(df, n=5) # print first 5 rows ``` Additionally, data frames have "attributes"; we'll discuss those in upcoming weeks ### A data frame is a named list ```{r} df typeof(df) names(df) length(df) # length=number of variables str(df) ``` Like any named list, we can examine the elements - Individual elements of a data frame are the variables - these variables are vectors with length equal to the number of rows/observations ```{r} typeof(df$mpg) length(df$mpg) # length=number of rows/obs str(df$mpg) ``` ### Main takeaways about atomic vectors and lists Basic data stuctures 1. __(Atomic) vectors__: __logical__, __integer__, __double__, __character__. - each element in vector must have same data type 2. __Lists__: - Data type can differ across elements Takeaways - These concepts are difficult; ok to feel confused - I will reinforce these concepts throughout the course - Good practice: run simple diagnostics on any new object - `length()` : how many __elements__ in the object - `typeof()` : what __type__ of data is the object - `str()` : hierarchical structure of the object ### Main takeaways about atomic vectors and lists Basic data stuctures 1. __(Atomic) vectors__: __logical__, __integer__, __double__, __character__. - each element in vector must have same data type 2. __Lists__: - Data type can differ across elements Takeaways - These data structures (vectors, lists) and data types (e.g., character, numeric, logical) are the basic building blocks of all object oriented programming languages - Application to statistical analysis - Datasets are just lists - The individual elements -- columns/variables -- within a dataset are just vectors - These structures and data types are foundational for all "data science" applications - e.g., mapping, webscraping, network analysis, etc.