--- title: R's Atomic Data Types and Vectorization layout: default_with_disqus author: Eric C. Anderson output: bookdown::html_chapter --- # Atomic Data Types and Coercion {#r-data-types-and-coercion} ## Basic Data "Modes" of R {#basic-modes} There are four main "modes" of scalar data, in order from least to most general: 1. logical can take two values: TRUE and FALSE, which can be abbreviated, when you type them as T and F. 1. The numeric mode comes in two flavors: "integer" and "numeric" (real numbers). Examples: 1, 3.14, 8.2, 10, etc. 1. complex: these are complex numbers of the form $a + bi$ where $a$ and $b$ are real numbers and $i=\sqrt{-1}.$ Examples: 3.2+7.3i, 4+0i 1. character: these take values that are often called "strings" in other languages. Examples: "fred", "foo", "bar", "boing". There is also a raw mode which refers to raw bytes of data, but we won't concern ourselves with that for now. ### Atomic Vectors A fundamental data structure in R: a vector in which every element is of the same mode. Like {r} x <- c(1,2,3,5,7) x  Pretty basic stuff, until you start accidentally, or intentionally mixing modes. {r} x <- c(1,2,3,5,7,"11") x  The mode of everything is _coerced_ to the mode of the element with the most general mode, and this can really bite you in the rear if you don't watch out! ## Coercion {#coercion} * All the data in an atomic vector _must be of the same mode_ * If data are added so that modes are mixed, then _the whole vector gets changed so that everything is of the most general mode_ * Example: {r} # simple atomic vector of mode numeric x <- 1:6 x # now change one to mode character and see what happens x[1] <- "tweezer" x  ### Coercion Up One Step * logical to numeric: * TRUE ==> 1 * FALSE ==> 0 * numeric to complex: * 6.4 ==> 6.4+0i * 5 ==> 5+0i * complex to character: * 6.4+0i ==> "6.4+0i" ### Coercion Up Two Or More Steps Note that the coercion sometimes "jumps over the intermediate steps" * logical to complex * TRUE ==> 1+0i * FALSE ==> 0+0i * logical to character (it _does not_ go FALSE ==> 0 ==> "0") * TRUE ==> "TRUE" * FALSE ==> "FALSE" * numeric to character * 7 ==> "7" * 3.1415 ==> "3.1415" ### Coercion down one step Sometimes things get coerced "downards" (i.e., toward less general data types). If the coercion doesn't make sense you end up with NA which is how R denotes missing data * numeric to logical (0 ==> FALSE, anything else ==> TRUE); _Always "makes sense"_ * 0 ==> FALSE * 1 ==> TRUE * 78.2 ==> TRUE * 0.0001 ==> TRUE * -563.3 ==> TRUE * complex to numeric (discards complex part and warns about it!) * 3.4+0i ==> 3.4 * 5.6+7.6i ==> 5.6 (+ a warning) {r} # witness a warning: as.numeric(7.4+5i)  * character to complex * "3.4+4i" ==> 3.4+4i * "a" -> NA (you can't coerce "a" to any number, reasonably) ### Coercion down more than one step Important point: it doesn't _necessarily_ go through intermediate steps: * complex to logical (0 ==>FALSE, anything else ==> TRUE) * 0+0i ==> FALSE * 0+2i ==> TRUE * 5+0i ==> TRUE * 5+9i ==> TRUE * character to logical * "TRUE" ==> TRUE * "FALSE" ==> FALSE * "1" ==> NA (_yikes! if it went through numeric you'd get something different!_) * "0" ==> NA * character to numeric * "56.764" ==> 56.764 * "4+8i" ==> 4 (with a warning that the complex part was dropped) * "fred" -> NA ### Functions For Explicit Coercion There is a whole family for coercing objects between different modes (or different types) that take the form as.something: * as.logical(x) * as.numeric(x) * as.integer(x) # not a mode, (this is a subclass of the numeric mode) * as.complex(x) * as.character(x) As expected, these are vectorized---they coerce every element of the vector to the desired mode. ## Missing Data and Special Values in R {#missing-data} We saw NA up above. That means "Not Available" and it denotes missing data. There are also two more interesting values: 1. Inf (-Inf) means $\infty$ (or $-\infty$) and arises from things like: 1/0 or log(0). 1. NaN means "Not a Number" and it arises from situations where you can't evaluate something and it doesn't have an obvious limit. Like 0/0 or Inf/-Inf or 0*Inf. * If you wish to test whether something is NaN, or NA you have: is.na(x) and is.nan(x) which return logical vectors. * The same goes for testing if things are finite or infinite: {r} x <- c(NA, 2, Inf, 4, NaN, 6) is.nan(x) # only the NaN is.na(x) # both NA and NaN is.infinite(x) # only Inf or -Inf  ### Modes of Missing Data Here is something to be aware of: missing values, like non-missing values, carry around their mode. Try this: {r} x <- c(1, 2, NA, 4, "5") x x[3] # this extracts the third element of x c(10,20,30,x[3]) c(10, 20, 30, NA) # this is a "fresh" NA, no coercion  ## Vectorization {#vectorization} * In R, the term _vectorization_ refers to the fact that, in many cases, when you apply a function to a vector, it applies the function to every element of the vector. * This is apparent in many of the _operators_ and we will see it in plenty of other functions, too. ### Most Operators are Vectorized This is _incredibly important_! All the mathematical operators, like +, -, *, and the logical operators, like & (AND), | (OR), and the comparison operators, like < and > are hungry to operate _element-wise_ on every _element_ of a vector. Example: {r} fish.lengths <- c(121, 95, 87, 142) fish.weights <- c(1011, 505, 702, 900) fish.fatness <- fish.weights / fish.lengths fish.fatness  ### Vectorization is so important... That we are going to go to open up a whole new lecture that starts with it.