--- title: R's Atomic Data Types and Vectorization layout: default_with_disqus author: Eric C. Anderson output: bookdown::html_chapter --- # Atomic Data Types and Coercion {#r-data-types-and-coercion} ## Basic Data "Modes" of R {#basic-modes} There are four main "modes" of scalar data, in order from least to most general: 1. `logical` can take two values: `TRUE` and `FALSE`, which can be abbreviated, when you type them as `T` and `F`. 1. The `numeric` mode comes in two flavors: "integer" and "numeric" (real numbers). Examples: `1`, `3.14`, `8.2`, `10`, etc. 1. `complex`: these are complex numbers of the form $a + bi$ where $a$ and $b$ are real numbers and $i=\sqrt{-1}.$ Examples: `3.2+7.3i`, `4+0i` 1. `character`: these take values that are often called "strings" in other languages. Examples: `"fred"`, `"foo"`, `"bar"`, `"boing"`. There is also a `raw` mode which refers to raw bytes of data, but we won't concern ourselves with that for now. ### Atomic Vectors A fundamental data structure in R: a vector in which every element is of the same mode. Like ```{r} x <- c(1,2,3,5,7) x ``` Pretty basic stuff, until you start accidentally, or intentionally mixing modes. ```{r} x <- c(1,2,3,5,7,"11") x ``` The mode of everything is _coerced_ to the mode of the element with the most general mode, and this can really bite you in the rear if you don't watch out! ## Coercion {#coercion} * All the data in an atomic vector _must be of the same mode_ * If data are added so that modes are mixed, then _the whole vector gets changed so that everything is of the most general mode_ * Example: ```{r} # simple atomic vector of mode numeric x <- 1:6 x # now change one to mode character and see what happens x[1] <- "tweezer" x ``` ### Coercion Up One Step * logical to numeric: * `TRUE` ==> `1` * `FALSE` ==> `0` * numeric to complex: * `6.4` ==> `6.4+0i` * `5` ==> `5+0i` * complex to character: * `6.4+0i` ==> `"6.4+0i"` ### Coercion Up Two Or More Steps Note that the coercion sometimes "jumps over the intermediate steps" * logical to complex * `TRUE` ==> `1+0i` * `FALSE` ==> `0+0i` * logical to character (it _does not_ go FALSE ==> 0 ==> "0") * `TRUE` ==> `"TRUE"` * `FALSE` ==> `"FALSE"` * numeric to character * `7` ==> `"7"` * `3.1415` ==> `"3.1415"` ### Coercion down one step Sometimes things get coerced "downards" (i.e., toward less general data types). If the coercion doesn't make sense you end up with `NA` which is how R denotes missing data * numeric to logical (0 ==> FALSE, anything else ==> TRUE); _Always "makes sense"_ * `0` ==> `FALSE` * `1` ==> `TRUE` * `78.2` ==> `TRUE` * `0.0001` ==> `TRUE` * `-563.3` ==> `TRUE` * complex to numeric (discards complex part and warns about it!) * `3.4+0i` ==> `3.4` * `5.6+7.6i` ==> `5.6` (+ a warning) ```{r} # witness a warning: as.numeric(7.4+5i) ``` * character to complex * `"3.4+4i"` ==> `3.4+4i` * `"a"` -> `NA` (you can't coerce `"a"` to any number, reasonably) ### Coercion down more than one step Important point: it doesn't _necessarily_ go through intermediate steps: * complex to logical (0 ==>FALSE, anything else ==> TRUE) * `0+0i` ==> `FALSE` * `0+2i` ==> `TRUE` * `5+0i` ==> `TRUE` * `5+9i` ==> `TRUE` * character to logical * `"TRUE"` ==> `TRUE` * `"FALSE"` ==> `FALSE` * `"1"` ==> `NA` (_yikes! if it went through numeric you'd get something different!_) * `"0"` ==> `NA` * character to numeric * `"56.764"` ==> `56.764` * `"4+8i"` ==> `4` (with a warning that the complex part was dropped) * `"fred"` -> `NA` ### Functions For Explicit Coercion There is a whole family for coercing objects between different modes (or different types) that take the form `as.something`: * `as.logical(x)` * `as.numeric(x)` * `as.integer(x)` # not a mode, (this is a subclass of the `numeric` mode) * `as.complex(x)` * `as.character(x)` As expected, these are vectorized---they coerce every element of the vector to the desired mode. ## Missing Data and Special Values in R {#missing-data} We saw `NA` up above. That means "Not Available" and it denotes missing data. There are also two more interesting values: 1. `Inf` (-Inf) means $\infty$ (or $-\infty$) and arises from things like: 1/0 or log(0). 1. `NaN` means "Not a Number" and it arises from situations where you can't evaluate something and it doesn't have an obvious limit. Like 0/0 or Inf/-Inf or 0*Inf. * If you wish to test whether something is NaN, or NA you have: `is.na(x)` and `is.nan(x)` which return logical vectors. * The same goes for testing if things are finite or infinite: ```{r} x <- c(NA, 2, Inf, 4, NaN, 6) is.nan(x) # only the NaN is.na(x) # both NA and NaN is.infinite(x) # only Inf or -Inf ``` ### Modes of Missing Data Here is something to be aware of: missing values, like non-missing values, carry around their mode. Try this: ```{r} x <- c(1, 2, NA, 4, "5") x x[3] # this extracts the third element of x c(10,20,30,x[3]) c(10, 20, 30, NA) # this is a "fresh" NA, no coercion ``` ## Vectorization {#vectorization} * In R, the term _vectorization_ refers to the fact that, in many cases, when you apply a function to a vector, it applies the function to every element of the vector. * This is apparent in many of the _operators_ and we will see it in plenty of other functions, too. ### Most Operators are Vectorized This is _incredibly important_! All the mathematical operators, like `+`, `-`, `*`, and the logical operators, like `&` (AND), `|` (OR), and the comparison operators, like `<` and `>` are hungry to operate _element-wise_ on every _element_ of a vector. Example: ```{r} fish.lengths <- c(121, 95, 87, 142) fish.weights <- c(1011, 505, 702, 900) fish.fatness <- fish.weights / fish.lengths fish.fatness ``` ### Vectorization is so important... That we are going to go to open up a whole new lecture that starts with it.