--- title: "Choosing things in dataframes" --- ## Packages The usual: ```{r} library(tidyverse) ``` ## Doing things with data frames Let's go back to our Australian athletes: ```{r read-athletes, echo=F} #| message = FALSE my_url <- "http://ritsokiguess.site/datafiles/ais.txt" athletes <- read_tsv(my_url) ``` ```{r choosing-R-1} athletes ``` ## Choosing a column ```{r choosing-R-2} athletes %>% select(Sport) ``` ## Choosing several columns ```{r choosing-R-3} athletes %>% select(Sport, Hg, BMI) ``` ## Choosing consecutive columns ```{r choosing-R-4} athletes %>% select(Sex:WCC, BMI) ``` ```{r} 1:5 ``` ## Choosing all-but some columns ```{r choosing-R-5} athletes %>% select(-(RCC:LBM)) ``` ```{r} athletes %>% select(Sex:Sport, Ht:Wt) ``` ## Select-helpers Other ways to select columns: those whose name: - `starts_with` something - `ends_with` something - `contains` something - `matches` a "regular expression" - `everything()` select all the columns ## Columns whose names begin with S ```{r choosing-R-6} athletes %>% select(starts_with("S")) ``` ## Columns whose names end with C either uppercase or lowercase: ```{r choosing-R-7} athletes %>% select(ends_with("c")) ``` ## Case-sensitive This works with any of the select-helpers: ```{r choosing-R-8} athletes %>% select(ends_with("C", ignore.case=FALSE)) ``` ## Column names containing letter R ```{r choosing-R-9} athletes %>% select(contains("r")) ``` ## Exactly two characters, ending with T In regular expression terms, this is `^.t$`: - `^` means "start of text" - `.` means "exactly one character, but could be anything" - `$` means "end of text". ```{r choosing-R-10} athletes %>% select(matches("^.t$")) ``` ## Choosing columns by property - Use `where` as with summarizing several columns - eg, to choose text columns: ```{r choosing-R-11} athletes %>% select(where(is.character)) ``` ## Choosing rows by number ```{r choosing-R-12} athletes %>% slice(16:25) ``` ## Non-consecutive rows ```{r choosing-R-13} athletes %>% slice(10, 13, 17, 42) ``` ## A random sample of rows ```{r choosing-R-14} athletes %>% slice_sample(n=8) ``` ## Rows for which something is true ```{r choosing-R-15} athletes %>% filter(Sport == "Tennis") ``` ## More complicated selections ```{r choosing-R-16} athletes %>% filter(Sport == "Tennis", RCC < 5) ``` ## Another way to do "and" ```{r choosing-R-17} athletes %>% filter(Sport == "Tennis") %>% filter(RCC < 5) ``` ## Either/Or ```{r choosing-R-18} athletes %>% filter(Sport == "Tennis" | RCC > 5) ``` ## Sorting into order ```{r choosing-R-19} athletes %>% arrange(RCC) ``` ## Breaking ties by another variable ```{r choosing-R-20} athletes %>% arrange(RCC, BMI) ``` ## Descending order ```{r choosing-R-21} athletes %>% arrange(desc(BMI)) ``` ## "The top ones" ```{r choosing-R-22} athletes %>% arrange(desc(Wt)) %>% slice(1:7) %>% select(Sport, Wt) ``` ## Another way ```{r choosing-R-23} athletes %>% slice_max(order_by = Wt, n=7) %>% select(Sport, Wt) ``` ## Create new variables from old ones ```{r new-from-old} athletes %>% mutate(wt_lb = Wt * 2.2) %>% select(Sport, Sex, Wt, wt_lb) %>% arrange(Wt) ``` ## Turning the result into a number Output is always data frame unless you explicitly turn it into something else, eg. the weight of the heaviest athlete, as a number: ```{r to-number} athletes %>% arrange(desc(Wt)) %>% pluck("Wt", 1) -> heavy heavy heavy * 2.2 ``` Or the 20 heaviest weights in descending order: ```{r choosing-R-24} athletes %>% arrange(desc(Wt)) %>% slice(1:20) %>% pluck("Wt") ``` ## Another way to do the last one ```{r choosing-R-25} athletes %>% arrange(desc(Wt)) %>% slice(1:20) %>% pull("Wt") ``` `pull` grabs the column you name *as a vector* (of whatever it contains). ## To find the mean height of the women athletes Two ways: ```{r choosing-R-26} athletes %>% group_by(Sex) %>% summarize(m = mean(Ht)) ``` ```{r choosing-R-27} athletes %>% filter(Sex == "female") %>% summarize(m = mean(Ht)) ``` ## Summary of data selection/arrangement "verbs" {.smaller} | Verb | Purpose | |:---------------|:-----------------------------------------------------------------| | `select` | Choose columns | | `slice` | Choose rows by number | | `slice_sample` | Choose random rows | | `slice_max` | Choose rows with largest values on a variable (also `slice_min`) | | `filter` | Choose rows satisfying conditions | | `arrange` | Sort in order by column(s) | | `mutate` | Create new variables | | `group_by` | Create groups to work with | | `summarize` | Calculate summary statistics (by groups if defined) | | `pluck` | Extract items from data frame | | `pull` | Extract a single column from a data frame as a vector | ## Looking things up in another data frame - Suppose you are working in the nails department of a hardware store and you find that you have sold these items: ```{r choosing-R-28, message=FALSE} my_url <- "http://ritsokiguess.site/datafiles/nail_sales.csv" sales <- read_csv(my_url) sales ``` ## Product descriptions and prices - but you don't remember what these product codes are, and you would like to know the total revenue from these sales. - Fortunately you found a list of product descriptions and prices: ```{r choosing-R-29, message=FALSE} my_url <- "http://ritsokiguess.site/datafiles/nail_desc.csv" desc <- read_csv(my_url) desc ``` ## The lookup - How do you "look up" the product codes to find the product descriptions and prices? - `left_join`. ```{r choosing-R-30} sales %>% left_join(desc) ``` ## What we have - this looks up all the rows in the *first* dataframe that are also in the *second*. - by default matches all columns with same name in two dataframes (`product_code` here) - get *all* columns in *both* dataframes. The rows are the ones for that `product_code`. So now can work out how much the total revenue was: ```{r choosing-R-31} sales %>% left_join(desc) %>% mutate(product_revenue = sales*price) %>% summarize(total_revenue = sum(product_revenue)) ``` ## More comments - if any product codes are not matched, you get NA in the added columns - anything in the *second* dataframe that was not in the first does not appear (here, any products that were not sold) - other variations (examples follow): - if there are two columns with the same name in the two dataframes, and you only want to match on one, use `by` with one column name - if the columns you want to look up have different names in the two dataframes, use `by` with a "named list" ## Matching on only some matching names - Suppose the `sales` dataframe *also* had a column `qty` (which was the quantity sold): ```{r choosing-R-32} sales %>% rename("qty"="sales") -> sales1 sales1 ``` - The `qty` in `sales1` is the quantity sold, but the `qty` in `desc` is the number of nails in a package. These should *not* be matched: they are different things. ## Matching only on product code ```{r choosing-R-33} sales1 %>% left_join(desc, join_by(product_code)) ``` - Get `qty.x` (from `sales1`) and `qty.y` (from `desc`). ## Matching on different names 1/2 - Suppose the product code in `sales` was just `code`: ```{r choosing-R-34} sales %>% rename("code" = "product_code") -> sales2 sales2 ``` - How to match the two product codes that have different names? ## Matching on different names 2/2 - Use `join_by`, but like this: ```{r choosing-R-35} sales2 %>% left_join(desc, join_by(code == product_code)) ``` ## Other types of join - `right_join`: interchanges roles, looking up keys from second dataframe in first. - `anti_join`: give me all the rows in the first dataframe that are *not* in the second. (Use this eg. to see whether the product descriptions are incomplete.) - `full_join`: give me all the rows in both dataframes, with missings as needed. ## Full join here ```{r choosing-R-36} sales %>% full_join(desc) ``` - The missing `sales` for "masonry nail" says that it was in the lookup table `desc`, but we didn't sell any. ## The same thing, but with `anti_join` Anything in first df but not in second? ```{r} desc %>% anti_join(sales) # sales %>% anti_join(desc) ``` Masonry nails are the only thing in our product description file that we did not sell any of.