--- title: "Numerical Summaries" --- ## Summarizing data in R 1/2 - Have seen `summary` (5-number summary of each column). But what if we want: - a summary or two of just one column - a count of observations in each category of a categorical variable - summaries by group - a different summary of all columns (eg. SD) - To do this, meet pipe operator `%>%`. This takes input data frame, does something to it, and outputs result. (Learn: `Ctrl-Shift-M`.) ## Summarizing data in R 2/2 - Output from a pipe can be used as input to something else, so can have a sequence of pipes. - Summaries include: `mean`, `median`, `min`, `max`, `sd`, `IQR`, `quantile` (for obtaining quartiles or any percentile), `n` (for counting observations). - Use our Australian athletes data again. ## Packages for this section ```{r numsum-R-1} library(tidyverse) ``` ```{r numsum-R-2, echo=F, message=F} my_url <- url("http://ritsokiguess.site/datafiles/ais.txt") athletes <- read_tsv(my_url) ``` ```{r} summary(athletes) ``` ## Summarizing one column - Mean height: ```{r numsum-R-3} athletes %>% summarize(m=mean(Ht)) ``` or to get mean and SD of BMI: ```{r numsum-R-4} athletes %>% summarize(m = mean(BMI), s = sd(BMI)) -> d d ``` This doesn't work: ```{r} #| error: true mean(BMI) ``` ## Quartiles - `quantile` calculates percentiles ("fractiles"), so we want the 25th and 75th percentiles: ```{r numsum-R-5} athletes %>% summarize( Q1=quantile(Wt, 0.25), Q3=quantile(Wt, 0.75)) ``` ## Creating new columns - These weights are in kilograms. Maybe we want to summarize the weights in pounds. - Convert kg to lb by multiplying by 2.2. - Create new column and summarize that: ```{r numsum-R-6} athletes %>% mutate(wt_lb=Wt*2.2) %>% summarize(Q1_lb=quantile(wt_lb, 0.25), Q3_lb=quantile(wt_lb, 0.75)) ``` ## Counting how many for example, number of athletes in each sport: ```{r numsum-R-7} athletes %>% count(Sport) ``` ## Counting how many, variation 2: Another way (which will make sense in a moment): ```{r numsum-R-8} athletes %>% group_by(Sport) %>% summarize(count=n()) ``` ## Summaries by group - Might want separate summaries for each "group", eg. mean and SD of height for males and females. Strategy is `group_by` (to define the groups) and then `summarize`: ```{r numsum-R-9} athletes %>% group_by(Sex) %>% summarize(mean_Ht = mean(Ht), sd_Ht = sd(Ht)) ``` ## Count plus stats - If you want number of observations per group plus some stats, you need to go the `n()` way: ```{r} athletes %>% group_by(Sex) %>% summarize(n = n(), mean_Ht = mean(Ht), sd_Ht = sd(Ht)) ``` - This explains second variation on counting within group: "within each sport/Sex, how many athletes were there?" ## Summarizing several columns - Standard deviation of each (numeric) column: ```{r numsum-R-10} athletes %>% summarize(across(where(is.numeric), \(x) sd(x))) ``` - Median and IQR of all columns whose name starts with H: ```{r numsum-R-11} athletes %>% summarize(across(starts_with("H"), list(med = \(x) median(x), iqr = \(x) IQR(x)))) ``` ## Same thing by group ```{r numsum-R-post-15} athletes %>% group_by(Sex) %>% summarize(across(starts_with("H"), list(med = \(h) median(h), iqr = \(h) IQR(h)))) ``` ```{r} athletes %>% group_by(Sex) %>% summarize(across(ends_with("C"), list(med = \(h) median(h), iqr = \(h) IQR(h)))) ```