---
title: "Statistical summaries"
author: Abhijit Dasgupta
date: BIOF 339

---

```{r setup, include=FALSE, child = here::here('slides/templates/setup.Rmd')}
```

---

## Where we've been

1. Understand what tidy data is
1. Manipulate data to make it tidy (tidyr, dplyr)
1. Transform particular variables
1. Write basic functions
1. High-throughput analyses
    - Lists of data sets
    - `map` to apply similar processes to each data set
    - for-loops to repeat same recipe on multiple data sets or objects

---

## Where we're going

1. Creating data summaries
1. Basic statistical comparisons between groups
1. Creating tables 
    -  Table 1
    -  Tables for analytic results

--

The basic assumption we'll make is that we will start with a tidy data set.


---
class: middle, center

# Statistical summaries

---

## Univariate summaries

**Single summaries**  
.pull-left[
- Mean (`mean`)
- Variance(`var`)
- Standard deviation (`sd`)
- Count (`nrow` or `dplyr::n` or `dplyr::n_distinct`)
]
.pull-right[
- Median ('median')
- Inter-quartile range (`IQR`)
- Mean absolute deviation (`mad`)
- Minimum (`min`) and Maximum (`max`)
]

--

**Multiple summaries**  
- Quantiles (`quantile`)
- Range (`range`)

---
class: middle, center

# Summarizing the breast cancer expression dataset

---

## Mean

```{r s1, eval = T, echo = T}
brca <- rio::import('../data/BreastCancer_Expression.csv')
brca %>% 
  summarize(across(starts_with('NP'), 
               mean, na.rm=T)) #<<
```

---

## Median

```{r s2, eval = T, echo = T}
brca %>% 
  summarize(across(starts_with('NP'), 
               median, na.rm=T)) #<<
```

---

## Standard deviation

```{r s3, eval = T, echo = T}
brca %>% 
  summarize(across(starts_with('NP'),
               sd, na.rm=T)) #<<
```

---

## Multiple summaries together

```{r s4, eval = T, echo = T}
brca %>% 
  summarize(across(starts_with('NP'),
               c(mean,
                 median,
                 sd), na.rm=T))
```

---

## Multiple summaries together

```{r s5, eval = T, echo = T}
brca %>% 
  summarize(across(-1, # got tired of typing
               c('Mean'=mean,
                 'Median' = median,
                 'SD'=sd), na.rm=T)) 
```

---

## Multiple summaries together

.left-column70[
```{r s7, eval = T, echo = T}
brca %>% 
  summarize(across(-1, 
               c('Mean' = mean,
                 'Median' = median,
                 'SD' = sd), na.rm=T)) %>% 
  pivot_longer(cols=everything(),
               names_to='variable', 
               values_to='value') %>% 
  # extract(variable, c('ID','Statistic'), 
          # regex = '(NP_\\d+)_([A-Za-z]+)') %>% 
  separate(variable, #<<
          c("Type",'ID','Statistic'), sep='_') %>% #<<
  pivot_wider(names_from = Statistic, values_from = value) %>% #<<
  unite(ID, c('Type','ID'), sep='_') #<<
```
]
.right-column70[
You could replace the highlighted code with

```{r, eval=F}
extract(variable, 
        c('ID','Statistic'),
        regex = '(NP_\\d+)_([A-Za-z]+)') %>% 
  pivot_wider(
    names_from=Statistic,
    values_from=value)
```

]

---
class: middle, inverse, center

# Summarizing a data set

---

## Data set summary

There is a function `summary` that will give you summaries of all the variables. It's nice for looking at the data, but the output format isn't very good for further manipulation

```{r s8, eval = T, echo = T}
summary(brca[,-1]) # Omit first column
```

---
class: middle, center

# Maybe an easier way?

---

## The `tableone` package

The `tableone` package is meant to create, you guessed it, Table 1. 

It is quite a convenient package for most purposes and saves gobs of time

---

## The `tableone` package

.pull-left[
```{r t1, eval = F, echo = T}
library(tableone)
tab1 <- CreateTableOne(data=brca[,-1])
tab1
```
]
.pull-right[
```{r 08-Summaries-8, eval=T, echo = F, ref.label="t1"}
```
]

---

## The `tableone` package

.pull-left[
```{r t2, eval = F, echo = T}
library(tableone)
tab1 <- CreateTableOne(data = brca[-1])
print(tab1, nonnormal = names(brca)[-1])
```

You have to give the variable names of those you think are non-normally distributed and need to be summarized by the median
]
.pull-right[
```{r 08-Summaries-9, eval=T, echo = F, ref.label="t2"}
```
]

---

## The `tableone` package

.pull-left[
```{r t3, eval = F, echo = T}
library(tableone)
tab1 <- CreateTableOne(data = brca[-1])
kableone(print(tab1, nonnormal = names(brca)[-1]),
         format='html')
```
]
.pull-right[
```{r 08-Summaries-10, eval=T, echo = F, ref.label="t3"}
```
]

---
class: middle, center

# Mixed data

---

Let's first put the expression and clinical data together

```{r 08-Summaries-11}
library(rio)
brca1 <- import('../data/clinical_data_breast_cancer_hw.csv')
brca2 <- import('../data/BreastCancer_Expression.csv')
brca <- left_join(brca1, brca2, by=c('Complete.TCGA.ID' = 'TCGA_ID')) %>% 
  mutate(Age.at.Initial.Pathologic.Diagnosis =
           as.numeric(Age.at.Initial.Pathologic.Diagnosis)) %>% 
  mutate(ER.Status = ifelse(ER.Status %in% c('Positive','Negative'),
                            ER.Status, NA))

summary(brca)
```

---

Let's first put the expression and clinical data together

```{r 08-Summaries-12}
library(rio)
brca1 <- import('../data/clinical_data_breast_cancer_hw.csv')
brca2 <- import('../data/BreastCancer_Expression.csv')
brca <- left_join(brca1, brca2, by=c('Complete.TCGA.ID' = 'TCGA_ID')) %>% 
  mutate(Age.at.Initial.Pathologic.Diagnosis =
           as.numeric(Age.at.Initial.Pathologic.Diagnosis)) %>% 
  mutate(ER.Status = ifelse(ER.Status %in% c('Positive','Negative'),
                            ER.Status, NA),
         HER2.Final.Status = ifelse(HER2.Final.Status=='Equivocal', 
                                    NA, HER2.Final.Status)) %>% 
  mutate(across(is.character, as.factor)) %>% 
  mutate(Complete.TCGA.ID = as.character(Complete.TCGA.ID)) #<<

str(brca)
```

---

Identify which variables are categorical (factors) and which are continuous (numeric)

```{r 08-Summaries-13}
catvars <- brca %>% select(where(is.factor)) %>% names()
ctsvars <- brca %>% select(where(is.numeric)) %>% names()
```

---

```{r tab1, eval = T, echo = T}
CreateCatTable(vars = catvars, data = brca)
```

---

```{r tab2, eval = T, echo = T}
CreateContTable(vars = ctsvars, data = brca)
```

---

.pull-left[
```{r tab2a, eval = F, echo = T}
brca <- brca %>% 
  rename(
    'Age'='Age.at.Initial.Pathologic.Diagnosis',
    'Last.Contact' = 'Days.to.Date.of.Last.Contact',
    'Death' = 'Days.to.date.of.Death'
  )
ctsvars <- brca %>% 
  select(where(is.numeric))%>% names()
CreateContTable(vars = ctsvars, data = brca)

```
]
.pull-right[
```{r 08-Summaries-16, eval=T, echo = F, ref.label="tab2a"}
```
]

---

## Putting it together

```{r tab3, eval = T, echo = T}
CreateTableOne(vars = c(catvars, ctsvars),
               data = brca)
```

---

## Putting it together

```{r tab4, eval = T, echo = T}
CreateTableOne(data = brca[,-1])
```


---

class: middle, center

# Grouped summaries

---

.left-column70[
```{r grp1, eval = T, echo = T, warning=F, message=F}
brca %>% 
  group_by(ER.Status) %>% 
  summarize(across(starts_with('NP'),
               mean))
```
]
.right-column70[
There are missing values now, so we have to use `na.rm=T`.
]

---

.left-column70[
```{r grp1a, eval = T, echo = T}
brca %>% 
  group_by(ER.Status) %>% 
  summarize(across(starts_with('NP'),
               mean, na.rm=T)) #<<
```
]
.right-column70[
We still have a row for the missing values of ER.Status
]

---

.left-column70[
```{r grp1b, eval = T, echo = T}
brca %>% 
  filter(!is.na(ER.Status)) %>% #<<
  group_by(ER.Status) %>% 
  summarize(across(starts_with('NP'),
               mean, na.rm=T))
```
]
.right-column70[
How about reversing the rows and columns for readability

]

---

.pull-left[
```{r grp1c, eval = F, echo = T}
brca %>% 
  filter(!is.na(ER.Status)) %>%
  group_by(ER.Status) %>% 
  summarize(across(starts_with('NP'),
               mean, na.rm=T)) %>% 
  pivot_longer(names_to='ID', values_to='value', #<<
               cols = c(-ER.Status)) %>% #<<
  pivot_wider(names_from = ER.Status, #<<
              values_from=value) #<<
```
]
.pull-right[
```{r 08-Summaries-22, eval=T, echo = F, ref.label="grp1c"}
```
]

---

### Using `tableone`

```{r grp1d, eval = T, echo = T}
CreateTableOne(
  data = brca %>% filter(!is.na(ER.Status)),
  vars = brca %>% 
    select(starts_with('NP')) %>% 
    names(),
  strata = 'ER.Status', # single quotes, not backticks
  test = F)
```

---

## Alternatives to **tableone**

+ [table1](https://github.com/benjaminrich/table1)
+ [gtsummary](https://cran.r-project.org/package=gtsummary)
+ [flextable](https://davidgohel.github.io/flextable/)
+ [arsenal](https://github.com/eheinzen/arsenal)

---

## arsenal

```{r, results='asis'}
library(arsenal)
summary(tableby(ER.Status ~ ., data = brca[,-1])) # Here . implies all other variables.
```