---
title: "TCGA Lung cancer"
output: html_notebook
---

##  1. Introduction
Lung cancer one of major cancers, accounting for [2.09 million deaths out of the 9.6 million total cancer deaths in 2018](https://www.who.int/en/news-room/fact-sheets/detail/cancer). There are two main types of lung cancer: small cell lung carcinoma (SCLC) and non-small cell lung carcinoma (NSCLC). NSCLC is responsible for 85–90% of lung cancer cases and its two largest subtypes are lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). 

LUAD develops in the periphery of the lungs and may be associated with smoking, but is the most common lung cancer type among non-smokers. In contrast, LUSC accounts for 25–30% of all total lung cancer cases while LUAD accounts for 40% of all total lung cancer cases. LUSC are likely to be found in the middle of the lungs and is associated with smoking. 

In this tutorial, we will explore the sample dataset of LUSC and LUAD from the The Cancer Genome Atlas (TCGA), a public resource for the genomic dataset. Here we use two types of lung cancers - LUAD and LUSC and will examine which information we can use for genomic analyses of lung cancers.

## 2. Explore your data
We will load a package and dataset for the tutorial.

```{r message=FALSE}
# Load packages
library(ggplot2, quietly = T)
library(tidyverse, quietly = T)
library(readr, quietly = T)

# Load dataset
# wget https://www.dropbox.com/s/xi1ilrp6uzwh4i8/data_20190822_luad_lusc_rnaseq_normalized.RData
load('data_20190822_luad_lusc_rnaseq_normalized.RData')
```

Let's explore which columns are provided in this dataset. 

```{r}
colnames(d_luad)
```

```{r}
colnames(d_lusc)
```

### 2.1. Which tissues or samples are available for your analysis?

Lung cancer samples in the dataset are provided in tumor or normal tissues. The column `shortLetterCode` contains [Sample Type Codes](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes) to describe the type of tissues collected in the dataset.

- TP: Primary solid Tumor
- TR: Recurrent solid Tumor
- NT: Solid Tissue Normal

Check which samples are included in the LUAD dataset.

```{r}
d_luad %>% count(shortLetterCode)
```

Check which samples are included in the LUSC dataset.

```{r}
d_lusc %>% count(shortLetterCode)
```

Let's make a simple bar plot to compare the number of tissues between two tissue types. We will look at the LUAD samples first.

```{r}
d_luad %>%
  count(shortLetterCode) %>% 
  ggplot(., aes(shortLetterCode, n, fill=shortLetterCode)) + 
  geom_bar(stat="identity", position=position_dodge()) 
```

Now we can plot the LUSC samples.

```{r}
d_lusc %>%
  count(shortLetterCode) %>% 
  ggplot(., aes(shortLetterCode, n, fill=shortLetterCode)) + 
  geom_bar(stat="identity", position=position_dodge()) 
```

We made two separate plots for the LUAD and LUSC dataset. It would be great if we can merge them into one plot. Here I put the code for this. It would be good practice if you comment or delete the line that you don't understand fully, you will compare the difference between outcomes. Also, please figure out what Qs do in this plot.

```{r}
bind_rows(d_luad %>% 
            mutate(type='luad') %>%
            select(type, shortLetterCode, tumor_stage), 
          d_lusc %>% 
            mutate(type='lusc') %>%
            select(type, shortLetterCode, tumor_stage)) %>%
  count(shortLetterCode, type) %>% 
  complete(type, shortLetterCode, fill = list(n = 0)) %>% # Q1: What is this?
  ggplot(., aes(shortLetterCode, n, fill=shortLetterCode)) + 
  geom_bar(stat="identity", position=position_dodge()) + 
  facet_wrap(~type, scales = 'free_x', ncol=5) # Q2: What is this?
```


### 2.2. Distribution of age at diagnosis

We will plot the distribution of age at diagnosis using histogram. Few things you can check:
- Is the distribution continuous?
- Is it fowlloing the normal distribution?
- What is the scale on the x-axis?
- If the distribution is stratified, which makes it?

```{r}
# Plot histogram
ggplot(d_luad, aes(age_at_diagnosis)) + geom_histogram(bins=100)
```

The x-axis is a day scale. Let's convert it to year.

```{r}
# Change the axis
ggplot(d_luad, aes(age_at_diagnosis/365)) + geom_histogram(bins=100)
```

To create a smooth density, we use the geom_density.

```{r}
ggplot(d_luad, aes(age_at_diagnosis)) + geom_density()
```


More plot types on distribution. First we can try the plot for both histogram and density.

```{r}
# Plot both histogram and density plot
ggplot(d_luad, aes(age_at_diagnosis/365)) + 
  geom_histogram(bins=100, colour="black", fill="white") + 
  geom_density(alpha=.2, fill="#FF6666")

```

What did you miss from this?

Let's plot a bit different version. We will replace the y-axis of the histogram with the y-axis of the density plot. 

```{r}
# Histogram with density plot
ggplot(d_luad, aes(age_at_diagnosis/365)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#FF6666") 
```

Q: Do you think whether it is good visualization for the distribution?


From the plot above, we still found the bump around the center. What would contribute to this stratification? Let's look at some variables from other columns. First we can try information from the `gender` column. 

```{r}
ggplot(d_luad, aes(age_at_diagnosis/365, fill=gender)) + geom_histogram(bins=100)
```

Histogram would be okay but not best visualization. What else we can try?

```{r}
ggplot(d_luad, aes(gender, age_at_diagnosis/365, fill=gender)) + 
  geom_boxplot()
```

Can you tell the difference between boxplot and density plot? We can try a violin plot.

```{r}
ggplot(d_luad, aes(gender, age_at_diagnosis/365, fill=gender)) + 
  geom_violin()
```

Can you tell the difference between boxplot and violin plot? 

```{r}
# ggplot(d_luad, aes(gender, age_at_diagnosis/365, fill=gender)) +
#   geom_boxplot() + 
#   geom_violin()

# ggplot(d_luad, aes(gender, age_at_diagnosis/365, fill=gender)) +
#   geom_violin() + 
#   geom_boxplot()

# ggplot(d_luad, aes(gender, age_at_diagnosis/365)) +
#   geom_violin(aes(fill=gender)) + 
#   geom_boxplot(fill='white')

ggplot(d_luad, aes(gender, age_at_diagnosis/365)) +
  geom_violin(aes(fill=gender)) + 
  geom_boxplot(fill='white', width=0.25)
```


```{r}
ggplot(d_luad, aes(gender, age_at_diagnosis/365)) + 
  geom_violin(aes(fill=gender))  + 
  geom_boxplot(fill='white') + 
  geom_dotplot(binaxis='y', stackdir='center', dotsize=1, alpha=0.5)

# Try different options from the manual
# http://www.sthda.com/english/wiki/ggplot2-violin-plot-quick-start-guide-r-software-and-data-visualization
```


Do you think gender is the reason? If not, how about tumor stages?

```{r}
ggplot(d_luad, aes(age_at_diagnosis/365)) + 
  labs(x ='Age at Diagnosis (years)', y = 'Number of LUAD patients',
       title = 'TCGA: Lung cancer adenocarcinoma') +
  geom_histogram(bins=100) 
```


#### More tasks for the class.

- Q1. Add labels for the plot.
- Q2. Change the color for categories.
- Q3. Save to PDF file
- Q4. Select the column with continuous information and plot the distribution by yourself.

```{r}
# assign p for plot
p <- ggplot(d_luad, aes(age_at_diagnosis/365)) + 
  labs(x ='Age at Diagnosis (years)', y = 'Number of LUAD patients',
       title = 'TCGA: Lung cancer adenocarcinoma') +
  geom_histogram(bins=100) 


# save to file
ggsave('plot.TCGA_LUAD.histogram_AgeOfDX.20191002.pdf', p, width = 16, height = 9)

```


### 2.3. Tumor stages of lung cancers

First information you might want to explore is staging cancers. Different types of staging systems are used for different types of cancer. You can read [further information in general staging rules](https://www.nhs.uk/common-health-questions/operations-tests-and-procedures/what-do-cancer-stages-and-grades-mean/), or specifics in lung cancers ([e.g. stage IA](https://www.cancer.gov/publications/dictionaries/cancer-terms/def/stage-ia-non-small-cell-lung-cancer)). 


```{r}
# Count the number of tumors in the LUAD dataset
counts_tumor <- d_luad %>% count(tumor_stage)

# Try bar plot
ggplot(counts_tumor, aes( tumor_stage, n)) + geom_bar(stat="identity")

```


Now you can combine both for visualization.


```{r}
# Cancer type by tumor stages
bind_rows(d_luad %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='luad') %>%
            select(type, tumor_stage), 
          d_lusc %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='lusc') %>%
            select(type, tumor_stage)) %>% 
  count(type, tumor_stage) %>%
  mutate(tumor_stage = factor(tumor_stage, levels=rev(unique(tumor_stage)))) %>%
  ggplot(aes( tumor_stage, n, fill=type)) + 
  labs(title = 'Cancer type by tumor stages',
       x = '', y='Number of samples for TCGA RNA-seq') + 
  theme_minimal() + geom_bar(stat="identity") + 
  facet_wrap(~type) + coord_flip() 
```

### 2.4. Cancer type by gender

Let's try another information - gender. We will describe this by a bar plot like above. After plotting, which information you can read from this?

NB: I am not pretty sure why TCGA put gender, not sex in the dataset, in addition to mixed use of female/male with gender. 

```{r}
# Cancer type by Sex
bind_rows(d_luad %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='luad') %>%
            select(type, gender), 
          d_lusc %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='lusc') %>%
            select(type, gender)) %>% 
  count(type, gender) %>%
  ggplot(aes( gender, n, fill=type)) + 
  labs(title = 'Cancer type by Sex',
       x = '', y='Number of samples for TCGA RNA-seq') + 
  theme_minimal() + geom_bar(stat="identity") + 
  facet_wrap(~type) + coord_flip() 
```

Is there difference in tumor stages by gender? Let's check with LUSC.

```{r}
# Add gender with stages
bind_rows(d_luad %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='luad') %>%
            select(type, tumor_stage, gender), 
          d_lusc %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='lusc') %>%
            select(type, tumor_stage, gender)) %>% 
  count(type, tumor_stage, gender) %>% 
  complete(gender, type, tumor_stage, fill = list(n = 0)) %>%
  mutate(tumor_stage = factor(tumor_stage, levels=rev(unique(tumor_stage))), 
         gender = factor(gender)) %>%
  ggplot(., aes(tumor_stage, n, fill=gender)) + 
  geom_bar(stat="identity", position=position_dodge()) + 
  facet_wrap(~type) + coord_flip() 

```

### 2.5. Site of biopsy

```{r}
# Cancer type by biopsy
bind_rows(d_luad %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='luad') %>%
            select(type, site_of_resection_or_biopsy), 
          d_lusc %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='lusc') %>%
            select(type, site_of_resection_or_biopsy)) %>% 
  count(type, site_of_resection_or_biopsy) %>%
  ggplot(aes( site_of_resection_or_biopsy, n, fill=type)) + 
  labs(title = 'Cancer type by biopsy',
       x = '', y='Number of samples for TCGA RNA-seq') + 
  theme_minimal() + geom_bar(stat="identity") + 
  facet_wrap(~type) + coord_flip() 

```

Plot the years at diagnosis by tissue or organ of origin. We would also include difference by gender. 

```{r}
bind_rows(d_luad %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='luad') %>%
            select(type, gender, age_at_diagnosis, tissue_or_organ_of_origin), 
          d_lusc %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='lusc') %>%
            select(type, gender, age_at_diagnosis, tissue_or_organ_of_origin)) %>%
  ggplot(., aes(gender, age_at_diagnosis/365, fill=gender)) + 
  geom_boxplot() + labs(y='year at diagnosis') + 
  facet_wrap(~tissue_or_organ_of_origin)
```