---
title: "SCN2A mutations from Sanders et al. (2018)"
output: html_notebook
---

Update: 2019/09/15

## 1. Introduction

### 1.1. Background

*SCN2A* is a voltage-gated sodium channel gene that encodes the neuronal sodium channel *NaV1.2* and plays a critical role in action potential initiation during early neurodevelopment. The latest study demonstrated that it is loss of function mutations that in *SCN2A* that lead to autism spectrum disorders (ASD), in contrast to gain of function, which leads to infantile seizures ([Ben-Shalom 2018](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5796785/)). 

In this tutorial, we will handle genetic data for *SCN2A* mutations identified in latest genomic studies, and explore their patterns using R basic functions. Our tutorial will utilize the summary data from [Sanders et al. (2018)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6015533/). We can obtain the list of mutations in the supplementary table 1 (Table S1).

### 1.2. Aims

What we will do with this dataset

- Be familiar with a data frame format.
- Practice functions you have learnt from the Chapter 3 and 4.
- Unboxing the dataset from a scientific journal ;)

## 2. Explore your data

### 2.1. Unboxing your dataset

We will read the txt file from the shared Dropbox link into your workspace. The file includes the first tab of the Table S1. You use `read.csv` function to read the file from the URL. Then, you create the `d` object.

```{r}
options(stringsAsFactors = F)
d = read.csv('https://www.dropbox.com/s/he7r6v4hxu0ibno/tableS1.Sanders2018.txt?dl=1', sep='\t')
```

Let's explore the object you just loaded. How would you check the class of the object `d`?

```{r}
# Please type the command below
class(d)
```

It shows that the `d` object is '_________'. 

Then, let's overview this data frame. We will use `head` function to print out the first few lines.

```{r}
head(d)
```

When you execute code in a notebook chunk, an output will be visible immediately beneath the input. From this, you can see several rows and columns in the data frame. Let's look at the first column `PatientID` and check which class it is. 

```{r}
# Please type the command below
head(d$PatientID)
```

Right. It's '___________'. How about the class of the column `Effect`?

```{r}
# Please type the command below
class(d$TrueRecurrence)
```

To check the class of columns, you don't need to type an individual column. We can overview the summary of the dataset using `summary` function. Which column has the `character` class?

```{r}
# Please type the command below
summary(d)
```

### 2.2. Difference between data frame and matrix

Here we will convert the data frame into a matrix, and compare which part will be different in this. To convert a data frame into a matrix, you can use the command called `as.matrix`.

```{r}
m = as.matrix(d)
```

Let's overview the matrix object. Can you tell difference with data frame?

```{r}
head(m[,'TrueRecurrence'])
```

---

### 2.3. Subsetting and Sorting

Some patients who have the SCN2A mutation (hereafter called "SCN2A patient") often have seizures. So we want to know when the seizure occurs in development.

Let's check the class first.

```{r}
class(d$SeizureOnsetDays)
```

Why this column contains character? Let's head the first few lines.

```{r}
head(d$SeizureOnsetDays)
```

It seems that some rows contain samples who do not have seizure or unknown information. It's represented by `"."`, and also recorded in another column called `Seizures`.

```{r}
head(d$Seizures)
```

So we want to subset rows where the seizure phenotype is available.

```{r}
d1 = d[d$Seizures=='Y',]
```

Let's see how many samples have seizure phenotypes?
Then, you can ask when is the earliest days for the representation of seizure phenotype? How can we check this? The fisrt, as seen previously the `SeizureOnsetDays` column is character so we cannot apply functions for numeric. 

```{r}
head(d1$SeizureOnsetDays)
```

So we have to convert this into numeric first. 

```{r}
d1$SeizureOnsetDays2 <- as.numeric(d1$SeizureOnsetDays)
```

Hmm. There's an warning for `NA` introduction. This is because some rows do not have character that we can properly convert from character to numeric. So possible solutions are either you can bear with this in your downstream analyses or 2) convert character into an appropriate form of numeric conversion.

Then, the question is how can we find the rows with `NA`? We will ask whether the rows contains `NA` or not using `is.na` function. This will return boolean as to `NA` presence.

```{r}
head(is.na(d1$SeizureOnsetDays2))
```

See we can find some rows with `NA`. One of them is the 7th row. Let's see how it looks like.

```{r}
d1$SeizureOnsetDays[7]
```

Here you have `<` (angle bracket) in the character so it won't properly converted to numeric information. Did you find more of these cases?

```{r}
d1[is.na(d1$SeizureOnsetDays2),]$SeizureOnsetDays
```

Our `NAs` all contains `<`, which prevent converting a character into a numeric.

We would fix for downstream analyses. For example, we can convert `<365` into `365`. One function we can try is `gsub`. This replace your string into a format that you may not get `NA`. For example, 

```{r}
# gsub('pattern in your character', 'new character you want to replace', vectors for your character)
d1$SeizureOnsetDays3 <- gsub('<', '', d1$SeizureOnsetDays)
head(d1$SeizureOnsetDays3)
```

Let's convert them into numerics.

```{r}
d1$SeizureOnsetDays3 <- as.numeric(d1$SeizureOnsetDays3)
```

Did you get warning for this?
Now we can ask our initial question. When is the earliest day for having seizure?

```{r}
min(d1$SeizureOnsetDays3)
```


### 2.4. Exercises

- Load the dataset using functions from the `tibble` package.
- Describe columns and set up a plan for downstream analyses.
- Explore other columns for frequency.

---

## 3. More details in SCN2A mutations

The dataset contains more details for genetic mutations in SCN2A patients. From this information, what can we analyze further? Here I list up few questions we can examine further. 

- Finding the position of the genetic mutations within SCN2A. Which information you would use? If you are not familiar with positional information on genetic variants (or mutations), please find the **Figure 1** or slides [Mutation from the Genetics BSMS205](https://docs.google.com/presentation/d/1-HXAbXKSBv8TrirsTwn9ZwRMki5TAau1f-t9EooMSkM/edit?usp=sharing).
- Counting the recurrent mutations at the same position (in other words, the same mutations from different patients), and examine whether the patients have similar phenotype.
- Finding the position where different consequences mutations occur. Please note that "consequences" are loss-of-function (Nonsense, Frameshift) or missense. 
- Sketch a plot for visualising your analysis.

<img src="https://ars.els-cdn.com/content/image/1-s2.0-S1525157810603546-gr2.gif" width="500">

**Figure 1. Standard mutation nomenclature** ([Ogino et al. 2007](https://www.sciencedirect.com/science/article/pii/S1525157810603546)). According to this guideline, genetic mutations can be represented for coding DNA reference sequences (`c.` prefix ) and protein-level amino acid sequences (`p.` prefix). 

### 3.1. For-loops and Vectorization

Here we examine more details of genetic mutations as to their functional consequence and position of SCN2A mutations. In the dataset includes, there are two columns called `c.DNA` and `p.Protein`, containing the cDNA or protein position for the genetic mutations. 

During these exercises, we will look at the concept of `for-loop` and `vectorization`, which you learn from the Chapter 4.4 and 4.5. Let's look at the column `p.Protein`. It contains protein positions from each patient. What would you check at the first place? 

- task 1
- task 2
- task 3
- task N

```{r}
# Write down your code to explore this column. 
head(d$p.Protein)
```

If you need more information on SCN2A, please visit [the Uniprot page for SCN2A](https://www.uniprot.org/uniprot/Q99250)

Then, I would remove the characters from the string so we can have only numerics for positions. Here I use `gsub` function to extract numerics from string. 

```{r}
# Remove non-numeric characters from the string
gsub('[^0-9]', '', 'p.R102X')
```

In the dataset, we have many rows for protein positions. One way we might try is to set up `for-loop` to process each row. 

```{r}
# Please write a code
for (i in 1:263) {
  a <- gsub('[^0-9]', '', d[i, 'p.Protein'] )
}
```

Do you think this is an effective approach? As we have done in your assignment, `for-loop` is not great choice to process vectors because R can enable `______________` for a shorter and clearer code. So this mean you can apply `gsub` on vector and return your output to another column (could be new assign). 

```{r}
# Please write a code
```

### 3.2. Counting the recurrent mutations

Recurrent mutations are the ones that the same genetic mutations occur in multiple individuals. Recurrent mutations can be common 1) when the mutation does not affect on natural selection, 2) when the mutation is beneficial, 3) in the hospot for a disease or strongly associated with trait. However, given we are dealing with the genetic mutations from rare disorders, the mutations in the dataset are supposed to be uniquely present in general population. Otherwise, the recurrent mutations can indicate strong association with the phenotype. 

To assess the recurrent mutations, the first thing we can try is to examine whether the same mutations occur in multiple individuals. Since the dataset contains individual patients for each row, we can simply check the frequency using: 

```{r}
# Please write a code
c <- as.data.frame(table(d$p.Protein))
```

or we can check the number of unique variants in the dataset by:

```{r}
# Please write a code
```

How many unique variants you can find? and which variants are occurred in multiple times?

Then, you can use other columns to check frequency for different groups. Which columns you can use for more grouping?

- ?
- ?
- ?

If you find that, please check the recurrent mutations for each group.

```{r}
# Please write a code
```


### 3.3. Finding the position where different consequences of mutations occur

If you checked the recurrent mutations, you might want to find a locus where two or more variants occur. Such loci might indicate functionally important position of the gene and you might find some insight as to pathophysiology. 

```{r}
# Please write a code
```

### 3.4. Sketch a plot for visualising your analysis

Please submit your hand-drawing for the plot you would like to create.