---
title: "Worksheet for Day 5 - building workflows"
author: "YOUR NAME HERE"
date: "`r Sys.Date()`"
output:
    html_document:
        theme: journal
#    slidy_presentation:
#        theme: default
---


This worksheet was created by [darachm](https://github.com/darachm) and
[Samson](https://github.com/samson920) with a [Creative Commons
Attribution-NonCommercial-ShareAlike 4.0
License](https://github.com/darachm/dll-r/blob/main/LICENSE.md).

# Instructions for this Worksheet

Write your code and/or your response underneath each task prompt. To render
your worksheet, click on the Knit button in Rstudio, or run the 
`render` function directly from your R console by copying and
pasting in this line of code: 

```{r,renderdoc,eval=F}
rmarkdown::render("dll-r_Day5_Lab.Rmd")
```

_( Don't understand what this is doing? Bring that up when we get to the
rmarkdown and package sections! )_

Be sure that your code is able to execute and that the result you expect is 
shown in the resulting html file.

------

# Day 5, Section 4 - workflows etc.

## 4.1 exercises

### do this after we finish 4.1

Work together. 
Take your work from yesterday, the work making ggplots.
Didn't make any ggplots? Now's your chance!!!

Organize it into a separate project directory, maybe called something like
`tidyverse-demo`. 
Include the Rmd scripts in an appropriate directory (`scripts`?),
the data in the `data` directory.
Write the Rmd so you can generate your whole report
by running `render` (in Console or via the 'Knit' button).

CAUTION: reading or writing files may not work! Instead of using something
like:

```{r,indemo,eval=F}
read.csv("that_file.csv")
```

you may have to look "up" and "over" (remember from day1/2?), like so:

```{r,indemo2,eval=F}
read.csv("../data/that_file.csv")
```

and for writing plots into an output directory, you would also need
to specify a path like "../outputs/that.jpeg"

In your group, discuss topics *such as*:

- How you organize your work, and what you think would be useful or not useful.
- How do you usually save notes/code/instructions?
- How do you share this with others?
- What are some disadvantages to this system?
- How do you leave instructions for future users?

## 4.2 exercises

Read 4.2.1 - 4.2.4, and after each of these attempt the exercises below.

4.2.1 and 4.2.3 are the most important to do!

### do after 4.2.1

Take a look at the files in the `data/viral_structural_proteins/` directory.

Read in one of the viral structural protein files.

You can check out the help for `read.delim` to learn what two arguments
you need to set to read one of these files, to control (1) what's in between
different fields and (2) if you need a header. 

Additionally, what other argument can you set to prevent the character strings
from being turned into factors?

### do after 4.2.2

What are two ways to access the protein sequence?

Use `grepl` to determine it has a certain combination of letters (make
something up, like "QQ") in the sequence using `grepl`.

Then, copy and modify this code to turn every set of those letters into a
different set of letters (or more or less letters!).
Use `gsub` or `sub`.

How do we calculate the number of characters in this protein sequence?
( Try searching with "??`number of characters`" )

### do after 4.2.3

- Write a simple for loop that just prints out ascending numbers.
    Whatever numbers. You choose.
- Loop through the first five or so filenames, printing them out.
- Modify the loop to only process files that end in "5.tsv".
- Modify this to actually read the file and print out the protein
    sequence.
- Modify this to calculate and print how long the protein sequence is,
    and print it out.

### do after 4.2.4

- Try making different sequences with `seq`. I suggest looking at the help `?`
    - Do 5 to 10 by 1's
    - Do 10 to 5 by 1's
    - Do 5 to 100 by 5's
- Write (or modify) the loop across files to calculate and save the
    length of the protein sequence in a vector, maybe call it `lengthz`.
    Do this for all the protein sequence files.
- Plot the result, the vector of lengths of protein sequences,
    using a histogram.

**Totally optional extra challenges**
- Modify your workflow/script here to look for occurrences of the "PRRA"
    motif in the sequences.
- In what viruses, what proteins does it pop up?
    How does that affect your interpretation?
- Does it show up where you'd expect it to?
    What might that suggest about this dataset?
- How might we come up with a null model for this? 
    How often would this occur? Under what assumptions? 
    How do you test that?


## 4.3 exercises

Installing things! 

### do after 4.3.1

- What's your current session info?
- What packages do you have installed?
- Where are they located on your computer?

### do after 4.3.2

Install the `stringdist` library from CRAN, and use it to: 

- compare the distance between strings "ATCGATCG" and "ATCGAACG"
        using the methods "hamming" and "lv"
- compare the distance between strings "ATCGATCG" and "ATCGAATCGC"
        using the methods "hamming" and "lv"

Check out the 
[official page for the package on CRAN](https://cran.r-project.org/web/packages/stringdist/index.html).
Look at the "reference manual" and the "vingette". 
Somtimes these are very technical (like this one), other ones (like `cowplot`
below) have more tutorials.

### do after 4.3.3

- Use `rmarkdown::render()` on this file to typeset it into an HTML file.
    Make sure you have the `rmarkdown` package installed. 

### do after 4.3.4

- Install the `colorblindcheck` package from user `nowosad`.
    It is located at `https://github.com/nowosad/colorblindcheck`.
- Use it to `palette_check` out the 
    default^[I _believe_ this is the default, not sure.] 
    color palette from R (`rainbow(8)`). Use `plot=T` to make sure you are
    plotting it for inspection.
- How does this palette look, for folks with different color vision mappings?
    Any obvious problems?


## 4.5 some bigger group problems


Now is an opportunity for you to put together everything that you've learned to
start asking questions about data. There are a few datasets listed below, along
with questions about the data. First, design some analyses you can do to answer
the question. Some examples might be comparing the means between certain
populations, plotting the distribution of a variable (or multiple variables). 
For all of these datasets, make sure to explore the data a little bit with
commands like head() and tail() before jumping into your analysis so you're
familiar with the data.

Let's split into groups.
Work together, make sure everyone is moving along nicely.
If you're having trouble, ask for help from your peers, and/or teaching folks.
Some are short, some are longer - if you finish one, attempt another!


**SARS2 antibody escape**

Starr et al Bloom used the Awesome Power Of Yeast Genetics to map
SARS-CoV-2 antibody escape probabilities early on in the pandemic.
Download the dataset [here](https://science.sciencemag.org/content/371/6531/850/tab-figures-data),
read it in, and analyze it.
Consider making a heatmap using `geom_tile`, and grouping/aggregating
the data by amino acid types, or antibodies added.

- Are there different amino acids that tend to be mutated to enable escape?
- Are there particular regions where changes are associated with escape?

**Alzheimer's Microarray Analysis**

These data are originally from
https://www.kaggle.com/andrewgao/alzheimer-microarray-analysis. You can
download them from https://github.com/darachm/dll-r/tree/main/data

- What gene is most associated with Alzheimer's disease (based on the unadjusted
p-value)?
- How many genes have unadjusted p-values less than 0.05?
- Do some reading on p-values. Why do you think the adjusted p-value is included
in the data?


**Mortality Data**

The CDC tracks data on mortality in the United States. We have downloaded a
dataset on mortality by state, by week across a few years. You can download the
data from https://github.com/darachm/dll-r/tree/main/data, though the data were
originally from
https://www.kaggle.com/codebreaker619/usa-weekly-counts-of-deaths-by-state-and-causes.

- What state has the highest number of deaths per year based on the data reported
here? Do you think this is a fair way to look at the data? What can you do to
help make the analysis more fair across states if you wanted to understand
which states had the highest death rates?
- What cause of death has increased the most from 2014 to 2018? Which has
decreased?
- What week (based on the MMWR week column) generally has the highest mortality?

**Calculating codon freuencies in the viral proteins**

Let's say you are actually looking for motif occurances in the viral protein
dataset we analyzed earlier, and doing
a careful job about balancing your test cases and inputs, etc.
Your null model might depend on how often certain amino acids tend to
occur. 

- How might you write some R code to count up each occurance of each amino 
    acid in this dataset (all ~240 proteins)?

- Can you figure out how to do this, how to calculate the amino acid frequencies
    across all the viral structural proteins in all the files?

**Mass cytometry data**

Mass cytometry is [so cool](https://youtu.be/eNKMdVMglvI?t=374) 
(well, 7500 kelvin, so hot I guess). It generates 
[data like this](https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-020-15956-9/MediaObjects/41467_2020_15956_MOESM3_ESM.xlsx).
Each one of those is a single cell, for all those parameters.
It's like single-cell RNA seq, but way more accurate on a per-cell measure
(and protein not RNA).

Take a look at that last tab in the Excel file, and export it as a CSV
(or TSV if you prefer).

- Analyze what factors are changing as the melanoma experiment progresses.


**Iris Dataset**

You can load this dataset from the `data` folder, `iris.csv`.

- What variable is most useful for distinguishing between species?
- What species has the biggest variability in flower size?