# Exercises This chapter only contains exercises. The solutions are in the next chapter which has a numbering parallel to this one. To work on the exercises it is probably best to create either an RMarkdown Notebook (File → New File → R Notebook) or an RMarkdown document (File → New File → R Markdown... → choose Document as type and give it a title). A Notebook has the advantage that you can toggle the visibility of the code, but it has fewer output options (no pdf or Word). I strongly suggest you play around with both. This is a link to an [R cheat sheet](figures/R_cheatsheet.pdf) that you could use for selecting the right function for a task. If you want you can download this chapter as separate file and work in that. You can download it [here](https://raw.githubusercontent.com/MichielNoback/davur1_gitbook/master/07-exercises.Rmd). Before commencing you should put in a new Markdown header at the top of the file, with at least this contents: ``` --- title: "Exercise solutions DAVuR1" author: "" --- ``` #### Plotting rules {-} With all plots, take care to adhere to the rules regarding titles and other decorations. Tip: the site [Quick-R](http://www.statmethods.net/) has nice detailed information with examples on the different plot types and their configuration. Especially the section [on plotting](http://www.statmethods.net/graphs/index.html) is helpful for these assignments. ## Toolbox ### Get set up Install these tools on your own PC or laptop, in this order: 1. R itself [https://cran.r-project.org/bin/windows/base/](https://cran.r-project.org/bin/windows/base/) 2. RStudio [https://rstudio.com/products/rstudio/download/](https://rstudio.com/products/rstudio/download/) (choose the free version of course) 3. [Optionally] If you want to generate pdf documents you should also install a Latex version: MikTeX or TinyTex for Windows or MacTeX on MacOS. If you want to keep things simple I suggest you stick to HTML and MS Word output (Word can also export to PDF). ### Résumé Create an R Markdown document called Résumé.Rmd and create your Curriculum Vitae using Markdown syntax. Use the R Markdown Reference Guide or Cheat Sheet or this [Markdown Cheatsheet](https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf) (you don't need R for your résumé so markdown alone is enough). ## Basic R ### Math in the console In the console, calculate the following: $31 + 11$ $66 - 24$ $\frac{126}{3}$ $12^2$ $\sqrt{256}$ $\frac{3*(4+\sqrt{8})}{5^3}$ ### First look at functions **A** View the help page for `paste()`. There are two variants of this function. Which? And what is the difference between them? Use both variants to generate _exactly_ this message `"welcome to R"` from these arguments: `"welcome ", "to ", "R"` **B** What does the `abs` function do? What is returned by `abs(-20)` and what is `abs(20)`? **C** What does the `c` function do? What is the difference in returned value (result) of `c()` when you combine either `1`, `3` and `"a"` as arguments, or `1`, `2` and `3`? Use the function `class()` to answer this. **D** What does the function `install.packages()` do? Use it to install the package `RColorBrewer`. We'll have a look at this package later on. ### Variables Create three variables with the given values - x=20, y=10 and z=3. Next, calculate the following with these variables: 1. $x+y$ 2. $x^z$ 3. $q = x \times y \times z$ 4. $\sqrt{q}$ 5. $\frac{q}{\pi}$ (pi is simply pi in R) 6. $\log_{10}{(x \times y)}$ ### Vectors #### Circles {-} The circumference of a circle is $2\pi\cdot r$, its surface $4\pi \cdot r^2$ and its volume $4/3 \pi\cdot r^3$. Given this vector of circle radiuses, ```{r radiuses-defined, eval = FALSE} radiuses <- c(0, 1, 2, pi, 4) ``` **A** Calculate their cirumference. **B** Calculate their surface. **C** Calculate their volume. #### Creating vectors {-} Create the following vectors, _as efficiently as possible_. The functions `rep()`, `seq()`, `paste0()` and the colon operator `:` can be used, in any combination. **A** `[1] 1 2 5 1 2 5` **B** `[1] 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5` **C** ` [1] 1 1 1 4 4 4 9 9 9 1 1 1 4 4 4 9 9 9` **D** ` [1] "1a" "2b" "3c" "4d" "5e" "1a" "2b" "3c" "4d" "5e"` **E** `[1] "0z" "0.2y" "0.4x" "0.6w" "0.8v" "1u"` **F** `[1] "505" "404" "303" "202" "101" "000"` **G [Challenge]** `[1] "0.5A5.0" "0.4B4.0" "0.3C3.0" "0.2D2.0" "0.1E1.0"` ### Stair walking and heart rate The vectors below hold data for a staircase walking experiment. A subject of normal weight and height was asked to ascend a (long) stairs wearing a heart-rate monitor. The subjects' heart was registered for different step heights. Create a **line plot** showing the dependence of heart rate (y axis) on stair height (x axis). ```{r eval = F} #number of steps on the stairs stair_height <- c(0, 5, 10, 15, 20, 25, 30, 35) #heart rate after ascending the stairs heart_rate <- c(66, 65, 67, 69, 73, 79, 86, 97) ``` ### More subjects The experiment from the previous question was extended with three more subjects. One of these subjects was like the first of normal weight, whereas the two others were obese. The data are given below. Create a single **scatter plot** with connector lines between the points showing the data for all four subjects. Give the normal-weighted subjects a green line and symbol and the obese subjects a red line and symbol. You can add new data series to a plot by using the `points(x, y)` function. Use the `ylim()` function to adjust the Y-axis range. ```{r eval = F} #number of steps on the stairs stair_height <- c(0, 5, 10, 15, 20, 25, 30, 35) #heart rates for subjects with normal weight heart_rate_1 <- c(66, 65, 67, 69, 73, 79, 86, 97) heart_rate_2 <- c(61, 61, 63, 68, 74, 81, 89, 104) #heart rates for obese subjects heart_rate_3 <- c(58, 60, 67, 71, 78, 89, 104, 121) heart_rate_4 <- c(69, 73, 77, 83, 88, 96, 102, 127) ``` ### Chickens on a diet The body weights of chicks were measured at birth and every second day thereafter until day 20. They were also measured on day 21. In the experiment there were four groups of chicks on different protein diets. Here are the data for the first four chicks. Chick one and two were on diet 1 and chick three and four were on diet 2. Create a single line plot showing the data for all four chicks. Give each chick its own color. ```{r eval = F} # chick weight data time <- c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 21) chick_1 <- c(42, 51, 59, 64, 76, 93, 106, 125, 149, 171, 199, 205) chick_2 <- c(40, 49, 58, 72, 84, 103, 122, 138, 162, 187, 209, 215) chick_3 <- c(42, 53, 62, 73, 85, 102, 123, 138, 170, 204, 235, 256) chick_4 <- c(41, 49, 61, 74, 98, 109, 128, 154, 192, 232, 280, 290) ``` ### Chicken bar plot With the data from the previous question, create a bar plot of the maximum weights of the chicks. ### Discoveries The R language comes with a wealth of datasets for you to use as practice materials. We will see several of these. One of these datasets is The Time-Series dataset called `discoveries` holding the numbers of "great" inventions and scientific discoveries in each year from 1860 to 1959. Type its name in the console to see it. Create plot(s) answering these questions: **A** What is the frequency distribution of the number of discoveries per year? Use the `barplot()` and `table()` functions for this. **B** What is the 5-number summary of discoveries per year? **C** What is the trend over time for the numbers of discoveries per year? PS: This is actually not a simple vector but a vector with some time-related attributes. It is called a Time-Series (a `ts` class), but this does not really matter for this assignment. ### Lung cancer The R datasets package has three related timeseries datasets relating to lung cancer deaths. These are `ldeaths`, `mdeaths` and `fdeaths` for total, male and female deaths respectively. Create a line plot showing the monthly mortality holding all three of these datasets. Use the `legend()` function to add a legend to the plot, as demonstrated in this example: ```{r legend-demo, out.width='80%', fig.asp=.75, fig.align='center'} t <- 1:5 y1 <- c(2, 3, 5, 4, 6) y2 <- c(1, 3, 4, 5, 7) plot(t, y1, type = "b", ylab = "response", ylim = c(0, 8)) points(t, y2, col = "blue", type = "b") legend("topleft", legend = c("series 1", "series 2"), col = c("black", "blue"), pch = 1, lty = 1) ``` **A** Create the mentioned line plot. Do you see trends and/or patterns and if so, can you explain these? **B** Create a combined boxplot of the three time-series. Are there outliers? If so, can you figure out when this occurred, and why? ## Complex datatypes This section serves you some datatype challenges. ### Creating factors **A** Given this vector: ```{r eval = FALSE} animal_risk <- c(2, 4, 1, 1, 2, 4, 1, 4, 1, 1, 2, 1) ``` and these possible levels: 1: harmless 2: risky 3: dangerous 4: deadly Create a factor from this data and then barplot the result. **B** Given this data, a simulation of wealth distribution of "poor", "middle class", "wealthy" "rich: ```{r eval = FALSE} set.seed(1234) wealth_male <- sample(x = letters[1:4], size = 1000, replace= TRUE, prob = c(0.7, 0.17, 0.12, 0.01)) wealth_female <- sample(x = letters[1:4], size = 1000, replace= TRUE, prob = c(0.8, 0.15, 0.497, 0.003)) ``` Create a factor from these two and report the cumulative percentage of its individual levels starting at the most abundant level, combined for male and female. Hint: use `table()` and `prop.table()`. Next, create a side-by-side barplot of this data. Don't forget the legend! ### A dictionary with a named vector Almost all programming languages know the (hash)map / dictionary data structure storing so-called "key-and-value" pairs. They make it possible to "look up" the value belonging to a "key". That is where the term dictionary comes from. A dictionary holds keys (the words) and their meaning (values). R does not have a dictionary type but you could make a dict-like structure using a **_vector with named elements_**. Here follows an example. If I wanted to create and use a DNA codon translation table, and use it to translate a piece of DNA, I could do something like what is shown below (there are only 4 of the 64 codons included). See if you can figure out what is going on there ```{r named-vector} ## define codon table as named vector codons <- c("Gly", "Pro", "Lys", "Ser") names(codons) <- c("GGA", "CCU", "AAA", "AGU") ## the DNA to translate my_DNA <- "GGACCUAAAAGU" my_prot <- "" ## iterate the DNA and take only every position for (i in seq(1, nchar(my_DNA), by=3)) { codon <- substr(my_DNA, i, i+2); my_prot <- paste(my_prot, codons[codon]) } print(my_prot) ``` **A** Make a modified copy of this code chunk in such a way that no spaces are present between the amino acid residues (use help on `paste()` to figure this out) and that single-letter codes of amino acids are used instead of three-letter codes. **B** **[Challenge]** Here is a vector called `nuc_weights`. It holds the weights for the nucleotides A, C, G and U respectively. Make it a named vector, iterate `my_DNA` from the above code chunk and calculate its molecular weight. ```{r eval = F} nuc_weights <- c(491.2, 467.2, 507.2, 482.2) ``` ### Protein concentrations with Lowry The Lowry method is a widely used spectroscopic method to quantify protein amounts in solutions. Here are some real Lowry results for the quantification of proteins: The calibration curve is made using BSA (bovine serum albumin) as a standard. The concentrations are in mg/ml: ```{r} bsa_conc <- c(0, 0, 0.025, 0.025, 0.075, 0.075, 0.125, 0.125) ``` Note that the duplos are all put into one vector. We'll deal with that later. These are the obtained absorption values at 750 nm (again for the duplos): ```{r} OD750 <- c(0.063, 0.033, 0.16, 0.181, 0.346, 0.352, 0.491, 0.488) ``` **A** Combine these two vectors in a dataframe and assign to a variable with name `dilution`. Do not add column names yet! **B** Now add the appropriate column names: `prot_conc` and `absorption`. **C** You forgot to include the last datapoints (again these are duplos): ```{r} bsa_conc2 <- c(0.175, 0.175, 0.25, 0.25) OD750_2 <- c(0.597, 0.595, 0.743, 0.742) ``` Generate a dataframe assigned to the variable `df_temp` from these datapoints. This time, add the appropriate column names with creation of the dataframe. be sure you use exactly the same comlumn names. **D** Now add the second dataframe `df_temp` to dataframe with name `dilution` and assign it to the name `dilution` (thus overwrite the original variable): **E** Generate a line plot with the following properties: - Title: "Absorbance as a function of protein concentration". - Axis labels with clear indicated quantities and units. - Datapoints and connector lines should be visible and have a blue color. - Y-axis lower limit should be 0 and the upper limit should be 1. - The plot character (symbol) should be a filled circle. Use `col = rgb(0, 0, 1, 0.5)` to get transparent blue plot symbols. **F** Generate a new dataframe from the dataframe `dilution`, now with the duplo measurements side-by-side. So, instead of 2 you will now have 4 columns. - Fist generate a dataframe `even` using the even numbered rows. - Then generate a dataframe `odd`using the odd numbered rows. - Finally combine these two new dataframes into one and assign it to the variable `dilution_duplo`. Hint: use subsetting with TRUE and FALSE and vector cycling. **G** The result will be a dataframe with two columns named `prot_conc` and two columns named `absorption`. Delete the second column named `prot_conc` (this is an exact copy after all!) and rename the column names to `prot_conc`, `abs1` and `abs2`. **H** Calculate the mean of the two abs measurements and add it as a column named `mean_abs`: **I** Generate a **bar** plot with the following properties: - Title: "Absorbance versus protein concentration". - Axis labels with clear indicated quantities and units. - Y-axis lower-limit should be 0, upper-limit should be 1. - Plot the mean absorbance on the Y-axis. - Bars should have a green color. ### HPLC data The data below are from an educational experiment "Thymol quantification in Thyme". This data are obtained by a reverse-phase HPLC method using thymol in a standard curve. Peak areas were calculated using an integrator device. Note that peak areas are dimensionless. The concentration thymol is in μg/ml. - co = concentration - pa = peak area ```{r} co <- c(100, 100, 75, 75, 50, 50, 25, 25, 10, 10, 5, 5) pa <- c(1969017, 1858012, 1399762, 1449423, 963014, 832137, 467856, 562012, 200123, 145545, 94567, 64752) ``` **A** Add the vectors to a dataframe named `hplc_data`. Use appropriate and clear column names. **B** The concentrations are in descending order. Use `order()` to sort the dataframe so that the concentrations are in ascending order. **C** Generate a scatter plot with the following settings: - Title: "Peak area as a function of thymol concentration" - Axis labels with clear indicated quantities and units. - Datapoints should be visible (no connector lines). - Datapoints should be blue - Add a red colored linear regression line with a lineweigth of 2. ### Airquality The `airquality` dataset is also one of the datasets included in the `datasets` package. We'll explore this for a few questions. **A** Create a scatterplot of Temperature as a function of Solar radiation. Is there, as you might naively expect, a strong correlation? You could use `cor.test()` to find out. Add a linear model using `lm()` to extend your plot. **B** Create a boxplot-series of `Temp` as a function of `Month` (use `?boxplot` to find out how this works). What appears to be the warmest month? **C** What date (day/month) has the lowest recorded temperature? Which the highest? Please give temperature values in Celsius, not Fahrenheit! (Yes, this is an extra challenge!) **D** Create a histogram of the wind speeds, and add a thick blue vertical line for the value of the mean and a fat red line for the median (use `abline()` for this). **E** Use the `pairs()` function with argument `panel = panel.smooth` to plot all pairwise correlations between Ozone, Solar radiation, Wind and Temperature. Which pair shows the strongest correlation in your opinion? Verify this using the `cor()` function after removing incomplete cases. Create a separate well annotated scatterplot of this pair. [**Challenge**] Through the `summary()` function you can obtain the R-squared value -the strength of the correlation- like this: `summary(your_linear_model)$r.squared`. Can you place it within the plot using the `text()` function? ### File reading practice The files in this online folder on [github](https://github.com/MichielNoback/davur1_gitbook/tree/master/data/file_reading) directory contain some (simulated) gene-array data. The dataset contains a selection of induced transcripts after some stimulus. The columns represent: - the mRNA entry - fold induction after some stimulus - the protein entry number - protein length in number of amino acids of the corresponding NP entry - the protein family the protein belongs to - the predicted cellular localization Whatever the contents of a file, you always need to address (some of) these questions: - Are there comment lines at the top? - Is there a header line with column names? - What is the column separator? - Are there quotes around character data? - How are missing values encoded? - How are numeric values encoded? - What is the type in each column? Also: read the help for the `read.table` function carefully. Read the content of the each file in the `file_reading` directory into a dataframe object and assign it to the variable name `df`. There are 15 different files to practice your file loading skills. You don't need to download them manually; simply use this: ```{reval = FALSE} download.file(url = "https://raw.githubusercontent.com/MichielNoback/davur1_gitbook/master/data/file_reading/file01.txt", destfile = "file01.txt") ``` Then you'll have the file in the current working directory. Replace the file number as needed. ### Bird observations You will explore a bird observation dataset, downloaded from [GOLDEN GATE AUDUBON SOCIETY](http://goldengateaudubon.org/birding-resources/observations/). This file lists bird observations collected by this bird monitoring group in the San Francisco Bay Area. I already cleaned it a bit and placed it here: [data/Observations-Data-2014.csv](data/Observations-Data-2014.csv). You can download it as follows: ```{r eval = FALSE} file_name <- "Observations-Data-2014.csv" remote_url <- paste0("https://raw.githubusercontent.com/MichielNoback/davur1_gitbook/master/data/", file_name) download.file(url = remote_url, destfile = file_name) ``` Load the observation data into R and assign it to a variable called `bird_obs`. From here on, it is assumed that you have the dataframe `bird_obs` loaded. This series of exercises deals with cleaning and transforming data, and exploring a cleaned dataset using basic plotting techniques and descriptive statistics. **A** First, explore the raw data as they are. - What data on bird observations were recorded (i.e. what kind of variables do you have)? - What did R do to the original column names? - Are all column names clear to you? **B** How many bird observations were recorded? **C** The column holding observation "Number" is actually not a number. Into what type has R converted it? **D** Convert the "Number" column into an integer column using `as.integer()`, but assign it to a new column called "Count" (i.e. do not overwrite the original values). Compare the first 50 values or so of these two columns. What happened to the data? Is this OK? **E** The previous question has shown that converting factors to numbers is a bit dangerous. It is often easiest to convert characters to numbers. The best way to do this is by using the `as.is = c()` argument for the `read.table()` function. So, which columns should be loaded as real factor data and which as plain character data? Use `read.table()` and the `as.is` argument to reload the data, and then transform the `Number` column to integer again as `Count`. **F** Compare the first 50 values of the Number and Count columns again. Has the conversion succeeded? How many `Number` values could not be transformed into an integer value? Hint: use `is.na()` **G** Explore the sighting counts: - Report the sighting with the maximum number of birds in a single sighting? - What is the mean sighting count? - What is the median of the sighting count? **H** Is the `Count` variable a normal distributed value? You can use `hist(...)`, `table()` or `plot(density(...))` to explore this further. **I** Explore the species constitution: - How many different species were recorded? - How many genera do they constitute? - What species from the genus "Puffinus" have been observed? Hint: use the function `unique()` here. [**Challenge**] The number of unique "common names" should equal the number of unique "scientific names" since each species has a common name and a scientific name. The scientific name can be generated from Genus and Species. Investigate this. Is it correct? If not, what went wrong? **J [Challenge]** This is a challenge exercise for those who like to grind their brains! Think of a strategy to "rescue" the NAs that appear after transforming "Number" to "Count". Hint: use `gsub()` or`grep()` ## Functions ### The `cut()` function I Suppose we have a gene expression data set of gene array data. The vector below contains the fold-change of genome-wide mRNA expression after exposure of cells to dihydrotestosterone. A value above 1 means an induction of gene expression, a value below 1 means a repression of gene expression. However, we know that the accuracy of the fold induction measurements is approximately plus or minus 0.1. This means that we can set the breakpoints as follows: - range 0 < x < 0.9 is a repression - range 0.9 <= x < 1.1 is no net change - range x >= 1.1 is an induction ```{r eval = F} fold_change <- c(0.10, 8.50, 0.95, 56.08, 0.90, 99.15, 1.00, 0.01, 0.65, 1.10, 2.34, 62.2) ``` **A** Use the `cut()` function to convert the data into a factor that tells you if there is an `induction`, no net change (`no_change`) or a `repression`. Make sure that you get an ordered factor. Assign the factor to the variable `induct_type`. **B** Create a data frame of both vectors and assign it to `induction`. Use `fold_change` and `induction_type` as column names. **C** Create a well annotated plot of the induction type that shows the number of genes that are repressed, have no change or are induced upon dihydrotestosterone treatment. Bars should have a blue color. The bars should be named `repressed`, `no change`, `induced`. ### The `cut()` function II We will use the `cut()` function on a slightly larger data set with fictitious gene array data. The text file is located [here](https://raw.githubusercontent.com/MichielNoback/davur1_gitbook/master/data/gene_expression_data.txt).You can simply use that as argument to `read.table()`: ```{r eval = F} remote_location <- "https://raw.githubusercontent.com/MichielNoback/davur1_gitbook/master/data/gene_expression_data.txt" read.table(remote_location, ) ``` The columns are: - transcript_ID: NM_ entry number of the reference mRNA. - fold_change: the same rules as in the previous exercise for repression, no change and induction apply here. - family: the protein family that the gene encodes for. - location: the cellular localization of the gene product. **A** Load the file and assign it to the variable `gene_array`. **B** Add a column `induction_repression` to the data frame based on the fold_induction column. Use the `cut` function to accomplish this. The same rules apply for the breakpoints. **C** Create a well annotated plot of the induction type that shows the number of genes that are repressed, have no change or are induced upon dihydrotestosterone treatment. Bars should be blue. The bars should be named `repressed`, `no change`, `induced`. Make sure that the bars are within the axis borders. ## Regular Expressions ### Restriction enzymes **A** The restriction enzyme PacI has the recognition sequence "TTAATTAA". Define (at least) three alternative regex patterns that will catch these sites. **B** The restriction enzyme SfiI has the recognition sequence "GGCCNNNNNGGCC". Define (at least) three alternative regex patterns that will catch these sites. ### Prosite Patterns **A** The Prosite pattern PS00211 (ABC-transporter-1; https://prosite.expasy.org/PS00211) has the pattern: "[LIVMFYC]-[SA]-[SAPGLVFYKQH]-G-[DENQMW]-[KRQASPCLIMFW]-[KRNQSTAVM]-[KRACLVM]-[LIVMFYPAN]-{PHY}-[LIVMFW]-[SAGCLIVP]-{FYWHP}-{KRHP}-[LIVMFYWSTA]." Translate it into a regex pattern. Info on the syntax is here: https://prosite.expasy.org/prosuser.html#conv_pa **B** The Prosite pattern PS00018 (EF-hand calcium-binding domain; https://prosite.expasy.org/PS00018) has the pattern: "D-{W}-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]-x(2)- [DE]-[LIVMFYW]." Translate it into a regex pattern. You could exercise more by simply browsing Prosite. Test your pattern by fetching the proteins referred to within the Prosite pattern details page. ### Fasta Headers The fasta sequence format is a very common sequence file format used in molecular biology. It looks like this (I omitted most of the actual protein sequences for better representation): ``` >gi|21595364|gb|AAH32336.1| FHIT protein [Homo sapiens] MSFRFGQHLIK...ALRVYFQ >gi|15215093|gb|AAH12662.1| Fhit protein [Mus musculus] MSFRFGQHLIK...RVYFQA >gi|151554847|gb|AAI47994.1| FHIT protein [Bos taurus] MSFRFGQHLIK...LRVYFQ ``` As you can see there are several distinct elements within the Fasta **_header_** which is the description line above the actual sequence: one or more database identification strings, a protein description or name and an organism name. Study the format - we are going to extract some elements from these fasta headers using the `stringr` package. Install it if you don't have it yet. Here is a small example: ```{r stringr-demo} library(stringr) hinfII_re <- "GA[GATC]TC" sequences <- c("GGGAATCC", "TCGATTCGC", "ACGAGTCTA") str_extract(string = sequences, pattern = hinfII_re) ``` Function `str_extract()` simply extracts the exact match of your regex (shown above). On the other hand, function `str_match()` supports **_grouping capture_** through bounding parentheses: ```{r} phones <- c("+31-6-23415239", "+49-51-55523146", "+31-50-5956566") phones_re <- "\\+(\\d{2})-(\\d{1,2})" #matching country codes and area codes matches <- str_match(phones, phones_re) matches ``` Thus, each set of parentheses will yield a column in the returned matrix. Simply use its column index to get that result set: ```{r} matches[, 2] ##the country codes ``` Now, given the fasta headers in [./data/fasta_headers.txt](./data/fasta_headers.txt) which you can simply load into a character vector using `readLines()`, extract the following. **A** - Extract all complete organism names. - Extract all species-level organism names (omitting subspecies and strains etc). **B** Extract all **_first_** database identifiers. So in this header element `>gi|224017144|gb|EEF75156.1|` you should extract only `gi|224017144` **C** Extract all protein names/descriptions. ## Scripting This section serves you some exercises that will help you improve your function-writing skills. ### Illegal reproductions As an exercise, you will re-invent the wheel here for some statistical functions. #### The mean {-} Create a function, `my_mean()`, that duplicates the R function `mean()`, i.e. calculates and returns the mean of a vector of numbers, without actually using `mean()`. #### Standard deviation {-} Create a function, `my_sd()`, that duplicates the R function `sd()`, i.e. calculates and returns the standard deviation of a vector of numbers, without actually using `sd()`. #### Maximum {-} Suppose we have some random numbers: ```{r eval = FALSE} set.seed(123) my_nums <- sample(10000, 1000) ``` The use of `set.seed()` assures that the generated random numbers will always be the same, for everyone who runs it, every time. Use a `for` loop and `if/else` flow control to find the largest number in `my_nums`. Do not use the build in `max()` or `sort()` functions. Put this in a function with the name `my_max` and demonstrate its use. You may use the build in `max()` function to verify your result. #### Find match locations Create a function that will report the match locations some value in a given vector. Name it `where_is_it()`. Use only `for()` and `if/else` to get to your solution, no functions except `c()`. So, this snippet should return the given expected reult ```{r eval = FALSE} x <- c("Pro", "Glu", "Lys", "Pro", "Ser") where_is_it(x, "Pro") ##expected: #[1] 1 4 ``` #### Median {-} **[Challenge]** Create a function, `my_median()`, that duplicates the R function `median()`, i.e. calculates and returns the median of a vector of numbers. This is actually a bit harder than you might expect. Hint: use the `sort()` function. ### Count odd and even numbers Create a function, `count_odd_even()`, that counts the number of even and odd numbers in an input vector. It should return the result as a _named vector_. Do not make use of R base functions. ### Add column compared to mean Create a function that accepts as input: - a dataframe - a column name. This column name should refer to a numeric column The function should return a copy of this dataframe with an extra column attached. The column with name 'compared_to_mean' should have values "greater" (greater than mean) or "less" (less than or equal to the mean) when compared to the indicated column. Apply a check to the users' input and stop with an appropriate error message if something is not right (not a dataframe, column does not exist, not a numeric column). ### Interquantile ranges Create a function that will calculate a custom "interquantile range". The function should accept three arguments: a numeric vector, a lower quantile and an upper quantile. It should return the difference (range) between these two quantile values. The lower quantile should default to 0 and the higher to 1, thus returning `max(x)` minus `min(x)`. The function therefore has this "signature": ```{r eval = FALSE} interquantile_range <- function(x, lower = 0, higher = 100) {} ``` Perform some tests on the arguments to make a robust method: are all arguments numeric? To test you method, you can compare `interquantile_range(some_vector, 0.25, 0.75)` with `IQR(some_vector)` - they should be the same. ### Vector distance Create a function, `distance(p, q)`, that will calculate and return the Euclidean distance between two vectors of equal length. A numeric vector can be seen as a point in multidimensional space. Euclidean distance is defined as $$d(p, q) = \sqrt{\sum_{i = 1}^{n}(q_i-p_i)^2}$$ Where _p_ and _q_ are the two vectors and _n_ the length of the two vectors. You should first perform a check whether the two vectors are of equal length and both of type `numeric` or `integer`. If not, the function should abort with an appropriate error message. #### Other distance measures {-} Extend the function of the previous assignment in such a way that a third argument is accepted, `method = `, which defaults to "euclidean". Other possible distance measures are "Manhattan" (same as "city block" and "taxicab") and [**Challenge**] Pearson correlation. Look up the equations for these in Wikipedia or some other place. ### Quadratic equation solver Write a function that will use the quadratic equation (ABC formula) to solve an equation in the form $ax^2 + bx + c = 0$. Call the function `abc_solver`. Use flow control to perform the following: - First calculate the discriminant (you should know the equation from your math lessons). The website of Khan Academy (https://www.khanacademy.org/) gives ample review. - Test if the discriminant is negative. If so, return a _warning_ message that tells the user that the equation can not be solved. - Then test if the discriminant is positive. If so, there are two solutions for the equation. Return them in a _named_ vector. - Then test if the discriminant is 0. If so, there is one solution for the equation. Return the solution as a _named_ vector. Test your function for the following quadratic models: - $ax^1 + 2x + 3$ (no solution) - $ax^3 + 7x + 4$ (two intersections) - $ax^2 + 4x + 2$ (1 intersection) ### G/C percentage of DNA **[Challenge XL]** Create a function, `GC_perc()`, that calculates and returns the GC percentage of a DNA or RNA sequence. Accept as input a sequence and a flag -`strict`- indicating whether other characters are accepted than core DNA/RNA (GATUC). If `strict = FALSE`, the percentage of other characters should be reported using a `warning()` call. If `strict = TRUE`, the function should terminate with an error message. Use `stop()` for this. `strict` should default to `TRUE`. NOTE, usage of `strict` can complicate things, so start with the core functionality! You can use `strsplit()` or `substr()` to get hold of individual sequence characters. ## Function `apply` and its relatives In this section you will encounter some exercises revolving around the different flavors of apply. ### Whale selenium On the course website under [Resources](https://michielnoback.github.io/bincourses/course_contents/davur/resources.html) you will find a link to file `whale_selenium.txt`. You could download it into your working directory manually or use `download.file()` to obtain it. However, there is a third way to get its contents without actually downloading it as a local copy. You can read it directly using `read.table()` as shown here. ```{r read-remote-file, eval = FALSE} whale_sel_url <- "https://raw.githubusercontent.com/MichielNoback/davur1/gh-pages/exercises/data/whale_selenium.txt" whale_selenium <- read.table(whale_sel_url, header = T, row.names = 1) ``` Note: when you are going to load a file many times it is probably better to store a local copy. **A** Report the means of both columns using `apply()`. **B** Report the standard deviation of both columns, using `apply()` **C** Report the standard error of the mean of both columns, using `apply()` The SEM is calculated as $$\frac{sd}{\sqrt{n}}$$ where $sd$ is the sample standard deviation and $n$ the number of measurements. You should create the function calculating this statistic yourself. **D** Using `apply()`, calculate the ratio of $Se_{tooth} / Se_{liver}$ and attach it to the `whale_selenium` dataframe as column `ratio`. Create a histogram of this ratio. **E** Using `print()` and `paste()`, report the mean and the standard deviation of the ratio column, but do this with an inline expression, e.g. an expression embedded in the R markdown paragraph text. ### ChickWeight This exercise revolves around the `ChickWeight` dataset of the built-in `datasets` package. **A** Report the number of chickens used in the experiment. **B** Use `aggregate()` to get the mean weight of the chickens for the different Diets. **C** Use `coplot()` to plot a panel with weight as function of Time, split over Diet. **D** [**Challenge**] Add a column called `weight_gain` to the dataframe holding values for the weight gain since the last measurement. Take special care with rows marking the boundaries between individual chickens! You could consider using a traditional for loop here. In the next course, we'll see a more efficient way of doing this. **E** Split the `weight_gain` column on Diet and report the mean, median and standard deviation for each diet. If you were not successful in the previous question, load and attach the data from file `ChickWeight_weight_gain.Rdata` downloadable from [https://github.com/MichielNoback/davur1_gitbook/raw/master/data/ChickWeight_weight_gain.Rdata](https://github.com/MichielNoback/davur1_gitbook/raw/master/data/ChickWeight_weight_gain.Rdata). You can use this code chunk for downloading and loading the data into variable `stored_weight_gain`. Don't forget to attach the column to the data frame! ```{r eval = F} local_file <- "ChickWeight_weight_gain.Rdata" download.file(paste0("https://github.com/MichielNoback/davur1_gitbook/raw/master/data/", local_file), local_file) load(local_file) ``` **F** Create a (single-panel) boxplot for weight gain, split over Diet. Hint: read the `boxplot()` help page! ### Food constituents The [food constituents dataset](https://raw.githubusercontent.com/MichielNoback/davur1_gitbook/master/data/food_constituents.txt) holds information on ingredients for different foods. Individual foods are simply marked with an id. **A** Load the data and report the different food categories (`Type`). Also report the numbers of entries for each Type. **B** What is the mean energy content of chocolate foods? **C** What is the food category with the highest mean fat content? **D** What food category has the highest mean energy content, and which has the lowest? **E** **[Challenge]** Create a boxplot showing the difference in sugar content between drink and solid food. **F** Assuming both saturated fats and sugar are bad for you, what food category do you consider the worst? Think of a means to answer this, explain it and carry it out. ### Urine properties The [urine specimens dataset](https://github.com/MichielNoback/datasets/tree/master/urine) contains a `readme.txt` file and a data file (`urine.csv`; direct link: "https://raw.githubusercontent.com/MichielNoback/datasets/master/urine/urine.csv"). Study the readme to get an idea of the data. Download the file like this: ```{r download-urine-data-1, eval = FALSE} urine_file_name <- "urine.csv" url <- paste0("https://raw.githubusercontent.com/MichielNoback/datasets/master/urine/", urine_file_name) local_name <- paste0("../", urine_file_name) #specifiy your own folder! download.file(url = url, destfile = local_name) ``` **A** Load the data into a dataframe with name `urine`. **B** Convert the column `r` into a factor with two levels: `yes` and `no`, and give it a better name: `ox_crystals`. **C** Using `apply()`, report the mean and standard deviation of the numeric columns only. Give these with only two decimal digits. Use a _named vector_ so you get this output: ``` gravity ph osmo cond urea calc mean 1.02 6.03 615.04 20.90 266.41 4.14 sd 0.01 0.72 238.25 7.95 131.25 3.26 ``` **D** Using `aggregate`, report the mean of the numeric columns, but split over the `ox_crystals` variable levels. Which variables seem most likely candidates for a relation with oxalate crystal formation. **E** Use the `pairs()` plotting function to explore the pairwise relationships between the different numeric variables **F** Use the `heatmap.2()` and `cor()` functions together to create a heatmap of pairwise variable correlations. You will need to install package `gplots` for the heatmap.2 function. Alternatively, use `heatmap()` from base R. Which visualization do you prefer - the `pairs()` or the `heatmap.2()`? **G** [**Challenge**] There does not seem to be an interesting correlation between pH and any of the other variables. Sometimes this is becuase you are not looking in enough detail. Let's dig a little further. Use `cut()` to split the pH variable into a factor with three levels: "acidic", "neutral" and "basic" with breakpoints between them at pH 5.5 and 7. Next, use the `split()` and `lapply()` functions to calculate the mean, meadian and sd of the other numeric variables for each level of this pH factor. This exercise can of course be done in many ways. one of them is by using an `apply()` _within_ the applied function of `lapply()`. ### Bird observations revisited This exercise revisits the bird observations dataset. You can download it [here](data/Observations-Data-2014.csv). (Re)load the dataset. **A** Report the number of observations per `County`. Use both a textual as a barplot representation. With the barplot, you should order the bars according to observation numbers. **B** Report the number of observations per `Observer.1` but only for observers with more than 10 observations, ordered from high to low observation count. Use `order()` to achieve this. **C** Which observer has the highest number of observations listed (and how many is that)? **D** Report the different observed species (using `Common.name`) for each genus. **[Challenge]** Report only the 5 Genera with the highest number of observed species. **E** **[Challenge]** Create a Dataframe holding the number of birds per day (use Date.start) and plot it with date on the x-axis and number of birds on the y-axis. Hint: use `as.Date()` to convert the character date to a real date field. See this page how you can do that [Date Values](http://www.statmethods.net/input/dates.html).