### TUTORIAL #1 -- LOADING DATA AND PRE-PROCESSING IT ######################### # The goal of the first tutorial is to learn how to important Affymetrix data # into R and to pre-process it appropriately. We are going to be working with # dataset from the largest repository of microarray data, which is the Gene # Expression Omnibus (GEO) repository hosted by NCBI. # # We are going to look at a study of the response of Burkitt's lymphoma cells # to perturbation of the mRNA abundances of two oncogenes: Foxm1 and Myb. This # study was selected because it is a small cancer dataset that allows for both # univariate and multivariate statistical analyses. The dataset is small in that # it contains only 3 experimental conditions, each with three technical # replicates, for a total of 9 arrays. Further, it uses the relatively old U95Av2 # Affymetrix array, which assays about 12,000 transcripts. Thus there will be # approximately 100,000 data-points in this dataset. # # All GEO datasets are given an accession number, and the accession for the # dataset we will be using is GSE17172. The main GEO page for any given dataset # has a lot of information about the study. For this dataset it is at: # http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17172 # # Normally you would download the supplementary file containing the raw data. # This file will be named: GSE39277_RAW.tar # You can decompress this file using an archive-manager (e.g. winzip or 7zip), # and subsequently decompress the .CEL files it contains. However in this case # for the purposes of easing the tutorial, we have decompressed and uploaded all # the data to the CBW wiki: # bioinformatics.ca/workshop_wiki/index.php/Bioinformatics_for_Cancer_Genomics_2015_Workshop_Wiki # # There are 9 files to download. I recommend storing these files in a new directory # on your computer, and keeping them separate from the scripts you will develop to # analyze this dataset. Strict separation of data and code is a fundamental principle # of good software-engineering that will make your life much, much easier over time! ### Question #1 -- Loading Data ############################################### # Use ReadAffy to load the data into an R object. # You should have the affy library installed on your computer as part of your # BioConductor installation. You can very this by typing: library(affy); # If this does not succeed, you can install the package by typing: source("http://bioconductor.org/biocLite.R"); biocLite(); # This will install a series of BioConductor "core" packages # Now, you will want to learn how to install the data. The function to do this is: ?ReadAffy; # You will also want to consult the affy package primer on how to use this function. # On machines with functional PDF processing you can view the primer by typing: vignette('affy'); # Sometimes the vignette function fails on machines with improperly-configured PDF # readers. The vignette document is still available in your R program directory, # under the library/affy/ subfolder. ### Question #2 -- Pre-Processing Data ######################################## # A) Use the expresso() function to pre-process the data with the RMA method. # # Next, you will want to pre-process the data. There are a very large number of # possible pre-processing methodologies. The affy package provides access to # these through the function: ?expresso # Use this function to pre-process the data in two different ways: according to # the RMA model: # background correction RMA # normalization method quantile # pm correction method pmonly # summary method median polish # B) Re-preprocess the data using an alternative CDF # When you initially loaded the CEL files from it into R, it is likely that your R # session automatically went to download the default Affymetrix CDF file. This file # was probably taken from the BioConductor web-site directly # Fortunately ReadAffy() can easily accept alternative CDF files (and thus # alternative Probe mappings) using the cdfname argument. Your first challenge # is to re-run the analyses from the first tutorial using an alternative # mapping. To get the mapping you will go to: # http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/CDF_download.asp # You will need to download the latest Entrez Gene ID CDFs for this microarray # platform (U95Av2)and install them into R (using standard package-installation # techniques -- ask for help if you get stuck here). # Once you have the package installed, re-load the data using ?ReadAffy and then # repreprocess with the alternative CDF (again using RMA). ### Question #3 -- Assess data quality ######################################## # A) Make histograms or density curves of the raw and pre-processed data # B) Make a heatmap of the correlation matrix from the entire experiment # Once your data has been pre-processed you will want to determine if it is # of sufficient quality to actually do any meaningful analysis. There are a # large number of EDA (exploratory data analysis) plots that can be constructed # to help in this, and several are already implemented in R/BioConductor, mostly # in the affy package. # Many of the QA/QC plots you might wish to make are directly available from # within the affy package itself. To make histograms of the raw data you can # simply use hist on an object of class Affybatch. For example raw.data <- ReadAffy(); hist(raw.data); # To make density curves of the normalized data: eset.rma <- expresso(raw.data, ...); ?plotDensity # And to make a heatmap of the inter-sample correlations you will need to: # a) use ?cor to generate a n x n correlation matrix (n = the sample-number) # b) use ?heatmap to create a mapping of samples # Consider changing the default scale parameter in heatmap -- can you see why? ### Question #4 -- Univariate statistical analysis ############################ # A) assess differential expression for each gene in the Foxm1 vs. NT comparison # B) assess differential expression for each gene in the Myb vs. NT comparison # # Once we are convinced of data-quality, the next step is to do a detailed # statistical analysis to assess which genes are being changed in this experiment. # This study involves knock-down of two genes via siRNA, with a non-specific siRNA # (non-targeting, NT) used as a control. There are some interesting multivariate # statistical approaches viable here, but we can start with some straight-forward # univariate analyses. # # The first step to doing this is to annotate the Affymetrix data with the # sample classifications. This can be done using the *phenodata* slot. To # achieve this you will want to create a tab-delimited text file that identifies # the filename of each sample (first column) and which experimental group it # belongs to (second column). You can read this phenodata directly into R # by using the phenoData parameter in ?ReadAffy # Next, you will want to use standard univariate statistical techniques. It is # controversial whether a parametric approach (i.e. a student's t-test): ?t.test # or a non-parametric test (i.e. Wilcoxon rank-sum test; u-test): ?wilcox.test # is more appropriate. With very low n studies, in general a t-test will be superior # You can use a for-loop to run a t-test or a u-test on each ProbeSet. Save these # values as a vector for later use. Repeat this procedure for both comparisons. # C) adjust for multiple-testing # Any experiment of this magnitude will generate a large number of false positives by # chance alone. If we select a 5% significance threshold and use a microarray platform # containing 10,000 genes, then we will have 0.05 x 10000 = 500 false-positives by # chance alone. # To control for this large effect, we can use a multiple-testing adjustment. There # are several types, but by far the most common for genomics studies is the false- # discovery rate adjustment (FDR). FDR adjustment slightly changes the interpretation # of a p-value: instead of giving the chance of a false-positive (i.e. of type I # error) on a single test, an FDR-adjusted p-value gives the fraction of all # statistically-significant tests that will be false-positives (i.e. it gives the # type I error rate amongst statistically-significant hits). # It is easy to adjust for multiple-testing using the function: ?p.adjust # Create a plot comparing the naive and the multiple-testing-adjusted p-values. # Create a histogram showing the distributions of naive & adjusted p-values. # D) repeat these analyses for both default and alternative CDFs ### Question #5 -- Compare the two experimental conditions #################### # Another experimental question one might ask is: how similar are the genes # affected by knock-down of Myb to those affected by knock-down of Foxm1. There # are two basic classes of techniques to look at this. # A) Make plot(s) to compare and contrast all p-values between the two conditions # The most direct way to compare these two experiments is simply to plot the # p-values from one study against the p-values from the other (i.e. a scatterplot). # The R function ?plot can do this. # # However a plot like this can miss the most interesting genes, which will be # clustered around the origin (i.e. close to p = 0). One solution to this problem # is to plot the data in log10-space rather than in normal-space. # Compare the two plots: which is clearer? Do they give different information? # B) Make plot(s) to compare statistically-significant hits between the conditions # The other basic approach is to compare only those genes that are statistically # significant. If a gene does not have a reasonably small p-value in at least # one of the two experiments, then it is unlikely to be of significant interest # in follow-up studies. It is therefore biologically reasonable to focus only on # statistically-significant results. # # The most common plot for doing that comparison is a Venn diagram. There are a # few different Venn diagram packages in R, and we will use the VennDiagram package # here. If necessary install the package and then load it using: library(VennDiagram); # You can then directly compare two vectors using: vector1 <- 1:150; vector2 <- 121:170; venn.diagram( list( A = vector1, B = vector2 ), filename = "Venn_test.tiff" ); # Make Venn diagrams that compare genes that are statistically significant at a # reasonable range of p-value thresholds. ### Question #6 -- Compare the experiments using effect size ################## # A) Calculate fold-changes for every gene in both experiments. # So far your analyses have only looked at statistical significance (p-values, # adjusted or not) as a measure of how interesting a gene is. However a gene # could conceivably be upregulated by Myb and down-regulated by Foxm1. Genes # showing this type of divergent behaviour are being treated the same as genes # showing convergent behaviour. We therefore need to consider effect-size. # # In statistics, looking at raw p-values is never sufficient. Instead one always # needs to assess how large the effect being studied is -- is it sufficiently # large to be of interest? With sufficient replication even very small effects # can become statistically significant, but this does not necessarily make them # biologically or clinically important. # # The most common measure of effect-size is the fold-change. Because RMA data is # in log-space, fold-changes are simply calculated as: # mean(treated-samples) - mean(control-samples) # If we were in normal-space this would instead be: # mean(treated-samples) / mean(control-samples) # # Calculate a fold-change for every gene in each experiment. Look at the distribution # of effect-sizes using a histogram (?hist): can you see the advantage of log- # transforming the data? # # B) Create volcano plots # The most common way to visualize the relationship between effect-size and # statistical significance is called a volcano plot. This is a scatter-plot with # the -log10 of the p-value on the y-axis (i.e. a measure of statistical # significance, with larger values being more significant) and the effect-size # (in our case fold-change) on the x-axis. Create this plot for each experiment. # # C) Compare effect-sizes between experiments # In question #5 you saw how to compare either all genes or only statistically- # significant ones between the experiments. You can now do the same using fold- # change cutoffs, or using combined cutoffs. For example, try this comparison: # genes with effect-sizes of at least 25% and unadjusted p-values below 0.05 ### Question #7 -- Compare the experiments using effect size ################## # So far all of our analysis has been done using the RMA pre-processing # technique. What happens if you repeat these studies using MAS5? How much do # the specific results change? # A) Repeat pre-processing with MAS5 using the ?expresso function again # B) Compare the results of the two methods overall # To compare the results of the two methods you will want to get access to the # "expression matrix". This can be accessed using the function: ?exprs # You will notice that the MAS5 data is in normal-space and the RMA data is in # log2-space, so you will want to bring them to a common-space for plotting ?log2 # You will also want to see how correlated the results of the two analyses are. # Calculate the correlation between the MAS5 and RMA preprocessed results # C) How many genes change ordering between the two pre-processing techniques? # How many of the ProbeSets on the array change ordering between the two pre- # processing methods. That is, ask if the ranks of samples are the same between # the two expression matrices for each row, and count the number of rows that are # altered. This can be easily visualized with a histogram. # This turns out to being equivalent to asking if the Spearman correlation is # exactly 1.0, so this question can be reduced to calculating that correlation # between every row of the two matrices. The key function is: ?cor # D) Repeat your univariate statistical analysis with MAS5-processed data # Do you get similar or different gene-lists using MAS5-processed data instead # of RMA processed data. Repeat the gene-list comparison techniques (p-values, # effect-sizes, and Venn diagrams) you learned above to compare the two # methods. # # Does it matter significantly how the data is pre-processed?