--- title: "Mass spectrometry and proteomics" author: - name: Laurent Gatto affiliation: de Duve Institute, UCLouvain output: BiocStyle::html_document --- # Introduction This tutorial requires `r BiocStyle::Biocpkg("MSnbase")` and `r BiocStyle::Biocexptpkg("msdata")` package. ```{r env, message = FALSE} library("MSnbase") library("msdata") ``` A longer version of the material is available [here](https://rawgit.com/lgatto/bioc-ms-prot/master/lab.html). # Data files in mass spectrometry Most community-driven formats described in the table are supported in `R`. We will see how to read and access these data in the following sections. ```{r datatab, results='asis', echo=FALSE} datatab <- data.frame(Type = c("raw", "identification", "quantitation", "peak lists", "quant and id", "protein db"), Format = c("mzML, mzXML, netCDF, mzData", "mzIdentML", "mzQuantML", "mgf", "mzTab", "fasta"), Package = c( "*[MSnbase](http://bioconductor.org/packages/MSnbase)* (read and write in version >= 2.3.13) via *[mzR](http://bioconductor.org/packages/mzR)*", paste("*[mzID](http://bioconductor.org/packages/mzID)* (read) and", "*[MSnbase](http://bioconductor.org/packages/MSnbase)* (read, via *[mzR](http://bioconductor.org/packages/mzR)*)"), "", "*[MSnbase](http://bioconductor.org/packages/MSnbase)* (read)", "*[MSnbase](http://bioconductor.org/packages/MSnbase)* (read)", "*[Biostrings](http://bioconductor.org/packages/Biostrings)*")) knitr::kable(datatab) ``` # Loading raw data Raw data files (in any of the above formats) is read into R using `MSnbase::readMSData` function that will return an list-like object of class `MSnExp`. ```{r data1} basename(fl3 <- msdata::proteomics(full.name = TRUE, pattern = "MS3TMT11")) (rw3 <- readMSData(fl3, mode = "onDisk")) ``` An `MSnExp` object contains - the raw data (spectra), accessible with `[` and `[[` - features data, accessible as a dataframe with `fData` We can also access specific data using - `msLevel(rw3)` - `rtime(rw3)` - `precursorMz(rw3)` - `centroided(rw3)` - ... ```{r ex1} table(msLevel(rw3), centroided(rw3)) ``` # Loading identification data ```{r id} basename(idf <- msdata::ident(full.name = TRUE)) iddf <- readMzIdData(idf) ``` ```{r fdrplot} library("ggplot2") ggplot(iddf, aes(x = MS.GF.RawScore, colour = isDecoy)) + geom_density() ``` # Combining raw and identification data ```{r addid} basename(quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"), full.name = TRUE, pattern = "mzXML$")) basename(identFile <- dir(system.file(package = "MSnbase", dir = "extdata"), full.name = TRUE, pattern = "dummyiTRAQ.mzid")) msexp <- readMSData(quantFile) fvarLabels(msexp) msexp <- addIdentificationData(msexp, identFile) fvarLabels(msexp) ``` # Quantitative data Quantiative data is stored a objects of class `MSnSet`, which can be ```{r msnset, fig.cap = "The MSnSet structure", echo = FALSE} knitr::include_graphics("./Figures/msnset.png") ``` 1. Created from an `MSnExp` object with the `quantify` function. ```{r quant} data(itraqdata) itraqdata msnset <- quantify(itraqdata, method = "trap", reporters = iTRAQ4) msnset ``` ```{r qvis, fig.cap = "Raw data and quantiative results"} plot(itraqdata[[1]], reporters = iTRAQ4, full = TRUE) exprs(msnset)[1, ] fData(msnset)[1, ] ``` 2. Created from a spreadsheet produced by a third party software using the `readMSnSet2` function. Before reading the spreadsheet, we need to identify which columns contain quantitation data (that will be put into the `exprs` slot) and the feature data (that will be put into the `fData` slot). **Download** the data [here](https://raw.githubusercontent.com/lgatto/bioc-ms-prot/master/data/cptac_peptides.txt). ```{r read} f <- "./data/cptac_peptides.txt" getEcols(f, split = "\t") e <- grepEcols(f, "Intensity ", split = "\t") ## careful at the space! (cptac <- readMSnSet2(f, ecol = e, fnames = "Sequence", sep = "\t")) ``` We can access the expression data with `exprs` and the feature meta-data with `fData`. For the sake of simplicity, we can clean up the feature variables and only keep those of interest. It is possible to do this interactively with ```{r selectfeats0, eval = FALSE} cptac <- selectFeatureData(cptac) ``` or by setting the feature variables of interest. ```{r selectfeats} cptac <- selectFeatureData(cptac, fcol = c("Proteins", "Potential.contaminant", "Reverse", "Sequence")) ``` Let's also add sample annotations: ```{r pdata} cptac$group <- rep(c("6A", "6B"), each = 3) cptac$sample <- rep(7:9, 2) sampleNames(cptac) <- sub("Intensity.", "", sampleNames(cptac)) pData(cptac) ``` # Data processing ## Filtering out contaminants ```{r conts} table(sel_conts <- fData(cptac)$Potential.contaminant != "+") ``` ```{r rev} table(sel_rev <- fData(cptac)$Reverse != "+") ``` ```{r filtering} (cptac <- cptac[sel_conts & sel_rev, ]) ``` Note how the filtering has been recorded in the object's processing log. ## Notes on missing values Unfortunately, some software uses 0 irrespecitve whether the data has intensity zero and when the data haven't been observer. Below we fix this. ```{r setna} exprs(cptac)[exprs(cptac) == 0] <- NA table(is.na(exprs(cptac))) ``` ```{r napac, fig.cap = "Overview of missing data"} napac <- cptac exprs(napac)[!is.na(exprs(napac))] <- 1 naplot(napac) ``` ```{r nna} fData(cptac)$nNA <- apply(exprs(cptac), 1, function(x) sum(is.na(x))) table(fData(cptac)$nNA) ``` (Note that some peptides aren't seen at all because these 6 samples are a subset of a larger dataset, and these features are present in the other acquisitions.) From here on, one could **filter** data with missing values, which however sacrifices a lot of data. ```{r filterNA} (cptac <- filterNA(cptac)) ``` Perform **imputation**, considering the underlying nature of missingness, i.e missing not at random (left-censored) or at random. ![Root-mean-square error (RMSE) observations standard deviation ratio (RSR), KNN and MinDet imputation. Lower (blue) is better.](./Figures/imp-sim.png) See Lazar *et al.* [Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies](http://dx.doi.org/10.1021/acs.jproteome.5b00981). The best solution is arguably to handle missing values at **the statistical test level** (see later). ## Log transformation ```{r log} (cptac <- log(cptac, base = 2)) ``` ## Normalisation Normalisation is handled by the `normalise` (or `normalize`) function. ```{r norm} (cptac <- normalise(cptac, method = "quantiles")) ``` ## Summarisation So far we have quantitation values at the peptide level, while the level of interest are proteins. We can take all peptides that are associated to a protein (or protein group), as defined by the `Proteins` feature variable and aggregate them using a preferred operation. ```{r featsfig, fig.cap = "Aggregation between different levels.", echo = FALSE} knitr::include_graphics("https://rformassspectrometry.github.io/Features/articles/Features_files/figure-html/featuresplot-1.png") ``` ```{r featsfig2, fig.cap = "Examples of aggregations.", echo = FALSE} knitr::include_graphics("https://rformassspectrometry.github.io/Features/articles/Features_files/figure-html/plotstat-1.png") ``` ```{r combine} (cptac <- combineFeatures(cptac, fcol = "Proteins", method = "mean")) ``` **As you will see in the next section, there are much better aggregation function than the mean!** # Session info {.unnumbered} ```{r sessionInfo, echo=FALSE} sessionInfo() ```