{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Machine learning and metagenomics to study microbial communities\n", "\n", "## Luis Pedro Coelho\n", "\n", "- [@luispedrocoelho](https://twitter.com/luispedrocoelho)\n", "- http://luispedro.org" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Microbial communities\n", "\n", "_Definition_: A microbial community is just many small organisms living together.\n", "\n", "Most often, we study _bacteria_ and _archea_ (prokaryotes, but some people do not like that word as it's not a monophyletic group).\n", "\n", "Less often, microeukaryotes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Q](Q.png) What are the (wet-)lab tools to analyse microbial communities?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Wetlab methods\n", "\n", "- Sequencing\n", " - WGS (metagenomics)\n", " - Amplicon (16S, typically; but other genes too)\n", " - Transcriptomics\n", "- Imaging\n", "- Metaproteomics\n", "- ...\n", "\n", "The biggest reason why we study prokaryotes more than eukaryotes is that it's easier." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Which environments are typically analysed?\n", "\n", "- Host associated microbiomes (with clinical implications)\n", "- Environmental samples (ocean, lakes, soil)\n", "- Wastewater treatment plants\n", "- Cheese\n", "- ..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Introduction to metagenomics analysis\n", "\n", "(Most of this also applies to amplicon and metatranscriptomics sequencing)\n", "\n", "1. Collect sample\n", "2. Extract DNA (or RNA)\n", "3. Perhaps more manipulations (amplicon, RNA -> cDNA, ...)\n", "4. Send sample to sequencing centre\n", "5. Sequencing happens\n", "6. You get FASTQ files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "#### Alternative pipeline\n", "\n", "1. Download FASTQ files from ENA (EBI) or SRA" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Raw metagenomics data (FastQ Files)\n", "\n", "- We cannot process metagenomics data in a 1-day course: too much computation, too many tools to introduce; not enough time.\n", "- We can, however, do some functional data analysis if we start with preprocessed data.\n", "- Still, let's discuss what can be done with these data and how they are processed in general." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# What does a FASTQ file look like?\n", "\n", "\n", " @HWI-D00647:25:C4P50ANXX:1:1101:1914:1994:0:N:0:AGGCAGAAAAGGCTAT\n", " TAAAAGCATCAAAGTTTTTATAAATTTTTTATCCAACTTAAACATTTTTTTCCTCCTAA\n", " +\n", " <>BGGGGGCGGGGGGGGGGG>GGGGGGGGGGGEGGGGEGGGGGGGGGGGGGG>GGBGGG\n", " @HWI-D00647:25:C4P50ANXX:1:1101:4595:1998:0:N:0:AGGCAGAAAAGGCTAT\n", " CTCTATCGCACGCTCCAGGCGCGCTATCCCCAGGCGATAATCGATGTGATGGCACCGGC\n", " +\n", " :>BGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGFGGGGGGGGGGGGGGGG\n", "\n", "![Q](Q.png)\n", "How would you process these data?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Gene catalog approach\n", "\n", "![Gene catalog](./MOCAT_TARA.png)\n", "\n", "\n", "\n", "Tools for this process:\n", "\n", "- [MOCAT](http://vm-lux.embl.de/~kultima/MOCAT/)\n", "- [ngless](https://ngless.rtfd.io)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Existing catalogs\n", "\n", "- Ocean\n", "- Host-associated: human gut, mouse gut, pig gut, human skin\n", "- ...\n", "\n", "More will come in the future." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Outputs of metagenomics processing\n", "\n", "Feature tables: (sample x feature) \n", "\n", "- taxonomic: features are species abundances or genus abundance ...\n", "- functional: features are gene groups, gene families\n", "\n", "\n", "Note: all of these are **relative abundances**. Sequencing can never give you absolute abundances by itself.\n", "\n", "\n", "It is the feature tables that are the input to the machine learning and other analyses." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "# Problems\n", "\n", "- [supervised] Colorectal cancer detection from fecal samples\n", "- [unsupervised] Automated inference of species existence (mOTUs/CAGs/...)\n", "- [supervised/unsupervised] As a **scientific tool** for relationship inferrence\n", "- ...\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "# Colorectal Cancer (CRC) Detection\n", "\n", "This section is based on work from ([Zeller et al., 2014](http://onlinelibrary.wiley.com/doi/10.15252/msb.20145645/abstract)).\n", "\n", "Problem:\n", "\n", "> Can we use the microbiome present in a fecal sample as a test for CRC?\n", "\n", "Background:\n", "\n", "- CRC has a low survival rate if detected late, but high if detected early enough\n", "- Current tests are either invasive and uncomfortable (colonoscopy) or weak in power (occult blood tests).\n", "- Some in vitro research had already hinted at a possible microbiome involvement" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Classification problem\n", "\n", "1. Get samples from healthy individuals and patients\n", "2. Generate data from both groups\n", "3. Build a classifier" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Practical I\n", "\n", "(These slides are adapted from Georg Zeller's work)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mechanics of Jupyter notebooks\n", "\n", "- You can run this *cell* by typing CTRL+ENTER or from the menu.\n", "- `?` after an object/function for help\n", "- *Restart* when stuck" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Hello World\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can run the cells out of order, but **variables** are preserved.\n", "\n", "Another way of saying it is that you are manipulating an environment underneath" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "name = \"Luis\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Hello {}\".format(name))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "name = 'Anna'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some basic imports\n", "\n", "These are general imports for data analysis in Python" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "from matplotlib import style\n", "style.use('seaborn-white')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A magic command to make plots appear *inline*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "% matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import metrics\n", "from sklearn import linear_model\n", "from sklearn import cross_validation\n", "from sklearn import ensemble" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting the data\n", "\n", "In this case, the data is already in the form of feature tables.\n", "\n", "How to obtain these feature table from raw data is outside the scope of this tutorial. You can, however, read a tutorial on how to do so on the [NGLess website](http://ngless.embl.de/_static/gut-metagenomics-tutorial-presentation/gut_specI_tutorial.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features = pd.read_table('http://www.bork.embl.de/~zeller/mlbc_data/FR-CRC_N114_feat_specIclusters.tsv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(features.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Feature normalization\n", "\n", "Here is what we do:\n", "\n", "1. Convert to relative abundances\n", "2. Remove low abundance features\n", "3. Log transform the features\n", "4. Whiten the data\n", "\n", "![Q](Q.png) **Why?**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Why convert to relative abundances?\n", "\n", "Because the **number of sequences you obtain does not depend on any property of the sample**. It is purely technical." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features.sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features /= features.sum()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Why remove low abundance features?\n", "\n", "1. Removing some features is good for the classification (dimensionality reduction). **We are also doing this without looking at the classes** (i.e., unsupervised).\n", "2. Low abundance features are noisier (when you are closer to the detection limit, you get to the Law of Small Numbers: _small groups have large variance_)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "min_abundance = 1e-3\n", "print(\"Nr features before filtering: {}\".format(features.shape[0]))\n", "features = features[features.max(1) > min_abundance]\n", "print(\"Nr features after filtering: {}\".format(features.shape[0]))\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Unsupervised feature selection alternatives\n", "\n", "- Use average value (mean instead of max) or prevalence (i.e., fraction > 0)\n", "- Use the methods that you learned about yesterday\n", "\n", "It is often a heuristic process in any case. They will give you similar results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Why log transform the features?\n", "\n", "1. Variance normalization.\n", "2. Biologically, we care about **fold changes** not about absolute changes." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Variable normalization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "ax.set_xscale('log')\n", "ax.set_yscale('log')\n", "ax.scatter(features.mean(1),features.var(1))\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Log transforming the data\n", "\n", "![Q](Q.png)\n", "\n", "What is the big issue with log-transforms?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "What to do about zeros in the counts table?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Empirical pseudo-count: use one order of magnitude smaller than the smallest detected value." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "pseudo_count = features.values[features.values > 0].min()/10.\n", "features = np.log10(pseudo_count + features)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "ax.scatter(features.mean(1), features.std(1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not, perfect, but look at the Y-axis scale: 0-3 (previously, it was over several orders of magnitude)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Whitening the data\n", "\n", "Finally, we \"whiten\" the data: remove the mean and make the standard deviation be 1." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features = features.T\n", "features = (features - features.mean(0))/features.std(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note we also transposed the matrix to a more natural format." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Now, can we use a classifier to identify cancerous samples?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels = pd.read_table('http://www.bork.embl.de/~zeller/mlbc_data/FR-CRC_N114_label.tsv',\n", " index_col=0, header=None, squeeze=True)\n", "print(labels)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "raw_features = features\n", "raw_labels = labels\n", "\n", "features = raw_features.values\n", "labels = raw_labels.values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rf = ensemble.RandomForestClassifier(n_estimators=101)\n", "prediction = cross_validation.cross_val_predict(rf, features, labels)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cross_validation.cross_val_score?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(prediction == labels).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Q](Q.png) Actually, we did not need to normalize the features for a random forest! Why? " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Let's try other classifiers\n", "\n", "What about a penalized logistic regression?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf = linear_model.LogisticRegressionCV()\n", "lg_predictions = cross_validation.cross_val_predict(clf, features, labels)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print((lg_predictions == labels).mean())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lg_predictions" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Can we get a soft prediction?\n", "\n", "Instead of just \"the best guess\", often it is useful to " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lg_predictions = np.zeros(len(labels), float)\n", "for train, test in cross_validation.KFold(len(labels), n_folds=3, shuffle=True):\n", " clf.fit(features[train], labels[train])\n", " lg_predictions[test] = clf.predict_proba(features[test]).T[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lg_fpr, lg_tpr,_ = metrics.roc_curve(labels, lg_predictions)\n", "fig,ax = plt.subplots()\n", "ax.plot(lg_tpr, lg_fpr)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Q](Q.png) Do you understand what the curve above is?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# ROC Curves\n", "\n", "X-axis: false positive rate, (1 -specificity), (how many objects that are classified as positive are actually negative?)\n", "\n", "Y-axis: sensitivity, recall, true positive rate (how many objects that are positive are indeed classified as positive?)\n", "\n", "## Alternative way to think about it:\n", "\n", "1. Order all the examples by the classifier\n", "2. Start at (0,0)\n", "3. Go down the list of examples as ordered by the classifier\n", " - if true, go up\n", " - if false, go right" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Not all errors are the same: context matters." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Area under the curve" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fpr, tpr,_ = metrics.roc_curve(labels, lg_predictions)\n", "auc_roc = metrics.roc_auc_score(labels, -lg_predictions)\n", "fig,ax = plt.subplots()\n", "ax.plot(tpr, fpr)\n", "print(\"AUC: {}\".format(auc_roc))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Biomarker discovery\n", "\n", "We can use these models for biomarker discovery:\n", "\n", "We use the models to infer which features are important (you probably talked about this yesterday).\n", "\n", "Many classifiers can also output a feature relevance measure." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Lasso is sparse\n", "\n", "\n", "Blackboard diagram:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lasso in scikit-learn" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "clf = linear_model.LogisticRegressionCV(penalty='l1', solver='liblinear', Cs=100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf.fit(features, labels)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_values = pd.Series(np.abs(clf.coef_).ravel(), index=raw_features.columns)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "feature_values.sort_values(inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "_=ax.plot(clf.coefs_paths_[1].mean(0))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "feature_values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparing classifiers\n", "\n", "So, which classifier is better?\n", "\n", "Let's run the same evaluation scheme as before:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "rf = ensemble.RandomForestClassifier(n_estimators=101)\n", "lasso_predictions = np.zeros(len(labels), float)\n", "rf_predictions = np.zeros(len(labels), float)\n", "for train, test in cross_validation.KFold(len(labels), n_folds=3, shuffle=True):\n", " clf.fit(features[train], labels[train])\n", " lasso_predictions[test] = clf.predict_proba(features[test]).T[0]\n", " rf.fit(features[train], labels[train])\n", " rf_predictions[test] = rf.predict_proba(features[test]).T[0]\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "\n", "lg_fpr, lg_tpr,_ = metrics.roc_curve(labels, lg_predictions)\n", "lg_auc_roc = metrics.roc_auc_score(labels, -lg_predictions)\n", "ax.plot(lg_tpr, lg_fpr)\n", "print(\"Logistic regression (L2) AUC: {}\".format(lg_auc_roc))\n", "\n", "\n", "la_fpr, la_tpr,_ = metrics.roc_curve(labels, lasso_predictions)\n", "la_auc_roc = metrics.roc_auc_score(labels, -lasso_predictions)\n", "ax.plot(la_tpr, la_fpr)\n", "print(\"Logistic regression (L1) AUC: {}\".format(la_auc_roc))\n", "\n", "rf_fpr, rf_tpr,_ = metrics.roc_curve(labels, rf_predictions)\n", "rf_auc_roc = metrics.roc_auc_score(labels, -rf_predictions)\n", "ax.plot(rf_tpr, rf_fpr)\n", "print(\"RF AUC: {}\".format(rf_auc_roc))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# For a classifier, use Random Forests\n", "\n", "Which classifier should I use?\n", "\n", "For small to moderate sized problems, use Random Forests.\n", "\n", "- _Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?_ Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim in [JMLR](http://www.jmlr.org/papers/v15/delgado14a.html)\n", "- _I TRIED A BUNCH OF THINGS: THE DANGERS OF UNEXPECTED OVERFITTING IN CLASSIFICATION_ by Michael Skocik, John Collins, Chloe Callahan-Flintoft, Howard Bowman [BioRXiv preprint](http://biorxiv.org/content/early/2016/10/03/078816)\n", "\n", "Other classifiers can have some advantages: Lasso is sparse, for example." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's continue looking at using the Lasso for biomarker discovery:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Increasing the penalty makes the model sparser\n", "\n", "We can also think of it as \"decreasing the budget\" of the classifier." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "clf = linear_model.LogisticRegression(C=0.05, penalty='l1', solver='liblinear')\n", "clf.fit(features, labels)\n", "feature_values = pd.Series(np.abs(clf.coef_).ravel(), index=raw_features.columns)\n", "feature_values.sort_values(inplace=True)\n", "feature_values" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Lasso is not very stable\n", "\n", "Lasso is _winner takes all_.\n", "\n", "Back to blackboard:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "clf = linear_model.LogisticRegression(C=0.1, penalty='l1', solver='liblinear')\n", "\n", "fvalues = []\n", "for _ in range(100):\n", " selected = (np.random.random(len(features)) < 0.9)\n", " clf.fit(features[selected], labels[selected])\n", " feature_values = pd.Series(np.abs(clf.coef_).ravel(), index=raw_features.columns)\n", " fvalues.append(feature_values)\n", "fvalues = pd.concat(fvalues, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the code above, we try the Lasso 100 times on different parts of the data (getting ~90% each time)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "frac_selected =(fvalues > 0).mean(1)\n", "frac_selected.sort_values(inplace=True)\n", "frac_selected" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Final notes on CRC\n", "\n", "- There is no causality model.\n", " - Maybe the tumour is a good environment for certain microbes\n", " - Maybe certain microbes cause cancer\n", " - Maybe people with subclinical symptoms subtly change their diet\n", " \n", "The two first hypotheses are much more likely, but we cannot rule out the third." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "## The same approach can be applied in other contexts\n", "\n", "Issues are how do you " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Discovery of species using clustering\n", "\n", "Remember we have a jumble of genes and for many species (most in the open environment) we have no reference genomes.\n", "\n", "Genes from the same species should correlate accross samples.\n", "\n", "We can attempt to cluster together the genes from the same species!\n", "\n", "## Many related approaches\n", "\n", "- CAGs\n", "- mOTUs\n", "- ..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# We have several genes which we don't know where they belong\n", "\n", "\n", "![](metagenomics-species-nbt.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](nbt.2939-F1.jpg)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Subspecies/strain Clustering\n", "\n", "We can go one level deeper and cluster at below species level (subspecies/strain level).\n", "\n", "![](subspecies-3a.png)\n", "\n", "(Costea, Coelho, et al., in review)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![And now for something completely different](https://static1.squarespace.com/static/5342b8d7e4b0cc3fc1bc6079/t/577af0fe197aea2767b508bf/1467674907628/?format=500w)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Tutorial 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the metadata (sheet index 8 is the right one). The original data is at [http://ocean-microbiome.embl.de/companion.html](http://ocean-microbiome.embl.de/companion.html).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tara_data = pd.read_excel('http://ocean-microbiome.embl.de/data/OM.CompanionTables.xlsx', sheetname=['Table W1', 'Table W8'], index_col=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "meta = tara_data['Table W8']\n", "meta" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Remove the non-data lines at the bottom:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "meta = meta.select(lambda ix: ix.startswith('TARA'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Now, we need to do some ID manipulation\n", "\n", "(Let's look at the tables in the supplement first, before we look at this obscure code)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "samples = tara_data['Table W1']\n", "pangea = samples['PANGAEA sample identifier']\n", "meta.index = meta.index.map(dict([(p,t) for t,p in pangea.to_dict().items()]).get)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A lot of bioinformatics is just converting file types and identifiers." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We are going to focus on the surface samples, in the prokaryotic size fraction:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "meta = meta.select(lambda sid: '_SRF_' in sid)\n", "meta = meta.select(lambda sid: '0.22-3' in sid or '0.22-1.6' in sid)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Taxonomic tables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "raw_mOTUs = pd.read_table('http://ocean-microbiome.embl.de/data/mOTU.linkage-groups.relab.release.tsv',\n", " index_col=0)\n", "mOTUs = raw_mOTUs.copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get just the relevant rows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mOTUs" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mOTUs = mOTUs[meta.index]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mOTUs.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select abundant ones" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mOTUs = mOTUs[mOTUs.max(1) > 1e-2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(mOTUs > 0).mean().mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transpose" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mOTUs = mOTUs.T" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(mOTUs.shape)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# PCA Plot\n", "\n", "- Principal component analysis (PCA)\n", "- Principal coordinate analysis (PCoA)\n", "- Multidimensional analysis (MDS)\n", "- Multidimensional non-euclidean analysis\n", "\n", "I.e., if $x_i$, $x_j$ are the original (high dimensional) vectors, and $p_i$, $p_j$ are their transformed counterparts (low dimensional), then\n", "\n", "$|x_i - x_j| \\approx |p_i - p_j|$\n", "\n", "PCA is one of the simplest methods, very classical (matrix operations) and very deep (shows up in many forms). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Wikipedia article on PCA](https://en.wikipedia.org/wiki/Principal_component_analysis)\n", "\n", "![PCA](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/GaussianScatterPCA.svg/330px-GaussianScatterPCA.svg.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# PCA in scikit-learn" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "from sklearn import decomposition\n", "pca = decomposition.PCA(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can perform it in a single call:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca_decomp = pca.fit_transform(mOTUs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "ax.scatter(pca_decomp.T[0], pca_decomp.T[1], s=60)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Adding metadata to the plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regions = samples['Ocean and sea regions (IHO General Sea Areas 1953) [MRGID registered at www.marineregions.com]']\n", "regions = regions.reindex(meta.index)\n", "regions = pd.Categorical(regions)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "COLORS = np.array([\n", " '#7fc97f',\n", " '#beaed4',\n", " '#fdc086',\n", " '#ffff99',\n", " '#386cb0',\n", " '#f0027f',\n", " '#bf5b17',\n", " '#666666', \n", "])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "ax.scatter(pca_decomp.T[0], pca_decomp.T[1], c=COLORS[regions.codes],s=60)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Hmm. We \"forgot\" to log transform.\n", "\n", "What is the largest feature?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mOTUs.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "ax.scatter(pca_decomp.T[0], mOTUs['unassigned'], c=COLORS[regions.codes],s=60)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, yeah, without log-normalization, PC1 is almost completely determined by the single largest element.\n", "\n", "In our case, this is even worse, because the largest element is the unassigned fraction!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "pc = mOTUs.values[mOTUs.values > 0].min() / 10.\n", "mOTUs = np.log10(mOTUs + pc)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca_decomp = pca.fit_transform(mOTUs)\n", "fig,ax = plt.subplots()\n", "ax.scatter(pca_decomp.T[0], pca_decomp.T[1], c=COLORS[regions.codes],s=60)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Explained variance in PCA\n", "\n", "It is, in general, impossible to get a perfect low dimensional representation of a high dimensional space" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca_decomp = pca.fit_transform(mOTUs)\n", "fig,ax = plt.subplots()\n", "ax.scatter(pca_decomp.T[0], pca_decomp.T[1], c=COLORS[regions.codes],s=60)\n", "ax.set_xlabel('Explained variance: {:.1%}'.format(pca.explained_variance_ratio_[0]))\n", "ax.set_ylabel('Explained variance: {:.1%}'.format(pca.explained_variance_ratio_[1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You **always want to the explained variance**.\n", "\n", "Ideally, 50% or more should be explained by the two axes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Correlation of temperature with PC1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "temperature = meta['Mean_Temperature [deg C]*']\n", "ax.scatter(temperature, pca_decomp.T[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check with standard statistics that this is indeed a strong correlation:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy import stats\n", "stats.spearmanr(temperature, pca_decomp.T[0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# How about predicting temperature?\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Regression, not classification\n", "\n", "We are predicting a continuous output, not just a single class." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Elastic net regression\n", "\n", "OLS: $\\beta = \\arg\\min | \\beta X - y |^2$\n", "\n", "Lasso: $\\beta = \\arg\\min | \\beta X - y |^2 + \\alpha |\\sum_i\\beta_i|$\n", "\n", "Ridge: $\\beta = \\arg\\min | \\beta X - y |^2 + \\alpha |\\sum_i\\beta_i|^2$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Elastic net: $\\beta = \\arg\\min | \\beta X - y |^2 + \\alpha_1 |\\sum_i\\beta_i| + \\alpha_2 |\\sum_i\\beta_i|^2$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ideally, elastic net is a \"soft Lasso\": still sparse, but not as greedy.\n", "\n", "**If you just want a regression method, _use elastic nets_.**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Feature normalization\n", "\n", "We use much of the same procedure as before, except we used _a different normalization_: rank transform.\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from scipy import stats\n", "ranked = np.array([stats.rankdata(mOTUs.iloc[i]) for i in range(len(mOTUs))])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "predictor = linear_model.ElasticNetCV(n_jobs=4)\n", "\n", "cv = cross_validation.LeaveOneOut(len(temperature))\n", "prediction = cross_validation.cross_val_predict(predictor, ranked, temperature, cv=cv)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "ax.scatter(temperature, prediction)\n", "ax.plot([0,30], [0,30], 'k:')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"R2: {}\".format(metrics.r2_score(temperature, prediction)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# R² as a measure of prediction\n", "\n", "- R²_CV_ or R² or Q² or Coefficient of Determination: nomenclature is not 100% standard.\n", "- It's variance explained.\n", "- Alternatively: it's the improvement over a _null model_.\n", "\n", "It can be negative!\n", "\n", "\n", "$R^2 = 1 - \\frac{\\sum_i | y_i - \\hat{y}_i|^2}{\\sum_i |y_i - \\bar{y}|^2}$\n", "\n", "It can be very sensitive to outliers!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# What about spatial auto-correlation?\n", "\n", "![Q](Q.png) What is spatial auto-correlation?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Q](Q.png) How do we solve the issue of spatial auto-correlation?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Cross-validation as a solution to auto-correlations\n", "\n", "- Cross-validation can be a very powerful scientific tool\n", "\n", "http://luispedro.org/files/Coelho2013_Bioinformatics_extra/crossvalidation.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cv = cross_validation.LeaveOneLabelOut(labels=regions)\n", "prediction = cross_validation.cross_val_predict(predictor, ranked, temperature, cv=cv)\n", "print(\"R2: {}\".format(metrics.r2_score(temperature, prediction)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "ax.scatter(temperature, prediction)\n", "ax.plot([0,30], [0,30],'k:')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Regression to the mean in penalized linear models\n", "\n", "Models will often regress to the mean.\n", "\n", "This is OK." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Be very careful with group/batch effects\n", "\n", "- They are pervasive\n", "- They can kill your generalization\n", "- They can trick you into thinking your system works better than it does\n", "\n", "\n", "References\n", "\n", "- _Determining the subcellular location of new proteins from microscope images using local features_ by Coelho et al. (2013) Bioinformatics [DOI: 10.1093/bioinformatics/btt392](http://bioinformatics.oxfordjournals.org/content/29/18/2343.short)\n", "- _Assessing and tuning brain decoders: Cross-validation, caveats, and guidelines_ by Varoquaux et al. NeuroImage (2016) [DOI: 10.1016/j.neuroimage.2016.10.038](http://www.sciencedirect.com/science/article/pii/S105381191630595X)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Problem: can we build a model that can generalize across studies?\n", "\n", "Big issues\n", "\n", "- Not the same technology (Illumina vs 454 and different library preps)\n", "- Not the same sequencing depth\n", "\n", "![Q](Q.png) Suggestions?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Here is what we did:\n", "\n", "1. Downsample (randomly) our data to the GOS depth\n", "2. Only used presences/absence (encoded as 0.0/1.0)\n", "\n", "Worked well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Questions before we switch topics again?" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.14" } }, "nbformat": 4, "nbformat_minor": 1 }