{ "metadata": { "name": "", "signature": "sha256:f151df0e6f32f7ad981a4b98eba6fad2277ddfd8540393a65c8288d8dd27c7ed" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Classifying pediatric IBD stool samples (work in progress)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is a recoding of the analysis used in the PLoSONE paper: [Non-Invasive Mapping of the Gastrointestinal Microbiota Identifies Children with Inflammatory Bowel Disease](http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0039242) using python, sklearn and pandas.\n", "\n", "[We](http://almlab.mit.edu) decided that the SLiME package, as it was packaged for the publication of the paper, should not be available anymore. This notebook replaces it, replicating the analysis executed on the paper with more up-to-date tools and (hopefully soon) expanding on its conclusion. \n", "I hope this can be the starting point for others trying to follow the same approach and improve upon it. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from scipy import interp\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.multiclass import OneVsRestClassifier\n", "from sklearn.cross_validation import cross_val_score, StratifiedKFold, train_test_split, KFold\n", "from sklearn.metrics import roc_curve, roc_auc_score, auc\n", "from sklearn.preprocessing import LabelEncoder, label_binarize" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Loading data\n", "The data came from two rounds of 16S sequencing of previously collected stool samples. Here we will use the OTU tables directly, which were created by using the RDP classifier and were subsequently normalized (details in the paper's methods).\n", "\n", "Sequencing was performed at the [Broad Institute](https://www.broadinstitute.org/). The first round of sequencing was dubbed CHIMP (Children Hospital IBD Pediatric), while the second round of sequencing -- performed following the request of an anonymous peer reviewer -- was termed 'blind validation'. Its purpose was to further validate the algorithm trained on the CHIMP dataset, as the reviewer did not think sufficient a \"leave 20% out\" approach on CHIMP was sufficient to demonstrate robust prediction. These were used as training and test set in the last figure of the paper respectively.\n", "\n", "It is useful to join the two data sets here." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#get the CHIMP training data\n", "\n", "X_chimp = pd.read_csv('data/chimp/chimp.Qsorted.rdpout.xtab.norm', delimiter=\"\\t\", index_col=0)\n", "y_chimp = pd.read_csv('data/chimp/sampledata.training.chimp.csv', index_col=0)\n", "\n", "#just make sure the labels are the same\n", "X_chimp.sort_index(inplace=True)\n", "y_chimp.sort_index(inplace=True)\n", "assert (X_chimp.index == y_chimp.index).all()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "## do the same for the blind validation test data\n", "X_blind = pd.read_csv('data/chimp/blind.sorted.rdpout.xtab.norm',\n", " delimiter=\"\\t\", index_col=0)\n", "y_blind = pd.read_csv('data/chimp/sampledata.validation.blind.csv',\n", " index_col=0)\n", "\n", "X_blind.sort_index(inplace=True)\n", "y_blind.sort_index(inplace=True)\n", "assert (X_blind.index == y_blind.index).all()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "#concatenate using pandas\n", "X = pd.concat([X_chimp, X_blind], keys=['chimp','blind'])\n", "X.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | \n", " | cls_Actinobacteria | \n", "cls_Alphaproteobacteria | \n", "cls_Bacilli | \n", "cls_Bacteroidia | \n", "cls_Betaproteobacteria | \n", "cls_Clostridia | \n", "cls_Cyanobacteria | \n", "cls_Deltaproteobacteria | \n", "cls_Epsilonproteobacteria | \n", "cls_Erysipelotrichi | \n", "... | \n", "phylum_Euryarchaeota | \n", "phylum_Firmicutes | \n", "phylum_Fusobacteria | \n", "phylum_Lentisphaerae | \n", "phylum_NA | \n", "phylum_Proteobacteria | \n", "phylum_Spirochaetes | \n", "phylum_Synergistetes | \n", "phylum_TM7 | \n", "phylum_Verrucomicrobia | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
chimp | \n", "003A | \n", "0.000000 | \n", "0.000549 | \n", "0.014827 | \n", "0.002197 | \n", "0.000275 | \n", "0.230917 | \n", "0 | \n", "0.000000 | \n", "0 | \n", "0.000000 | \n", "... | \n", "0 | \n", "0.245744 | \n", "0.023064 | \n", "0 | \n", "0.000000 | \n", "0.728995 | \n", "0 | \n", "NaN | \n", "NaN | \n", "0 | \n", "
004A | \n", "0.000000 | \n", "0.000000 | \n", "0.002486 | \n", "0.754195 | \n", "0.000000 | \n", "0.230889 | \n", "0 | \n", "0.000932 | \n", "0 | \n", "0.001865 | \n", "... | \n", "0 | \n", "0.238036 | \n", "0.000000 | \n", "0 | \n", "0.000000 | \n", "0.007147 | \n", "0 | \n", "NaN | \n", "NaN | \n", "0 | \n", "|
005A | \n", "0.006521 | \n", "0.000000 | \n", "0.026084 | \n", "0.000000 | \n", "0.000000 | \n", "0.908706 | \n", "0 | \n", "0.000000 | \n", "0 | \n", "0.054451 | \n", "... | \n", "0 | \n", "0.990545 | \n", "0.000000 | \n", "0 | \n", "0.000326 | \n", "0.002608 | \n", "0 | \n", "NaN | \n", "NaN | \n", "0 | \n", "|
008A | \n", "0.000315 | \n", "0.000000 | \n", "0.000210 | \n", "0.837024 | \n", "0.035995 | \n", "0.112499 | \n", "0 | \n", "0.000000 | \n", "0 | \n", "0.000840 | \n", "... | \n", "0 | \n", "0.113758 | \n", "0.000000 | \n", "0 | \n", "0.000000 | \n", "0.048903 | \n", "0 | \n", "NaN | \n", "NaN | \n", "0 | \n", "|
009A | \n", "0.001291 | \n", "0.000000 | \n", "0.001550 | \n", "0.823864 | \n", "0.024277 | \n", "0.118027 | \n", "0 | \n", "0.000000 | \n", "0 | \n", "0.000000 | \n", "... | \n", "0 | \n", "0.119576 | \n", "0.000000 | \n", "0 | \n", "0.000000 | \n", "0.055269 | \n", "0 | \n", "NaN | \n", "NaN | \n", "0 | \n", "
5 rows \u00d7 284 columns
\n", "