{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tomancak et al Drosophila Data with GPy and Pandas\n", "\n", "#### presented at the EBI BioPreDyn Course 'The Systems Biology Modelling Cycle'\n", "\n", "### Neil Lawrence, 12th May 2014 with help from Marta Milo for the Bioconductor Portion\n", "\n", "#### Updated 18th March 2015 for latest GPy edition.\n", "\n", "### Introduction\n", "\n", "In this demonstration we are going to load pre-processed Affymetrix time series data. The data is pre-processed in R using bioconductor. The results are then loaded into a python for further analysis with a Gaussian process.\n", "\n", "The analysis we are going to do is to check whether any given set of replicates is 'valid'. In other words, were the replicates drawn from the same Gaussian process. To do this we will fit two Gaussian process models, one which assumes that the replicates are independent and one that assumes that they are drawn from the same process. We will evaluate the likelihood of each. The result could be used in a likelihood ratio test.\n", "\n", "### Preparing the Environment\n", "\n", "Firstly we load in the data set by downloading `cel` files if the analyzed files aren't already present on the disk. We load in `rmagic` if that data isn't present to do the preprocessing in the R PUMA package from bioconductor." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n", "import numpy as np\n", "import GPy\n", "import pods\n", "%matplotlib inline\n", "import matplotlib as plt\n", "plt.rcParams['figure.figsize'] = (10.0, 4.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Bioconductor for Processing\n", "\n", "Now we process the `cel` files using the PUMA package from bioconductor in R. This step can take a long time. The R code checks whether the preprocessed results are present and skips the processing of the `cel` files if they are. If you want to run the processing from scratch, delete any processed results stored in `datapath` and set the variable in the next box `process_data_in_R` to `True`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# This downloads cel files if they are not present. \n", "# These cel files would be needed if you want to do \n", "# the full Bioconductor analysis below.\n", "process_data_in_R = False\n", "if process_data_in_R:\n", " %load_ext rmagic\n", " datapath = os.path.join(pods.datasets.data_path, 'fruitfly_tomancak')\n", " # download the original cel files and prepare to process!\n", " data_set = 'fruitfly_tomancak_cel_files'\n", " if not pods.datasets.data_available(data_set):\n", " data = pods.datasets.download_data(data_set)\n", "else:\n", " # download the puma-processed affymetrix data.\n", " data = pods.datasets.fruitfly_tomancak()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This portion of the code will do the PUMA analysis of the gene\n", "expression data in R. To run it needs to have the cel files\n", "from this fruitfly.org FTP server. The code above dowloads all the `cel` files and stores them in the path given by `datapath`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%%R -i datapath\n", "returnpath <- getwd()\n", "setwd(datapath)\n", "if(!file.exists(\"tomancak_exprs.csv\")) {\n", " source(\"http://www.bioconductor.org/biocLite.R\")\n", " biocLite(\"puma\")\n", " library(puma)\n", " print(\"Processing data with PUMA\")\n", " expfiles <- c(paste(\"embryo_tc_4_\", 1:12, \".CEL\", sep=\"\"), paste(\"embryo_tc_6_\", 1:12, \".CEL\", sep=\"\"), paste(\"embryo_tc_8_\", 1:12, \".CEL\", sep=\"\"))\n", " library(puma)\n", " drosophila_exp_set <- justmmgMOS(filenames=expfiles, celfile.path=datapath)\n", " pData(drosophila_exp_set) <- data.frame(\"time.h\" = rep(1:12, 3), row.names=rownames(pData(drosophila_exp_set))) \n", " write.reslts(drosophila_exp_set, file='tomancak')\n", "}\n", "else {\n", " print(\"Processed data found on disk.\")\n", "}\n", "setwd(returnpath)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read Gene Expression Data into Pandas Dataframe\n", "\n", "Once the analysis is complete, the results are stored in the file `tomancak_exprs.csv`. To save you doing the full analysis, we can download a pre-processed version of this file using this command." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = pods.datasets.fruitfly_tomancak()\n", "X = data['X']\n", "Y = data['Y']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The gene expression data is now loaded into the python environment. We have made use of `pandas` a python library for handling data structures. It provides us with a `DataFrame` object which gives some functionality similar to that of `R` for basic analysis.\n", "\n", "### Are these Replicates Really Valid?\n", "\n", "Next we are going to write a couple of python functions that allow us to do some simple processing of the gene expression data. The idea will be to see if the data is best modelled through a series of *identical* Gaussian processes, or through series of *independent* Gaussian processes. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def fit_probe(id, independent=False):\n", " \"\"\"Fit a set of probe repeats as either independent or correlated.\"\"\"\n", " \n", " # set up the covariance function.\n", " lengthscale = 2.\n", " if independent:\n", " kern = GPy.kern.RBF(1,lengthscale=lengthscale)**GPy.kern.Coregionalize(1,3, rank=0)\n", " name = 'independent gp'\n", " else:\n", " kern = GPy.kern.RBF(1,lengthscale=lengthscale)**GPy.kern.Coregionalize(1,3, rank=1)\n", " kern.coregion.W[0] = 1.\n", " kern.coregion.W[1] = 1.\n", " kern.coregion.W[2] = 1.\n", " kern.coregion.W.constrain_fixed()\n", " name = 'joint gp'\n", "\n", " kern += GPy.kern.Bias(2)\n", " \n", " m = GPy.models.GPRegression(X, Y[id][:, None], kern)\n", " m.name = name\n", " m.optimize(messages=True)\n", " return m" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'd like to display the model fits. The `GPy` software allows us to select which data to plot from the model. Below there's a function for plotting the data associated with each repliacte alongside the fit." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "from IPython.display import display\n", "\n", "def show_model_fit(m):\n", " display(m)\n", " fig, ax = plt.subplots(1, 3, sharex=False,sharey=True, figsize=(12, 3.5))\n", " symbols = ['x', 'x', 'x']\n", " replicate_str = 'replicate {}'\n", " #linecolors = [(1.,0.,0.), (0., 1., 0.), (0., 0., 1.)]\n", " mi, ma = np.inf, -np.inf\n", " for replicate in range(3):\n", " # Plot the result without noise (trying to estimate underlying gene expression)\n", " which_data_rows=np.nonzero(X[:,1]==replicate) \n", " data_symbol=symbols[replicate]\n", " pl = m.plot_f(ax=ax[replicate], fixed_inputs=[(1, replicate)], \n", " which_data_rows=which_data_rows,\n", " data_symbol=data_symbol,\n", " )\n", " ax[replicate].plot(m.X[which_data_rows, 0].flatten(), \n", " m.Y[which_data_rows].flatten(), \n", " data_symbol, c='k', mew=1.5,\n", " )\n", " ax[replicate].text(.98, .98, replicate_str.format(replicate), ha='right', va='top',\n", " transform=ax[replicate].transAxes)\n", " _mi, _ma = ax[replicate].get_ylim()\n", " if _mi < mi: mi = _mi\n", " if _ma > ma: ma = _ma\n", " for _ax in ax:\n", " _ax.set_ylim(mi, ma)\n", " ax[0].set_ylabel('gene expression [arbitrary]')\n", " ax[1].set_xlabel('time [hrs]')\n", " fig.text(0.5,1, m.name, ha='center', va='top', size=22, \n", " bbox=dict(facecolor='white', edgecolor='k', lw=.4, boxstyle='round'), clip_on=False)\n", " fig.tight_layout(rect=(0,0,1,1))\n", " #GPy.plotting.matplot_dep.base_plots.align_subplots(1, 3, xlim=ax[0].get_xlim(), ylim=ax[0].get_ylim())\n", " \n", "def fit_and_display(probe_id):\n", " mc = fit_probe(probe_id, independent=True)\n", " mi = fit_probe(probe_id, independent=False)\n", " show_model_fit(mc)\n", " show_model_fit(mi)\n", " return mc, mi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's fit each of the two models to a given probe id. First we select a probe_id where the correspondence between the repeats is not very strong (even if it's there)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n",
"Model: independent gp
\n",
"Objective: 71.0084915677
\n",
"Number of Parameters: 7
\n",
"Number of Optimization Parameters: 7
\n",
"Updates: True
\n",
"
independent_gp. | value | constraints | priors |
---|---|---|---|
sum.mul.rbf.variance | 0.481622218702 | +ve | |
sum.mul.rbf.lengthscale | 2.22933167662 | +ve | |
sum.mul.coregion.W | [] | ||
sum.mul.coregion.kappa | (3,) | +ve | |
sum.bias.variance | 13.0169956879 | +ve | |
Gaussian_noise.variance | 2.36743656163 | +ve |
\n",
"Model: joint gp
\n",
"Objective: 68.317333131
\n",
"Number of Parameters: 10
\n",
"Number of Optimization Parameters: 7
\n",
"Updates: True
\n",
"
joint_gp. | value | constraints | priors |
---|---|---|---|
sum.mul.rbf.variance | 1.72364735571 | +ve | |
sum.mul.rbf.lengthscale | 3.33381017053 | +ve | |
sum.mul.coregion.W | (3, 1) | fixed | |
sum.mul.coregion.kappa | (3,) | +ve | |
sum.bias.variance | 15.5420403587 | +ve | |
Gaussian_noise.variance | 1.9462549958 | +ve |
\n",
"Model: independent gp
\n",
"Objective: 56.9718543123
\n",
"Number of Parameters: 7
\n",
"Number of Optimization Parameters: 7
\n",
"Updates: True
\n",
"
independent_gp. | value | constraints | priors |
---|---|---|---|
sum.mul.rbf.variance | 2.09052725197 | +ve | |
sum.mul.rbf.lengthscale | 2.15096358834 | +ve | |
sum.mul.coregion.W | [] | ||
sum.mul.coregion.kappa | (3,) | +ve | |
sum.bias.variance | 47.3972473373 | +ve | |
Gaussian_noise.variance | 0.583374639802 | +ve |
\n",
"Model: joint gp
\n",
"Objective: 49.078643542
\n",
"Number of Parameters: 10
\n",
"Number of Optimization Parameters: 7
\n",
"Updates: True
\n",
"
joint_gp. | value | constraints | priors |
---|---|---|---|
sum.mul.rbf.variance | 1.6733336308 | +ve | |
sum.mul.rbf.lengthscale | 2.25933047866 | +ve | |
sum.mul.coregion.W | (3, 1) | fixed | |
sum.mul.coregion.kappa | (3,) | +ve | |
sum.bias.variance | 47.024207916 | +ve | |
Gaussian_noise.variance | 0.466843134315 | +ve |
\n",
"Model: independent gp
\n",
"Objective: 70.2981190262
\n",
"Number of Parameters: 8
\n",
"Number of Optimization Parameters: 8
\n",
"Updates: True
\n",
"
independent_gp. | value | constraints | priors |
---|---|---|---|
sum.mul.rbf.variance | 9.02815616863e-05 | +ve | |
sum.mul.rbf.lengthscale | 1.23195228302e-15 | +ve | |
sum.mul.coregion.W | [] | ||
sum.mul.coregion.kappa | (3,) | +ve | |
sum.bias.variance | 12.0830497819 | +ve | |
Student_T.t_scale2 | 1.40327776475 | +ve | |
Student_T.deg_free | 241230.532918 | +ve |
\n",
"Model: joint gp
\n",
"Objective: 68.3173355645
\n",
"Number of Parameters: 11
\n",
"Number of Optimization Parameters: 8
\n",
"Updates: True
\n",
"
joint_gp. | value | constraints | priors |
---|---|---|---|
sum.mul.rbf.variance | 1.72362972802 | +ve | |
sum.mul.rbf.lengthscale | 3.33382235324 | +ve | |
sum.mul.coregion.W | (3, 1) | fixed | |
sum.mul.coregion.kappa | (3,) | +ve | |
sum.bias.variance | 15.5408413253 | +ve | |
Student_T.t_scale2 | 1.9462960868 | +ve | |
Student_T.deg_free | 197085.121973 | +ve |
\n",
"Model: independent gp
\n",
"Objective: 56.9719772327
\n",
"Number of Parameters: 8
\n",
"Number of Optimization Parameters: 8
\n",
"Updates: True
\n",
"
independent_gp. | value | constraints | priors |
---|---|---|---|
sum.mul.rbf.variance | 825.413920886 | +ve | |
sum.mul.rbf.lengthscale | 2.15105493755 | +ve | |
sum.mul.coregion.W | [] | ||
sum.mul.coregion.kappa | (3,) | +ve | |
sum.bias.variance | 47.5224357048 | +ve | |
Student_T.t_scale2 | 0.583366971663 | +ve | |
Student_T.deg_free | 25080.1532746 | +ve |
\n",
"Model: joint gp
\n",
"Objective: 49.0786466464
\n",
"Number of Parameters: 11
\n",
"Number of Optimization Parameters: 8
\n",
"Updates: True
\n",
"
joint_gp. | value | constraints | priors |
---|---|---|---|
sum.mul.rbf.variance | 1.67355438177 | +ve | |
sum.mul.rbf.lengthscale | 2.25937641453 | +ve | |
sum.mul.coregion.W | (3, 1) | fixed | |
sum.mul.coregion.kappa | (3,) | +ve | |
sum.bias.variance | 47.0055298343 | +ve | |
Student_T.t_scale2 | 0.466869541971 | +ve | |
Student_T.deg_free | 649705.042094 | +ve |