{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ADAGE: Pan-cancer gene expression\n", "\n", "**Gregory Way 2017**\n", "\n", "This script trains a denoising autoencoder for cancer gene expression data using Keras. It modifies the framework presented by the ADAGE (Analysis using denoising autoencoders of gene expression) model published by [Tan _et al_ 2015](https://doi.org/10.1128/mSystems.00025-15).\n", "\n", "An ADAGE model learns a non-linear, reduced dimensional representation of gene expression data by bottlenecking raw features into a smaller set. The model is then trained by minimizing the information lost between input and reconstructed input.\n", "\n", "The specific model trained in this notebook consists of gene expression input (5000 most variably expressed genes by median absolute deviation) compressed down into one length 100 vector. The hidden layer is then decoded back to the original 5000 dimensions. The encoding (compression) layer has a `relu` activation and the decoding layer has a `sigmoid` activation. The weights of each layer are glorot uniform initialized. We include an l1 regularization term (see [`keras.regularizers.l1`](https://keras.io/regularizers/) for more details) to induce sparsity in the model, as well as a term controlling the probability of input feature dropout. This is only active during training and is the denoising aspect of the model. See [`keras.layers.noise.Dropout`](https://keras.io/layers/core/) for more details.\n", "\n", "We train the autoencoder with the Adadelta optimizer and MSE reconstruction loss.\n", "\n", "The pan-cancer ADAGE model is similar to tybalt, but does not constrain the features to match a Gaussian distribution. It is an active research question if the VAE learned features provide any additional benefits over ADAGE features. The VAE is a generative model and therefore permits easy generation of fake data. Additionally, we hypothesize that the VAE learns a manifold that can be interpolated to extract meaningful biological knowledge." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "import os\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "import pydot\n", "import graphviz\n", "from keras.utils import plot_model\n", "from IPython.display import SVG\n", "from keras.utils.vis_utils import model_to_dot\n", "\n", "from keras.layers import Input, Dense, Dropout, Activation\n", "from keras.layers.noise import GaussianDropout\n", "from keras.models import Model\n", "from keras.regularizers import l1\n", "from keras import optimizers\n", "import keras" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.0.6\n" ] } ], "source": [ "print(keras.__version__)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "plt.style.use('seaborn-notebook')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sns.set(style=\"white\", color_codes=True)\n", "sns.set_context(\"paper\", rc={\"font.size\":14,\"axes.titlesize\":15,\"axes.labelsize\":20,\n", " 'xtick.labelsize':14, 'ytick.labelsize':14})" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(10459, 5000)\n" ] }, { "data": { "text/html": [ "
\n", " | RPS4Y1 | \n", "XIST | \n", "KRT5 | \n", "AGR2 | \n", "CEACAM5 | \n", "KRT6A | \n", "KRT14 | \n", "CEACAM6 | \n", "DDX3Y | \n", "KDM5D | \n", "... | \n", "FAM129A | \n", "C8orf48 | \n", "CDK5R1 | \n", "FAM81A | \n", "C13orf18 | \n", "GDPD3 | \n", "SMAGP | \n", "C2orf85 | \n", "POU5F1B | \n", "CHST2 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TCGA-02-0047-01 | \n", "0.678296 | \n", "0.289910 | \n", "0.034230 | \n", "0.0 | \n", "0.0 | \n", "0.084731 | \n", "0.031863 | \n", "0.037709 | \n", "0.746797 | \n", "0.687833 | \n", "... | \n", "0.440610 | \n", "0.428782 | \n", "0.732819 | \n", "0.634340 | \n", "0.580662 | \n", "0.294313 | \n", "0.458134 | \n", "0.478219 | \n", "0.168263 | \n", "0.638497 | \n", "
TCGA-02-0055-01 | \n", "0.200633 | \n", "0.654917 | \n", "0.181993 | \n", "0.0 | \n", "0.0 | \n", "0.100606 | \n", "0.050011 | \n", "0.092586 | \n", "0.103725 | \n", "0.140642 | \n", "... | \n", "0.620658 | \n", "0.363207 | \n", "0.592269 | \n", "0.602755 | \n", "0.610192 | \n", "0.374569 | \n", "0.722420 | \n", "0.271356 | \n", "0.160465 | \n", "0.602560 | \n", "
2 rows × 5000 columns
\n", "\n", " | gene mean | \n", "gene abs(sum) | \n", "
---|---|---|
PPAN-P2RY11 | \n", "-0.001976 | \n", "0.239137 | \n", "
GSTT1 | \n", "0.023419 | \n", "0.232442 | \n", "
GSTM1 | \n", "0.005316 | \n", "0.219905 | \n", "
EIF1AY | \n", "-0.048686 | \n", "0.209455 | \n", "
DDX3Y | \n", "-0.034089 | \n", "0.207383 | \n", "