{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Predict regulatory regions from the DNA sequence with Janggu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we illustrate several variants how to predict regulatory regions (of a toy example) from the DNA sequence.\n", "The reference genome is made up of a concatenation of Oct4 and Mafk binding sites and we shall use all regions on chromosome 'pseudo1' as training\n", "and 'pseudo2' as test chromosomes." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/wkopp/anaconda3/envs/jdev/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", " from ._conv import register_converters as _register_converters\n", "Using TensorFlow backend.\n" ] } ], "source": [ "import os\n", "\n", "import numpy as np\n", "from keras import Model\n", "from keras import backend as K\n", "from keras.layers import Conv2D\n", "\n", "from keras.layers import Dense\n", "from keras.layers import GlobalAveragePooling2D\n", "from keras.layers import Input\n", "from keras.layers import Reshape\n", "\n", "from pkg_resources import resource_filename\n", "\n", "from janggu import Janggu\n", "from janggu import Scorer\n", "from janggu import inputlayer\n", "from janggu import outputdense\n", "from janggu.data import Bioseq\n", "from janggu.data import Cover\n", "from janggu.data import ReduceDim\n", "from janggu.layers import DnaConv2D\n", "from janggu.layers import LocalAveragePooling2D\n", "from janggu.utils import ExportClustermap\n", "from janggu.utils import ExportTsne\n", "from janggu.utils import ExportTsv\n", "\n", "from IPython.display import Image\n", "\n", "np.random.seed(1234)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to specify the output directory in which the results are stored." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "os.environ['JANGGU_OUTPUT'] = '/home/wkopp/janggu_examples'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Specify the DNA sequence feature order. Order 1, 2 and 3 correspond to mono-, di- and tri-nucleotide based features (see Tutorial)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "order = 3" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# load the dataset\n", "# The pseudo genome represents just a concatenation of all sequences\n", "# in sample.fa and sample2.fa. Therefore, the results should be almost\n", "# identically to the models obtained from classify_fasta.py.\n", "REFGENOME = resource_filename('janggu', 'resources/pseudo_genome.fa')\n", "# ROI contains regions spanning positive and negative examples\n", "ROI_TRAIN_FILE = resource_filename('janggu', 'resources/roi_train.bed')\n", "ROI_TEST_FILE = resource_filename('janggu', 'resources/roi_test.bed')\n", "# PEAK_FILE only contains positive examples\n", "PEAK_FILE = resource_filename('janggu', 'resources/scores.bed')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the datasets for training and testing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reusing the same dataset for training and test regions with view" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you have loaded the datasets using the store_whole_genome=True option, it is possible to reuse the same\n", "dataset with different region of interests. To this end, the view method can be used to create another view on the dataset.\n", "The advantage of this option is that the memory footprint will remain the same." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from janggu.data import view" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "DNA_TRAIN = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,\n", " roi=ROI_TRAIN_FILE,\n", " order=order,\n", " binsize=200,\n", " store_whole_genome=True)\n", " \n", "LABELS_TRAIN = ReduceDim(Cover.create_from_bed('peaks', roi=ROI_TRAIN_FILE,\n", " bedfiles=PEAK_FILE,\n", " binsize=200,\n", " resolution=200,\n", " storage='sparse',\n", " store_whole_genome=True))\n", "\n", "\n", "DNA_TEST = view(DNA_TRAIN, ROI_TEST_FILE)\n", "LABELS_TEST = view(LABELS_TRAIN, ROI_TEST_FILE)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((7797, 198, 1, 64), (200, 198, 1, 64))" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DNA_TRAIN.shape, DNA_TEST.shape" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((7797, 1), (200, 1))" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LABELS_TRAIN.shape, LABELS_TEST.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define and fit a model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Neural networks can also be defined using the Janggu wrappers for keras.\n", "This offers a few additional advantages, including reduced redundancy for defining models or automated evaluation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we define model using keras using a method. The decorators will automatically instantiate the initial layers and the output layers with the correct dimensionality. In the example above, this needed to be specified explicitly." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "@inputlayer\n", "@outputdense('sigmoid')\n", "def double_stranded_model_dnaconv(inputs, inp, oup, params):\n", " \"\"\" keras model for scanning both DNA strands.\n", "\n", " A more elegant way of scanning both strands for motif occurrences\n", " is achieved by the DnaConv2D layer wrapper, which internally\n", " performs the convolution operation with the normal kernel weights\n", " and the reverse complemented weights.\n", " \"\"\"\n", " with inputs.use('dna') as layer:\n", " # the name in inputs.use() should be the same as the dataset name.\n", " layer = DnaConv2D(Conv2D(params[0], (params[1], 1),\n", " activation=params[2]))(layer)\n", " output = GlobalAveragePooling2D(name='motif')(layer)\n", " return inputs, output\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the model can be instantiated accordingly. We will also use a specific model name (which is optional)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "modelname = 'dna2peak_ex4_order{}'.format(order)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# create a new model object\n", "model = Janggu.create(template=double_stranded_model_dnaconv,\n", " modelparams=(30, 21, 'relu'),\n", " inputs=DNA_TRAIN,\n", " outputs=LABELS_TRAIN,\n", " name=modelname)\n", "\n", "model.compile(optimizer='adadelta', loss='binary_crossentropy',\n", " metrics=['acc'])\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Model fitting is similar as before." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note also the use of ReduceDim which converts the original 4D Cover object to 2D table-like data structure. This is just for convenience, since it is also possible to set up a model as in the previous example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/100\n", "244/244 [==============================] - 4s 15ms/step - loss: 0.6372 - acc: 0.6297\n", "Epoch 2/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.5187 - acc: 0.7661\n", "Epoch 3/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.4618 - acc: 0.7984\n", "Epoch 4/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.4225 - acc: 0.8230\n", "Epoch 5/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.3930 - acc: 0.8318\n", "Epoch 6/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.3680 - acc: 0.8473\n", "Epoch 7/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.3450 - acc: 0.8548\n", "Epoch 8/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.3237 - acc: 0.8672\n", "Epoch 9/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.3023 - acc: 0.8795\n", "Epoch 10/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.2814 - acc: 0.8922\n", "Epoch 11/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.2626 - acc: 0.8986\n", "Epoch 12/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.2437 - acc: 0.9101\n", "Epoch 13/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.2253 - acc: 0.9174\n", "Epoch 14/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.2110 - acc: 0.9263\n", "Epoch 15/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.1967 - acc: 0.9319\n", "Epoch 16/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.1844 - acc: 0.9379\n", "Epoch 17/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.1737 - acc: 0.9402\n", "Epoch 18/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.1634 - acc: 0.9438\n", "Epoch 19/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.1546 - acc: 0.9486\n", "Epoch 20/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.1475 - acc: 0.9503\n", "Epoch 21/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.1395 - acc: 0.9535\n", "Epoch 22/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.1324 - acc: 0.9571\n", "Epoch 23/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.1265 - acc: 0.9604\n", "Epoch 24/100\n", "244/244 [==============================] - 3s 13ms/step - loss: 0.1206 - acc: 0.9609\n", "Epoch 25/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.1153 - acc: 0.9644\n", "Epoch 26/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.1102 - acc: 0.9657\n", "Epoch 27/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.1048 - acc: 0.9663\n", "Epoch 28/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.1006 - acc: 0.9700\n", "Epoch 29/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0973 - acc: 0.9714\n", "Epoch 30/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0936 - acc: 0.9720\n", "Epoch 31/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0906 - acc: 0.9721\n", "Epoch 32/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0865 - acc: 0.9744\n", "Epoch 33/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0837 - acc: 0.9743\n", "Epoch 34/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0803 - acc: 0.9780\n", "Epoch 35/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0778 - acc: 0.9791\n", "Epoch 36/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0755 - acc: 0.9793\n", "Epoch 37/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0728 - acc: 0.9809\n", "Epoch 38/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0700 - acc: 0.9817\n", "Epoch 39/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0681 - acc: 0.9814\n", "Epoch 40/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0657 - acc: 0.9831\n", "Epoch 41/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0641 - acc: 0.9834\n", "Epoch 42/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0612 - acc: 0.9849\n", "Epoch 43/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0597 - acc: 0.9862\n", "Epoch 44/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0576 - acc: 0.9852\n", "Epoch 45/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0565 - acc: 0.9863\n", "Epoch 46/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0545 - acc: 0.9871\n", "Epoch 47/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0531 - acc: 0.9877\n", "Epoch 48/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0520 - acc: 0.9875\n", "Epoch 49/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0496 - acc: 0.9886\n", "Epoch 50/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0485 - acc: 0.9890\n", "Epoch 51/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0471 - acc: 0.9895\n", "Epoch 52/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0455 - acc: 0.9907\n", "Epoch 53/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0444 - acc: 0.9905\n", "Epoch 54/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0433 - acc: 0.9903\n", "Epoch 55/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0423 - acc: 0.9912\n", "Epoch 56/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0410 - acc: 0.9910\n", "Epoch 57/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0396 - acc: 0.9915\n", "Epoch 58/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0386 - acc: 0.9924\n", "Epoch 59/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0376 - acc: 0.9918\n", "Epoch 60/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0366 - acc: 0.9931\n", "Epoch 61/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0348 - acc: 0.9924\n", "Epoch 62/100\n", "244/244 [==============================] - 3s 10ms/step - loss: 0.0347 - acc: 0.9932\n", "Epoch 63/100\n", "244/244 [==============================] - 2s 10ms/step - loss: 0.0334 - acc: 0.9942\n", "Epoch 64/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0326 - acc: 0.9937\n", "Epoch 65/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0315 - acc: 0.9940\n", "Epoch 66/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0311 - acc: 0.9944\n", "Epoch 67/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0299 - acc: 0.9944\n", "Epoch 68/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0292 - acc: 0.9953\n", "Epoch 69/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0289 - acc: 0.9949\n", "Epoch 70/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0279 - acc: 0.9947\n", "Epoch 71/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0270 - acc: 0.9954\n", "Epoch 72/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0262 - acc: 0.9955\n", "Epoch 73/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0255 - acc: 0.9962\n", "Epoch 74/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0249 - acc: 0.9968\n", "Epoch 75/100\n", "244/244 [==============================] - 3s 10ms/step - loss: 0.0238 - acc: 0.9971\n", "Epoch 76/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0238 - acc: 0.9962\n", "Epoch 77/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0229 - acc: 0.9971\n", "Epoch 78/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0223 - acc: 0.9976\n", "Epoch 79/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0216 - acc: 0.9973\n", "Epoch 80/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0213 - acc: 0.9980\n", "Epoch 81/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0207 - acc: 0.9978\n", "Epoch 82/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0201 - acc: 0.9983\n", "Epoch 83/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0194 - acc: 0.9982\n", "Epoch 84/100\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "244/244 [==============================] - 3s 14ms/step - loss: 0.0189 - acc: 0.9981\n", "Epoch 85/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0188 - acc: 0.9978\n", "Epoch 86/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0183 - acc: 0.9983\n", "Epoch 87/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0177 - acc: 0.9986\n", "Epoch 88/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0171 - acc: 0.9983\n", "Epoch 89/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0167 - acc: 0.9988\n", "Epoch 90/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0164 - acc: 0.9986\n", "Epoch 91/100\n", "244/244 [==============================] - 3s 12ms/step - loss: 0.0156 - acc: 0.9990\n", "Epoch 92/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0155 - acc: 0.9992\n", "Epoch 93/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0151 - acc: 0.9988\n", "Epoch 94/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0146 - acc: 0.9992\n", "Epoch 95/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0143 - acc: 0.9990\n", "Epoch 96/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0140 - acc: 0.9991\n", "Epoch 97/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0136 - acc: 0.9992\n", "Epoch 98/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0133 - acc: 0.9994\n", "Epoch 99/100\n", "244/244 [==============================] - 3s 11ms/step - loss: 0.0130 - acc: 0.9992\n", "Epoch 100/100\n", "221/244 [==========================>...] - ETA: 0s - loss: 0.0127 - acc: 0.9994" ] } ], "source": [ "hist = model.fit(DNA_TRAIN, LABELS_TRAIN, epochs=100)\n", "\n", "print('#' * 40)\n", "print('loss: {}, acc: {}'.format(hist.history['loss'][-1],\n", " hist.history['acc'][-1]))\n", "print('#' * 40)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# do the evaluation on the independent test data\n", "model.evaluate(DNA_TEST, LABELS_TEST, datatags=['test'],\n", " callbacks=['auc', 'auprc', 'roc', 'prc'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "evaluation_folder = os.path.join(os.environ['JANGGU_OUTPUT'], 'evaluation', modelname, 'test')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Image(os.path.join(evaluation_folder, 'prc.png'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Image(os.path.join(evaluation_folder, 'roc.png'))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }