{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cancer Type Classification using Deep-Learning\n", "## S.Ravichandran" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This document will explain how to use genomic expression data for classifying different cancer/tumor sites/types. This workshop is a follow-up to the NCI-DOE Pilot1 benchmark also called TC1. You can read about the project here, https://github.com/ECP-CANDLE/Benchmarks/tree/master/Pilot1/TC1\n", "\n", "For classification, we use a Deep-Learning procedure called 1D-Convolutional Neural Network (CONV1D; https://en.wikipedia.org/wiki/Convolutional_neural_network. \n", "NCI Genomic Data Commons (GDC; https://gdc.cancer.gov/) is the source of RNASeq expression data. \n", "\n", "First we will start with genomic data preparation and then we will show how to use the data to build CONV1D model that can classify different cancer types. Please note that there are more than ways to extract data from GDC. What I am describing is one possible way. \n", "\n", "This is a continuation of data preparation which can be accessed from here, \n", "https://github.com/ravichas/ML-TC1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part-2: Convolutional Neural Network" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load some libraries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "from __future__ import print_function\n", "import os, sys, gzip, glob, json, time, argparse\n", "import warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "import pandas as pd\n", "from pandas.io.json import json_normalize\n", "import numpy as np\n", "\n", "from sklearn import preprocessing\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler\n", "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n", "\n", "from keras.utils import to_categorical\n", "from keras import backend as K\n", "from keras.layers import Input, Dense, Dropout, Activation, Conv1D, MaxPooling1D, Flatten\n", "from keras import optimizers\n", "from keras.optimizers import SGD, Adam, RMSprop\n", "from keras.models import Sequential, Model, model_from_json, model_from_yaml\n", "from keras.utils import np_utils\n", "from keras.callbacks import ModelCheckpoint, CSVLogger, ReduceLROnPlateau\n", "from keras.callbacks import EarlyStopping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let us read the input data and outcome class data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Read features and output files\n", "TC1data3 = pd.read_csv(\"Data/TC1-data3stypes.tsv\", sep=\"\\t\", low_memory = False)\n", "outcome = pd.read_csv(\"Data/TC1-outcome-data3stypes.tsv\", sep=\"\\t\", low_memory=False, header=None)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "60400 | \n", "60401 | \n", "60482 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "1.716923 | \n", "0.0 | \n", "1.951998 | \n", "1.167483 | \n", "0.667981 | \n", "1.274099 | \n", "1.258272 | \n", "1.837351 | \n", "1.000251 | \n", "1.991821 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 1 | \n", "1.979573 | \n", "0.0 | \n", "1.939303 | \n", "0.946014 | \n", "0.828050 | \n", "1.338521 | \n", "1.215231 | \n", "2.298950 | \n", "1.974058 | \n", "1.744890 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 2 | \n", "1.681222 | \n", "0.0 | \n", "2.016686 | \n", "0.789298 | \n", "0.930981 | \n", "1.167504 | \n", "1.026718 | \n", "2.058239 | \n", "1.776646 | \n", "1.510484 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 3 | \n", "1.640044 | \n", "0.0 | \n", "1.669994 | \n", "0.821958 | \n", "0.426876 | \n", "1.214174 | \n", "1.673027 | \n", "1.904529 | \n", "0.867674 | \n", "1.526440 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 4 | \n", "1.800725 | \n", "0.0 | \n", "2.013062 | \n", "0.743211 | \n", "0.652487 | \n", "0.935054 | \n", "1.102839 | \n", "2.068075 | \n", "1.405575 | \n", "1.674716 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "