{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Determination of optimal biomarker for the decription of stochastic disease dynamics from logitudinal data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The general idea\n", "Assume that stochastic disease dynamics can be described by a Markov chain. It means that disease can be roughly characterized by a number of states. And when at the time moment ($t$) the disease is in a particular state ($i$) it has a probability $P_{\\Delta t} (j|i)$ to go to another state ($j$) at the next time moment ($t+\\Delta t$). If one known all the transition probabilities $P_{\\Delta t} (j|i)$, he can evaluet possible disease trajectories starting from current state, and, in particular, compute the probability to end up healthy or dead, roughly speaking, after some time.\n", "\n", "How can one determine $P_{\\Delta t} (j|i)$? Assume that one has a very large set of patients, whose state is monitored at repeated time intervals of $\\Delta t$. By the state we mean the collection of all the parameters of the patient, i.e., metobolome, genomee, proteome, etc, that could be important for the diease. Such a large type of data sets is an example of big longitudinal data. Now, if we have a collection of patients that were at some time moment at state $i$ and were found at the next time moment at different states, the transition probability can be estmated as $P_{\\Delta t} (j|i)=n(j|i)/n(i)$, where $n(j|i)$ is the total number of times a patient was found at state $i$ at time $t$ and at state $j$ at time $t+\\Delta t$ and $n(i)=\\sum_j n(j|i)$. As one can see it is important that each patient has more than one meausrement, at least two, i.e., the data should be longitudinal. There are methods to determine biomarker using non-longitudinal data, however they can not be used to construct a Markov chain model of desease dynamics.\n", "\n", "Using longitudinal data and Markov chain model of dynamics allows one to avoid the following problem associated with standard methods that use non-longitudinal data. Such methods divide patients into two groups, e.g., healthy and ill, where one seeks a parameter or a set of parameters that can be used to differentiate between the patients in two groups. The weak point in such methods is the necessity to accurately divide patients into the two groups. Usually one assumes the existence of a gold standard method, that is used for such division. However, such accurate methods do not always exist, and second they are likley to be very accurate only for patients that are very healthy or very ill. The classification of patients in the grey zone, between the two extreme cases, e.g., a patient having a dissease in the initial stage, is likely to be problematic. And hence such patients can not be included in the training sets to determine the biomarker. However diagnostics of such patients is of outmost importance, since it is easier to fight disease at its initial stage. The Markov chain model avoids this problem by constructing the model of the dynamics of the entire disease, so that one can predict the probability to end up in any state of interest, starting from any state of interst. In particular, the probability to end up completely healthy vs very ill, starting from any state of the disease. In other words, one needs only to define the final states, where the patient is either very healthy or very ill, which can be done with a very good accuracy by a golden standard or other reasonably good method.\n", "\n", "Another difference with the standard methods of biomarker determination and analysis, which is a consequence of the assumed stochastic model of disease dynamics, is that we predict not outcomes, i.e., whether a patient is ill or healthy, or whether a patient will survive or die, but rather the probabilities of such outcomes.\n", "To illuminate the difference consider a hypothetical case with a few absolutely identical patients in absolutely identical conditions. The standard approches demand that they all have identical outcome, and difference in outcome means that we have missed some important characteristic difference between the patients.\n", "The stochastic dynamics means that the patients may end up having different outcome, one may day another may survive, and it is not due to inaccuracy of the analysis, but due to inherent stochasticity. Another consequence, is that one may not use the standard ROC analysis of accuracy and specificity for diseases with stochastic dynamics.\n", "\n", "## Optimal biomarker\n", "Determining the complete model of the disease dynamics, i.e., the Markov chain, is a very difficult task, mainly do to exponentially large amount of data required, because of the exponentially large configuration space. And while it might be possible in the future, we are not there yet. One way to significantly restrict the size of the configuration space, and correspondingly, the amount of data required, is to consider just a few important variables, or in extreme case just one. It is clear that one looses a lot of information, thus it is important to select variables in an optimal way, so that the important information is preserved. One such optimal variable, known as the committor function, is an ideal candidate for optimal biomarker. To define it, consider a Markov chain model for a disease dynamics. Define two end or boundary states, corresponding to perfectly healthy or very ill patient. We assume that if a patient, during disease dynamics visits any of these two states, he/she will stay in them forever. The committor function for state $i$, $q(i)$, equals the probability to end up in healthy state rather than in ill state starting from the current state. It is, literally, the likelihood of positive outcome, the most important information one would like to know about the prognosis of the patient dynamics. This optimal variable has a number of nice properties, in particule it can be used to construct simplified models of dynamics, such that they can be used to compute _exactly_ some important properties of the original dynamics. It is extensively used in the analysis of protein folding simulations. \n", "\n", "A nice property of the commitor function, useful here, is that it can be constructed without constructing first the complete Markov chain model of the dynamics. For an equilibrium dynamics with detailed balance, one can show that the commitor function satisfies the following variational principle: it provide minimums to the total squared displacement $\\Delta r^2$. More specifically, consider a long multidimensional trajectory $\\vec{X}(t)$ recorded with time interval $\\Delta t$. Let $R(\\vec{X})$ be function of configuration space that computes the value of putatie optimal variable $r$. Let $r(t)=R(\\vec{X(t)})$ be the time-series of the putative optimal variable. We assume that the two end/boundary states, defined as $A$ and $B$ are mapped to $r(A)=0$ and $r(B)=1$. The total squared displacment is defined as $\\Delta r^2=\\sum_t [r(t+\\Delta t)-r(t)]^2$. Then $\\Delta r^2$ reachies minimum when $r(i)$ equals $q(i)$ the committor function for the Markov chain with the boundary states A and B. Which suggest the following approach for determining $q$. One suggest a functional form with many parameters $r=R(\\vec{X},\\alpha_i)$ as an approximation to $q$. The parameters $\\alpha_i$ are determined to provide minimum to the $\\Delta r^2$ functional. If function $R$ linearly dependes of $\\alpha_i$, then their values can be found analytically. In case one have no idea how the functional form $R(\\vec{X},\\alpha_i)$ may look like, one may determine $q(i)$ in a non-parametric way, without using $R(\\vec{X},\\alpha_i)$ at all. These ideas has been sucessfully applied to the analysis of protein folding simulations.\n", "\n", "One complication in using the described approach to determine the committor or optimal biomarker from patient trajectories, is that oned does not have a single long equilbrium trajectory but rather a set of short trajectories for different patients. It means that the sampling is not at equilbrium and the detailed balance, which was assumed in the derivation of the variational principle is not satisfied. A simple idea, that has been sucessfuly tested, is to assume that changes to optimal variable due to non-equilbrium sampling are negligible, so that one can use the variational functional without modifications. After the optiaml biomarker has been deterrmined, one may unbias the sampling, to obtian equilibrium properties of the dynamics. We will use such an approach here.\n", "\n", "\n", "## The overall analysis workflow\n", "\n", "The simplest analysis consists of the following steps:\n", " \n", "- Reading in the patients data using the panda library.\n", " \n", "- After reading the data might be preprocessed. For example, some of the variables may be removed. The NMR spectra, used in metabolomics, could be coarsegrained. The variables could be transformed, popular transformations include $\\sqrt{x}$, $\\log(x+\\epsilon)$. More advanced steps may include the variable or feature selection by using e.g., the lasso or $L_1$ regularisation.\n", " \n", "- An optimal biomarker is constructed from all the variables by using subroutines from the module tools. In particular, commitor function determines optimal biomarker in a supervised way, using the known clinical outcomes for the patients. The evec function determines a few optimal biomarkers in an unsupervised way, by determining a few lowest eigenmodes/eigenvectors of the stochastic disease dynamics. \n", "\n", "- Visualisation of the results. Informative plots may include the following. Patient trajectories: plot of optimal biomarker time-series for each patient. It shows how the state of a patient changes with time. - Free energy landscape of the disease dynamics. This plot provides a simple, while quantitatively accurate picture of the stochastic disease dynamics as diffusion along the optimal biomarker coordinate. The picture can be supplemented with the commitor function as the function of the optimal biomarker. While optimal biomarker and committor function a should be the same, due to biased sampling, small number of parameters or poor approximation via functional form, they may differ. Leave-one-out cross validation provides cross validation test, and can be used to demonstrate the robustness of the results, that there is no overfitting, and that the obtained optimal biomarker can be used to analyze new patients.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis of dynamics of patient recovery after kidney transplant\n", "We will analyse the stochastic dynamics of patient recovery after kidney transplant. The biomarker is constructed based on NMR spectra of blood from 18 patients, taken before, during, and after the operation (upto 7 days). More details can be found in Krivov et al. PLoS Comput. Biol. 10 (2014) e1003685.\n", "\n", "\n", "### Reading data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "data=pd.read_excel('test.xlsx')\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The patients data are stored in data object, which is an instance of DataFrame object of pandas module. data.head() gives a summary (first 5 rows) of its content, e.g., it shows that we have 140 samples. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The shape of the data object shows that we have 31427 variables i.e., the NMR spectral lines. The first column shows the NMR shift in PPM.\n", "\n", "Before constructng the optimal biomarker, one may want to visualize the spectra. To do this for patient _k1_ one can use the plot function of pandas:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "ax=plt.gca()\n", "data.plot(x='ppm',y='k1 pre',ax=ax)\n", "data.plot(x='ppm',y='k1 int',ax=ax)\n", "data.plot(x='ppm',y='k1 post',ax=ax)\n", "data.plot(x='ppm',y='k1 01',ax=ax)\n", "data.plot(x='ppm',y='k1 03',ax=ax)\n", "data.plot(x='ppm',y='k1 04',ax=ax)\n", "data.plot(x='ppm',y='k1 07',ax=ax) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As one can see the spectra are very complex and it is difficult to see the difference between different outcomes.\n", "\n", "To compute the optimal biomarker - the committor, we first need to coarsegrain the spectra, otherwise the number of variables is too large, which leads to overfitting, as we will see later. The following will truncate the data to the segment ppm min=0 and ppm max=8.0 and bin it with interval dppm=0.32. Other prepocessing of the data can be done at this point. For example, we remove samples taken before and after the operation 'pre' and 'post' and rename sample 'int' (during operation) to day 0. We also convert to MultiIndex to handle longitudinal data more transparently.\n", "\n", "### Coarsegraning the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cleanup(data): #modified to work with pandas 0.24.2\n", " for c in data.columns[1:]: # remove pre and post and convert int to 0\n", " pat,time=c.split()\n", " if time=='pre' or time=='post':data.drop(columns=c,inplace=True)\n", " if time=='int':data.rename(columns={c:pat+' 0'},inplace=True)\n", " data.set_index('ppm',inplace=True) # move ppm column to index\n", " multi={} # create multiIndex: first index - patient, second index - time\n", " for c in data.columns:\n", " pat,time=c.split()\n", " time=float(time)\n", " multi[c]=pat,time\n", " data.rename(columns=multi,inplace=True)\n", " mi=pd.MultiIndex.from_tuples(data.columns)\n", " data=data.reindex(columns=mi)\n", " data.columns.names=['patient','time']\n", " return data\n", " \n", "def coarsegrain(data,ppmmin,ppmmax,dppm):\n", " import numpy\n", " bins=numpy.arange(ppmmin,ppmmax+dppm+1e-5,dppm)\n", " cdata=data.groupby(pd.cut(data.index, bins=bins)).sum()\n", " cdata.index.names=['ppm']\n", " return cdata\n", "\n", "data=cleanup(data)\n", "cdata=coarsegrain(data,ppmmin=0,ppmmax=8.0,dppm=0.32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Coarse-grained data with MultiIndex looks like" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cdata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see a subset of data for patient 'k1'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cdata['k1']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Due to MultiIndex, visualization now can be done more strighforwardly. The coarse-grained spectra for patient 'k1':" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cdata['k1'].plot.bar(figsize=(12,4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Determining the committor\n", "To compute the commitor one needs to define the two boundary or end states, which is usually done by clinical assesment. In our case the patients, based on a clinical assesment at the last day, are divided into three groups:\n", "- AR - Acute Rejection\n", "- DGF - Delayed Graft Function\n", "- PF - Perfect Function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "AR=['k8', 'k12', 'k14', 'k21', 'k27']\n", "DGF=['k4', 'k7', 'k9', 'k15', 'k17', 'k18', 'k24']\n", "PF=['k1', 'k2', 'k3', 'k5', 'k6', 'k10', 'k11', 'k13', 'k16', 'k25', 'k26', 'k33', 'k34']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets consider dynamics between PF and DGF states. Correspondingly, we define the boundary nodes A and B as" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DGF_PF={}\n", "for pat in DGF:DGF_PF[pat]='B'\n", "for pat in PF:DGF_PF[pat]='A'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are ready to compute an approximation to the committor function, as a linear combination of variables in the sample $q=\\sum_\\alpha c_\\alpha x_\\alpha$, or to project the multidimensional data onto the committor coordinate." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def p2committor(data,AB,maskx=None,lxuse=None,lmb=0):\n", " import numpy\n", " nx,nt=data.shape\n", " if maskx is None:maskx=numpy.ones(nx,'i')\n", " if lxuse is not None:\n", " maskx=numpy.zeros(len(lppm),'i')\n", " for i in lxuse: maskx[i]=1\n", " i21=[i for i in range(nx) if maskx[i]==1]\n", " nx2=len(i21)\n", " import numpy\n", " lx=numpy.zeros((nx2,nt),'float64') # array of samples/variables\n", " bval=numpy.zeros(nt,'int') # boundary values for supervised optimization\n", " lp=numpy.zeros(nt,'int') # list/index of patients\n", " skip=numpy.zeros(nt,'int') # index of patients to skip, used for cross-validation\n", " \n", " ipat={}# patient to number\n", " for it in range(nt):\n", " pat,time=data.columns[it]\n", " if pat not in ipat:ipat[pat]=len(ipat)+1\n", " if ipat[pat]