1. Predictive Modelling: Predictive Modeling is about using historical data to view the model for predicting future outcome. It is not a single algorithm, but rather a sequence of computational tasks. The key steps are: + defining the prediction target, constructing the right patient cohort, and constructing the right features (observation window, index date, prediction window and diagnosis date); + models to predict outcome.
The model can be a classification problem (predicting if the patient will have a heart failure or not) or a regression problem (predicting costs a patient will incur). The final step of this pipeline is to asses how good our model is through performance evaluation (Leave-one-out cross validation, K-fold cross validation, randomized cross validation). 2. Computational Phenotyping, Computational Phenotyping is about turning messy Electronic Health Records (EHRs) into meaningful clinical concepts.
The input to computational phenotyping is raw patient data from many sources such as demographic information diagnosis, medication, procedure, lab test and clinical notes. The phenotyping algorithm converts this raw patient data into medical concepts or phenotypes. The main usage of this data is to support clinical operations such as billing, or to support genomic studies. + Medical Oncology. One of the things that makes Healthcare a unique domain for Big Data Analytics is the existence of structured medical knowledge. Current effor: OMOP common data model
+ The most popular medical ontology is called SNOMED (Systemized Nomenclature of Medicine). It is a huge graph medical concepts and their relations with each other. + The patient is coming to a hospital to get a lab test. The result of this lab test is stored using a LOINC code. + The lab test result goes to the doctor who diagnoses the patient with different ICD codes. + Once we have the diagnosis on this patient, we want to treat him with a medical procedure represented by a CPT code. + The patient can also take some medication that is represented by an NDC code. + A Tensor for EHR. Here, $\lambda$ captures the importance of each Phenotype after Tensor Factorisation.
3. Patient Similarity. In Healthcare, the traditional paradigm has been evidence-based medicine in which decisions are based on well-designed and conducted research and then applying those guidelines in practice through Randomized Clinical Trails (RCT). The problem with RCT is that it requires a controlled environment as well as population. And it tests one thing at a time which is expensive and time consuming. Patient Similarity algorithms use healthcare data to identify groups of patients sharing similar characteristics. Patient Similarity can potentially give rise to a new paradigm called Precision Medicine, where personalized decision making is recommended after conducting Pragmatic Trials (PT) based on EHR data and measuring similarity among patients.
+ [Real world evidence](https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence), [guideline](https://www.fda.gov/regulatory-information/search-fda-guidance-documents/submitting-documents-using-real-world-data-and-real-world-evidence-fda-drugs-and-biologics-guidance) from FDA when submitting drugs, devices, etc. Example, a paper published in [Lancet](https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(19)32317-7/fulltext) very recently.
# Population-Level Estimation - Population-level effect estimation refers to the estimation of **average causal effects** of exposures (e.g. medical interventions such as drug exposures or procedures) on specific health outcomes of interest. - Direct effect estimation: estimating the effect of an exposure on the risk of an outcome, as compared to no exposure. - Comparative effect estimation: estimating the effect of an exposure (the target exposure) on the risk of an outcome, as compared to another exposure (the comparator exposure). - In both cases, the patient-level causal effect contrasts a **factual outcome**, i.e., what happened to the exposed patient, with a **counterfactual** outcome, i.e., what would have happened had the exposure not occurred (direct) or had a different exposure occurred (comparative). - Since any one patient reveals only the factual outcome (the fundamental problem of causal inference), the various effect estimation designs employ different analytic devices to shed light on the counterfactual outcomes. - In the causal inference lanuage: + We observe $N$ units, indexed by $i = 1,...,N$, drawn randomly from a large population. For each unit, there is a pair of potential outcomes, $Y_i(0)$ for the outcome under the control treatment and $Y_i(1)$ for the outcome under the active treatment. In addition, each unit has a vector of covariates and denoted by $X_i$. + Each unit is exposed to a single treatment: $W_i = 0$ if unit $i$ receives the control treatment and $W_i = 1$ if unit $i$ receives the active treatment. We therefore observe for each unit the triple $(W_i, Y_i, X_i)$, where $Y_i$ is the realized outcome: $$Y_i \equiv Y_i(W_i) = \begin{pmatrix}Y_i(0) & \mbox{if $W_i=0$}\\ Y_i(1) & \mbox{if $W_i=1$} \end{pmatrix}$$ + Population average treatment effect: $$\tau_P = E[Y_i(1)−Y_i(0)]$$ + Sample average treatment effect : $$\tau_s = \frac{1}{N}\sum_{i=1}^N \left( Y_i(1)−Y_i(0)\right)$$ + **Assumption 1 (strongly ignorable)** $$\left\{ Y_i(0), Y_i(1)\right\} \perp W_i \mid X_i $$ + **Assumption 2 (overlap) ** $$0 < Pr(W_i=1|X_i) < 1$$ + If the two assumption holds, we can estimate average treatment effect $\tau$ as $$ \begin{align} \tau(x) &\equiv E[Y_i(1) - Y_i(0)|X_i = x]\\ & = E[Y_i(1)|X_i = x] - E[Y_i(0)|X_i = x]\\ & = E[Y_i(1)|X_i = x, W_i=1] - E[Y_i(0)|X_i = x, W_i=0]\\ & = E[Y_i|X_i = x, W_i=1] - E[Y_i|X_i = x, W_i=0] \end{align} $$ ## Study designs ### The Cohort Method Design