## Bayes Theorem

### Rohan L. Fernando

### June 2015

## Motivation


In whole-genome analyses, the number $k$ of marker covariates typically
exceeds the number of $n$ of observations. In this situation, least
squares methods cannot be used to simultaneously estimate the effects of
all the $k$ marker covariates. One of the most widely used methods to
overcome this problem is Bayesian inference, where prior information
about marker effects is combined with the data to make inferences about
the marker effects. In Bayesian inference, inferences are based on
conditional probabilities, and the Bayes theorem is a statement on
conditional probability.

##Conditional Probability of $X$ Given $Y$


Suppose $X$ and $Y$are two random variables with joint probability
distribution $\Pr(X,Y)$. Then, the conditional probability of $X$ given
$Y$ is given by Bayes theorem as

$$\Pr(X|Y) = \frac{\Pr(X,Y)}{\Pr(Y)} \tag{1}$$

where $\Pr(Y)$ is the probability distribution of $Y$. 

Similarly, thethe conditional probability of $Y$ given $X$ is

$$\Pr(Y|X)=\frac{\Pr(X,Y)}{\Pr(X)},$$ 

which upon rearranging gives

$$\Pr(X,Y)=\Pr(Y|X)\Pr(X). \tag{2}$$

Then, substituting (2) in (1) gives

$$\begin{eqnarray} 
\Pr(X|Y) &= &\frac{\Pr(X,Y)}{\Pr(Y)}\\
 &= &\frac{\Pr(Y|X)\Pr(X)}{\Pr(Y)},
\end{eqnarray}$$

which is the form of the formula that is used for inference of $X$ given
$Y.$

Bayes Theorem by Example
----------------------------------------------

Here we consider a simple example to justify the formula
(1). The following table gives the joint distribution of
smoking and lung cancer in a hypothetical population of 1,000,000
individuals.

$$
\begin{array}{c|lcr}
\text{Cancer} & \text{Yes} & \text{No} & \text{} \\
\hline
\text{Yes} & 42,500 & 7,500 & 50,000 \\
\text{No} & 207,500 & 742,500 & 950,000 \\
 & 250,000 & 750,000 
\end{array}
$$

Given these numbers, consider how you would compute the relative
frequency of lung cancer among smokers. There are a total of 250,000
smokers in this population, and among these 250,000 individuals, 42,500
have lung cancer. So, relative frequency of lung cancer among smokers is
$\frac{42,500}{250,000}$. As we reason below, this relative frequency is
also the conditional probability of lung cancer given the individual is
a smoker.

1. The frequentist definition of probability of an event is the
 limiting value of its relative frequency in a “large” number of
 trials.

2. Suppose we sample with replacement individuals from the 250,000
 smokers and compute the relative frequency of the incidence of lung
 cancer.

3. It can be shown that as the sample size goes to infinity, this
 relative frequency will approach $\frac{42,500}{250,000}=0.17$.

4. This ratio can also be written as
 $$\frac{42,500/1,000,000}{250,000/1,000,000}=0.17.$$

5. The ratio in the numerator is the joint probability of smoking and
 lung cancer, and the ratio in the denominator is the marginal
 probability of smoking.

In [3]:
;ipython nbconvert --to slides BayesTheorem.ipynb

[NbConvertApp] Using existing profile dir: u'/Users/rohan/.ipython/profile_default'
[NbConvertApp] Converting notebook BayesTheorem.ipynb to slides
[NbConvertApp] Support files will be in BayesTheorem_files/
[NbConvertApp] Loaded template slides_reveal.tpl
[NbConvertApp] Writing 203424 bytes to BayesTheorem.slides.html
