#+AUTHOR:      Arnaud Legrand
#+TITLE:       Introduction to Probabilities and Statistics
#+DATE:        MOSIG Lecture
#+STARTUP: beamer overview indent
#+TAGS: noexport(n)
#+LaTeX_CLASS: beamer
#+LaTeX_CLASS_OPTIONS: [11pt,xcolor=dvipsnames,presentation]
#+OPTIONS:   H:3 num:t toc:nil \n:nil @:t ::t |:t ^:nil -:t f:t *:t <:t
#+LATEX_HEADER: \input{org-babel-style-preembule.tex}
#+LATEX_HEADER: \usepackage{commath}

#+LaTeX: \input{org-babel-document-preembule.tex}

* List                                                             :noexport:
** TODO Jean-Marc

- Je n’ai pas parlé de la série génératrice des moments qui va faire
  la confusion avec la fonction génératrice dans le cas discret $\Esp
  t^X$ pour laquelle les coefficients en 0 sont les probabilités et
  les coefficients en 1 les moments
- En fait $M_X$ est surtout un outil calculatoire, il y a beaucoup de
  problèmes pour définir proprement son domaine si les variables sont
  réelles (singularités,…) je pense qu’il faut donner la bonne idée
  (transformée de Fourier dont M_X est un avatar).
- Par contre j’ai parlé de fonction caractéristique en cours avec les
  propriétés qui vont bien
- Avant l’inégalité de Bienaymé-Chebichev il faut parler de
  l’inégalité de Markov qui est vraiment fondamentale : lien entre
  espérance et cdf, Chebichev apparait donc comme un corollaire
  évident et on peut généraliser le processus (ce qui se fait plus
  tard si on fait des grandes déviations)
- Euh la dernière limite slide 21 n’est pas correcte il vaut mieux
  dire que l’on a une approximation $\frac {S_n}n \simeq \mathcal{N}(\mu,
  \frac {\sigma^2}n)$
- Pour les notations, comme il y a différents types de convergence en
  théorie des probabilité (en loi, en proba, presque sure, dans
  L^1,.. L^p,…) il vaut mieux préciser systématiquement donc ici on a
  de la convergence en loi que je note
  $\stackrel{\longrightarrow}{\mathcal{L}}$
- Slide 25, pour caler les variances il faut diviser par $sqrt{2}$
  (explique le racine de n dans la formule)
* Brief reminder on probabilities
** 
*** Probabilities
\vspace{-.3em}
- Using probabilities enables to model *uncertainty* that may result of
  *incomplete information* or *imprecise measurements* 

  \pause

  A *random variable* (or stochastic variable) is, roughly speaking, a
  variable whose value results from a measurement (or an observation)

  You can think of it as a *small box*:
  - Every time you open the box, you get a _different_ value.
  - I will use this box analogy throughout the whole lecture and I
    encourage you to ask yourself what the box can be in your own
    studies\medskip
  \pause
- Formally a *probability space* is defined by $(\Omega, \F, \P)$ where:
  - $\Omega$, the *sample space*, is the set of all possible *outcomes*
    - \Eg
      all the possible combinations of your DNA with the one of your
      {girl|boy}friend
    - You may or may not be able to observe directly the outcome.
    \pause
  - \F if the set of *events* where an event is a set containing zero or
    more outcomes
    - \Eg the event of "the DNA corresponds to a girl with blue eyes"
    - An event is somehow more tangible and can generally be observed
    \pause
  - The *probability measure* $\P:\F \to [0,1]$ is a function returning an
    event's probability 
    #+LaTeX: ($\P$("having a brown-eyed baby girl") = 0.0005)
*** Continuous random variable
- A *random variable* associates a *numerical value* to *outcomes*
  \begin{equation*}
  \rv{X}: \Omega \to \R
  \end{equation*}
  \vspace{-1.6em}
  - \Eg the weight of the baby at birth (assuming it solely depends
    on DNA, which is quite false but it's for the sake of the example)
  - Since many computer science experiments are based on time
    measurements, we focus on *continuous* variables
- \textbf{Note:} To distinguish random variables, which are complex
  objects, from other mathematical objects, they will always be
  written in blue capital letters in this set of slides (\eg \rv{X})

- The probability measure on \Omega induces probabilities on the
  *values* of \rv{X}
  - $\P(\rv{X}=0.5213)$ is generally 0 as the outcome never exactly matches
  - $\P(0.5213\leq\rv{X}\leq0.5214)$ may however be non-zero
*** Probability distribution
#+begin_src R :results output graphics :file "pdf_babel/Gamma_distribution.pdf" :exports none :width 6 :height 3 :session
library(ggplot2)
library(ggthemes)
df = data.frame(x=c(-2,10), y=c(0,.3))
func = dgamma
pfunc = pgamma
args = list(shape = 3)
xmin = 1
xmax = 6
x = seq(from=xmin,to=xmax,length.out=50)
y = do.call(func,c(list(x=x),args))
area = data.frame(x=x, y=y)
integral = diff(range(do.call(pfunc,c(list(q=c(xmin,xmax)),args))))
label = paste("P(paste(",xmin," <= X) <= ",xmax,") == ", integral)

r2.value <- 0.90


p = ggplot(data=df,aes(x=x,y=y)) + geom_point(size=0) + theme_classic() + 
    stat_function(fun = func, colour = "darkgreen", args = args) +
        geom_area(data=area,aes(x=x,y=y),fill="lightskyblue2") + 
            geom_text(x=.7*max(df$x),y=.7*max(df$y), label=label, parse=T) +
            ylab(expression(paste(f[X],"(",italic(w),")"))) + xlim(df$x) +
            xlab(expression(italic(w)))
p
# ggsave(p,file="pdf_babel/Gamma_distribution.pdf",width=6,height=4)
#+end_src

#+RESULTS:
[[file:pdf_babel/Gamma_distribution.pdf]]

A *probability distribution* (a.k.a. *probability density function* or
p.d.f.) is used to describe the probabilities of different *values*
occurring

- A random variable \rv{X} has density $f_X$, where $f_X$ is a
  non-negative and integrable function, if: \qquad\quad
  $\displaystyle \boxed{\P[a \leq \rv{X} \leq b] = \int_a^b f_X(w) \, \dif w}$
  #+BEGIN_EXPORT latex
%\vspace{-.5em}
\begin{columns}
  \begin{column}{.7\linewidth}
    \includegraphics[width=\linewidth]{pdf_babel/Gamma_distribution.pdf}
  \end{column}
  \begin{column}{.3\linewidth}
    \begin{boxedminipage}{1.1\linewidth}
      \scriptsize Note: \texttt{the} $\text{X}$ in
      \hbox{$1\leq\text{X}\leq6$} \textit{should be in blue...}
    \end{boxedminipage}
  \end{column}
\end{columns}

  #+END_EXPORT
#  \vspace{-.5em}
- \textbf{Note}: people often confuse the sample space with the random
  variable. Try to make the difference when modeling your system, it
  will help you
*** Characterizing a random variable
The probability density function *fully characterizes* the random
variable but it is also complex object

- It may be symmetrical or not
- It may have one or several *modes*
- It may have a bounded support or not, hence the random variable may
  have a *minimal* and/or a *maximal* value
- The *median* cuts the probabilities in half

#+begin_src R :results output graphics :file "pdf_babel/distribution_characteristics.pdf" :exports none :width 6 :height 3 :session
library(ggplot2)
library(ggthemes)
xmin = -2
xmax = 10
ymin = 0
ymax = .3

func = dgamma
pfunc = pgamma
args = list(shape = 3)

x = seq(from=xmin,to=xmax,length.out=500)
y = do.call(pfunc,c(list(q=x),args))
dfminx = data.frame(x=x,y=y)
minx = tail(dfminx[dfminx$y==0,],n=1)$x
if(length(minx)==0) {minx=NA}
maxx = head(dfminx[dfminx$y==1,],n=1)$x
if(length(maxx)==0) {maxx=NA}
medianx = tail(dfminx[dfminx$y<.5,],n=1)$x
if(length(medianx)==0) {medianx=NA}

y = do.call(func,c(list(x=x),args))
dfminx = data.frame(x=x,y=y)
modex = dfminx[dfminx$y==max(dfminx$y),]$x
espx = sum(dfminx$x*dfminx$y)*diff(range(head(dfminx$x,n=2)))

dfstat = data.frame(name = c("min", "median", "max","mode","expected value"),
                    x = c(minx,medianx,maxx,modex,espx),
                    y = ymax)

p = ggplot(data=dfstat,aes(x=x,y=y,color=name)) + geom_line(alpha=1) +
    xlim(xmin,xmax) + ylim(ymin,ymax) + theme_classic() + ylab("f(x)") +
    stat_function(fun = func, colour = "black", args = args) +
    geom_vline(aes(xintercept=x,color=name)) + guides(colour = guide_legend(""))

p
# ggsave(p,file="pdf_babel/Gamma_distribution.pdf",width=6,height=4)
#+end_src

#+RESULTS:
[[file:pdf_babel/distribution_characteristics.pdf]]

#+BEGIN_CENTER
\includegraphics[width=.7\linewidth]{pdf_babel/distribution_characteristics.pdf}
#+END_CENTER
\vspace{-1em}

\bf These are interesting aspects of $\mathbf{f_X}$ but they barely
summarize it
*** Expected value and variance
- When one speaks of the "expected price", "expected height", etc. one
  means the *expected value* of a random variable that is a price, a
  height, etc.
  \begin{align*}
  \E[\rv{X}] & = x_1p_1 + x_2p_2 + \ldots + x_kp_k = \int_{-\infty}^\infty x f_X(x)\, \dif x
  \end{align*}
  The expected value of \rv{X} is the ``average value'' of \rv{X}.\smallskip

  It is \textbf{not} the most probable value. The mean is _one_ aspect
  of the distribution of \rv{X}. The *median* or the *mode* are other
  interesting aspects.
- The *variance* is a measure of how far the values of a
  random variable are spread out from each other.

  If a random variable \rv{X} has the expected value (mean) $\mu =
  \E[\rv{X}]$, then the variance of \rv{X} is given by:
  #+BEGIN_EXPORT latex
  \begin{align*} 
      \Var(\rv{X}) &= \E\left[(\rv{X} - \mu)^2
      \right] = \int_{-\infty}^\infty  (x-\mu)^2 f_X(x)\, \dif x
  \end{align*}
  #+END_EXPORT
- The *standard deviation $\sigma$* is the square root of the variance. This
  normalization allows to compare it with the expected value
* Moment Generating Function
** Intuitions
*** Definition
Working with the density function is not always convenient, especially
when *summing* random variables (it implies *convolving* the pdf). We need
an /alternate representation/. \medskip

How could we summarize a random variable ?
- By its mean, its variance, its skewness, \dots by its moments $\mu_k =
  \E(\rv{X}^k)$
- It is not clear that it would be sufficient although we would know a
  lot about $f_{\rv{X}}$.

Let's define the *moment generating function* \M as follows:
#+BEGIN_EXPORT latex
\begin{align*}
  \M & = \E\left(e^{t\rv{X}}\right) = \E\left(\sum_{k=0}^\infty \frac{t^k\rv{X}^k}{k!}\right)
     = \E\left(\sum_{k=0}^\infty \frac{t^k\rv{X}^k}{k!}\right) = \sum_{k=0}^\infty \mu_k\frac{t^k}{k!}\\
     & = \int e^{tx}f_{\rv{X}}(x)dx 
\end{align*}
#+END_EXPORT
*** Deriving moments with the mgf
Remember we have $\displaystyle \M = \sum_{k=0}^\infty \mu_k\frac{t^k}{k!}$

Therefore $\displaystyle  \frac{d^n \MM}{dt^n}(0) = \mu_n$ \medskip

All the moments of $\rv{X}$ are encoded in \M. Is there more ?
*** Characterization of a distribution through the mgf
Let's assume that $\rv{X}$ is discrete $\left((x_1,p_1),\dots,
(x_n,p_n)\right)$ with $x_1<\dots<x_n$

- Then $\M = \E\left(e^{t\rv{X}}\right) = \sum_{j=1}^n p_j e^{tx_j} = \sum_{j=1}^n p_j
  (e^{t})^{x_j}$
- Therefore $\displaystyle\M \widesim{t\to\infty} p_n e^{tx_n}$ and
  $\displaystyle\MM'(t) \widesim{t\to\infty} p_nx_n e^{tx_n}$.

  $\leadsto$ $\frac{\MM'(t)}{\M}\xrightarrow[t\to\infty]{} x_n$
- Hence, we can determine $x_n$, then $p_n$, substract $p_n e^{tx_n}$
  from \M and proceed to find $x_{n-1}$.

#+BEGIN_CENTER
$\rv{X}$ is fully characterized by its mgf $\MM$
#+END_CENTER

Proving the same results when $\rv{X}$ is continuous, requires to go
through Fourier transform.
** Properties
*** Convenient properties
#+BEGIN_EXPORT latex
\begin{align*}
  \M[a\rv{X}+b] & = \E\left(e^{t(a\rv{X}+b)}\right) 
                  = \E\left(e^{bt}e^{at\rv{X}}\right)\\
                & = e^{bt}\MM(at)
  \intertext{\medskip}
  \M[\rv{X}+\rv{Y}] & = \E\left(e^{t(\rv{X}+\rv{Y})}\right) 
                      = \E\left(e^{t\rv{X}+t\rv{Y}}\right)
                      = \E\left(e^{t\rv{X}}e^{t\rv{Y}}\right)
                      = \E\left(e^{t\rv{X}}\right)\E\left(e^{t\rv{Y}}\right)\\
                    & = \M.\M[\rv{Y}]
\end{align*}
#+END_EXPORT
*** Mgf of usual laws
- Uniform law:
  #+BEGIN_EXPORT latex
  $\displaystyle
  \M = \begin{cases} 
    \frac{e^{tb}-e^{ta}}{t(b-a)}
    &\text{for } t \neq 0 \\ 1 &\text{for } t = 0 
  \end{cases}$
  #+END_EXPORT
- Exponential law: $\displaystyle f(x;\lambda) = \begin{cases} \lambda e^{-\lambda x}
  & x \ge 0, \\ 0 & x < 0. \end{cases}$
  #+BEGIN_EXPORT latex
  \begin{align*}
    \M & = \E\left(e^{t\rv{X}}\right) = \int_0^\infty e^{tx}\lambda e^{-\lambda x} dx 
     = \lambda \int_0^\infty e^{(t-\lambda)x} dx \\&= \lambda \left[\frac{e^{(t-\lambda)x}}{t-\lambda} \right]_0^\infty 
     = \frac{\lambda}{\lambda-t} \text{\qquad (for $t<\lambda$)}
  \end{align*}
  #+END_EXPORT
This allows to *easily compute moments* and *sum random variables*.

The moment generating function is somehow similar to the Fourier
transform on periodic signals.
* Toward the Central Limit Theorem
** Law of Large Numbers
*** How to estimate the Expected value?
To empirically *estimate* the expected value of a random variable
\rv{X}, one repeatedly measures observations of the variable and
computes the arithmetic mean of the results \bigskip 

This is called the *sample mean* and it intuitively converges to the
expected value\bigskip 

Unfortunately, if you repeat the estimation, you may get a different
value since \rv{X} is a random variable \dots \bigskip

What can we really say ?
*** On the way to the Law of Large Numbers
**** Chebyshev Inequality
Let \rv{X} be a random variable with expected value $\mu = \E(\rv{X})$,
and let $\epsilon>0$ be any positive real number. Then $\P(|\rv{X}-\mu|\geq \epsilon) \leq
\frac{\Var(\rv{X})}{\epsilon^2}$.
**** Proof
#+BEGIN_EXPORT latex
\begin{align*}
  \Var(\rv{X}) = \int (x-\mu)^2f(x).dx & \geq 
                 \int_{|x-\mu|\geq\epsilon} (x-\mu)^2f(x).dx  \\
               & \geq \int_{|x-\mu|\geq\epsilon} \epsilon^2f(x).dx = 
                   \epsilon^2 \underbrace{\int_{|x-\mu|\geq\epsilon} f(x).dx}_{\P(|\rv{X}-\mu|\geq \epsilon)}
\end{align*}

#+END_EXPORT
*** Law of Large Numbers
**** Law of Large Numbers
Let \rv{X_1}, \rv{X_2}, \dots, \rv{X_n} be a sequence of identical and
independent random variables with finite expected value
$\mu=\E(\rv{X_i})$ and finite variance $\sigma^2=\Var(\rv{X_i})$.
Let $\rv{S_n} = \rv{X_1} + \rv{X_2} + \dots + \rv{X_n}$. 

Then for any $\epsilon>0$, $\P(|\rv{S_n}/n-\mu|\geq \epsilon) \xrightarrow[n\to\infty]{} 0$.
**** Proof
The $\rv{X_i}$ are i.i.d, hence:
- $\Var(\rv{S_n}) = n.\sigma^2 \leadsto \Var(\rv{S_n}/n) = \sigma^2/n$.
- $\E(\rv{S_n}/n) = \mu$.

Using Chebyshev's inequality:

#+BEGIN_EXPORT latex
$$\P(|\rv{S_n}/n-\mu|\geq \epsilon) \leq \frac{\sigma^2}{n\epsilon^2} \xrightarrow[n\to\infty]{} 0
\text{ (for a fixed $\varepsilon$)}$$
#+END_EXPORT
*** Illustration: convergence in probability
#+begin_src R :results output graphics :file pdf_babel/tcl_large_numbers.pdf :exports results :width 7 :height 2 :session
library(ggplot2)
N=10000;

repsamp = function(N,r) {
  x = 0 ;
  for(i in 1:r) {
    x = x + sample(x=c(0,1),replace=T,N);
  }
  x = x/r
  data.frame(x=x,r=r)
}

df = rbind(repsamp(N,1), repsamp(N,10), repsamp(N,100), 
           repsamp(N,1000),repsamp(N,10000));

ggplot(df,aes(x=x)) + geom_histogram(binwidth=.05) + facet_wrap(~r,nrow=1) + 
       theme_bw() + scale_x_continuous(breaks=c(0,.5,1)) + xlab("Sn")
#+end_src

#+RESULTS:
[[file:pdf_babel/tcl_large_numbers.pdf]]

So we do converge to a spike, but how ?

Assume $\sigma=1$ and we aim at having a precision of $\epsilon=.1$. For $n=500$,
the previous formula only gives us
$\P(|\rv{S_n}/n-\mu|\geq \epsilon) \leq \frac{\sigma^2}{n\epsilon^2} = \frac{100}{n} = 0.5
\quad\frowny$

In general, for an $\alpha$ confidence interval (i.e., $\P(|\rv{S_n}/n-\mu|\le
\delta)\le\alpha$), we get $\delta=\frac{1}{\sqrt{1-\alpha}}.\frac{\sigma}{\sqrt{n}}$

|  $\alpha$ | Chebyshev's Range $\frowny$    | CLT range $\smiley$            |
|------+--------------------------+--------------------------|
|  .95 | $4.47\frac{\sigma}{\sqrt{n}}$ | $1.95\frac{\sigma}{\sqrt{n}}$ |
| .999 | $31.6\frac{\sigma}{\sqrt{n}}$ | $6.58\frac{\sigma}{\sqrt{n}}$ |
** Central Limit Theorem
*** Central Limit Theorem [\textbf{CLT}]
- Let $\{\rv{X_1}, \rv{X_2}, \dots, \rv{X_n}\}$ be a random sample of size
  $n$ (\ie a sequence of *independent* and *identically distributed*
  random variables with expected values $\mu$ and variances $\sigma^2$)
- We know that $\E(\rv{S_n}/n) = \mu$ and $\Var(\rv{S_n}) = n\sigma^2$.
- Let's define the *standardized mean* of these random variables as:
  #+BEGIN_EXPORT latex
  $$\displaystyle \rv{S^*_n} = \frac{S_n-n\mu}{\sqrt{n\sigma^2}}$$
  #+END_EXPORT
  We have $\E(\rv{S^*_n}) = 0$ and  $\Var(\rv{S^*_n}) = 1$.
- For large $n$, the distribution of $\rv{S^*_n}$ is approximately
  *normal*
  #+BEGIN_EXPORT latex
  \begin{equation*}
  \rv{S^*_n} \xrightarrow[n\to\infty]{} \N\left(0,1\right)
  \end{equation*}
  #+END_EXPORT
  Or equivalently
  #+BEGIN_EXPORT latex
  \begin{equation*}
  \frac{\rv{S_n}}{n} \xrightarrow[n\to\infty]{} \N\left(\mu,\frac{\sigma^2}{n}\right)
  \end{equation*}
  #+END_EXPORT
*** CLT Illustration: the mean smooths distributions
#+begin_src R :results output graphics :file "pdf_babel/CLT_illustration.pdf" :exports none :width 9 :height 6 :session
library(ggplot2)
library(ggthemes)

triangle <- function(n=10) {
  sqrt(runif(n)) 
}

broken <- function(n=10) {
  x=runif(n);
  x/(1-x);
}

broken_mid <- function(n=10) {
  x=(runif(n)+runif(n))/2;
  x/(1-x);
}


generate <- function(n=50000,N=c(1,2,5,10,15,20,30,100), law=c("unif","binom","triangle")) {
  df=data.frame();
  for(l in law) {
    for(p in N) {
      X=rep.int(0,n);
      for(i in 1:p) {
        X = X + switch(l, unif = runif(n),
                          binom = rbinom(n,1,.5), 
                          exp=rexp(n,rate = 2), 
                          norm=rnorm(n,mean = .5),
                          triangle=triangle(n)-1/6,
                          broken=broken(n),
                          broken_mid=broken_mid(n));
      }
      X = X/p;
      df=rbind(df,data.frame(N=p,SN=X,law=l));
    }
  } 
  df;
}
d=generate()
ggplot(data=d,aes(x=SN)) + geom_density(aes(y = ..density..)) + 
     facet_grid(law~N) + theme_classic() + xlab("") + 
     scale_x_continuous(breaks=c(0,.5,1))
#+end_src

#+RESULTS:
[[file:pdf_babel/CLT_illustration.pdf]]

  
Start with an *arbitrary* distribution and compute the distribution of
$S_n$ for increasing values of $n$.
#+BEGIN_CENTER
#+LaTeX: \includegraphics<1>[width=.8\linewidth]{pdf_babel/CLT_illustration.pdf}
#+END_CENTER

*** The Normal distribution
#+BEGIN_EXPORT latex
  \begin{overlayarea}{\linewidth}{4.5cm}
    \begin{center}%
      \includegraphics<1>[height=4.5cm]{pdf_babel/normal_distribution.pdf}%
      \includegraphics<2>[height=4.5cm]{images/Standard_deviation_diagram.pdf}%
    \end{center}
    \vspace{-5.7cm}
    \begin{flushright}
      \fbox{\textbf{Density}: $f_{\mu,\sigma}(x)= \frac{1}{\sigma\sqrt{2\pi}} e^{
        -\frac{(x-\mu)^2}{2\sigma^2} }$}
    \end{flushright}
  \end{overlayarea}
  \uncover<1->{The smaller the variance the more ``spiky'' the
    distribution.}
  \uncover<2->{
#+END_EXPORT
- Dark blue is less than one standard deviation from the mean\approx
  68% of the set.
- Two standard deviations from the mean (medium and dark blue)\approx95%
- Three standard deviations (light, medium, and dark blue)\approx99.7%
#+LaTeX: }

*** The Normal distribution (property 1)
The family of normal distributions is *closed under linear
transformations*: if \rv{X} is normally distributed with mean $\mu$ and
standard deviation $\sigma$, then the variable \rv{Y} = a\rv{X} + b is also
normally distributed, with mean $a\mu + b$ and standard deviation
$|a|\sigma$.

  #+begin_src R :results output graphics :file pdf_babel/normal_distribution_linearity.pdf :exports results :width 11 :height 4 :session
  par(mfrow=c(1,2))
  hist(rnorm(10000,mean=0,sd=1),breaks=30)
  hist(3*rnorm(10000,mean=0,sd=1)+10,breaks=30)
  par(mfrow=c(1,1))
  #+end_src

  #+RESULTS:
  [[file:pdf_babel/normal_distribution_linearity.pdf]]

*** The Normal distribution (property 2)
*Convolution*: if \rv{X_1} and \rv{X_2} are two independent normal
random variables, with means $\mu_1$, $\mu_2$ and standard deviations
$\sigma_1$, $\sigma_2$, then their sum $\rv{X_1} + \rv{X_2}$ will also be
normally distributed, with mean $\mu_1 + \mu_2$ and variance $\sigma_1^2 +
\sigma_2^2$.

  #+begin_src R :results output graphics :file pdf_babel/normal_distribution_convolution.pdf :exports results :width 11 :height 4 :session
  hist(rnorm(10000,mean=2,sd=3) + rnorm(10000,mean=3,sd=4),breaks=30)
  #+end_src

  #+RESULTS:
  [[file:pdf_babel/normal_distribution_convolution.pdf]]


Intuitively, if $\rv{S^*_n}$ converges to something (say $\L$), it
"/has to/" be a normal distribution:
#+BEGIN_EXPORT latex
$$\frac{1}{2}(\underbrace{\rv{S^*_{1\dots n}}}_{\sim\L} + 
             \underbrace{\rv{S^*_{n+1\dots2n}}}_{\sim\L}) = 
   \underbrace{\rv{S^*_{2n}}}_{\sim\L}$$
#+END_EXPORT


*** Moment generating function of the normal distribution
Let's assume $\rv{X}\sim\N(0,1)$.

#+BEGIN_EXPORT latex
\begin{align*}
  \M & = \int e^{tx}f_\N(x).dx
       = \int e^{tx}\frac{e^{-\frac{x^2}{2}}}{\sqrt{2\pi}}dx 
       = \int \frac{e^{\frac{1}{2}(-x^2+2tx)}}{\sqrt{2\pi}}dx \\
     & = \int \frac{e^{\frac{1}{2}(-(x-t)^2+t^2)}}{\sqrt{2\pi}}dx 
       = e^{\frac{t^2}{2}} \int
         \frac{e^{\frac{-(x-t)^2}{2}}}{\sqrt{2\pi}}dx
       = e^{\frac{t^2}{2}} \int 
        \frac{e^{\frac{-x^2}{2}}}{\sqrt{2\pi}}dx \\
     & = e^{\frac{t^2}{2}}
\end{align*}
#+END_EXPORT

Actually, if we assume $\rv{X}\sim\N(\mu,\sigma^2)$, one can easily prove in the
same way that:
#+BEGIN_EXPORT latex
\begin{align*}
  \M = e^{\mu t + \frac{1}{2}\sigma^2t^2}
\end{align*}
#+END_EXPORT

*** Proof of the CLT
#+BEGIN_EXPORT latex
$\boxed{
  \begin{array}{l}
   \M = \E(e^{t\rv{X}})\approx 1+\mu t+ \sigma^2\frac{t^2}{2}+o(t^2)\\
   \leadsto \log(\M[\rv{X-\mu}]) \approx \sigma^2\frac{t^2}{2}+o(t^2)
 \end{array}
}$
\hfill 
$\boxed{\begin{cases}\rv{S_n}=\rv{X_1}+\dots+\rv{X_n}\\
        \rv{S^*_n}=\frac{\rv{S_n}-n\mu}{\sigma\sqrt{n}}\end{cases}}$
#+END_EXPORT

We have:
#+BEGIN_EXPORT latex
\begin{align*}
  \M[\rv{S^*_n}] & = \E(e^{t\rv{S^*_n}}) 
                   = \E(e^{t{\frac{\rv{S_n}-n\mu}{\sigma\sqrt{n}}}}) 
                   = \E(e^{\frac{t}{\sigma\sqrt{n}}(\rv{S_n}-n\mu)}) 
                   = \MM[\rv{S_n-n\mu}]\left(\frac{t}{\sigma\sqrt{n}}\right)\\
                & = \Bigg(\MM[\rv{X-\mu}]\Big(\underbrace{\frac{t}{\sigma\sqrt{n}}}_{\xrightarrow[n\to\infty]{} 0}\Big)\Bigg)^n
                    \text{\qquad (since $\M[\rv{X}+\rv{Y}]=\M[\rv{X}]\M[\rv{Y}]$)}\\
                & = \exp\left(\!n\log\left(\!\MM[\rv{X-\mu}]\left(\!\frac{t}{\sigma\sqrt{n}}\!\right)\!\right)\!\right)
                  = \exp\left(\!n\left(\!\sigma^2\frac{t^2}{2n\sigma^2}+o\left(\!\frac{t^2}{n^2}\!\right)\!\right)\!\right)\\
                & = \exp\left(\frac{t^2}{2} + o(t^2/n)\right) \xrightarrow[n\to\infty]{} e^{t^2/2}
                    \text{, which is the mgf of $\N(0,1)$}\qed
\end{align*}
#+END_EXPORT

*** CLT = convergence of laws
The law of \rv{S^*_n} converges to $\N(0,1)$. In other words, whatever
the initial law of $X$:
#+LaTeX: $$\lim_{n\to\infty} \P[a<\rv{S^*_n}<b] = \int_a^b \frac{1}{\sigma\sqrt{2\pi}} e^{-x^2/2} dx$$

\medskip

It provides a reasonable approximation when close to the peak of the
normal distribution.\\
\quad\small (it requires a very large number of observations to stretch
into the tails)

** Central Limit Theorem consequences
*** Confidence interval
#+begin_src R :results output graphics :file pdf_babel/CI_illustration.pdf :exports none :width 5 :height 3 :session
library(ggplot2)
mu = 500
N = 30
n = 40
X = 0
for (i in 1:N) {
    X = X + mu + runif(n, min = -1, max = 1) # Hence var=1/3
}
# so sigma_n = sqrt(1/3)/sqrt(N)
ci = 2*sqrt(1/3)/sqrt(N);

X = X/N

# length(X[X >= 1775.5 & X <= 1776.6])/length(X)

df = data.frame(x = X, y = seq(1:length(X)))
df$valid = 1
df[abs(df$x - mu) > ci, ]$valid = 0
ggplot(df, aes(x = x, y = y, color = factor(valid))) + geom_point() + 
    geom_errorbarh(aes(xmax = x - ci, xmin = x + ci)) + 
    geom_vline(xintercept = mu) + 
    theme_classic() + guides(colour = guide_legend("")) +
    xlim(mu-3*ci,mu+3*ci) + 
    ylab("Trial #") + xlab("Observation: sample mean with \nconfidence interval") +
    coord_flip() + ggtitle(paste(n," observations of the mean of ",N," samples"))
#+end_src

#+RESULTS:
[[file:pdf_babel/CI_illustration.pdf]]

#+BEGIN_EXPORT latex
\begin{overlayarea}{\linewidth}{4.5cm}
  \begin{center}%
    \includegraphics<1>[height=4.5cm]{images/Standard_deviation_diagram.pdf}%
    \includegraphics<2>[height=4.5cm]{pdf_babel/CI_illustration.pdf}%
  \end{center}
\end{overlayarea}
#+END_EXPORT

When $n$ is large:
#+BEGIN_EXPORT latex
\begin{center}
  \scalebox{.9}{$\displaystyle
  \P\left(\mu\in
    \left[\rv{S_n}-2\frac{\sigma}{\sqrt{n}},\rv{S_n}+2\frac{\sigma}{\sqrt{n}}\right]\right)
  = \P\left(\rv{S_n}\in
    \left[\mu-2\frac{\sigma}{\sqrt{n}},\mu+2\frac{\sigma}{\sqrt{n}}\right]\right)
  \approx  95\%$}
\end{center}
\uncover<2>{There is 95\% of chance that the \alert{true mean} lies
  within 2$\frac{\sigma}{\sqrt{n}}$ of the \alert{sample mean}.}
#+END_EXPORT
*** Without any particular hypothesis
- Assume, you have evaluated two *alternatives* $A$ and $B$ on $n$
  different *setups*

- You therefore consider the associated random variables \rv{A} and
  \rv{B} and try to *estimate* their expected values $\mu_A$ and $\mu_B$
#+BEGIN_EXPORT latex
  \begin{center}
    \begin{overlayarea}{.9\linewidth}{4.5cm}
      \begin{center}%
        \includegraphics<1>[scale=.911,subfig=1]{fig/2sample_comp_1.fig}%
        \includegraphics<2>[scale=.911,subfig=2]{fig/2sample_comp_2.fig}%
        \includegraphics<3->[scale=.911,subfig=3]{fig/2sample_comp_3.fig}%
      \end{center}
    \end{overlayarea}
  \end{center}
  \vspace{-.8em}
    \begin{overlayarea}{\linewidth}{1.5cm}%
      \only<1>{The two 95\% confidence intervals do not overlap\vspace{-.8em}
        \begin{flushright}
          $\leadsto \mu_A<\mu_B$ with more than 90\% of confidence
          \smiley
        \end{flushright}
      }%
      \only<2>{The two 95\% confidence intervals do overlap\vspace{-.8em}
        \begin{flushright}
          $\leadsto$ Nothing can be concluded \frowny\\
          Reduce C.I?
        \end{flushright}
      }%
      \only<3>{The two 70\% confidence intervals do not overlap\vspace{-.8em}
        \begin{flushright}
          $\leadsto\mu_A<\mu_B$ with less than 50\% of confidence \frowny
          $\leadsto$ more experiments...
        \end{flushright}
      }%
      \only<4->{The width of the confidence interval is proportional
        to $\frac{\sigma}{\sqrt{n}}$\vspace{-.8em}
        \begin{flushright}
          You can estimate how much more experiments you need\smiley\\
          4 times more to halve it! \frowny Try to \alert{reduce variance} if you can...\smiley
        \end{flushright}
      }
    \end{overlayarea}
#+END_EXPORT