{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fonnesbeck/Bios8366/blob/master/notebooks/Section4_5-Model-Building-with-PyMC.ipynb)\n",
    "\n",
    "# Building Models in PyMC\n",
    "\n",
    "Now that we have been introduced to PyMC and Aesara at a very high level, let's now take a more detailed look at PyMC's API as it relates to building models.\n",
    "\n",
    "Bayesian inference begins with specification of a probability model relating unknown variables to data. PyMC provides the basic building blocks for Bayesian probability models: stochastic random variables, deterministic variables, and factor potentials. \n",
    "\n",
    "A **stochastic random variable** is a factor whose value is not completely determined by its parents, while the value of a **deterministic random variable** is entirely determined by its parents. Most models can be constructed using only these two variable types. The third quantity, the **factor potential**, is *not* a variable but simply a\n",
    "log-likelihood term or constraint that is added to the joint log-probability to modify it. \n",
    "\n",
    "## The Distribution class\n",
    "\n",
    "A stochastic variable is represented in PyMC by a `Distribution` class. This structure adds functionality to Aesara's `aesara.tensor.random.op.RandomVariable` class, mainly by registering it with an associated PyMC `Model`. As we demonstrated in a previous section, `Distribution` objects are only usable inside of a `Model` context.\n",
    "\n",
    "`Distribution` subclasses (i.e. implementations of specific statistical distributions) will accept several arguments when constructed:\n",
    "\n",
    "`name`\n",
    ":   Name for the new model variable. This argument is **required**, and is used as a label and index value for the variable.\n",
    "\n",
    "`shape`\n",
    ":   The variable's shape.\n",
    "\n",
    "`total_size`\n",
    ":   The overall size of the variable (this variable will not exist for scalars).\n",
    "\n",
    "`dims`\n",
    ":   A tuple of dimension names known to the model.\n",
    "\n",
    "`transform`\n",
    ":   A transformation to be applied to the distribution when used by the model, especially when the distribution is constrained.\n",
    "\n",
    "`initval`\n",
    ":   Numeric or symbolic untransformed initial value of matching shape, or one of the following initial value strategies: \"moment\", \"prior\". Depending on the sampler's settings, a random jitter may be added to numeric, symbolic or moment-based initial values in the transformed space.\n",
    "\n",
    "`model`\n",
    ":   The PyMC model to which the variable belongs.\n",
    "\n",
    "\n",
    "As we previewed in the introduction, `Distribution` has a classmethod `dist` that returns a **stateless** probability distribution of that type; that is, without being wrapped in a PyMC random variable object. Sometimes we wish to use a particular statistical distribution, without using it as a variable in a model; for example, to generate random numbers from the distribution. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "x = pm.Exponential.dist(1, size=1000)\n",
    "sns.histplot(x.eval());"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creation of stochastic random variables\n",
    "\n",
    "Stochastic random variables with standard distributions provided by PyMC can be created in a single line using special subclasses of the `Distribution` class. For example, as we have seen, the uniformly-distributed discrete variable `switchpoint` in the coal mining disasters model is created using the automatic interface as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pymc as pm\n",
    "\n",
    "with pm.Model() as disaster_model:\n",
    "\n",
    "    switchpoint = pm.DiscreteUniform('switchpoint', lower=0, upper=110)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, the rate parameters can automatically be given exponential priors:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with disaster_model:\n",
    "    early_mean = pm.Exponential('early_mean', lam=1)\n",
    "    late_mean = pm.Exponential('late_mean', lam=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PyMC includes most of the probability density functions (for continuous variables) and probability mass functions (for discrete variables) used in statistical modeling. These distributions are divided into five distinct categories:\n",
    "\n",
    "* Univariate continuous\n",
    "* Univariate discrete\n",
    "* Multivariate\n",
    "* Mixture\n",
    "* Timeseries\n",
    "\n",
    "Probability distributions are all subclasses of `Distribution`, which in turn has two major subclasses: `Discrete` and `Continuous`. In terms of data types, a `Continuous` random variable is given whichever floating point type is defined by `aesara.config.floatX`, while `Discrete` variables are given `int16` types when `aesara.config.floatX` is `float32`, and `int64` otherwise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "switchpoint.dtype"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "early_mean.dtype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Multivariate and Timeseries random variables are vector-valued, rather than scalar (though `Continuous` and `Discrete` variables may have non-scalar values)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "early_mean.shape.eval()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pm.MvNormal.dist(mu=np.ones(2), cov=np.eye(2)).shape.eval()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with pm.Model():\n",
    "\n",
    "    p = pm.Beta('p', 1, 1, shape=(3, 3))\n",
    "\n",
    "p.shape.eval()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All of the `Distribution` subclasses included in PyMC will have two key methods, `random()` and `logp()`, which are used to generate random values and compute the log-probability of a value, respectively.\n",
    "\n",
    "```python\n",
    "class SomeDistribution(Continuous):\n",
    "    def __init__(...):\n",
    "        ...\n",
    "\n",
    "    def random(self, point=None, size=None):\n",
    "        ...\n",
    "        return random_samples\n",
    "\n",
    "    def logp(self, value):\n",
    "        ...\n",
    "        return total_log_prob\n",
    "```\n",
    "\n",
    "PyMC expects the `logp()` method to return a log-probability evaluated at the passed `value` argument. This method is used internally by all of the inference methods to calculate the model log-probability that is used for fitting models. The `random()` method is used to simulate values from the variable, and is used internally for posterior predictive checks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you have a well-behaved density function, we can use it in a model to build a model log-likelihood function. Almost any Aesara function can be turned into a\n",
    "distribution using the `DensityDist` function. For exmaple, the uniformly-distributed discrete stochastic variable `switchpoint` in the disasters model could alternatively be created from a function that computes its log-probability as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "with pm.Model():\n",
    "    \n",
    "    def uniform_logp(value, lower=0, upper=111):\n",
    "        \"\"\"The switchpoint for the rate of disaster occurrence.\"\"\"\n",
    "        return pm.math.switch((value > upper) | (value < lower), -np.inf, -np.log(upper - lower + 1))\n",
    "\n",
    "    switchpoint = pm.DensityDist('switchpoint', logp=uniform_logp, dtype='int64')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pm.logp(switchpoint, 4).eval()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pm.logp(switchpoint, 44).eval()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pm.logp(switchpoint, -4).eval()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A couple of things to notice: while the function specified for the `logp` argument can be an arbitrary Python function, it must use **Aesara operators and functions** in its body. This is because one or more of the arguments passed to the function may be `TensorVariables`, and they must be supported. \n",
    "\n",
    "To emphasize, the Python function passed to `DensityDist` should compute the *log*-density or *log*-probability of the variable. That is why the return value in the example above is `-log(upper-lower+1)` rather than `1/(upper-lower+1)`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Contexts and Random Variables\n",
    "\n",
    "As we have seen, the canonical way to specify PyMC models is using a `Model` context manager. Generally speaking, a context manager is a Python idiom that does the following:\n",
    "\n",
    "```python\n",
    "    VAR = EXPR\n",
    "    VAR.__enter__()\n",
    "    try:\n",
    "        USERCODE\n",
    "    finally:\n",
    "        VAR.__exit__()\n",
    "\n",
    "```\n",
    "\n",
    "As an analogy, `Model` is a tape machine that records what is being added to the model; it keeps track the random variables (observed or unobserved) and other model components. The model context then computes some simple model properties, builds a **bijection** mapping that transforms between Python dictionaries and numpy/Aesara ndarrays. , More importantly, a `Model` contains methods to compile Aesara functions that take Random Variables--that are also\n",
    "initialised within the same model--as input.\n",
    "\n",
    "Within a model context, random variables are essentially Aesara `TensorVariables`:\n",
    "\n",
    "```python\n",
    "with pm.Model() as model:\n",
    "    z = pm.Normal('z', mu=0., sigma=5.)             # ==> aesara.tensor.var.TensorVariable\n",
    "    x = pm.Normal('x', mu=z, sigma=1., observed=5.) # ==> aesara.tensor.var.TensorVariable\n",
    "x.logp({'z': 2.5})                                  # ==> -4.0439386\n",
    "model.logp({'z': 2.5})                              # ==> -6.6973152\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Observed Random Variables\n",
    "\n",
    "Stochastic random variables whose values are observed (*i.e.* data likelihoods) are represented by a different class than unobserved random variables. A `ObservedRV` object is instantiated any time a stochastic variable is specified with data passed as the `observed` argument. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with disaster_model:\n",
    "    \n",
    "    disasters = pm.Poisson('disasters', mu=3, observed=[3,4,1,2,0,2,2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or, for custom distributions, the `observed` data can be passed to `DensityDist`:\n",
    "\n",
    "```python\n",
    "def logp(value, failure, lam):\n",
    "    return (value * log(lam) - lam * t).sum()\n",
    "\n",
    "exp_surv = pm.DensityDist('exp_surv', failure, lam, logp=logp, observed=t)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An important responsibility of `ObservedRV` is to automatically handle missing values in the data, when they are present (absent?). More on this later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Auto-transformation\n",
    "\n",
    "To support efficient sampling by PyMC's MCMC algorithms, any continuous variables that are constrained to a sub-interval of the real line are automatically transformed so that their support is unconstrained. This frees sampling algorithms from having to deal with boundary constraints.\n",
    "\n",
    "For example, the gamma distribution is positive-valued. If we define one for a model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with pm.Model() as model:\n",
    "    g = pm.Gamma('g', 1, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will create an object in the model's `value_vars` attribute that represent the values of each random variable actually used by the model's log-likelihood.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.value_vars"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the name suggests, the variable ``g`` has been log-transformed, and this is the space over which posterior sampling takes place. When a sample is drawn, the value of the transformed variable is simply back-transformed to recover the original variable.\n",
    "\n",
    "By default, auto-transformed variables are ignored when summarizing and plotting model output, since they are not generally of interest to the user."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deterministic Variables\n",
    "\n",
    "A deterministic variable is one whose values are **completely determined** by the values of their parents. For example, in our disasters model, `rate` is a deterministic variable.\n",
    "\n",
    "```python\n",
    "with disaster_model:\n",
    "    \n",
    "    rate = pm.Deterministic('rate', switch(switchpoint >= np.arange(112), early_mean, late_mean))\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "so `rate`'s value can be computed exactly from the values of its parents `early_mean`, `late_mean` and `switchpoint`.\n",
    "\n",
    "There are two types of deterministic variables in PyMC3\n",
    "\n",
    "#### Anonymous deterministic variables\n",
    "\n",
    "The easiest way to create a deterministic variable is to operate on or transform one or more variables in a model directly. For example, the simplest way to specify the `rate` variable above is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with disaster_model:\n",
    "    \n",
    "    rate = pm.math.switch(switchpoint >= np.arange(112), early_mean, late_mean)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or, let's say we wanted to use the mean of the `early_mean` and `late_mean` variables somehere in our model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with disaster_model:\n",
    "    \n",
    "    mean_of_means = (early_mean + late_mean)/2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are called *anonymous* variables because we did not wrap it with a call to `Determinstic`, which gives it a name as its first argument. We simply specified the variable as a Python (or, Aesara) expression. This is therefore the simplest way to construct a determinstic variable. The only caveat is that the values generated by anonymous determinstics at every iteration of a MCMC algorithm, for example, are not recorded to the resulting trace. So, this approach is only appropriate for intermediate values in your model that you do not wish to obtain posterior estimates for, alongside the other variables in the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Named deterministic variables\n",
    "\n",
    "To ensure that deterministic variables' values are accumulated during sampling, they should be instantiated using the **named deterministic** interface; this uses the `Deterministic` function to create the variable. Two things happen when a variable is created this way:\n",
    "\n",
    "1. The variable is given a name (passed as the first argument)\n",
    "2. The variable is appended to the model's list of random variables, which ensures that its values are tallied.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with disaster_model:\n",
    "    \n",
    "    rate = pm.Deterministic('rate', pm.math.switch(switchpoint >= np.arange(112), early_mean, late_mean))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "disaster_model.named_vars"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Factor Potentials\n",
    "\n",
    "For some applications, we want to be able to modify the joint density by incorporating terms that don't correspond to probabilities of variables conditional on parents, for example:\n",
    "\n",
    "$$p(x_0, x_2, \\ldots x_{N-1}) \\propto \\prod_{i=0}^{N-2} \\psi_i(x_i, x_{i+1})$$\n",
    "\n",
    "In other cases we may want to add probability terms to existing models. For example, suppose we want to constrain the difference between the early and late means in the disaster model to be less than 1, so that the joint density becomes: \n",
    "\n",
    "$$p(y,\\tau,\\lambda_1,\\lambda_2) \\propto p(y|\\tau,\\lambda_1,\\lambda_2) p(\\tau) p(\\lambda_1) p(\\lambda_2) I(|\\lambda_2-\\lambda_1| \\lt 1)$$\n",
    "\n",
    "We call such log-probability terms **factor potentials** (Jordan 2004). Bayesian\n",
    "hierarchical notation doesn't accomodate these potentials. \n",
    "\n",
    "### Creation of Potentials\n",
    "\n",
    "A potential can be created via the `Potential` function, in a way very similar to `Deterministic`'s named interface:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with disaster_model:\n",
    "    \n",
    "    rate_constraint = pm.Potential('rate_constraint', pm.math.switch(at.abs(early_mean-late_mean)>1, -np.inf, 0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function takes just a `name` as its first argument and an expression returning the appropriate log-probability as the second argument.\n",
    "\n",
    "A common use of a factor potential is to represent an observed likelihood, where the **observations are partly a function of model variables**. In the contrived example below, we are representing the error in a linear regression model as a zero-mean normal random variable. Thus, the \"data\" in this scenario is the residual, which is a function both of the data and the regression parameters. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = np.array([15, 10, 16, 11, 9, 11, 10, 18, 11])\n",
    "x = np.array([1, 2, 4, 5, 6, 8, 19, 18, 12])\n",
    "\n",
    "with pm.Model() as arma_model:\n",
    "\n",
    "    sigma = pm.HalfCauchy('sigma', 5)\n",
    "    beta = pm.Normal('beta', 0, sd=2)\n",
    "    mu = pm.Normal('mu', 0, sd=10)\n",
    "\n",
    "    err = y - (mu + beta*x)\n",
    "                  \n",
    "    like = pm.Potential('like', \n",
    "        pm.logp(\n",
    "            pm.Normal.dist(0, sd=sigma), \n",
    "            err\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This parameterization would not be compatible with an observed stochastic, because the `err` term would become fixed in the likelihood and not be allowed to change during sampling."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise: Bioassay model\n",
    "\n",
    "Gelman et al. (2003) present an example of an acute toxicity test, commonly performed on animals to estimate the toxicity of various compounds.\n",
    "\n",
    "In this dataset `log_dose` includes 4 levels of dosage, on the log scale, each administered to 5 rats during the experiment. The response variable is death, the number of positive responses to the dosage.\n",
    "\n",
    "The number of deaths can be modeled as a binomial response, with the probability of death being a linear function of dose:\n",
    "\n",
    "$$\\begin{aligned}\n",
    "y_i &\\sim \\text{Bin}(n_i, p_i) \\\\\n",
    "\\text{logit}(p_i) &= a + b x_i\n",
    "\\end{aligned}$$\n",
    "\n",
    "The common statistic of interest in such experiments is the LD50, the dosage at which the probability of death is 50%.\n",
    "\n",
    "Specify this model in PyMC:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Log dose in each group\n",
    "log_dose = [-.86, -.3, -.05, .73]\n",
    "\n",
    "# Sample size in each group\n",
    "n = 5\n",
    "\n",
    "# Outcomes\n",
    "deaths = [0, 1, 3, 5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Write your answer here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sampling with MCMC\n",
    "\n",
    "PyMC's core business is using Markov chain Monte Carlo to fit virtually any probability model. This involves the assignment and coordination of a suite of **step methods**, each of which is responsible for updating one or more variables. \n",
    "\n",
    "The user's interface to PyMC's sampling algorithms is the `sample` function:\n",
    "\n",
    "```python\n",
    "pm.sample(\n",
    "    draws=1000,\n",
    "    step=None,\n",
    "    init='auto',\n",
    "    n_init=200000,\n",
    "    initvals=None,\n",
    "    trace=None,\n",
    "    chains=None,\n",
    "    cores=None,\n",
    "    tune=1000,\n",
    "    progressbar=True,\n",
    "    model=None,\n",
    "    random_seed=None)\n",
    "```\n",
    "\n",
    "`sample` assigns particular samplers to model variables, and generates samples from them. The `draws` argument\n",
    "controls the total number of MCMC iterations. PyMC can automate most of the details of sampling, outside of the selection of the number of draws, using default settings for several parameters that control how the sampling is set up and conducted. However, users may manually intervene in the specification of the sampling by passing values to a number of keyword argumetns for `sample`.\n",
    "\n",
    "### Assigning step methods\n",
    "\n",
    "The `step` argument allows users to assign a MCMC sampling algorithm to the entire model, or to a subset of the variables in the model. For example, if we wanted to use the Metropolis-Hastings sampler to fit our model, we could pass an instance of that step method to `sample` via the `step` argument:\n",
    "\n",
    "```python\n",
    "with my_model:\n",
    "\n",
    "    trace = pm.sample(1000, step=pm.Metropolis())\n",
    "```\n",
    "\n",
    "or if we only wanted to assign `Metropolis` to a parameter called `beta`:\n",
    "\n",
    "```python\n",
    "with my_model:\n",
    "\n",
    "    trace = pm.sample(1000, step=pm.Metropolis(vars=[beta]))\n",
    "```\n",
    "\n",
    "When `step` is not specified by the user, PyMC will assign step methods to variables automatically. To do so, each step method implements a class method called `Competence`. This method returns a value from 0 (incompatible) to 3 (ideal), based on the attributes of the random variable in question. `sample()` assigns the step method that returns the highest competence value to each of its unallocated stochastic random variables. In general:\n",
    "\n",
    "* Binary variables will be assigned to `BinaryMetropolis` (Metropolis-Hastings for binary values)\n",
    "* Discrete variables will be assigned to `Metropolis`\n",
    "* Continuous variables will be assigned to `NUTS` (No U-turn Sampler)\n",
    "\n",
    "### Starting values\n",
    "\n",
    "The `start` argument allows for the specification of starting values for stochastic random variables in the model. MCMC algorithms begin by initializing all unknown quantities to arbitrary starting values. Though in theory the value can be any value under the support of the distribution describing the random variable, we can make sampling more difficult if an initial value is chosen in the extreme tail of the distribution, for example. If starting values are not passed by the user, default values are chosen from the mean, median or mode of the distribution.\n",
    "\n",
    "As suggested in the previous section on approximation methods, it is sometimes useful to initialize a MCMC simulation at the maximum *a posteriori* (MAP) estimate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with bioassay_model:\n",
    "    \n",
    "    posterior_mode = pm.find_MAP()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "posterior_mode"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with bioassay_model:\n",
    "    trace = pm.sample(100, step=pm.Metropolis(), cores=2, initvals=posterior_mode)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we are sampling more than one Markov chain from our model, it is often recommended to initialize each chain to different starting values, so that lack of convergence can be more easily detected (see *Model Checking* section). \n",
    "\n",
    "### Storing samples\n",
    "\n",
    "Notice in the above call to `sample` that output is assigned to a variable we have called `trace`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This `InferenceData` object is a data structure that stores the samples from an MCMC run as grouped attributes. The data structure itself is an `xarray.Dataset` object, which is a dictionary-like object that stores the samples in a multi-dimensional array.\n",
    "\n",
    "The xarray components include:\n",
    "\n",
    "- **Data variables** are the actual values generated from the MCMC draws\n",
    "- **Dimensions** are the axes on which refer to the data variables\n",
    "- **Coordinates** are pointers to specific slices or points in the `xarray.Dataset`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parallel sampling\n",
    "\n",
    "Nearly all modern desktop computers have multiple CPU cores, and running multiple MCMC chains is an **embarrasingly parallel** computing task. It is therefore relatively simple to run chains in parallel in PyMC. This is done by setting the `cores` argument in `sample` to some value between 2 and the number of cores on your machine (you can specify more chains than cores, but you will not gain efficiency by doing so). The default value of `cores` is `None`, which will select the number of CPUs on your machine, to a maximum of 4. \n",
    "\n",
    "> Keep in mind that some chains might themselves be multithreaded via openmp or BLAS. In those cases it might be faster to set this to 1.\n",
    "\n",
    "By default, PyMC will run a sample a minimum of 2 and a maximum of `cores` chains. However, the number of chains sampled can be set independently of the number of cores by specifying the `chains` argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with bioassay_model:\n",
    "    ptrace = pm.sample(100, tune=100, chains=4, cores=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Running $n$ iterations with $c$ chains will result in $c \\times n$ samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ptrace.posterior['alpha'].shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generating several chains is generally recommended because it aids in model checking, allowing statistics such as the potential scale reduction factor ($\\hat{R}$) and effective sample size to be calculated.\n",
    "\n",
    "### Reproducible sampling\n",
    "\n",
    "A practical drawback of using stochastic sampling methods for statistical inference is that it can be more difficult to reproduce individual results, due to the fact that sampling involves the use of pseudo-random number generation. To aid in reproducibility (and debugging), it can be helpful to set a **random number seed** prior to sampling. The `random_seed` argument can be used to set PyMC's random number generator to a particular seed integer, which results in the same sequence of random numbers each time the seed is set to the same value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with bioassay_model:\n",
    "    rtrace = pm.sample(100, cores=2, random_seed=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rtrace.posterior['beta'].sel(draw=slice(0, 5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Setting the same seed for another run of the same model will generate the same sequence of samples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with bioassay_model:\n",
    "    rtrace = pm.sample(100, cores=2, random_seed=42)\n",
    "\n",
    "rtrace.posterior['beta'].sel(draw=slice(0, 5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step methods\n",
    "\n",
    "Step method classes handle individual stochastic variables, or sometimes groups of them. They are responsible for making the variables they handle take **single MCMC steps** conditional on the rest of the model. Each PyMC step method (usually subclasses of `ArrayStep`) implements a method called `astep()`, which is called iteratively by `sample`. \n",
    "\n",
    "All step methods share an optional argument `vars` that allows a particular subset of variables to be handled by the step method instance. Particular step methods will have additional arguments for setting parameters and preferences specific to that sampling algorithm.\n",
    "\n",
    "> NB: when a PyMC function or method has an argument called `vars` it is expecting a list of variables (*i.e.* the variables themselves), whereas arguments called `var_names` expect a list of variables names (*i.e.* strings)\n",
    "\n",
    "### HamiltonianMC\n",
    "\n",
    "The Hamiltonian Monte Carlo algorithm is implemented in the `HamiltonianMC` class. Being a gradient-based sampler, it is only suitable for **continuous random variables**. Several optional arguments can be provided by the user. The algorithm is **non-adaptive**, so the parameter values passed at instantiation are fixed at those values throughout sampling.\n",
    "\n",
    "`HamiltonianMC` requires a scaling matrix parameter `scaling`, which is analogous to the variance parameter for the jump proposal distribution in Metropolis-Hastings, although it is used somewhat differently here. The matrix gives an approximate shape of the posterior distribution, so that `HamiltonianMC` does not make jumps that are too large in some directions and too small in other directions. It is important to set this scaling parameter to a reasonable value to facilitate efficient sampling. This is especially true for models that have many unobserved stochastic random variables or models with highly non-normal posterior distributions. \n",
    "\n",
    "Fortunately, `HamiltonianMC` can often make good guesses for the scaling parameters. If you pass a point in parameter space (as a dictionary of variable names to parameter values, the same format as returned by `find_MAP`), it will look at the **local curvature** of the log posterior-density (the diagonal of the Hessian matrix) at that point to guess values for a good scaling vector, which can result in a good scaling value. Also, the MAP estimate is often a good point to use to initiate sampling. \n",
    "\n",
    "- `scaling` \n",
    ": Scaling for momentum distribution. If a 1-dimensional array is passed, it is interpreted as a matrix diagonal.\n",
    "            \n",
    "- `step_scale` \n",
    ": Size of steps to take, automatically scaled down by $1/n^{0.25}$. Defaults to .25.\n",
    "            \n",
    "- `path_length` \n",
    ": total length to travel during leapfrog. Defaults to 2.\n",
    "            \n",
    "- `is_cov` \n",
    ": Flag for treating scaling as a covariance matrix/vector, if True. Treated as precision otherwise.\n",
    "            \n",
    "- `step_rand` \n",
    ": A function which takes the step size and returns an new one used to randomize the step size at each iteration.\n",
    "\n",
    "\n",
    "### NUTS\n",
    "\n",
    "`NUTS` is the No U-turn Sampler of Hoffman and Gelman (2014), an adaptive version of Hamiltonian MC that **automatically tunes** the step size and number on the fly. \n",
    "\n",
    "In addition to the arguments to `HamiltonianMC`, `NUTS` takes additional parameters to controls the tuning. The most important of these is the target acceptance rate for the Metropolis acceptance phase of the algorithm, `target_accept`. \n",
    "Sometimes if the NUTS struggles to sample efficiently, changing this parameter above the default target rate of 0.8 will improve sampling (the original recommendation by Hoffman & Gelman was 0.6). Increasing the rate very high will also make the sampler more conservative, however, taking many small steps at every iteration. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with bioassay_model:\n",
    "    trace_90 = pm.sample(1000, tune=1000, target_accept=0.9)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trace_90.sample_stats['acceptance_rate'].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import arviz as az\n",
    "\n",
    "az.plot_trace(trace_90, var_names=['alpha']);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is rarely a reason to use `HamiltonianMC` rather than `NUTS`. It is the default sampler for continuous variables in PyMC3."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Metropolis\n",
    "\n",
    "``Metropolis`` implements a Metropolis-Hastings step, as described the theory section, and is designed to handle float- and integer-valued variables.\n",
    "\n",
    "A `Metropolis` step method can be instantiated with any of several optional arguments:\n",
    "\n",
    "- `S`\n",
    ":   This sets the proposal standard deviation or covariance matrix.\n",
    "\n",
    "- `proposal_dist`\n",
    ":   A function that generates zero-mean random deviates used as proposals. Defaults to the normal distribution.\n",
    "\n",
    "- `scaling`\n",
    ":   An initial scale factor for the proposal\n",
    "\n",
    "- `tune_interval`\n",
    ":   The number of intervals between tuning updates to `scaling` factor.\n",
    "\n",
    "When the step method is instantiated, the `proposal_dist` is parameterized with the value passed for `S`. While sampling, the value of `scaling` is used to scale the value proposed by `proposal_dist`, and this value is tuned throughout the MCMC run. During tuning, the acceptance ratio of the step method is examined, and this scaling factor\n",
    "is updated accordingly. Tuning only occurs when the acceptance rate is **lower than 20%** or **higher than 50%**; rates between 20-50% are considered optimal for Metropolis-Hastings sampling. The default tuning interval (`tune_interval`) is 100 iterations.\n",
    "\n",
    "Although tuning will continue throughout the sampling loop, it is important to verify that the\n",
    "**diminishing tuning** condition of [Roberts and Rosenthal (2007)](http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.jap/1183667414) is satisfied: the\n",
    "amount of tuning should decrease to zero, or tuning should become very infrequent.\n",
    "\n",
    "`Metropolis` handles discrete variable types automatically by rounding the proposed values and casting them to integers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### BinaryMetropolis\n",
    "\n",
    "While binary (boolean) variables can be handled by the `Metropolis` step method, sampling will be very inefficient. The `BinaryMetropolis` class is optimized to handle binary variables, by one of only two possible values. The only tuneable parameter is the `scaling` argument, which is used to vary the Bernoulli probability:\n",
    "\n",
    "    p_jump = 1. - .5 ** self.scaling\n",
    "\n",
    "This value is compared to pseudo-random numbers generated by the step method, to determine whether a 0 or 1 is proposed.\n",
    "\n",
    "`BinaryMetropolis` will be automatically selected for random variables that are distributed as Bernoulli, or categorical with only 2 categories."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Slice\n",
    "\n",
    "Though the Metropolis-Hastings algorithm is easy to implement for a variety of models, its efficiency is poor. We have seen that it is possible to tune Metropolis samplers, but it would be nice to have a \"black-box\" method that works for arbitrary continuous distributions, which we may know little about a priori.\n",
    "\n",
    "The **slice sampler** (Neal 2003) improves upon the Metropolis sampler by being both efficient and easy to program generally. The idea is to first sample from the conditional distribution for $y$ (i.e., $Pr(x)$) given some current value of $x$, which is uniform over the $(0,f(x))$, and conditional on this value for $y$, then sample $x$, which is uniform on $S = {x : y < f (x)}$.\n",
    "\n",
    "The steps required to perform a single iteration of the slice sampler to update the current value of $x_i$ is as follows:\n",
    "\n",
    "1. Sample $y$ uniformly on (0,f(xi)).\n",
    "2. Use this value $y$ to define a horizontal *slice* $S = {x : y < f (x)}$.\n",
    "3. Establish an interval, I=(xa,xb), around xi that contains most of the slice.\n",
    "4. Sample $x_{i+1}$ from the region of the slice overlaping I.\n",
    "\n",
    "Hence, slice sampling employs an **auxilliary variable** ($y$) that is not retained at the end of the iteration. Note that in practice one may operate on the log scale such that $g(x) = \\log(f (x))$ to avoid floating-point underflow. In this case, the auxiliary variable becomes $z = log(y) = g(x_i) − e$, where $e \\sim \\text{Exp}(1)$, resulting in the slice $S = \\{x : z < g(x)\\}$.\n",
    "\n",
    "There are many ways of establishing and sampling from the interval $I$, with the only restriction being that the resulting Markov chain leaves $f(x)$ **invariant**. The objective is to include as much of the slice as possible, so that the potential step size can be large, but not (much) larger than the slice, so that the sampling of invalid points is minimized. Ideally, we would like it to be the slice itself, but it may not always be feasible to determine (and certainly not automatically).\n",
    "\n",
    "In PyMC3, the `Slice` class implements the **univariate** slice sampler. It is suitable for univariate, continuous variables. There is a single user-defined parameter `w`, which sets the width of the initial slice. If not specified, it defaults to a width of 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with bioassay_model:\n",
    "    \n",
    "    slice_trace = pm.sample(2000, cores=2, step=pm.Slice())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "az.plot_trace(slice_trace, var_names=['alpha','beta']);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PyMC also includes implementations of stochastic gradient Markov chain Monte Carlo (SGMCMC, Nemeth & Fearnhead 2019) and Multi-Level Delayed Acceptance MCMC (MLDA, Dodwell et al. 2019), which we will not cover here. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imputation of Missing Data\n",
    "\n",
    "As with most textbook examples, the models we have examined so far assume that the associated data are complete. That is, there are no **missing values** corresponding to any observations in the dataset. However, many real-world datasets have missing observations, usually due to some logistical problem during the data collection process. The easiest way of dealing with observations that contain missing values is simply to exclude them from the analysis. However, this results in loss of information if an excluded observation contains valid values for other quantities, and can bias results. An alternative is to impute the missing values, based on information in the rest of the model.\n",
    "\n",
    "For example, consider a survey dataset for some wildlife species:\n",
    "\n",
    "    Count   Site   Observer   Temperature\n",
    "    ------- ------ ---------- -------------\n",
    "    15      1      1          15\n",
    "    10      1      2          NA\n",
    "    6       1      1          11\n",
    "\n",
    "Each row contains the number of individuals seen during the survey, along with three covariates: the site on which the survey was conducted, the observer that collected the data, and the temperature during the survey. If we are interested in modelling, say, population size as a function of the count and the associated covariates, it is difficult to accommodate the second observation because the temperature is missing (perhaps the thermometer was broken that day). Ignoring this observation will allow us to fit the model, but it wastes information that is contained in the other covariates.\n",
    "\n",
    "In a Bayesian modelling framework, missing data are accommodated simply by treating them as **unknown model parameters**. Values for the missing data $\\tilde{y}$ are estimated naturally, using the posterior predictive distribution:\n",
    "\n",
    "$$p(\\tilde{y}|y) = \\int p(\\tilde{y}|\\theta) f(\\theta|y) d\\theta$$\n",
    "\n",
    "This describes additional data $\\tilde{y}$, which may either be considered unobserved data or potential future observations. We can use the posterior predictive distribution to model the likely values of missing data.\n",
    "\n",
    "Consider the coal mining disasters data introduced previously. Assume that two years of data are missing from the time series; we indicate this in the data array by the use of an arbitrary placeholder value, `-999`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "disasters_missing = np.array([ 4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6,\n",
    "3, 3, 5, 4, 5, 3, 1, 4, 4, 1, 5, 5, 3, 4, 2, 5,\n",
    "2, 2, 3, 4, 2, 1, 3, -999, 2, 1, 1, 1, 1, 3, 0, 0,\n",
    "1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1,\n",
    "0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2,\n",
    "3, 3, 1, -999, 2, 1, 1, 1, 1, 2, 4, 2, 0, 0, 1, 4,\n",
    "0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])\n",
    "\n",
    "N = len(disasters_missing)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To estimate these values in PyMC, we need to convert these placeholder values to `np.nan` (or `None`) values so that they can be handled by the model as missing values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "disasters_missing = np.where(disasters_missing == -999, np.nan, disasters_missing)\n",
    "disasters_missing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This array can then be passed to the model likelihood, which recognizes the `nan` values as missing and replaces them with stochastic variables of the desired type. For the coal mining disasters problem, recall that disaster events were modeled as Poisson variates:\n",
    "\n",
    "```python\n",
    "disasters = Poisson('disasters', mu=rate, observed=disasters_missing)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each element in `disasters` is a Poisson random variable, irrespective of whether the observation was missing or not. The difference is that actual observations are assumed to be data stochastics, while the missing\n",
    "values are unobserved stochastics. The latter are considered unknown, rather than fixed, and therefore estimated by the fitting algorithm, just as unknown model parameters are.\n",
    "\n",
    "The model is otherwise unchanged from the complete data case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with pm.Model() as missing_data_model:\n",
    "\n",
    "    # Prior for distribution of switchpoint location\n",
    "    switchpoint = pm.DiscreteUniform('switchpoint', lower=0, upper=N)\n",
    "    # Priors for pre- and post-switch mean number of disasters\n",
    "    early_mean = pm.Exponential('early_mean', lam=1.)\n",
    "    late_mean = pm.Exponential('late_mean', lam=1.)\n",
    "\n",
    "    # Allocate appropriate Poisson rates to years before and after current\n",
    "    # switchpoint location\n",
    "    idx = np.arange(N)\n",
    "    rate = pm.Deterministic('rate', pm.math.switch(switchpoint >= idx, early_mean, late_mean))\n",
    "\n",
    "    # Data likelihood\n",
    "    disasters = pm.Poisson('disasters', rate, observed=disasters_missing)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we have used the `masked_array` function, rather than\n",
    "`masked_equal`, and the value -999 as a placeholder for missing data.\n",
    "The result is the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with missing_data_model:\n",
    "    trace_missing = pm.sample(1000, tune=1000, cores=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "missing_data_model.vars"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "az.summary(trace_missing, var_names=['disasters_missing'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "Load the titanic dataset, and construct an appropriate model to predict passenger survival rate. Summarize the parameter estimates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "titanic = pd.read_excel('https://raw.githubusercontent.com/fonnesbeck/Bios8366/master/data/titanic.xls', index_col='name')\n",
    "titanic.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your answer here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## References\n",
    "\n",
    "1. Ching & Chen. 2007. Transitional Markov chain Monte Carlo method for Bayesian model updating, model class selection and model averaging. Journal of Engineering Mechanics 2007\n",
    "2.\tHoffman MD, Gelman A. 2014. The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. The Journal of Machine Learning Research. 15(1):1593-1623.\n",
    "3. M.I. Jordan. 2004. Graphical models. Statist. Sci., 19(1):140–155.\n",
    "4. Neal, R. M. 2003. Slice sampling. The Annals of Statistics, 31(3), 705–767. doi:10.1111/1467-9868.00198"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  },
  "latex_envs": {
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 0
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}