{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multiple Comparisons"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_This setup code is required to run in an IPython notebook_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "sns.set_style(\"darkgrid\")\n",
    "plt.rc(\"figure\", figsize=(16, 6))\n",
    "plt.rc(\"savefig\", dpi=90)\n",
    "plt.rc(\"font\", family=\"sans-serif\")\n",
    "plt.rc(\"font\", size=14)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reproducability\n",
    "import numpy as np\n",
    "\n",
    "gen = np.random.default_rng(23456)\n",
    "# Common seed used throughout\n",
    "seed = gen.integers(0, 2**31 - 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The multiple comparison procedures all allow for examining aspects of superior predictive ability. There are three available:\n",
    "\n",
    "* `SPA` - The test of Superior Predictive Ability, also known as the Reality Check (and accessible as `RealityCheck`) or the bootstrap data snooper, examines whether any model in a set of models can outperform a benchmark.\n",
    "* `StepM` - The stepwise multiple testing procedure uses sequential testing to determine which models are superior to a benchmark.\n",
    "* `MCS` - The model confidence set which computes the set of models which with performance indistinguishable from others in the set.\n",
    "\n",
    "All procedures take **losses** as inputs.  That is, smaller values are preferred to larger values.  This is common when evaluating forecasting models where the loss function is usually defined as a positive function of the forecast error that is increasing in the absolute error.  Leading examples are Mean Square Error (MSE) and Mean Absolute Deviation (MAD)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The test of Superior Predictive Ability (SPA)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This procedure requires a $t$-element array of benchmark losses and a $t$ by $k$-element array of model losses.  The null hypothesis is that no model is better than the benchmark, or \n",
    "\n",
    "$$ H_0: \\max_i E[L_i] \\geq E[L_{bm}] $$\n",
    "\n",
    "where $L_i$ is the loss from model $i$ and $L_{bm}$ is the loss from the benchmark model.\n",
    "\n",
    "This procedure is normally used when there are many competing forecasting models such as in the study of technical trading rules.  The example below will make use of a set of models which are all equivalently good to a benchmark model and will serve as a *size study*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Study Design"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The study will make use of a measurement error in predictors to produce a large set of correlated variables that all have equal expected MSE.  The benchmark will have identical measurement error and so all models have the same expected loss, although will have different forecasts.\n",
    "\n",
    "The first block computed the series to be forecast."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import statsmodels.api as sm\n",
    "from numpy.random import randn\n",
    "\n",
    "t = 1000\n",
    "factors = randn(t, 3)\n",
    "beta = np.array([1, 0.5, 0.1])\n",
    "e = randn(t)\n",
    "y = factors.dot(beta)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next block computes the benchmark factors and the model factors by contaminating the original factors with noise.  The models are estimated on the first 500 observations and predictions are made for the second 500.  Finally, losses are constructed from these predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Measurement noise\n",
    "bm_factors = factors + randn(t, 3)\n",
    "# Fit using first half, predict second half\n",
    "bm_beta = sm.OLS(y[:500], bm_factors[:500]).fit().params\n",
    "# MSE loss\n",
    "bm_losses = (y[500:] - bm_factors[500:].dot(bm_beta)) ** 2.0\n",
    "# Number of models\n",
    "k = 500\n",
    "model_factors = np.zeros((k, t, 3))\n",
    "model_losses = np.zeros((500, k))\n",
    "for i in range(k):\n",
    "    # Add measurement noise\n",
    "    model_factors[i] = factors + randn(1000, 3)\n",
    "    # Compute regression parameters\n",
    "    model_beta = sm.OLS(y[:500], model_factors[i, :500]).fit().params\n",
    "    # Prediction and losses\n",
    "    model_losses[:, i] = (y[500:] - model_factors[i, 500:].dot(model_beta)) ** 2.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally the SPA can be used.  The SPA requires the **losses** from the benchmark and the models as inputs.  Other inputs allow the bootstrap sued to be changed or for various options regarding studentization of the losses.  `compute` does the real work, and then `pvalues` contains the probability that the null is true given the realizations.\n",
    "\n",
    "In this case, one would not reject. The three p-values correspond to different re-centerings of the losses.  In general, the `consistent` p-value should be used.  It should always be the case that\n",
    "\n",
    "$$ lower \\leq consistent \\leq upper .$$\n",
    "\n",
    "See the original papers for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from arch.bootstrap import SPA\n",
    "\n",
    "spa = SPA(bm_losses, model_losses, seed=seed)\n",
    "spa.compute()\n",
    "spa.pvalues"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The same blocks can be repeated to perform a simulation study.  Here I only use 100 replications since this should complete in a reasonable amount of time.  Also I set `reps=250` to limit the number of bootstrap replications in each application of the SPA (the default is a more reasonable 1000)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the pvalues\n",
    "pvalues = []\n",
    "b = 100\n",
    "seeds = gen.integers(0, 2**31 - 1, b)\n",
    "# Repeat 100 times\n",
    "for j in range(b):\n",
    "    if j % 10 == 0:\n",
    "        print(j)\n",
    "    factors = randn(t, 3)\n",
    "    beta = np.array([1, 0.5, 0.1])\n",
    "    e = randn(t)\n",
    "    y = factors.dot(beta)\n",
    "\n",
    "    # Measurement noise\n",
    "    bm_factors = factors + randn(t, 3)\n",
    "    # Fit using first half, predict second half\n",
    "    bm_beta = sm.OLS(y[:500], bm_factors[:500]).fit().params\n",
    "    # MSE loss\n",
    "    bm_losses = (y[500:] - bm_factors[500:].dot(bm_beta)) ** 2.0\n",
    "    # Number of models\n",
    "    k = 500\n",
    "    model_factors = np.zeros((k, t, 3))\n",
    "    model_losses = np.zeros((500, k))\n",
    "    for i in range(k):\n",
    "        model_factors[i] = factors + randn(1000, 3)\n",
    "        model_beta = sm.OLS(y[:500], model_factors[i, :500]).fit().params\n",
    "        # MSE loss\n",
    "        model_losses[:, i] = (y[500:] - model_factors[i, 500:].dot(model_beta)) ** 2.0\n",
    "    # Lower the bootstrap replications to 250\n",
    "    spa = SPA(bm_losses, model_losses, reps=250, seed=seeds[j])\n",
    "    spa.compute()\n",
    "    pvalues.append(spa.pvalues)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally the pvalues can be plotted.  Ideally they should form a $45^o$ line indicating the size is correct.  Both the consistent and upper perform well.  The lower has too many small p-values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pvalues_df = pd.DataFrame(pvalues)\n",
    "for col in pvalues_df:\n",
    "    values = pvalues_df[col].to_numpy(copy=True)\n",
    "    values.sort()\n",
    "    pvalues_df[col] = values\n",
    "# Change the index so that the x-values are between 0 and 1\n",
    "pvalues_df.index = pd.Index(np.linspace(0.005, 0.995, 100))\n",
    "fig = pvalues_df.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Power"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The SPA also has power to reject then the null is violated.  The simulation will be modified so that the amount of measurement error differs across models, and so that some models are actually better than the benchmark.   The p-values should be small indicating rejection of the null."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Number of models\n",
    "k = 500\n",
    "model_factors = np.zeros((k, t, 3))\n",
    "model_losses = np.zeros((500, k))\n",
    "for i in range(k):\n",
    "    scale = (2500.0 - i) / 2500.0\n",
    "    model_factors[i] = factors + scale * randn(1000, 3)\n",
    "    model_beta = sm.OLS(y[:500], model_factors[i, :500]).fit().params\n",
    "    # MSE loss\n",
    "    model_losses[:, i] = (y[500:] - model_factors[i, 500:].dot(model_beta)) ** 2.0\n",
    "\n",
    "spa = SPA(bm_losses, model_losses, seed=seed)\n",
    "spa.compute()\n",
    "spa.pvalues"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here the average losses are plotted.  The higher index models are clearly better than the lower index models -- and the benchmark model (which is identical to model.0)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_losses_df = pd.DataFrame(\n",
    "    model_losses, columns=[\"model.\" + str(i) for i in range(k)]\n",
    ")\n",
    "avg_model_losses = pd.DataFrame(model_losses_df.mean(axis=0), columns=[\"Average loss\"])\n",
    "fig = avg_model_losses.plot(style=[\"o\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stepwise Multiple Testing (StepM)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Stepwise Multiple Testing is similar to the SPA and has the same null. The primary difference is that it identifies the set of models which are better than the benchmark, rather than just asking the basic question if any model is better.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from arch.bootstrap import StepM\n",
    "\n",
    "stepm = StepM(bm_losses, model_losses_df)\n",
    "stepm.compute()\n",
    "print(\"Model indices:\")\n",
    "print([model.split(\".\")[1] for model in stepm.superior_models])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "better_models = pd.concat(\n",
    "    [model_losses_df.mean(axis=0), model_losses_df.mean(axis=0)], axis=1\n",
    ")\n",
    "better_models.columns = pd.Index([\"Same or worse\", \"Better\"])\n",
    "better = better_models.index.isin(stepm.superior_models)\n",
    "worse = np.logical_not(better)\n",
    "better_models.loc[better, \"Same or worse\"] = np.nan\n",
    "better_models.loc[worse, \"Better\"] = np.nan\n",
    "fig = better_models.plot(style=[\"o\", \"s\"], rot=270)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Model Confidence Set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model confidence set takes a set of **losses** as its input and finds the set which are not statistically different from each other while controlling the familywise error rate.  The primary output is a set of p-values, where models with a pvalue above the size are in the MCS.  Small p-values indicate that the model is easily rejected from the set that includes the best.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from arch.bootstrap import MCS\n",
    "\n",
    "# Limit the size of the set\n",
    "losses = model_losses_df.iloc[:, ::20]\n",
    "mcs = MCS(losses, size=0.10)\n",
    "mcs.compute()\n",
    "print(\"MCS P-values\")\n",
    "print(mcs.pvalues)\n",
    "print(\"Included\")\n",
    "included = mcs.included\n",
    "print([model.split(\".\")[1] for model in included])\n",
    "print(\"Excluded\")\n",
    "excluded = mcs.excluded\n",
    "print([model.split(\".\")[1] for model in excluded])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "status = pd.DataFrame(\n",
    "    [losses.mean(axis=0), losses.mean(axis=0)], index=[\"Excluded\", \"Included\"]\n",
    ").T\n",
    "status.loc[status.index.isin(included), \"Excluded\"] = np.nan\n",
    "status.loc[status.index.isin(excluded), \"Included\"] = np.nan\n",
    "fig = status.plot(style=[\"o\", \"s\"])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}