{
 "metadata": {
  "name": "",
  "signature": "sha256:11278e6b7bba23662e5cb47cf1d0051bdac0d5f368e5de5e5ae814d866cf0c8b"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "SARIMAX: Model selection, missing data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The example mirrors Durbin and Koopman (2012), Chapter 8.4 in application of Box-Jenkins methodology to fit ARMA models. The novel feature is the ability of the model to work on datasets with missing values."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%matplotlib inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "import pandas as pd\n",
      "from scipy.stats import norm\n",
      "import statsmodels.api as sm\n",
      "import matplotlib.pyplot as plt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import requests\n",
      "from StringIO import StringIO\n",
      "from zipfile import ZipFile\n",
      "\n",
      "# Download the dataset\n",
      "dk = requests.get('http://www.ssfpack.com/files/DK-data.zip').content\n",
      "zipped = ZipFile(StringIO(dk))\n",
      "df = pd.read_table(\n",
      "    StringIO(zipped.read('internet.dat')),\n",
      "    skiprows=1, header=None, sep='\\s+', engine='python',\n",
      "    names=['internet','dinternet']\n",
      ")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Model Selection\n",
      "\n",
      "As in Durbin and Koopman, we force a number of the values to be missing."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Get the basic series\n",
      "dta_full = df.dinternet[1:].values\n",
      "dta_miss = dta_full.copy()\n",
      "\n",
      "# Remove datapoints\n",
      "missing = np.r_[6,16,26,36,46,56,66,72,73,74,75,76,86,96]-1\n",
      "dta_miss[missing] = np.nan"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Then we can consider model selection using the Akaike information criteria (AIC), but running the model for each variant and selecting the model with the lowest AIC value.\n",
      "\n",
      "There are a couple of things to note here:\n",
      "\n",
      "- When running such a large batch of models, particularly when the autoregressive and moving average orders become large, there is the possibility of poor maximum likelihood convergence. Below we ignore the warnings since this example is illustrative.\n",
      "- We use the option `enforce_invertibility=False`, which allows the moving average polynomial to be non-invertible, so that more of the models are estimable.\n",
      "- Several of the models do not produce good results, and their AIC value is set to NaN. This is not surprising, as Durbin and Koopman note numerical problems with the high order models."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import warnings\n",
      "\n",
      "aic_full = pd.DataFrame(np.zeros((6,6), dtype=float))\n",
      "aic_miss = pd.DataFrame(np.zeros((6,6), dtype=float))\n",
      "\n",
      "warnings.simplefilter('ignore')\n",
      "\n",
      "# Iterate over all ARMA(p,q) models with p,q in [0,6]\n",
      "for p in range(6):\n",
      "    for q in range(6):\n",
      "        if p == 0 and q == 0:\n",
      "            continue\n",
      "            \n",
      "        # Estimate the model with no missing datapoints\n",
      "        mod = sm.tsa.statespace.SARIMAX(dta_full, order=(p,0,q), enforce_invertibility=False)\n",
      "        try:\n",
      "            res = mod.fit()\n",
      "            aic_full.iloc[p,q] = res.aic\n",
      "        except:\n",
      "            aic_full.iloc[p,q] = np.nan\n",
      "        \n",
      "        # Estimate the model with missing datapoints\n",
      "        mod = sm.tsa.statespace.SARIMAX(dta_miss, order=(p,0,q), enforce_invertibility=False)\n",
      "        try:\n",
      "            res = mod.fit()\n",
      "            aic_miss.iloc[p,q] = res.aic\n",
      "        except:\n",
      "            aic_miss.iloc[p,q] = np.nan"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For the models estimated over the full (non-missing) dataset, the AIC chooses ARMA(1,1) or ARMA(3,0). Durbin and Koopman suggest the ARMA(1,1) specification is better due to parsimony.\n",
      "\n",
      "$$\n",
      "\\text{Replication of:}\\\\\n",
      "\\textbf{Table 8.1} ~~ \\text{AIC for different ARMA models.}\\\\\n",
      "\\newcommand{\\r}[1]{{\\color{red}{#1}}}\n",
      "\\begin{array}{lrrrrrr}\n",
      "\\hline\n",
      "q &      0 &      1 &      2 &      3 &      4 &      5 \\\\\n",
      "\\hline\n",
      "p &     {} &     {} &     {} &     {} &     {} &     {} \\\\\n",
      "0 &   0.00 & 549.81 & 519.87 & 520.27 & 519.38 & 518.86 \\\\\n",
      "1 & 529.24 & \\r{514.30} & 516.25 & 514.58 & 515.10 & 516.28 \\\\\n",
      "2 & 522.18 & 516.29 & 517.16 & 515.77 & 513.24 & 514.73 \\\\\n",
      "3 & \\r{511.99} & 513.94 & 515.92 & 512.06 & 513.72 & 514.50 \\\\\n",
      "4 & 513.93 & 512.89 &    nan &    nan & 514.81 & 516.08 \\\\\n",
      "5 & 515.86 & 517.64 &    nan &    nan &    nan &    nan \\\\\n",
      "\\hline\n",
      "\\end{array}\n",
      "$$\n",
      "\n",
      "For the models estimated over missing dataset, the AIC chooses ARMA(1,1)\n",
      "\n",
      "$$\n",
      "\\text{Replication of:}\\\\\n",
      "\\textbf{Table 8.2} ~~ \\text{AIC for different ARMA models with missing observations.}\\\\\n",
      "\\begin{array}{lrrrrrr}\n",
      "\\hline\n",
      "q &      0 &      1 &      2 &      3 &      4 &      5 \\\\\n",
      "\\hline\n",
      "p &     {} &     {} &     {} &     {} &     {} &     {} \\\\\n",
      "0 &   0.00 & 488.93 & 464.01 & 463.86 & 462.63 & 463.62 \\\\\n",
      "1 & 468.01 & \\r{457.54} & 459.35 & 458.66 & 459.15 & 461.01 \\\\\n",
      "2 & 469.68 &    nan & 460.48 & 459.43 & 459.23 & 460.47 \\\\\n",
      "3 & 467.10 & 458.44 & 459.64 & 456.66 & 459.54 & 460.05 \\\\\n",
      "4 & 469.00 & 459.52 &    nan & 463.04 & 459.35 & 460.96 \\\\\n",
      "5 & 471.32 & 461.26 &    nan &    nan & 461.00 & 462.97 \\\\\n",
      "\\hline\n",
      "\\end{array}\n",
      "$$\n",
      "\n",
      "**Note**: the AIC values are calculated differently than in Durbin and Koopman, but show overall similar trends."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Postestimation\n",
      "\n",
      "Using the ARMA(1,1) specification selected above, we perform in-sample prediction and out-of-sample forecasting."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Statespace\n",
      "mod = sm.tsa.statespace.SARIMAX(dta_miss, order=(1,0,1))\n",
      "res = mod.fit()\n",
      "print res.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# In-sample one-step-ahead predictions\n",
      "predict_res = res.predict(full_results=True)\n",
      "\n",
      "predict = predict_res.forecasts\n",
      "cov = predict_res.forecasts_error_cov\n",
      "predict_idx = np.arange(len(predict[0]))\n",
      "\n",
      "# 95% confidence intervals\n",
      "critical_value = norm.ppf(1 - 0.05 / 2.)\n",
      "std_errors = np.sqrt(cov.diagonal().T)\n",
      "ci = np.c_[\n",
      "    (predict - critical_value*std_errors)[:, :, None],\n",
      "    (predict + critical_value*std_errors)[:, :, None],\n",
      "][0].T"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Out-of-sample forecasts and confidence intervals\n",
      "nforecast = 20\n",
      "forecast = res.forecast(nforecast)\n",
      "forcast_idx = len(dta_full) + np.arange(nforecast)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Graph\n",
      "fig, ax = plt.subplots(figsize=(12,6))\n",
      "ax.xaxis.grid()\n",
      "ax.plot(predict_idx, dta_miss, 'k.')\n",
      "\n",
      "# Plot\n",
      "ax.plot(predict_idx, predict[0], 'gray');\n",
      "ax.fill_between(predict_idx, ci[0], ci[1], alpha=0.1)\n",
      "\n",
      "ax.plot(forcast_idx[-20:], forecast[0], 'k--', linestyle='--', linewidth=2)\n",
      "\n",
      "ax.set(title='Figure 8.9 - Internet series');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}