{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building an ARIMA Model for a Financial Dataset\n",
    "\n",
    "In this notebook, you will build an ARIMA model for AAPL stock closing prices. The lab objectives are:\n",
    "\n",
    "* Pull data from Google Cloud Storage into a Pandas dataframe\n",
    "* Learn how to prepare raw stock closing data for an ARIMA model\n",
    "* Apply the Dickey-Fuller test \n",
    "* Build an ARIMA model using the statsmodels library"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Make sure you restart the Python kernel after executing the `pip install` command below__! After you restart the kernel you don't have to execute the command again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --user statsmodels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import datetime\n",
    "\n",
    "%config InlineBackend.figure_format = 'retina'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import data from Google Clod Storage\n",
    "\n",
    "In this section we'll read some ten years' worth of AAPL stock data into a Pandas dataframe. We want to modify the dataframe such that it represents a time series. This is achieved by setting the date as the index. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('gs://cloud-training/ai4f/AAPL10Y.csv')\n",
    "\n",
    "df['date'] = pd.to_datetime(df['date'])\n",
    "df.sort_values('date', inplace=True)\n",
    "df.set_index('date', inplace=True)\n",
    "\n",
    "print(df.shape)\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare data for ARIMA \n",
    "\n",
    "The first step in our preparation is to resample the data such that stock closing prices are aggregated on a weekly basis. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_week = df.resample('w').mean()\n",
    "df_week = df_week[['close']]\n",
    "df_week.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a column for weekly returns. Take the log to of the returns to normalize large fluctuations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_week['weekly_ret'] = np.log(df_week['close']).diff()\n",
    "df_week.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# drop null rows\n",
    "df_week.dropna(inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_week.weekly_ret.plot(kind='line', figsize=(12, 6));"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "udiff = df_week.drop(['close'], axis=1)\n",
    "udiff.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test for stationarity of the udiff series\n",
    "\n",
    "Time series are stationary if they do not contain trends or seasonal swings. The Dickey-Fuller test can be used to test for stationarity. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import statsmodels.api as sm\n",
    "from statsmodels.tsa.stattools import adfuller"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rolmean = udiff.rolling(20).mean()\n",
    "rolstd = udiff.rolling(20).std()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(12, 6))\n",
    "orig = plt.plot(udiff, color='blue', label='Original')\n",
    "mean = plt.plot(rolmean, color='red', label='Rolling Mean')\n",
    "std = plt.plot(rolstd, color='black', label = 'Rolling Std Deviation')\n",
    "plt.title('Rolling Mean & Standard Deviation')\n",
    "plt.legend(loc='best')\n",
    "plt.show(block=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform Dickey-Fuller test\n",
    "dftest = sm.tsa.adfuller(udiff.weekly_ret, autolag='AIC')\n",
    "dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])\n",
    "for key, value in dftest[4].items():\n",
    "    dfoutput['Critical Value ({0})'.format(key)] = value\n",
    "    \n",
    "dfoutput"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With a p-value < 0.05, we can reject the null hypotehsis. This data set is stationary."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ACF and PACF Charts\n",
    "\n",
    "Making autocorrelation and partial autocorrelation charts help us choose hyperparameters for the ARIMA model.\n",
    "\n",
    "The ACF gives us a measure of how much each \"y\" value is correlated to the previous n \"y\" values prior.\n",
    "\n",
    "The PACF is the partial correlation function gives us (a sample of) the amount of correlation between two \"y\" values separated by n lags excluding the impact of all the \"y\" values in between them. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from statsmodels.graphics.tsaplots import plot_acf\n",
    "\n",
    "# the autocorrelation chart provides just the correlation at increasing lags\n",
    "fig, ax = plt.subplots(figsize=(12,5))\n",
    "plot_acf(udiff.values, lags=10, ax=ax)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from statsmodels.graphics.tsaplots import plot_pacf\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(12,5))\n",
    "plot_pacf(udiff.values, lags=10, ax=ax)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The table below summarizes the patterns of the ACF and PACF.\n",
    "\n",
    "<img src=\"../imgs/How_to_Read_PACF_ACF.jpg\" alt=\"drawing\" width=\"300\" height=\"300\"/>\n",
    "\n",
    "The above chart shows that reading PACF gives us a lag \"p\" = 3 and reading ACF gives us a lag \"q\" of 1. Let's Use Statsmodel's ARMA with those parameters to build a model. The way to evaluate the model is to look at AIC - see if it reduces or increases. The lower the AIC (i.e. the more negative it is), the better the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build ARIMA Model\n",
    "\n",
    "Since we differenced the weekly closing prices, we technically only need to build an ARMA model. The data has already been integrated and is stationary. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from statsmodels.tsa.arima_model import ARMA\n",
    "\n",
    "# Notice that you have to use udiff - the differenced data rather than the original data. \n",
    "ar1 = ARMA(tuple(udiff.values), (3, 1)).fit()\n",
    "ar1.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our model doesn't do a good job predicting variance in the original data (peaks and valleys)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(12, 8))\n",
    "plt.plot(udiff.values, color='blue')\n",
    "preds = ar1.fittedvalues\n",
    "plt.plot(preds, color='red')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make a forecast 2 weeks ahead:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "steps = 2\n",
    "\n",
    "forecast = ar1.forecast(steps=steps)[0]\n",
    "\n",
    "plt.figure(figsize=(12, 8))\n",
    "plt.plot(udiff.values, color='blue')\n",
    "\n",
    "preds = ar1.fittedvalues\n",
    "plt.plot(preds, color='red')\n",
    "\n",
    "plt.plot(pd.DataFrame(np.array([preds[-1],forecast[0]]).T,index=range(len(udiff.values)+1, len(udiff.values)+3)), color='green')\n",
    "plt.plot(pd.DataFrame(forecast,index=range(len(udiff.values)+1, len(udiff.values)+1+steps)), color='green')\n",
    "plt.title('Display the predictions with the ARIMA model')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The forecast is not great but if you tune the hyper parameters some more, you might be able to reduce the errors."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}