{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Building an ARIMA Model for a Financial Dataset\n", "\n", "In this notebook, you will build an ARIMA model for AAPL stock closing prices. The lab objectives are:\n", "\n", "* Pull data from Google Cloud Storage into a Pandas dataframe\n", "* Learn how to prepare raw stock closing data for an ARIMA model\n", "* Apply the Dickey-Fuller test \n", "* Build an ARIMA model using the statsmodels library" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Make sure you restart the Python kernel after executing the `pip install` command below__! After you restart the kernel you don't have to execute the command again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install --user statsmodels" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import datetime\n", "\n", "%config InlineBackend.figure_format = 'retina'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import data from Google Clod Storage\n", "\n", "In this section we'll read some ten years' worth of AAPL stock data into a Pandas dataframe. We want to modify the dataframe such that it represents a time series. This is achieved by setting the date as the index. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('gs://cloud-training/ai4f/AAPL10Y.csv')\n", "\n", "df['date'] = pd.to_datetime(df['date'])\n", "df.sort_values('date', inplace=True)\n", "df.set_index('date', inplace=True)\n", "\n", "print(df.shape)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare data for ARIMA \n", "\n", "The first step in our preparation is to resample the data such that stock closing prices are aggregated on a weekly basis. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_week = df.resample('w').mean()\n", "df_week = df_week[['close']]\n", "df_week.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a column for weekly returns. Take the log to of the returns to normalize large fluctuations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_week['weekly_ret'] = np.log(df_week['close']).diff()\n", "df_week.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# drop null rows\n", "df_week.dropna(inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_week.weekly_ret.plot(kind='line', figsize=(12, 6));" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "udiff = df_week.drop(['close'], axis=1)\n", "udiff.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test for stationarity of the udiff series\n", "\n", "Time series are stationary if they do not contain trends or seasonal swings. The Dickey-Fuller test can be used to test for stationarity. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import statsmodels.api as sm\n", "from statsmodels.tsa.stattools import adfuller" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rolmean = udiff.rolling(20).mean()\n", "rolstd = udiff.rolling(20).std()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12, 6))\n", "orig = plt.plot(udiff, color='blue', label='Original')\n", "mean = plt.plot(rolmean, color='red', label='Rolling Mean')\n", "std = plt.plot(rolstd, color='black', label = 'Rolling Std Deviation')\n", "plt.title('Rolling Mean & Standard Deviation')\n", "plt.legend(loc='best')\n", "plt.show(block=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Perform Dickey-Fuller test\n", "dftest = sm.tsa.adfuller(udiff.weekly_ret, autolag='AIC')\n", "dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])\n", "for key, value in dftest[4].items():\n", " dfoutput['Critical Value ({0})'.format(key)] = value\n", " \n", "dfoutput" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With a p-value < 0.05, we can reject the null hypotehsis. This data set is stationary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ACF and PACF Charts\n", "\n", "Making autocorrelation and partial autocorrelation charts help us choose hyperparameters for the ARIMA model.\n", "\n", "The ACF gives us a measure of how much each \"y\" value is correlated to the previous n \"y\" values prior.\n", "\n", "The PACF is the partial correlation function gives us (a sample of) the amount of correlation between two \"y\" values separated by n lags excluding the impact of all the \"y\" values in between them. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from statsmodels.graphics.tsaplots import plot_acf\n", "\n", "# the autocorrelation chart provides just the correlation at increasing lags\n", "fig, ax = plt.subplots(figsize=(12,5))\n", "plot_acf(udiff.values, lags=10, ax=ax)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from statsmodels.graphics.tsaplots import plot_pacf\n", "\n", "fig, ax = plt.subplots(figsize=(12,5))\n", "plot_pacf(udiff.values, lags=10, ax=ax)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The table below summarizes the patterns of the ACF and PACF.\n", "\n", "\"drawing\"\n", "\n", "The above chart shows that reading PACF gives us a lag \"p\" = 3 and reading ACF gives us a lag \"q\" of 1. Let's Use Statsmodel's ARMA with those parameters to build a model. The way to evaluate the model is to look at AIC - see if it reduces or increases. The lower the AIC (i.e. the more negative it is), the better the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build ARIMA Model\n", "\n", "Since we differenced the weekly closing prices, we technically only need to build an ARMA model. The data has already been integrated and is stationary. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from statsmodels.tsa.arima_model import ARMA\n", "\n", "# Notice that you have to use udiff - the differenced data rather than the original data. \n", "ar1 = ARMA(tuple(udiff.values), (3, 1)).fit()\n", "ar1.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our model doesn't do a good job predicting variance in the original data (peaks and valleys)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12, 8))\n", "plt.plot(udiff.values, color='blue')\n", "preds = ar1.fittedvalues\n", "plt.plot(preds, color='red')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make a forecast 2 weeks ahead:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "steps = 2\n", "\n", "forecast = ar1.forecast(steps=steps)[0]\n", "\n", "plt.figure(figsize=(12, 8))\n", "plt.plot(udiff.values, color='blue')\n", "\n", "preds = ar1.fittedvalues\n", "plt.plot(preds, color='red')\n", "\n", "plt.plot(pd.DataFrame(np.array([preds[-1],forecast[0]]).T,index=range(len(udiff.values)+1, len(udiff.values)+3)), color='green')\n", "plt.plot(pd.DataFrame(forecast,index=range(len(udiff.values)+1, len(udiff.values)+1+steps)), color='green')\n", "plt.title('Display the predictions with the ARIMA model')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The forecast is not great but if you tune the hyper parameters some more, you might be able to reduce the errors." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 4 }