{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n", "\n", "
Author: Denis Mironov (@dmironov)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#
Forecasting stock returns with ARIMA, and prices with Ridge and Lasso
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Feature and data explanation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Description of the task" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the project, we are analyzing five assets: \n", " \n", " - Barrick Gold Corporation (ABX) Basic Industries\n", " - Walmart Inc. (WMT) Consumer Services\n", " - Caterpillar Inc (CAT) Capital Goods\n", " - BP p.l.c. (BP) Energy\n", " - Ford Motor Company (F) Capital Goods \n", " - General Electric Company (GE) Energy \n", " \n", "based on Yahoo Finance (https://finance.yahoo.com) data. In order to reproduce the results of the project, the data for the assets should be either collected manually from Yahoo Finance for the daily time-period from the 10th of December 2013 to 7th of December 2018 or be downloaded here https://github.com/dmironov1993/Data\n", "\n", "In this project, our goal is to predict stock returns with Autoregressive Integrated Moving Average (ARIMA) model and stock prices with Ridge and Lasso. ARIMA model is defined by three parameters $(p,d,q)$, where $p$ - the order of the autoregressive model, $d$ - the degree of differencing and $q$ - the order of the moving-average model. The choice of these parameters is done by brute-forcing (grid search) and choosing the model with the lowest Akaike Information Criterion. If time-series is stationary, we are left with p and q parameters, while $d=0$. Such a model is usually called ARMA, and we use this abbreviation hereinafter. However, since Python does not have several important libraries as those in R, for instance the libraries for ARMA-GARCH simultaneous analysis, stock prices will be predicted as well by using machine learning (ML) algorithms such as Ridge and Lasso. At the appropriate stage of our work, we will generate additional features. ML algorithms will be tuned by using grid search of hyperparameters. The performance of ML algorithms will be tested on a houldout sample. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2. Libraries and Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we import libraries and load the data in a CSV form which we are going to analysis and working with." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Jupiter notebook setup and Importing libraries \n", "# By default, all figures are shown in 'png'. If the latter is changed to 'svg', higher quality is guaranteed\n", "%config InlineBackend.figure_format = 'png'\n", "import warnings\n", "warnings.simplefilter('ignore')\n", "\n", "# Data manipulations\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# Visualization\n", "import seaborn as sns\n", "from matplotlib import pyplot as plt\n", "%matplotlib inline\n", "from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot\n", "import plotly\n", "import plotly.graph_objs as go\n", "from plotly import tools\n", "import plotly.plotly as py\n", "init_notebook_mode(connected=True)\n", "\n", "# ARIMA (ARMA) modelling\n", "import statsmodels.api as sm\n", "import statsmodels.tsa.api as smt\n", "import statsmodels.tsa.stattools as ts\n", "\n", "# Statistics\n", "import scipy.stats as scs\n", "from scipy.stats import skew\n", "from scipy.stats import kurtosis\n", "from statsmodels.tsa.stattools import kpss\n", "\n", "# Ljung-box test (to check whether residuals are white noise)\n", "from statsmodels.stats.diagnostic import acorr_ljungbox\n", "\n", "\n", "# Metrics for ML (ARIMA/ARMA has embedded AIC criterion)\n", "from sklearn.metrics import mean_absolute_error\n", "\n", "# Hyperparameter tuning and validation\n", "from sklearn.model_selection import GridSearchCV, TimeSeriesSplit \n", "from sklearn.preprocessing import StandardScaler\n", "\n", "# Machine learning algorithms\n", "#from sklearn.linear_model import LassoCV, RidgeCV\n", "from sklearn.linear_model import Ridge, Lasso" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Symbols of assets\n", "asset_names = ['ABX','WMT','CAT','BP','F','GE']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# function for loading csv\n", "def read_csv(symbols):\n", " \n", " \"\"\" \n", " reading csv file\n", " \n", " Input: list \n", " - symbols of traded stocks \n", " \n", " Output: tuple\n", " - dataframes of traded stocks\n", " \n", " \"\"\"\n", " \n", " ListofAssets_df = []\n", " for asset in symbols:\n", " ListofAssets_df.append(pd.read_csv('%s.csv' % asset, sep=',')\\\n", " .rename(columns={'Adj Close': '%s_Adj_close' % asset})\\\n", " .sort_values(by='Date', ascending=False))\n", " \n", " return tuple(ListofAssets_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# number of dataframes within read_csv\n", "print (len(read_csv(asset_names)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# number of rows and columns within each df\n", "for df in read_csv(asset_names):\n", " print (df.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# view of one of dataframes\n", "read_csv(asset_names)[0].head(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# information about structure of data\n", "read_csv(asset_names)[0].info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that there are no missing values in the datasets of our interest." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the obtained historical prices, we have the following information for each of the asset:\n", "\n", " - **Date**: Date\n", " - **Open**: Open price within a date\n", " - **High**: The highest price within a date\n", " - **Low**: The lowest price within a date\n", " - **Close**: Close price within a date\n", " - **`NAME`_Adj_close**: Adjusted close price in the end of a date\n", " - **Volume**: Trading volume\n", " \n", "In addition to this information, we are interested in daily returns of the assets. Here the latter will be calculated by using\n", "\n", " ###
$ r^{(i)}_{t} := ln\\left(\\frac{P^{(i)}_{t}}{P^{(i)}_{t-1}} \\right) = ln\\left( P^{(i)}_{t}\\right) - ln\\left( P^{(i)}_{t-1} \\right),$
\n", "\n", "where $ln$ stands for natural logarithm, $P^{(i)}_{t}$ is the adjusted close price of the asset ${i}$ at the time moment ${t}$, while $P^{(i)}_{t-1}$ is at the previous moment of time: ${t-1}$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# function for adding log-returns column to dataframes \n", "def add_log_returns(assets_df, symbols):\n", " \n", " \"\"\" \n", " Calculating returns of the assets\n", " \n", " Input:\n", " - assets_df: is a tuple of dataframes\n", " - symbols: list with symbols of traded stocks in the same order as those in assets_df\n", " Output:\n", " - tuple of dataframes with returns, dataframe's index is Date now, \n", " dataframe is also sorted by index in ascending order.\n", " \n", " \"\"\"\n", " \n", " ListofAssets_df = []\n", " num_asset = 0\n", " \n", " if len(assets_df) == len(symbols):\n", " \n", " for df in assets_df:\n", " adj_closing_price = df['%s_Adj_close' % symbols[num_asset]]\n", " log_array = np.log(np.array(adj_closing_price))\n", " log_return_array = log_array - np.append(log_array[1:], np.nan)\n", " log_return_df = pd.DataFrame(log_return_array,\n", " columns=['%s_returns' % symbols[num_asset]])\n", " df = df.reset_index().drop(columns=['index'])\n", " df = pd.concat((df, log_return_df), axis=1)\n", " df['Date'] = df['Date'].apply(pd.to_datetime)\n", " df = df.set_index('Date')\n", " df = df.sort_index(ascending=True)\n", " df.dropna(axis=0, inplace=True)\n", " ListofAssets_df.append(df)\n", " num_asset += 1\n", " \n", " return tuple(ListofAssets_df)\n", " \n", " else:\n", " print ('Number of DataFrames and Number of assets considered should be equal')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "asset_dfs = add_log_returns(read_csv(asset_names), asset_names)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we have added returns, year, month and day columns\n", "for df in asset_dfs:\n", " print (df.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "asset_dfs[0].head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now each of the dataframes, in addition to already presented six columns such as Open, High, Low, Close, NAME_Adj_close and Volume, has one additional column related to returns. Note that Date is index now rather than a column. Dataframes have been sorted by Date in ascending order. The latter is needed further for correct time series cross validation. Number of rows has been reduced to 1257 (1258 originally) since our data does not allow to calculate returns before 11th of December 2013." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For convenience, we present two functions. One to create a dataframe consisting of only adjusted close prices of our assets, while the other one is to create a dataframe consisting of only asset returns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def asset_adj_close(list_of_df, symbols):\n", " \"\"\"\n", " \n", " Input:\n", " - list_of_df: list of dataframes. Each of the latter should have \n", " a column corresponding to adjusted close price\n", " - symbols: list of asset symbols taken from a stock market\n", " Output: \n", " - pandas dataframe consisting of only adjusted close prices of considered assets\n", " \n", " \"\"\"\n", " \n", " adj_close = []\n", " number = 0\n", " for asset in asset_names:\n", " df = list_of_df[number]['%s_Adj_close' % asset]\n", " adj_close.append(df)\n", " number += 1\n", " \n", " return pd.concat(adj_close, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adj_close = asset_adj_close(asset_dfs, asset_names)\n", "adj_close.head(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def asset_returns(list_df, symbols):\n", " \"\"\"\n", " \n", " Input: \n", " - list_df: list of dataframes each of which contains asset returns column\n", " - symbols: list of asset symbols taken from a stock market\n", " Output: \n", " - a dataframe consisting of only asset returns\n", "\n", " \"\"\"\n", " \n", " asset_returns = []\n", " k = 0\n", " for asset in symbols:\n", " asset_returns.append(list_df[k]['%s_returns' % asset])\n", " k += 1\n", "\n", " return pd.concat(asset_returns, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "returns = asset_returns(asset_dfs, asset_names)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "returns.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we perform primary visual data analysis of adjusted close prices and returns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Primary data analysis, Primary visual data analysis, Insights and found dependencies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The structure of this section as follows\n", "\n", "- Correlation and Pairplot\n", "- A normal quantile-quantile plot and Comparison of their Kernel Density Estimation (KDE) to the closest parametric normal distribution\n", "- Skewness, Kurtosis, Max value, Min value, Mean and Variance\n", "- Box plot\n", "- Time Series plot\n", "- Correlation function and Partial correlation function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1. Correlation and Pairplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we provide a correlation table and a scatter plot against each other for both prices and returns of Barrick Gold Corporation (ABX), Walmart Inc. (WMT), Caterpillar Inc (CAT), BP p.l.c. (BP), Ford Motor Company (F) and General Electric Company (GE). Comparison of Pearson and Spearman correlations let us know the affection of large outliers on general picture of the assets movement." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we present two auxiliary functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def corr_plot(df, symbols, days=252, title='Returns'):\n", " \n", " \"\"\"\n", " Input: \n", " - df: a dataframe consisting only of data which correlations will be plotted\n", " - symbols: a list of the companies stocks names\n", " - days: represent the number of days to the past, set it to 0 to get consider the whole range\n", " - title: the plot title in accordance with df\n", " Output: \n", " - illustration of assets correlations\n", " \n", " \"\"\"\n", " \n", " fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,8))\n", " sns.heatmap(df[-days:].corr(method='pearson'), annot=True, fmt='.2f', cmap=\"YlGnBu\",\n", " xticklabels=symbols, yticklabels=symbols, ax=axes[0])\n", " sns.heatmap(df[-days:].corr(method='spearman'), annot=True, fmt='.2f', cmap=\"YlGnBu\",\n", " xticklabels=symbols, yticklabels=symbols, ax=axes[1])\n", " axes[0].set_title('Pearson correlation (%s)' % title)\n", " axes[1].set_title('Spearman correlation (%s)' % title)\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def pairplot(df, days=252, width=9, height=9):\n", " \n", " \"\"\"\n", " Input:\n", " - df: a dataframe consisting only of data which will be plotted\n", " - width and height: a corresponding size of the plot\n", " Output: \n", " - illustration of assets pairplot\n", " \n", " \"\"\"\n", " g = sns.pairplot(df[-days:]);\n", " g.fig.set_size_inches(width,height)\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.1 Asset Returns " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corr_plot(returns, asset_names, title='Returns')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the figure right above one can see that correlation between asset returns does not change much between two types of correlations. The returns tend to comoving within 252 days time interval." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's take a look at the whole picture of pair-correlations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pairplot(returns, width=9, height=9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.2 Asset Prices " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corr_plot(adj_close, asset_names, title='Adj close price')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the figure right above, Pearson correlation for asset adjusted close price is demonstrated on the left, while Spearman correlation is on the right. We see that some of the prices tend to comoving while others moving in opposite directions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pairplot(adj_close, width=9, height=9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above is the scatter plot for adjusted close prices. This allows us to see general tendency of asset pairs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to the correlation analysis, the assets considered here tend to comoving. However, we should\n", "also remember what time-interval is considered. Here we take into account 252 days interval corresponding to about 1 trading year. If some other time interval is of interest, then correlation coefficients for all of the assets change appropriately. It is not possible to know in advance whether the correlation between the assets will be higher, remain the same or lower with changing a time-period. But such an analysis demonstrates an estimation of their possible correlation values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2. A normal quantile-quantile plot and Comparison of their Kernel Density Estimation (KDE) to the closest parametric normal distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, we discuss return and adj close price distributions of United Continental Holdings Inc. (UAL), BP p.l.c. (BP), American Water Works Company Inc. (AWK), Ford Motor Company (F), General Electric Company (GE) and Walmart Inc. (WMT). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As previously, we introduce several auxiliary functions first." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def norm_function(x, mu, sigma):\n", " return (1/(sigma*np.sqrt(2*np.pi))) * np.exp(-(x-mu)**2 / (2*sigma**2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def qqplot_returns(df, symbols, option='returns', size=(10,5), sharex=False, sharey=False, wspace=0.6, hspace=0.4):\n", " \n", " \"\"\"\n", " \n", " Input:\n", " - df:\n", " - symbols:\n", " - option: returns or adj_close\n", " - size:\n", " - sharex:\n", " - sharey:\n", " - wspace:\n", " - hspace:\n", " \n", " \n", " Output:\n", " - illustration of a normal quantile-quantile plot of returns\n", " \n", " \"\"\"\n", " nrows = len(symbols)\n", " if nrows != 1:\n", " fig, axes = plt.subplots(nrows=nrows, ncols=2, sharex=sharex, sharey=sharey, figsize=size)\n", " plt.subplots_adjust(wspace=wspace, hspace=hspace)\n", " for i in range(nrows):\n", " \n", " if option == 'returns':\n", " name = '%s' % symbols[i]\n", " name_return = '%s_returns' % symbols[i]\n", " elif option == 'adj_close':\n", " name = '%s' % symbols[i]\n", " name_return = '%s_Adj_close' % symbols[i]\n", " \n", " probplot = sm.ProbPlot(df[name_return], dist='norm') \n", " fig = probplot.qqplot(line='q', ax=axes[i,0])\n", " axes[i,0].set_title('Normal Q-Q Plot (%s)' % name)\n", "\n", " sns.distplot(df[name_return], kde=True, hist=False, ax=axes[i,1], color='black')\n", " count, mean, std, min_, q1, mean, q3, max_ = df[name_return].describe()\n", " xx = np.linspace(min_, max_, 1000)\n", " axes[i,1].plot(xx, norm_function(xx, mean, std), '--', color='red') \n", " axes[i,1].set_title(name)\n", " axes[i,1].set_xlabel('')\n", " plt.show()\n", " \n", " else: \n", " if option == 'returns':\n", " name = '%s' % symbols[1]\n", " name_return = '%s_returns' % symbols[1]\n", " elif option == 'adj_close':\n", " name = '%s' % symbols[1]\n", " name_return = '%s_Adj_close' % symbols[1]\n", " \n", " nrows = 1\n", " fig, axes = plt.subplots(nrows=nrows, ncols=2, sharex=sharex, sharey=sharey, figsize=size)\n", " plt.subplots_adjust(wspace=wspace, hspace=hspace)\n", " name = '%s' % symbols[0]\n", " name_return = '%s_returns' % symbols[0]\n", "\n", " probplot = sm.ProbPlot(df[name_return], dist='norm') \n", " fig = probplot.qqplot(line='q', ax=axes[0])\n", " axes[0].set_title('Normal Q-Q Plot (%s)' % name)\n", "\n", " sns.distplot(df[name_return], kde=True, hist=False, ax=axes[1], color='black')\n", " count, mean, std, min_, q1, mean, q3, max_ = df[name_return].describe()\n", " xx = np.linspace(min_, max_, 1000)\n", " axes[1].plot(xx, norm_function(xx, mean, std), '--', color='red') \n", " axes[1].set_title(name)\n", " axes[1].set_xlabel('')\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2.1 Asset Prices " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "qqplot_returns(adj_close, asset_names, option='adj_close',\n", " size=(10,20), sharex=False, sharey=False, wspace=0.6, hspace=0.6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the left-hand side the normal quantile-quantile plots are shown, while on the right-hand side kernel density estimate of returns and their closest normal distribution are illustrated. Red lines correspond to normal distribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that some of adjusted close prices have bi-modal distributions while some others have even more complex structure which does not appear to be normal even close." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2.1 Asset Returns " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "qqplot_returns(returns, asset_names, option='returns',\n", " size=(10,20), sharex=False, sharey=False, wspace=0.6, hspace=0.6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Red lines correspond to normal distribution. Note that the distribution is not normal as demonstrated by both kinds of plots demonstrating fatter tails and higher kurtosis. However, their structure is closer to the normal distribution than that of adjusted close price." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On this stage of the work we may colcude that fat tails will become a problem for our ARIMA or ARMA modelling since it may be that we will not encompass all the time-series information due to that. Let's keep this notion in mind and move on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3. Skewness, Kurtosis, Max value, Min value, Mean and Variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's gather more statistics about target values. Precisely speaking, their Skewness, Kurtosis, Max returns, Max loss, Mean and Variances." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def stats(df, symbols):\n", " \"\"\"\n", " \n", " Input: \n", " - symbols: a list of asset symbols\n", " \n", " Output:\n", " - a dataframe containing information such as Skewness, Kurtosis, Max value,\n", " Min value, Mean and Variance of the df\n", " \n", " \"\"\"\n", " \n", " stat = pd.DataFrame(index=asset_names, \n", " columns=['Skewness','Kurtosis','Max value',\n", " 'Min value','Mean','Variance'])\n", " \n", " stat['Skewness'] = skew(df, axis=0)\n", " stat['Kurtosis'] = kurtosis(df, axis=0)\n", " stat['Max value'] = df.agg('max').values\n", " stat['Min value'] = df.agg('min').values\n", " stat['Mean'] = df.agg('mean').values\n", " stat['Variance'] = df.agg('var').values\n", "\n", " return stat" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stats(adj_close, asset_names)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stats(returns, asset_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4. Box plot of Returns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Box plot will help us to get the precise information about outliers and how they fit in the whole picture." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def asset_box_plot(df, symbols, title=None, width=700, height=400, \n", " jitter=0.2, pointpos=-1.5, boxpoints = 'suspectedoutliers'):\n", " \"\"\"\n", " \n", " Input: \n", " - df: a dataframe which columns will be plotted, \n", " - symbols: a list of symbols in the order the same as that in dataframe.\n", " - title:\n", " - width: \n", " - height: \n", " - jitter: \n", " - pointpos: \n", " \n", " Output: \n", " - Box plot illustrated by plotly library\n", " \n", " \"\"\"\n", " \n", " data=[]\n", " for i in range(len(symbols)):\n", " trace = go.Box(y = df.iloc[:,i],\n", " name = symbols[i],\n", " jitter=jitter,\n", " pointpos=pointpos,\n", " boxpoints = boxpoints)\n", "\n", " data.append(trace)\n", " \n", " \n", " layout = go.Layout(title = title,\n", " autosize=False,\n", " width=width,\n", " height=height)\n", "\n", " fig = go.Figure(data=data,layout=layout)\n", "\n", " iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4.1 Asset Price" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "asset_box_plot(adj_close, asset_names, title='Adj close price', boxpoints = 'suspectedoutliers')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4.2 Asset Returns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "asset_box_plot(returns, asset_names, title='Returns')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.5. Time-series plots" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A function for plotting stock price history of all assets\n", "def plot_prices(df, symbols, width=500, height=300):\n", " \"\"\"\n", " \n", " Input: First: list with symbols of traded stocks\n", " Second: dataframe containing only adjusted close price columns\n", " \n", " Output: asset prices plotting with plotly\n", " \n", " \"\"\"\n", " traces = []\n", " for asset in symbols:\n", " trace = go.Scatter(\n", " x=df.index,\n", " y=df['%s_Adj_close' % asset],\n", " name = '%s price' % asset)\n", " traces.append(trace)\n", " \n", " layout = go.Layout(title='Adj close price history', \n", " autosize=False,\n", " width=width,\n", " height=height)\n", "# layout = {'title': 'Stocks Price History'}\n", " fig = go.Figure(data=traces, layout=layout) \n", " \n", " return iplot(fig, show_link=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.5.1 Asset Prices " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_prices(adj_close, asset_names, height=600, width=800)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.5.2 Asset Returns " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A function for plotting history of log returns for all assets\n", "def plot_returns(df, symbols, width=500, height=300):\n", " \"\"\"\n", " \n", " Input: First: list with symbols of traded stocks\n", " Second: iterator with dataframes of traded stocks\n", " \n", " Output: asset log returns plotting with plotly\n", " \n", " \"\"\"\n", " traces = []\n", " for asset in symbols:\n", " trace = go.Scatter(\n", " x=df.index,\n", " y=df['%s_returns' % asset],\n", " name = '%s returns' % asset,\n", " opacity=0.8)\n", " traces.append(trace)\n", " \n", " layout = go.Layout(title='Returns history',\n", " autosize=False,\n", " width=width,\n", " height=height)\n", " fig = go.Figure(data=traces, layout=layout) \n", " \n", " return iplot(fig, show_link=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_returns(returns, asset_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It might be useful to see all the return values separately." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A function for plotting history of log returns for all assets\n", "def plot_returns_indiv(df, symbols, width=800, height=600):\n", " \"\"\"\n", " \n", " Input: First: list with symbols of traded stocks\n", " Second: iterator with dataframes of traded stocks\n", " \n", " Output: asset log returns plotting individually with plotly\n", " \n", " \"\"\"\n", " traces = []\n", " count = 0\n", " for asset in symbols:\n", " trace = go.Scatter(\n", " x=df.index,\n", " y=df['%s_returns' % asset],\n", " name = '%s returns' % asset,\n", " opacity=0.8)\n", " traces.append(trace)\n", " count += 1\n", " \n", " fig = tools.make_subplots(rows=int(len(symbols)/2+0.5), cols=2, shared_yaxes=True)\n", "\n", " i = 0\n", " while i < count:\n", " for ncol in [1,2]:\n", " for nrow in range(1, int(len(symbols)/2+1.5)):\n", " fig.append_trace(traces[i], nrow, ncol)\n", " i += 1\n", " \n", " fig['layout'].update(width=width, height=height, title='Returns History')\n", " \n", " return iplot(fig, show_link=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_returns_indiv(returns, asset_names);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the figures of the return time-series one can see that \n", "\n", "1. The assets do not have long-term deviations from the mean and are mainly oscillating over some constant values that is almost zero. This is a property of stationary process.\n", "2. There are periods of high volatility followed by periods of relatively tranquility (volatility clustering).\n", "3. One can also notice that volatility clustering does not indicate a lack of stationarity but rather can be viewed as a type of dependence in the conditional variance of each series." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we are going to determine whether price and return time-series are stationary by finding Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.6. Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When a time series is observed, a natural question is whether it appears to be stationary. In this section, we plot ACF and PACF in order to check the series on the presence of stationarity. We also perform Augmented Dickey-Fuller test as well as Kwiatkowski-Phillips-Schmidt-Shin test. The former tests null hypothesis that timeseries is not stationary against an alternative hypothesis that the time-series is stationary, while the latter test is designed to test null stationarity against an alternative of non-stationarity." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def stationary_analysis(df, symbols, lags, option='price', figsize=(10,20), wspace=0.3, hspace=0.5):\n", " \"\"\"\n", " Plot time series, its Autocorrelation Function (ACF) and Partial Autocorrelation function (PACF)\n", " \n", " Input: \n", " - df: Dataframe where columns represent either price or return values. \n", " - symbols: Whether stationary analysis is perfored for price or return is \n", " - lags: regulated by option parameter. The latter should be either \n", " - option: 'price' or 'return'\n", " - figsize:\n", " - wspace: \n", " - hspace:\n", " \n", " Output: \n", " - Illustration of Autocorrelation function in the right column, \n", " and Partial autocorrelation function in the left\n", " \"\"\"\n", " \n", " nrows = df.shape[1]\n", " fig, axes = plt.subplots(nrows=nrows, ncols=2, figsize=figsize)\n", " plt.subplots_adjust(wspace=0.3, hspace=0.5)\n", " row = 0\n", " for asset in symbols:\n", " if option == 'price':\n", " smt.graphics.plot_acf(df['%s_Adj_close' % asset], lags=lags, ax=axes[row,0])\n", " smt.graphics.plot_pacf(df['%s_Adj_close' % asset], lags=lags, ax=axes[row,1])\n", " axes[row,0].set_title('%s Adj close (ACF)' % asset)\n", " axes[row,1].set_title('%s Adj close (PACF)' % asset)\n", " row += 1\n", " elif option == 'return':\n", " smt.graphics.plot_acf(df['%s_returns' % asset], lags=lags, ax=axes[row,0])\n", " smt.graphics.plot_pacf(df['%s_returns' % asset], lags=lags, ax=axes[row,1])\n", " axes[row,0].set_title('%s returns (ACF)' % asset)\n", " axes[row,1].set_title('%s returns (PACF)' % asset)\n", " row += 1 \n", " \n", " return plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def adf_kpss(df, symbols, option='price'):\n", " \n", " \"\"\"\n", " Input:\n", " - df: a dataframe containing only either of prices or returns\n", " - symbols: list of asset symbols to analyze\n", " - option: either price or returns in accordance with df\n", " \n", " Output:\n", " - a dataframe constaining Augmented Dickey-Fuller (ADF) and Kwiatkowski–Phillips–Schmidt–Shin (KPSS) \n", " test results of asset prices or returns \n", " \n", " \"\"\"\n", " adf_kpss = pd.DataFrame(index=asset_names, columns=['ADF','KPSS'])\n", " \n", " if option == 'price':\n", " adf_price = []\n", " kpss_price = []\n", " for i in range(len(symbols)):\n", " adf_price.append(ts.adfuller(adj_close.iloc[:,i])[1])\n", " kpss_price.append(kpss(adj_close.iloc[:,i])[1])\n", " adf_kpss['ADF'] = np.array(adf_price)\n", " adf_kpss['KPSS'] = np.array(kpss_price)\n", " \n", " elif option == 'returns':\n", " adf_returns = []\n", " kpss_returns = []\n", " for i in range(len(asset_names)):\n", " adf_returns.append(ts.adfuller(returns.iloc[:,i])[1])\n", " kpss_returns.append(kpss(returns.iloc[:,i])[1])\n", " adf_kpss['ADF'] = np.array(adf_returns)\n", " adf_kpss['KPSS'] = np.array(kpss_returns)\n", " \n", " return adf_kpss" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.6.1 Asset Prices " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stationary_analysis(adj_close, asset_names, option='price', lags=30, figsize=(20,20))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) for six-price time-series (Barrick Gold Corporation (ABX), Walmart Inc. (WMT), Caterpillar Inc (CAT), BP p.l.c. (BP), Ford Motor Company (F) and General Electric Company (GE)). One can see that each ACF shows a very slow (barely noticeable), linear decay pattern which is typical of a nonstationary time series. Besides, PACF has one significant spike at lag 1 (lag 0 is not accounted for) meaning that (almost) all the higher-order autocorrelations are effectively explained by the lag 1 autocorrelation. Such ACF and PACF are typical for non-stationary time-series. There are also some test-statistic to either support or reject our assumptions.\n", "\n", "To prove our initial guessing about the behavior of price time-series, we perform Augmented Dickey-Fuller\n", "(ADF) test and Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adf_kpss_price = pd.DataFrame(index=asset_names, columns=['ADF','KPSS'])\n", "\n", "adf_price = []\n", "kpss_price = []\n", "for i in range(len(asset_names)):\n", " adf_price.append(ts.adfuller(adj_close.iloc[:,i])[1])\n", " kpss_price.append(kpss(adj_close.iloc[:,i])[1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adf_kpss_price['ADF'] = np.array(adf_price)\n", "adf_kpss_price['KPSS'] = np.array(kpss_price)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adf_kpss_price" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This table contains p-values information on whether or not to reject the null hypothesis. Since ADF tests gives p-values larget than 0.05, we do not have enough information to reject the null hypothesis stating that time-series is not stationary at 95% confidence level. Besides, KPSS p-values shows that we reject the null hypothesis that time series is stationary at 95% confidence level, but ABX. The latter is probably a deviation due to insufficient data. In general, combination of ADF and KPSS test results proves that time-series of the adjusted close prices are not stationary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.6.2 Asset Returns " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stationary_analysis(returns, asset_names, option='return', lags=20, figsize=(20,20))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ACF and PACF demonstrate that time-series of returns should be stationary. Let's support this by performing ADF and KPSS tests." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adf_kpss_returns = pd.DataFrame(index=asset_names, columns=['ADF','KPSS'])\n", "\n", "adf_returns = []\n", "kpss_returns = []\n", "for i in range(len(asset_names)):\n", " adf_returns.append(ts.adfuller(returns.iloc[:,i])[1])\n", " kpss_returns.append(kpss(returns.iloc[:,i])[1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adf_kpss_returns['ADF'] = np.array(adf_returns)\n", "adf_kpss_returns['KPSS'] = np.array(kpss_returns)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adf_kpss_returns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the table we see that for all the assets except GE, ADF null-hypothesis stating that time-series is not stationary may be rejected with 95% confidence level, while p-values of KPSS test indicate that there is not enough information to reject its null-hypothesis that time series is stationary at 95% confidence level. Then we may conclude that time-series of returns is stationary meaning that we can use ARMA for ABX, WMT, CAT, BP and F assets. However, we should be careful with GE. To be on the safe side, we include $d$ parameter while doing ARIMA grid search for GE." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Metrics and Model selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1. ARIMA Model and Box-Ljung test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we build ARIMA model for stationary time-series of asset returns following the steps:\n", "\n", "1. Identify p and q values of ARIMA by using grid search incorporating **Akaike information criterion metrics**. \n", "2. Plot figures and investigate residuals distribution (whether they have white noise properties or not)\n", "3. Apply Box-Ljung test to ARIMA residuals (null hypothesis is that residuals are white noise)\n", "4. Make predictions by using ARIMA models\n", "\n", "We use the material mentioned from:\n", "\n", "1. https://mlcourse.ai/notebooks/blob/master/jupyter_english/topic09_time_series/topic9_part1_time_series_python.ipynb?flush_cache=true\n", "\n", "2. http://www.blackarbs.com/blog/time-series-analysis-in-python-linear-models-to-garch/11/1/2016" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Defined the function to fit ARIMA(p, d, q) model\n", "# pick best order and final model based on aic\n", "\n", "def arima_model(df, symbol):\n", "\n", " \"\"\"\n", " \n", " Input: \n", " \n", " Output:\n", " \n", " \"\"\"\n", " \n", " asset = df['%s_returns' % symbol] \n", " best_aic = np.inf \n", " best_order = None\n", " best_mdl = None\n", "\n", " pq_rng = range(6) # [0,1,2,3,4 5] (for GE this was extended up to range(8))\n", " d_rng = range(1) # [0] (for GE we used range(2))\n", " \n", " for p in pq_rng:\n", " for d in d_rng:\n", " for q in pq_rng:\n", " try:\n", " tmp_mdl = smt.ARIMA(asset, order=(p,d,q)).fit(method='mle', trend='nc')\n", " tmp_aic = tmp_mdl.aic\n", " if tmp_aic < best_aic:\n", " best_aic = tmp_aic\n", " best_order = (p, d, q)\n", " best_mdl = tmp_mdl\n", " except: continue\n", "\n", "\n", "# print(' {}: aic: {:6.5f} | order: {}'.format(symbol, best_aic, best_order))\n", " return best_aic, best_order, best_mdl" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def tsplot(y, lags=None, figsize=(10, 8), style='bmh', name='asset'):\n", " if not isinstance(y, pd.Series):\n", " y = pd.Series(y)\n", " with plt.style.context(style): \n", " fig = plt.figure(figsize=figsize)\n", " #mpl.rcParams['font.family'] = 'Ubuntu Mono'\n", " layout = (3, 2)\n", " ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)\n", " acf_ax = plt.subplot2grid(layout, (1, 0))\n", " pacf_ax = plt.subplot2grid(layout, (1, 1))\n", " qq_ax = plt.subplot2grid(layout, (2, 0))\n", " pp_ax = plt.subplot2grid(layout, (2, 1))\n", " \n", " y.plot(ax=ts_ax)\n", " ts_ax.set_title('Time Series Analysis Plots: %s' % name, fontsize=20)\n", " smt.graphics.plot_acf(y, lags=lags, ax=acf_ax, alpha=0.5)\n", " smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax, alpha=0.5)\n", " sm.qqplot(y, line='s', ax=qq_ax)\n", " qq_ax.set_title('QQ Plot') \n", " scs.probplot(y, sparams=(y.mean(), y.std()), plot=pp_ax)\n", "\n", " plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# run ARMA model for the asset returns\n", "# save ARMA output as dictionaries\n", "\"\"\"\n", "aic_dict = {}\n", "order_dict = {}\n", "mdl_dict = {}\n", "for symbol in asset_names:\n", "aic_dict[symbol], order_dict[symbol], mdl_dict[symbol] = arima_model(returns, symbol)\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To save time, we mention the results of the grid search for ARMA:\n", "\n", "- ABX: ARMA(3,2)\n", "- WMT: ARMA(0,2)\n", "- CAT: ARMA(1,0)\n", "- BP: ARMA(5,4)\n", "- F: ARMA(5,5)\n", "- GE: ARIMA(6,1,7)\n", "\n", "Grid search was based on Akaike Information Criterion. The formed is restricted in order to find parsimonious values, while the latter was used because returns time-series are stationary as was proved above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "order_dict = {}\n", "mdl_dict = {}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "order_dict['ABX'] = (3,0,2)\n", "order_dict['WMT'] = (0, 0, 2)\n", "order_dict['CAT'] = (1, 0, 0)\n", "order_dict['BP'] = (5,0,4)\n", "order_dict['F'] = (5, 0, 5)\n", "order_dict['GE'] = (6, 1, 7)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mdl_dict['ABX'] = smt.ARIMA(returns.ABX_returns, order=(3,0,2)).fit(method='mle', trend='nc')\n", "mdl_dict['WMT'] = smt.ARIMA(returns.WMT_returns, order=(0,0,2)).fit(method='mle', trend='nc')\n", "mdl_dict['CAT'] = smt.ARIMA(returns.CAT_returns, order=(1,0,0)).fit(method='mle', trend='nc')\n", "mdl_dict['BP'] = smt.ARIMA(returns.BP_returns, order=(5,0,4)).fit(method='mle', trend='nc')\n", "mdl_dict['F'] = smt.ARIMA(returns.F_returns, order=(5,0,5)).fit(method='mle', trend='nc')\n", "mdl_dict['GE'] = smt.ARIMA(returns.GE_returns, order=(6,1,7)).fit(method='mle', trend='nc')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#for symbol in asset_names:\n", "# print (mdl_dict[symbol].summary())\n", "# tsplot(mdl_dict[symbol].resid, lags=30)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print (mdl_dict['ABX'].summary())\n", "tsplot(mdl_dict['ABX'].resid, lags=30, name='ABX')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print (mdl_dict['WMT'].summary())\n", "tsplot(mdl_dict['WMT'].resid, lags=30, name='WMT')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print (mdl_dict['CAT'].summary())\n", "tsplot(mdl_dict['CAT'].resid, lags=30, name='CAT')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print (mdl_dict['BP'].summary())\n", "tsplot(mdl_dict['BP'].resid, lags=30, name='BP')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print (mdl_dict['F'].summary())\n", "tsplot(mdl_dict['F'].resid, lags=30, name='F')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print (mdl_dict['GE'].summary())\n", "tsplot(mdl_dict['GE'].resid, lags=30, name='GE')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From these figures one can see that there is almost none autocorrelation in the residuals. This should be checked by Ljung_box test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Ljung-Box test**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ljung_box = {}\n", "for symbol in asset_names:\n", " ljung_box[symbol] = acorr_ljungbox(mdl_dict['GE'].resid, lags=15)[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ljung_box" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We cannot reject the null hypothesis that the residuals are white noise at 95% confidence level." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** ARMA predictions**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot 21 day forecast\n", "n_steps = 21\n", "holdout_signal = {}\n", "for symbol in asset_names:\n", " plt.style.use('bmh')\n", " fig = plt.figure(figsize=(20,10))\n", " ax = plt.gca()\n", "\n", " ts = returns['%s_returns' % symbol].iloc[-255:].copy()\n", " ts.plot(ax=ax, label='%s Returns' % symbol)\n", " pred = mdl_dict[symbol].predict()[-255:]\n", " pred.plot(ax=ax, style='r-', label='In-sample prediction')\n", " \n", " holdout_signal[symbol] = np.sign(pred.values)\n", "\n", " styles = ['b-', '0.2', '0.75', '0.2', '0.75']\n", " f, err95, ci95 = mdl_dict[symbol].forecast(steps=n_steps) # 95% CI\n", " _, err99, ci99 = mdl_dict[symbol].forecast(steps=n_steps, alpha=0.01) # 99% CI\n", " idx = pd.date_range(returns.index[-1], periods=n_steps, freq='D')\n", " fc_95 = pd.DataFrame(np.column_stack([f, ci95]), index=idx, columns=['forecast', 'lower_ci_95', 'upper_ci_95'])\n", " fc_99 = pd.DataFrame(np.column_stack([ci99]), index=idx, columns=['lower_ci_99', 'upper_ci_99'])\n", " fc_all = fc_95.combine_first(fc_99)\n", " fc_all.plot(ax=ax, style=styles)\n", " plt.fill_between(fc_all.index, fc_all.lower_ci_95, fc_all.upper_ci_95, color='gray', alpha=0.7)\n", " plt.fill_between(fc_all.index, fc_all.lower_ci_99, fc_all.upper_ci_99, color='gray', alpha=0.2)\n", " plt.title('{} Day {} Return Forecast\\nARIMA{}'.format(n_steps, symbol, order_dict[symbol]))\n", " plt.legend(loc='best', fontsize=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to collect more data for GE in order to model its returns with ARMA rather than ARIMA." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While predicting returns for the assets, one can also notice that there are volatility clusters meaning that we need to use Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model. Often, GARCH model is applied to ARIMA residuals to tackle this problem, while such an approach is not self-consistent. In order to make the predictions self-consistent, one needs to model return-series with ARMA-GARCH model. However, there is no such a library in Python. In order to get a better result we need to create ARMA-GARCH model ourselves. However, it is not a trivial task and goes beyond the scope of this work. This is one of possible cases for futher improvement of the solution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are not going to develop ARMA-GARCH approach here, but we can predict asset prices by using machine learning technique. Ridge and Lasso algorithms will be used for this purpose." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Data preprocessing, cross-validation, adjustment of model hyperparameters, creating of new features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** There is also metrics and model selection**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.1 Adjusted close price predictions with Ridge and Lasso " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before making any predictions, let's first indroduce several functions which will be useful in our modelling as well as grid parameters for Ridge and Lasso.\n", "\n", "Many functions and approaches are taken from:\n", "\n", "https://mlcourse.ai/notebooks/blob/master/jupyter_english/topic09_time_series/topic9_part1_time_series_python.ipynb?flush_cache=true" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# function to split the dataset into train and test\n", "def timeseries_train_test_split(X, y, test_size):\n", " \"\"\"\n", " Perform train-test split with respect to time series structure\n", " \"\"\"\n", " \n", " # get the index after which test set starts\n", " test_index = int(len(X)*(1-test_size))\n", " \n", " X_train = X.iloc[:test_index]\n", " y_train = y.iloc[:test_index]\n", " X_test = X.iloc[test_index:]\n", " y_test = y.iloc[test_index:]\n", " \n", " return X_train, X_test, y_train, y_test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# function to prepare data.\n", "# original function is presented in mlcourse.ai in lesson 9 part 1 (time-series)\n", "# here its slighly modified version is used\n", "def prepareData(series, lag_start=1, lag_end=20, test_size=0.2):\n", " \n", " \"\"\"\n", " series: pd.DataFrame\n", " - dataframe with timeseries\n", "\n", " lag_start: int\n", " - initial step back in time to slice target variable \n", " example - lag_start = 1 means that the model \n", " will see yesterday's values to predict today\n", "\n", " lag_end: int\n", " - final step back in time to slice target variable\n", " example - lag_end = 4 means that the model \n", " will see up to 4 days back in time to predict today\n", "\n", " test_size: float\n", " - size of the test dataset after train/test split as percentage of dataset\n", " \n", " \"\"\"\n", " \n", " # copy of the initial dataset\n", " data = pd.DataFrame(series.copy())\n", " data.columns = [\"y\"]\n", " \n", " # lags of series\n", " for i in range(lag_start, lag_end):\n", " data[\"lag_{}\".format(i)] = data.y.shift(i)\n", " \n", " # datetime features\n", " data['year'] = data.index.year\n", " data['month'] = data.index.month\n", " data['day'] = data.index.day\n", " data['weekday'] = data.index.weekday\n", " data['season'] = data['month'].apply(lambda x: '1' if x in [12,1,2] else\\\n", " ('2' if x in [3,4,5] else ('3' if x in [6,7,8] else '4')))\n", " \n", " time_feat = ['year','month','day','weekday','season']\n", " time_feat_df = pd.DataFrame(data[time_feat], index=data.index)\n", " time_feat_df[time_feat] = time_feat_df[time_feat].astype('str')\n", " time_feat_df = pd.get_dummies(time_feat_df)\n", " data = pd.concat((data, time_feat_df), axis=1)\n", " data.drop(columns=time_feat, inplace=True)\n", " \n", " y = data.dropna().y\n", " X = data.dropna().drop(['y'], axis=1)\n", " X_train, X_test, y_train, y_test = timeseries_train_test_split(X, y, test_size=test_size)\n", "\n", " return X_train, X_test, y_train, y_test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# perform scaling of features\n", "def feature_scaling(series, lag_start=1, lag_end=20, test_size=0.2):\n", " \n", " X_train, X_test, y_train, y_test = prepareData(series, lag_start=lag_start, \\\n", " lag_end=lag_end, test_size=test_size)\n", " scaler = StandardScaler()\n", " X_train_scaled = scaler.fit_transform(X_train)\n", " X_test_scaled = scaler.transform(X_test)\n", " \n", " return X_train_scaled, X_test_scaled, y_train, y_test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "time_split = TimeSeriesSplit(n_splits=3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def grid_search(estimator, X, y, grid_param, scoring='neg_mean_absolute_error', idd=False, cv=time_split,\n", " figsize=(10,5)):\n", " \n", " gsearch = GridSearchCV(estimator = estimator, \n", " param_grid = grid_param, \n", " scoring=scoring,\n", " iid=idd, \n", " cv=cv,\n", " n_jobs=-1,\n", " verbose=True)\n", " gsearch.fit(X, y) \n", " return gsearch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def mean_absolute_percentage_error(y_true, y_pred): \n", " return np.mean(np.abs((y_true - y_pred) / y_true)) * 100" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plotResults(model, X_test, y_test, figsize=(15, 7), asset_name = ''):\n", " \n", " \n", " prediction = model.predict(X_test)\n", " plt.figure(figsize=figsize)\n", " plt.plot(prediction, \"g\", label=\"%s prediction\" % asset_name, linewidth=2.0)\n", " plt.plot(y_test.values, label=\"actual\", linewidth=2.0)\n", " \n", " \n", " error = mean_absolute_percentage_error(y_test, prediction)\n", " plt.title(\"Mean absolute percentage error {0:.2f}%\".format(error))\n", " plt.legend(loc=\"best\")\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Hyperparameters for grid search**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "param_ridge = {'alpha': [25,10,4,2,1.0,0.8,0.5,0.3,0.2,0.1,0.05,0.02,0.01]}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "param_lasso = {'alpha': [25,10,4,2,1.0,0.8,0.5,0.3,0.2,0.1,0.05,0.02,0.01]}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check one more time that all the data are in the ascending order" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adj_close.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Predictions for hold-out samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.1 ABX predictions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#X_ABX_train = prepareData(adj_close.ABX_Adj_close)\n", "X_ABX_train_scaled, X_ABX_test_scaled, y_ABX_train, y_ABX_test = feature_scaling(adj_close.ABX_Adj_close)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[(el[0].shape, el[1].shape) for el in time_split.split(X_ABX_train_scaled)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.1.1 ABX Ridge" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge_ABX = grid_search(estimator = Ridge(),\n", " X=X_ABX_train_scaled, \n", " y=y_ABX_train,\n", " grid_param=param_ridge,\n", " cv=time_split)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotResults(ridge_ABX, X_ABX_test_scaled, y_ABX_test, figsize=(15, 7), asset_name='ABX')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.1.2 ABX Lasso" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lasso_ABX = grid_search(estimator = Lasso(),\n", " X=X_ABX_train_scaled, \n", " y=y_ABX_train,\n", " grid_param=param_lasso,\n", " cv=time_split)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotResults(lasso_ABX, X_ABX_test_scaled, y_ABX_test, figsize=(15, 7), asset_name='ABX')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.2 WMT predictions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_WMT_train = prepareData(adj_close.WMT_Adj_close)\n", "X_WMT_train_scaled, X_WMT_test_scaled, y_WMT_train, y_WMT_test = feature_scaling(adj_close.WMT_Adj_close)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[(el[0].shape, el[1].shape) for el in time_split.split(X_WMT_train_scaled)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.2.1 WMT Ridge" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge_WMT = grid_search(estimator = Ridge(),\n", " X=X_WMT_train_scaled, \n", " y=y_WMT_train,\n", " grid_param=param_ridge, \n", " cv=time_split)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotResults(ridge_WMT, X_WMT_test_scaled, y_WMT_test, figsize=(15, 7), asset_name='WMT')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.2.2 WMT Lasso" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lasso_WMT = grid_search(estimator = Lasso(),\n", " X=X_WMT_train_scaled, \n", " y=y_WMT_train,\n", " grid_param=param_lasso, \n", " cv=time_split)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotResults(lasso_WMT, X_WMT_test_scaled, y_WMT_test, figsize=(15, 7), asset_name='WMT')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.3 CAT predictions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_CAT_train = prepareData(adj_close.CAT_Adj_close)\n", "X_CAT_train_scaled, X_CAT_test_scaled, y_CAT_train, y_CAT_test = feature_scaling(adj_close.CAT_Adj_close)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[(el[0].shape, el[1].shape) for el in time_split.split(X_CAT_train_scaled)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.3.1 CAT Ridge" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge_CAT = grid_search(estimator = Ridge(),\n", " X=X_CAT_train_scaled, \n", " y=y_CAT_train,\n", " grid_param=param_ridge, \n", " cv=time_split)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotResults(ridge_CAT, X_CAT_test_scaled, y_CAT_test, figsize=(15, 7), asset_name='CAT')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.3.2 CAT Lasso" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lasso_CAT = grid_search(estimator = Lasso(),\n", " X=X_CAT_train_scaled, \n", " y=y_CAT_train,\n", " grid_param=param_lasso,\n", " cv=time_split)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotResults(lasso_CAT, X_CAT_test_scaled, y_CAT_test, figsize=(15, 7), asset_name='CAT')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.4. BP predictions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_BP_train = prepareData(adj_close.BP_Adj_close)\n", "X_BP_train_scaled, X_BP_test_scaled, y_BP_train, y_BP_test = feature_scaling(adj_close.BP_Adj_close)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[(el[0].shape, el[1].shape) for el in time_split.split(X_CAT_train_scaled)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.4.1 BP Ridge" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge_BP = grid_search(estimator = Ridge(),\n", " X=X_BP_train_scaled, \n", " y=y_BP_train,\n", " grid_param=param_ridge)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotResults(ridge_BP, X_BP_test_scaled, y_BP_test, figsize=(15, 7), asset_name='BP')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.4.2 BP Lasso" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lasso_BP = grid_search(estimator = Lasso(),\n", " X=X_BP_train_scaled, \n", " y=y_BP_train,\n", " grid_param=param_lasso,\n", " cv=time_split)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotResults(lasso_BP, X_BP_test_scaled, y_BP_test, figsize=(15, 7), asset_name='BP')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.5 F predictions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_F_train = prepareData(adj_close.F_Adj_close)\n", "X_F_train_scaled, X_F_test_scaled, y_F_train, y_F_test = feature_scaling(adj_close.F_Adj_close)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[(el[0].shape, el[1].shape) for el in time_split.split(X_F_train_scaled)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.5.1 F Ridge" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge_F = grid_search(estimator = Ridge(),\n", " X=X_F_train_scaled, \n", " y=y_F_train,\n", " grid_param=param_ridge)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotResults(ridge_F, X_F_test_scaled, y_F_test, figsize=(15, 7), asset_name='F')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.5.2 F Lasso" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lasso_F = grid_search(estimator = Lasso(),\n", " X=X_F_train_scaled, \n", " y=y_F_train,\n", " grid_param=param_lasso,\n", " cv=time_split)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotResults(lasso_F, X_F_test_scaled, y_F_test, figsize=(15, 7), asset_name='F')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.6. GE predictions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_GE_train = prepareData(adj_close.GE_Adj_close)\n", "X_GE_train_scaled, X_GE_test_scaled, y_GE_train, y_GE_test = feature_scaling(adj_close.GE_Adj_close)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.6.1 GE Ridge" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge_GE = grid_search(estimator = Ridge(),\n", " X=X_GE_train_scaled, \n", " y=y_GE_train,\n", " grid_param=param_ridge)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotResults(ridge_GE, X_GE_test_scaled, y_GE_test, figsize=(15, 7), asset_name='GE')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.6.2 GE Lasso" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lasso_GE = grid_search(estimator = Lasso(),\n", " X=X_GE_train_scaled, \n", " y=y_GE_train,\n", " grid_param=param_lasso,\n", " cv=time_split)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotResults(lasso_GE, X_GE_test_scaled, y_GE_test, figsize=(15, 7), asset_name='GE')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this project, we have provided a graphical analysis of United Continental Holdings Inc. (UAL), BP p.l.c. (BP), American Water Works Company Inc. (AWK), Ford Motor Company (F), General Electric Company (GE) and Walmart Inc. (WMT) returns as well as adjusted close prices. ARMA model has also been fit to the data individually to model returns as well as Ridge and Lasso ML algorithms have been applied to forecast prices. However, there are numerous ways to improve and strengthen the predictions.\n", "\n", "1. Blending of ML algorithms to predict whether the price goes up or down. \n", "2. ARMA-GARCH simulteneous modelling should drastically improve the result since, as we discussed above, there are volatility clusted which cannot be encompassed by ARMA alone. However, if we simply apply GARCH to ARMA-residuals, the approach will not be self-consistent.\n", "3. Robust feature tuning is required before applying ML algorithms. Open price can be added as an additional feature to predict adjusted close prices.\n", "4. Additing additional features. One of possible ways to generate additional feature is to forecast adjusted close price of NASDAQ or S&P500 index, and use it as an extra input parameter while tuning ML algorithm.\n", "5. Here 6 assets were individually modelled while they can be modelled together in order to improve the quality of predictions and reduce the overall of risk by building a portfolio. This project is the first step towards building and testing the performance of a porfolio based on these assets.\n", "6. When analysis of extreme events is done and portfolio evaluation is made, one can suggest a trading strategy. At the moment, even though, according to mean absolute percentage metrics, our predictions may look good, if we use them as signal predictions, the results are not promising." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }