Forecasting stock returns with ARIMA, and prices with Ridge and Lasso
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Feature and data explanation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 Description of the task"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the project, we are analyzing five assets: \n",
" \n",
" - Barrick Gold Corporation (ABX) Basic Industries\n",
" - Walmart Inc. (WMT) Consumer Services\n",
" - Caterpillar Inc (CAT) Capital Goods\n",
" - BP p.l.c. (BP) Energy\n",
" - Ford Motor Company (F) Capital Goods \n",
" - General Electric Company (GE) Energy \n",
" \n",
"based on Yahoo Finance (https://finance.yahoo.com) data. In order to reproduce the results of the project, the data for the assets should be either collected manually from Yahoo Finance for the daily time-period from the 10th of December 2013 to 7th of December 2018 or be downloaded here https://github.com/dmironov1993/Data\n",
"\n",
"In this project, our goal is to predict stock returns with Autoregressive Integrated Moving Average (ARIMA) model and stock prices with Ridge and Lasso. ARIMA model is defined by three parameters $(p,d,q)$, where $p$ - the order of the autoregressive model, $d$ - the degree of differencing and $q$ - the order of the moving-average model. The choice of these parameters is done by brute-forcing (grid search) and choosing the model with the lowest Akaike Information Criterion. If time-series is stationary, we are left with p and q parameters, while $d=0$. Such a model is usually called ARMA, and we use this abbreviation hereinafter. However, since Python does not have several important libraries as those in R, for instance the libraries for ARMA-GARCH simultaneous analysis, stock prices will be predicted as well by using machine learning (ML) algorithms such as Ridge and Lasso. At the appropriate stage of our work, we will generate additional features. ML algorithms will be tuned by using grid search of hyperparameters. The performance of ML algorithms will be tested on a houldout sample. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2. Libraries and Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we import libraries and load the data in a CSV form which we are going to analysis and working with."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Jupiter notebook setup and Importing libraries \n",
"# By default, all figures are shown in 'png'. If the latter is changed to 'svg', higher quality is guaranteed\n",
"%config InlineBackend.figure_format = 'png'\n",
"import warnings\n",
"warnings.simplefilter('ignore')\n",
"\n",
"# Data manipulations\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# Visualization\n",
"import seaborn as sns\n",
"from matplotlib import pyplot as plt\n",
"%matplotlib inline\n",
"from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot\n",
"import plotly\n",
"import plotly.graph_objs as go\n",
"from plotly import tools\n",
"import plotly.plotly as py\n",
"init_notebook_mode(connected=True)\n",
"\n",
"# ARIMA (ARMA) modelling\n",
"import statsmodels.api as sm\n",
"import statsmodels.tsa.api as smt\n",
"import statsmodels.tsa.stattools as ts\n",
"\n",
"# Statistics\n",
"import scipy.stats as scs\n",
"from scipy.stats import skew\n",
"from scipy.stats import kurtosis\n",
"from statsmodels.tsa.stattools import kpss\n",
"\n",
"# Ljung-box test (to check whether residuals are white noise)\n",
"from statsmodels.stats.diagnostic import acorr_ljungbox\n",
"\n",
"\n",
"# Metrics for ML (ARIMA/ARMA has embedded AIC criterion)\n",
"from sklearn.metrics import mean_absolute_error\n",
"\n",
"# Hyperparameter tuning and validation\n",
"from sklearn.model_selection import GridSearchCV, TimeSeriesSplit \n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"# Machine learning algorithms\n",
"#from sklearn.linear_model import LassoCV, RidgeCV\n",
"from sklearn.linear_model import Ridge, Lasso"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Symbols of assets\n",
"asset_names = ['ABX','WMT','CAT','BP','F','GE']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# function for loading csv\n",
"def read_csv(symbols):\n",
" \n",
" \"\"\" \n",
" reading csv file\n",
" \n",
" Input: list \n",
" - symbols of traded stocks \n",
" \n",
" Output: tuple\n",
" - dataframes of traded stocks\n",
" \n",
" \"\"\"\n",
" \n",
" ListofAssets_df = []\n",
" for asset in symbols:\n",
" ListofAssets_df.append(pd.read_csv('%s.csv' % asset, sep=',')\\\n",
" .rename(columns={'Adj Close': '%s_Adj_close' % asset})\\\n",
" .sort_values(by='Date', ascending=False))\n",
" \n",
" return tuple(ListofAssets_df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# number of dataframes within read_csv\n",
"print (len(read_csv(asset_names)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# number of rows and columns within each df\n",
"for df in read_csv(asset_names):\n",
" print (df.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# view of one of dataframes\n",
"read_csv(asset_names)[0].head(2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# information about structure of data\n",
"read_csv(asset_names)[0].info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that there are no missing values in the datasets of our interest."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the obtained historical prices, we have the following information for each of the asset:\n",
"\n",
" - **Date**: Date\n",
" - **Open**: Open price within a date\n",
" - **High**: The highest price within a date\n",
" - **Low**: The lowest price within a date\n",
" - **Close**: Close price within a date\n",
" - **`NAME`_Adj_close**: Adjusted close price in the end of a date\n",
" - **Volume**: Trading volume\n",
" \n",
"In addition to this information, we are interested in daily returns of the assets. Here the latter will be calculated by using\n",
"\n",
" ###