--- title: "Econ 425T Homework 6" subtitle: "Due Mar 24, 2023 @ 11:59PM" author: "YOUR NAME and UID" date: "`r format(Sys.time(), '%d %B, %Y')`" format: html: theme: cosmo embed-resources: true number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false engine: knitr knitr: opts_chunk: fig.align: 'center' # fig.width: 6 # fig.height: 4 message: FALSE cache: false --- Load Python libraries. ```{python} #| code-fold: true # Load the pandas library import pandas as pd # Load numpy for array manipulation import numpy as np # Load seaborn plotting library import seaborn as sns import matplotlib.pyplot as plt # For read file from url import io import requests # Set font sizes in plots sns.set(font_scale = 1.) # Display all columns pd.set_option('display.max_columns', None) ``` ## New York Stock Exchange (NYSE) data (1962-1986) (10 pts) ::: {#fig-nyse}

![](ISL_fig_10_14.pdf){width=600px height=600px}

Historical trading statistics from the New York Stock Exchange. Daily values of the normalized log trading volume, DJIA return, and log volatility are shown for a 24-year period from 1962-1986. We wish to predict trading volume on any day, given the history on all earlier days. To the left of the red bar (January 2, 1980) is training data, and to the right test data. ::: The [`NYSE.csv`](https://raw.githubusercontent.com/ucla-econ-425t/2023winter/master/slides/data/NYSE.csv) file contains three daily time series from the New York Stock Exchange (NYSE) for the period Dec 3, 1962-Dec 31, 1986 (6,051 trading days). - `Log trading volume` ($v_t$): This is the fraction of all outstanding shares that are traded on that day, relative to a 100-day moving average of past turnover, on the log scale. - `Dow Jones return` ($r_t$): This is the difference between the log of the Dow Jones Industrial Index on consecutive trading days. - `Log volatility` ($z_t$): This is based on the absolute values of daily price movements. ```{python} # Read in NYSE data from url url = "https://raw.githubusercontent.com/ucla-econ-425t/2023winter/master/slides/data/NYSE.csv" s = requests.get(url).content.decode('utf-8') NYSE = pd.read_csv(io.StringIO(s), index_col = 0) NYSE ``` The **autocorrelation** at lag $\ell$ is the correlation of all pairs $(v_t, v_{t-\ell})$ that are $\ell$ trading days apart. These sizable correlations give us confidence that past values will be helpful in predicting the future. ```{python} #| code-fold: true #| label: fig-nyse-autocor #| fig-cap: "The autocorrelation function for log volume. We see that nearby values are fairly strongly correlated, with correlations above 0.2 as far as 20 days apart." from statsmodels.graphics.tsaplots import plot_acf, plot_pacf plt.figure() plot_acf(NYSE['log_volume'], lags = 20) plt.show() ``` Do a similar plot for (1) the correlation between $v_t$ and lag $\ell$ `Dow Jones return` $r_{t-\ell}$ and (2) correlation between $v_t$ and lag $\ell$ `Log volatility` $z_{t-\ell}$. ## Project goal Our goal is to forecast daily `Log trading volume`, using various machine learning algorithms we learnt in this class. The data set is already split into train (before Jan 1st, 1980, $n_{\text{train}} = 4,281$) and test (after Jan 1st, 1980, $n_{\text{test}} = 1,770$) sets. In general, we will tune the lag $L$ to acheive best forecasting performance. In this project, we would fix $L=5$. That is we always use the previous five trading days' data to forecast today's `log trading volume`. Pay attention to the nuance of splitting time series data for cross validation. Study and use the [`TimeSeriesSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) in Scikit-Learn. Make sure to use the same splits when tuning different machine learning algorithms. Use the $R^2$ between forecast and actual values as the cross validation and test evaluation criterion. ## Baseline method (20 pts) We use the straw man (use yesterday’s value of `log trading volume` to predict that of today) as the baseline method. Evaluate the $R^2$ of this method on the test data. ## Autoregression (AR) forecaster (30 pts) - Let $$ y = \begin{pmatrix} v_{L+1} \\ v_{L+2} \\ v_{L+3} \\ \vdots \\ v_T \end{pmatrix}, \quad M = \begin{pmatrix} 1 & v_L & v_{L-1} & \cdots & v_1 \\ 1 & v_{L+1} & v_{L} & \cdots & v_2 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & v_{T-1} & v_{T-2} & \cdots & v_{T-L} \end{pmatrix}. $$ - Fit an ordinary least squares (OLS) regression of $y$ on $M$, giving $$ \hat v_t = \hat \beta_0 + \hat \beta_1 v_{t-1} + \hat \beta_2 v_{t-2} + \cdots + \hat \beta_L v_{t-L}, $$ known as an **order-$L$ autoregression** model or **AR($L$)**. - Tune AR(5) with elastic net (lasso + ridge) regularization using all 3 features on the training data, and evaluate the test performance. - Hint: [Workflow: Lasso](https://ucla-econ-425t.github.io/2023winter/slides/06-modelselection/workflow_lasso.html) is a good starting point. ## Autoregression by MLP (30 pts) - Use the same features as in AR($L$). Tune MLP and evaluate the test performance. - Hint: [Workflow (MLP with scikit-learn)](https://ucla-econ-425t.github.io/2023winter/slides/10-nn/heart_mlp/workflow_mlp_sklearn.html) or [Workflow (MLP with Keras Tuner)](https://ucla-econ-425t.github.io/2023winter/slides/10-nn/heart_mlp/workflow_mlp_kerastune.html) is a good starting point. ## LSTM forecaster (30 pts) - We extract many short mini-series of input sequences $X=\{X_1,X_2,\ldots,X_L\}$ with a predefined lag $L$: \begin{eqnarray*} X_1 = \begin{pmatrix} v_{t-L} \\ r_{t-L} \\ z_{t-L} \end{pmatrix}, X_2 = \begin{pmatrix} v_{t-L+1} \\ r_{t-L+1} \\ z_{t-L+1} \end{pmatrix}, \cdots, X_L = \begin{pmatrix} v_{t-1} \\ r_{t-1} \\ z_{t-1} \end{pmatrix}, \end{eqnarray*} and $$ Y = v_t. $$ - Tune LSTM and evaluate the test performance. ## Random forest forecaster (30pts) - Use the same features as in AR($L$) for the random forest. Tune the random forest and evaluate the test performance. - Hint: [Workflow: Random Forest for Prediction](https://ucla-econ-425t.github.io/2023winter/slides/08-tree/workflow_rf_reg.html) is a good starting point. ## Boosting forecaster (30pts) - Use the same features as in AR($L$) for the boosting. Tune the boosting algorithm and evaluate the test performance. - Adventurous students should try to learn and use [XGBoost](https://github.com/dmlc/xgboost) instead of Scikit-Learn. ## Summary (30pts) Your score for this question is largely determined by your final test performance. Summarize the performance of different machine learning forecasters in the following format. | Method | CV $R^2$ | Test $R^2$ | |:------:|:------:|:------:|:------:| | Baseline | | | | | AR(5) | | | | | AR(5) MLP | | | | | LSTM | | | | | Random Forest | | | | | Boosting | | | |