{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicting Stock Prices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This project is taken up to demonstrate Machine Learning's ability to predict one of the most challenging problems in financial world - \"to predict the unpredictable\" - predict the stock price.\n", "\n", "In this project, I have used only those techniques which we have studied in Topic 9 of the course with regards to Time Series analysis.\n", "\n", "For a use case to show the predictive power of very simple algorithms such as Lasso AND Ridge regressions, I have downloaded the data for a very famous stock in India - **\"TATA MOTORS\"**.\n", "\n", "The link to this data is mentioned here -\n", "\n", "https://in.finance.yahoo.com/quote/TATAMOTORS.NS/history?period1=662754600&period2=1544985000&interval=1d&filter=history&frequency=1d\n", "\n", "There are few important characteristics which I would like to outline here:\n", "\n", "1. We have around 17 years of information (from 02 Jan 1991 till 14 Dec 2018)\n", "\n", "\n", "2. There are two types of prices given in the data:\n", "\n", " a. Closing Prices which do not take into account of any corporate actions in the prices such as declaration of dividends or split of shares effect.\n", " \n", " b. Adjusted Closing Prices which do take care of effect of Dividend payments and stock split. [**This will be our target variable**]\n", "\n", " \n", "3. There are other information available as well in the data which may not be useful for our analysis point of view.\n", "\n", "So, let's dive in.\n", "\n", "First we will download all basic libraries into Pythnon" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Libraries and Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from fbprophet import Prophet\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "#setting figure size\n", "from matplotlib.pyplot import rcParams\n", "rcParams['figure.figsize'] = 20,10\n", "\n", "#for normalizing data\n", "from sklearn.preprocessing import MinMaxScaler\n", "scaler = MinMaxScaler(feature_range=(0, 1))\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# loading basic ML algoriths\n", "from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV\n", "from sklearn.metrics import mean_absolute_error, mean_squared_error\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.model_selection import TimeSeriesSplit, cross_val_score\n", "\n", "# few powerful algorithms as well which we will see later dont perform well compared to basic algorithms\n", "import xgboost\n", "import lightgbm" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load the downloaded data\n", "import os\n", "os.chdir('C:\\\\Users\\\\Abhik\\\\mlcourse.ai\\\\mlcourse.ai-master\\\\data')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load the data into pandas dataframe\n", "df = pd.read_csv('TATAMOTORS.NS.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# since there are few NaN values, we should remove these first\n", "df.dropna(axis=0, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Lets check the data once again\n", "df.head(6)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We now need to convert the Dates into Pandas Date format\n", "df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d')\n", "df.index = df['Date']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Better to check the data once again\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## EDA and Feature Engineering" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot the Graph for Adjusted Closing Price\n", "\n", "from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot\n", "import plotly\n", "import plotly.graph_objs as go\n", "\n", "init_notebook_mode(connected=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trace1 = go.Scatter(\n", " x=df.Date,\n", " y=df['Adj Close'],\n", " name='Closing Price'\n", ")\n", "data = [trace1]\n", "layout = {'title': 'Adjusted Closing Price'}\n", "fig = go.Figure(data=data, layout=layout)\n", "iplot(fig, show_link=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Shape of the Data\n", "df.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Lets create a new dataset in which we will only store the required inputs.\n", "\n", "#setting index as date values\n", "df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d')\n", "df.index = df['Date']\n", "\n", "#sorting\n", "data = df.sort_index(ascending=True, axis=0)\n", "\n", "#creating a separate dataset\n", "new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close'])\n", "\n", "for i in range(0,len(data)):\n", " new_data['Date'][i] = data['Date'][i]\n", " new_data['Close'][i] = data['Adj Close'][i]\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Lets check the Data once again\n", "new_data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We will create a number of features on the Dates\n", "\n", "new_data['year'] = new_data['Date'].map(lambda x : x.year)\n", "new_data['month'] = new_data['Date'].map(lambda x : x.month)\n", "new_data['day_week'] = new_data['Date'].map(lambda x : x.dayofweek)\n", "new_data['quarter'] = new_data['Date'].map(lambda x : x.quarter)\n", "new_data['week'] = new_data['Date'].map(lambda x : x.week)\n", "new_data['quarter_start'] = new_data['Date'].map(lambda x : x.is_quarter_start)\n", "new_data['quarter_end'] = new_data['Date'].map(lambda x : x.is_quarter_end)\n", "new_data['month_start'] = new_data['Date'].map(lambda x : x.is_month_start)\n", "new_data['month_end'] = new_data['Date'].map(lambda x : x.is_month_end)\n", "new_data['year_start'] = new_data['Date'].map(lambda x : x.is_year_start)\n", "new_data['year_end'] = new_data['Date'].map(lambda x : x.is_year_end)\n", "new_data['week_year'] = new_data['Date'].map(lambda x : x.weekofyear)\n", "new_data['quarter_start'] = new_data['quarter_start'].map(lambda x: 0 if x is False else 1)\n", "new_data['quarter_end'] = new_data['quarter_end'].map(lambda x: 0 if x is False else 1)\n", "new_data['month_start'] = new_data['month_start'].map(lambda x: 0 if x is False else 1)\n", "new_data['month_end'] = new_data['month_end'].map(lambda x: 0 if x is False else 1)\n", "new_data['year_start'] = new_data['year_start'].map(lambda x: 0 if x is False else 1)\n", "new_data['year_end'] = new_data['year_end'].map(lambda x: 0 if x is False else 1)\n", "new_data['day_month'] = new_data['Date'].map(lambda x: x.daysinmonth)\n", "\n", "# Create a feature which could be important - Markets are only open between Monday and Friday.\n", "mon_fri_list = [0,4]\n", "new_data['mon_fri'] = new_data['day_week'].map(lambda x: 1 if x in mon_fri_list else 0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Re-indexing the data\n", "new_data.index = new_data['Date']\n", "new_data.drop('Date', inplace=True, axis=1)\n", "new_data.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lags are very important features which need to be created for any time-series prediction as it will define the auto-correlation effect between past observations.\n", "\n", "Here we have taken the lag period of 1 to 22 days (since the market opens for around 22 days in a month)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in range(1, 22):\n", " new_data[\"lag_{}\".format(i)] = new_data.Close.shift(i)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_data.head(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Lets create dummies for categorical features\n", "\n", "cols = ['year', 'month', 'day_week', 'quarter', 'week', \n", " 'quarter_start', 'quarter_end', 'week_year', 'mon_fri', 'year_start', 'year_end',\n", " 'month_start', 'month_end', 'day_month']\n", "\n", "for i in cols:\n", " new_data = pd.concat([new_data.drop([i], axis=1), \n", " pd.get_dummies(new_data[i], prefix=i)\n", " ], axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Droping NAs if any and re-indexing again\n", "\n", "new_data = new_data.dropna()\n", "new_data = new_data.reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_data.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Target Variable\n", "y = new_data.Close.values\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Splitting the Data into Train-Test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Creating splitting index\n", "\n", "test_index = int(len(new_data) * (1 - 0.30))\n", "test_index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we dont want to look into immediate future, we are creating a window of 2 days. This means, training data will stop at day x-1 and test data will start at x+1." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# splitting whole dataset on train and test\n", "\n", "X_train = new_data.loc[:test_index-1].drop(['Close'], axis=1)\n", "y_train = new_data.loc[:test_index-1][\"Close\"]\n", "X_test = new_data.loc[test_index+1:].drop([\"Close\"], axis=1)\n", "y_test = new_data.loc[test_index+1:][\"Close\"] " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Lets visualize the train and test data together\n", "plt.figure(figsize=(16,8))\n", "plt.plot(y_train)\n", "plt.plot(y_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Scaling the Data\n", "\n", "from sklearn.preprocessing import StandardScaler\n", "scaler = StandardScaler()\n", "X_train_scaled = scaler.fit_transform(X_train)\n", "X_test_scaled = scaler.transform(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Machine Learning implementations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First we will use the simplest of them all - Linear Regression\n", "\n", "from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, Lasso, Ridge\n", "lr = LinearRegression()\n", "lr.fit(X_train_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For Cross Validation (CV) on Time Series data, we will use **TimeSeries Split** for CV.\n", "\n", "Let's see Mean Absolute Error for our simplest model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import TimeSeriesSplit\n", "from sklearn.model_selection import cross_val_score\n", "tscv = TimeSeriesSplit(n_splits=5)\n", "cv = cross_val_score(lr, X_train_scaled, y_train, scoring = 'neg_mean_absolute_error', cv=tscv)\n", "mae = cv.mean()*(-1)\n", "mae" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh Gosh!! Linear Regression failed miserably to predict the pattern. Lets try regularized linear models.\n", "\n", "But before that, we will use the plotting module written in Topic 9 of the course to plot some nice graphs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plotModelResults(model, df_train, df_test, y_train, y_test, plot_intervals=False, plot_anomalies=False, scale=1.96, cv=tscv):\n", " \"\"\"\n", " Plots modelled vs fact values\n", " \n", " model: fitted model \n", " \n", " df_train, df_test: splitted featuresets\n", " \n", " y_train, y_test: targets\n", " \n", " plot_intervals: bool, if True, plot prediction intervals\n", " \n", " scale: float, sets the width of the intervals\n", " \n", " cv: cross validation method, needed for intervals\n", " \n", " \"\"\"\n", " # making predictions for test\n", " prediction = model.predict(df_test)\n", " \n", " plt.figure(figsize=(20, 7))\n", " plt.plot(prediction, \"g\", label=\"prediction\", linewidth=2.0)\n", " plt.plot(y_test.values, label=\"actual\", linewidth=2.0)\n", " \n", " if plot_intervals:\n", " # calculate cv scores\n", " cv = cross_val_score(\n", " model, \n", " df_train, \n", " y_train, \n", " cv=cv, \n", " scoring=\"neg_mean_squared_error\"\n", " )\n", "\n", " # calculate cv error deviation\n", " deviation = np.sqrt(cv.std())\n", " \n", " # calculate lower and upper intervals\n", " lower = prediction - (scale * deviation)\n", " upper = prediction + (scale * deviation)\n", " \n", " plt.plot(lower, \"r--\", label=\"upper bond / lower bond\", alpha=0.5)\n", " plt.plot(upper, \"r--\", alpha=0.5)\n", " \n", " if plot_anomalies:\n", " anomalies = np.array([np.NaN]*len(y_test))\n", " anomalies[y_testupper] = y_test[y_test>upper]\n", " plt.plot(anomalies, \"o\", markersize=10, label = \"Anomalies\")\n", " \n", " # calculate overall quality on test set\n", " mae = mean_absolute_error(prediction, y_test)\n", " mape = mean_absolute_percentage_error(prediction, y_test)\n", " plt.title(\"MAE {}, MAPE {}%\".format(round(mae), round(mape, 2)))\n", " plt.legend(loc=\"best\")\n", " plt.grid(True);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another plotting module for Coefficients" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def getCoefficients(model):\n", " \"\"\"Returns sorted coefficient values of the model\"\"\"\n", " coefs = pd.DataFrame(model.coef_, X_train.columns)\n", " coefs.columns = [\"coef\"]\n", " coefs[\"abs\"] = coefs.coef.apply(np.abs)\n", " return coefs.sort_values(by=\"abs\", ascending=False).drop([\"abs\"], axis=1) \n", " \n", "\n", "def plotCoefficients(model):\n", " \"\"\"Plots sorted coefficient values of the model\"\"\"\n", " coefs = getCoefficients(model)\n", " \n", " plt.figure(figsize=(20, 7))\n", " coefs.coef.plot(kind='bar')\n", " plt.grid(True, axis='y')\n", " plt.hlines(y=0, xmin=0, xmax=len(coefs), linestyles='dashed')\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will define a loss metric - namely - *Mean Absolute Percentage Error* which calculated Mean Absolute Error in percentage" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def mean_absolute_percentage_error(y_true, y_pred): \n", " return np.mean(np.abs((y_true - y_pred) / y_true)) * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see the plot for Linear Regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotModelResults(lr, X_train_scaled, X_test_scaled, y_train, y_test, plot_intervals=True, plot_anomalies=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This plot does not tell us much apart from the fact that our model has faired poorly in predicting the pattern.\n", "\n", "Lets see the plot for coefficients" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotCoefficients(lr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets see the correlation matrix and Heat Map for the features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "plt.figure(figsize=(15,10))\n", "sns.heatmap(X_train.corr())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not much information can be derived from this Heat Map - only crucial information is prices in few years are completely uncorrelated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets create our next model - Lasso Regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lasso = LassoCV(cv =tscv, max_iter=10000)\n", "lasso.fit(X_train_scaled, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotModelResults(lasso, \n", " X_train_scaled, \n", " X_test_scaled,\n", " y_train, \n", " y_test,\n", " plot_intervals=True, plot_anomalies=True)\n", "plotCoefficients(lasso)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coef = getCoefficients(lasso)\n", "np.count_nonzero(np.where(coef['coef']==0.000000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh wow!\n", "\n", "Around 181 features were of no value which are elimiated by the Lasso Regression\n", "\n", "Let's see important features (Top10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coef.sort_values(by='coef', ascending=False).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It turns out that **Lag 1** is the most important feature\n", "\n", "Lets see how close our predicted values are compared to actual values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Lasso\n", "lasso = Lasso(max_iter=10000, random_state=17)\n", "\n", "lasso.fit(X_train_scaled, y_train)\n", "y_pred = lasso.predict(X_test_scaled)\n", "\n", "columns = ['Close_actual', 'Close_pred']\n", "df_pred_lasso = pd.DataFrame(columns = columns)\n", "\n", "df_pred_lasso.Close_actual = y_test\n", "df_pred_lasso.Close_pred = y_pred" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(15,8))\n", "plt.plot(df_pred_lasso)\n", "plt.plot(df_pred_lasso.Close_pred, \"b--\", label=\"prediction\", linewidth=1.0)\n", "plt.plot(df_pred_lasso.Close_actual, \"r--\", label=\"actual\", linewidth=1)\n", "plt.legend(loc=\"best\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_pred_lasso['diff'] = df_pred_lasso.Close_actual - df_pred_lasso.Close_pred\n", "df_pred_lasso['perc_diff'] = ((df_pred_lasso['diff']) / (df_pred_lasso['Close_pred']))\n", "\n", "df_pred_lasso.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Amazing!!\n", "\n", "Lasso Regression has done a very nice job in predicting the adjusted closing price of this stock\n", "\n", "We can also run PCA to eliminate more features and noises from the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "from sklearn.pipeline import make_pipeline\n", "\n", "def plotPCA(pca):\n", " \"\"\"\n", " Plots accumulated percentage of explained variance by component\n", " \n", " pca: fitted PCA object\n", " \"\"\"\n", " components = range(1, pca.n_components_ + 1)\n", " variance = np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)\n", " plt.figure(figsize=(20, 10))\n", " plt.bar(components, variance)\n", " \n", " # additionally mark the level of 95% of explained variance \n", " plt.hlines(y = 95, xmin=0, xmax=len(components), linestyles='dashed', colors='red')\n", " \n", " plt.xlabel('PCA components')\n", " plt.ylabel('variance')\n", " plt.xticks(components)\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create PCA object: pca\n", "pca = PCA()\n", "\n", "\n", "# Train PCA on scaled data\n", "pca = pca.fit(X_train_scaled)\n", "\n", "# plot explained variance\n", "plotPCA(pca)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca_comp = PCA(0.95).fit(X_train_scaled)\n", "print('We need %d components to explain 95%% of variance' \n", " % pca_comp.n_components_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PCA needs only 73 components to explain the variance. \n", "\n", "Lets fit and transform train and test data with these components" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca = PCA(n_components=pca_comp.n_components).fit(X_train_scaled)\n", "\n", "pca_features_train = pca.transform(X_train_scaled)\n", "pca_features_test = pca.transform(X_test_scaled)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets run the Linear Regression model once again to see if there are any improvements since last time" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lr.fit(pca_features_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotModelResults(lr, pca_features_train, pca_features_test, y_train, y_test, plot_intervals=True, plot_anomalies=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Super!\n", "\n", "PCA has resulted into an improvement in the linear regression model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets run another model - Ridge Regression and see how does it fare" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Ridge\n", "ridge = Ridge(max_iter=10000, random_state=17)\n", "\n", "ridge.fit(X_train_scaled, y_train)\n", "y_pred = ridge.predict(X_test_scaled)\n", "\n", "columns = ['Close_actual', 'Close_pred']\n", "df_pred_ridge = pd.DataFrame(columns = columns)\n", "\n", "df_pred_ridge.Close_actual = y_test\n", "df_pred_ridge.Close_pred = y_pred" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(15,8))\n", "plt.plot(df_pred_ridge)\n", "plt.plot(df_pred_ridge.Close_pred, \"b--\", label=\"prediction\", linewidth=1.0)\n", "plt.plot(df_pred_ridge.Close_actual, \"r--\", label=\"actual\", linewidth=1.0)\n", "plt.legend(loc=\"best\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_pred_ridge['diff'] = df_pred_ridge.Close_actual - df_pred_ridge.Close_pred\n", "df_pred_ridge['perc_diff'] = ((df_pred_ridge['diff']) / (df_pred_ridge['Close_pred']))*100\n", "df_pred_ridge.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not Bad at all!\n", "\n", "Lasso and Ridge turned out to quite close and already are superstars\n", "\n", "Lets see the plots for Ridge" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import RidgeCV\n", "ridge = RidgeCV(cv=tscv)\n", "ridge.fit(X_train_scaled, y_train)\n", "\n", "plotModelResults(ridge, X_train_scaled, X_test_scaled, y_train, y_test, plot_intervals=True, plot_anomalies=True)\n", "plotCoefficients(ridge)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now Lets see how Lasso and Ridge are performing on PCA transformed data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Lasso\n", "Lasso = Lasso(max_iter=10000)\n", "Lasso.fit(pca_features_train, y_train)\n", "\n", "from sklearn.linear_model import Ridge\n", "ridge = Ridge(max_iter=10000, random_state=17)\n", "ridge.fit(pca_features_train, y_train)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotModelResults(Lasso, pca_features_train, pca_features_test, y_train, y_test, plot_intervals=True, plot_anomalies=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotModelResults(ridge, pca_features_train, pca_features_test, y_train, y_test, plot_intervals=True, plot_anomalies=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### FB Prophet\n", "\n", "Now lets use FB-Prophet to predict the pattern" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fbprophet import Prophet\n", "import logging\n", "logging.getLogger().setLevel(logging.ERROR)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_new = df['Close']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_new" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Lets see the monthly pattern over the years\n", "monthly_df = df_new.resample('M').apply(sum)\n", "plt.figure(figsize=(15,10))\n", "plt.plot(monthly_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating Dataset for FB-Prophet" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_n = df_new.reset_index()\n", "df_n.columns = ['ds', 'y']\n", "df_n = df_n.reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prediction_size = 30 # prediction for one-month\n", "train_df = df_n[:-prediction_size]\n", "train_df.tail(n=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fitting the model and Creating Future Dataframes including the history" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m = Prophet()\n", "m.fit(train_df);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "future = m.make_future_dataframe(periods=prediction_size)\n", "future.tail(n=3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecast = m.predict(future)\n", "forecast.tail(n=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating Plots to see the patterns predicted by FB-Prophet" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m.plot(forecast)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m.plot_components(forecast)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above plots are self explanatory but few are important observations:\n", "\n", "1. On Wednesdays price of this stock on an average goes up\n", "2. August / September, prices are on an average goes down\n", "3. After financial crisis of 2008, stock has picked up well and reached to its peak in around 2013" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets combine Historic and Forecast data together" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def make_comparison_dataframe(historical, forecast):\n", " \"\"\"Join the history with the forecast.\n", " \n", " The resulting dataset will contain columns 'yhat', 'yhat_lower', 'yhat_upper' and 'y'.\n", " \"\"\"\n", " return forecast.set_index('ds')[['yhat', 'yhat_lower', 'yhat_upper']].join(historical.set_index('ds'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cmp_df = make_comparison_dataframe(df_n, forecast)\n", "cmp_df.tail(n=3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prediction_size=10 # 10 days prediction\n", "cmp_df_pred = cmp_df[-prediction_size:]\n", "cmp_df_pred['MAE'] = cmp_df_pred['y'] - cmp_df_pred['yhat']\n", "cmp_df_pred['MAPE'] = 100* cmp_df_pred['MAE'] / cmp_df_pred['y']\n", "\n", "print('average MAE:', np.mean(np.abs(cmp_df_pred['MAE'])))\n", "print('average MAPE:', np.mean(np.abs(cmp_df_pred['MAPE'])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "FB-Prophet has not done well so far in comparison with Lasso and Ridge.\n", "\n", "Lets normalize the data using Box-Cox transformation and see if these results have improved" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def inverse_boxcox(y, lambda_):\n", " return np.exp(y) if lambda_ == 0 else np.exp(np.log(lambda_ * y + 1) / lambda_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_df2 = train_df.copy().set_index('ds')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy import stats\n", "import statsmodels.api as sm\n", "train_df2['y'], lambda_prophet = stats.boxcox(train_df2['y'])\n", "train_df2.reset_index(inplace=True)\n", "train_df2.head(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m2 = Prophet()\n", "m2.fit(train_df2)\n", "future2 = m2.make_future_dataframe(periods=prediction_size)\n", "forecast2 = m2.predict(future2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for column in ['yhat', 'yhat_lower', 'yhat_upper']:\n", " forecast2[column] = inverse_boxcox(forecast2[column], lambda_prophet)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting the new components " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m2.plot_components(forecast2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets create a module for forecast errors" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def calculate_forecast_errors(df, prediction_size):\n", " \"\"\"Calculate MAPE and MAE of the forecast.\n", " \n", " Args:\n", " df: joined dataset with 'y' and 'yhat' columns.\n", " prediction_size: number of days at the end to predict.\n", " \"\"\"\n", " \n", " # Make a copy\n", " df = df.copy()\n", " \n", " # Now we calculate the values of e_i and p_i according to the formulas given in the article above.\n", " df['e'] = df['y'] - df['yhat']\n", " df['p'] = 100 * df['e'] / df['y']\n", " \n", " # Recall that we held out the values of the last `prediction_size` days\n", " # in order to predict them and measure the quality of the model. \n", " \n", " # Now cut out the part of the data which we made our prediction for.\n", " predicted_part = df[-prediction_size:]\n", " \n", " # Define the function that averages absolute error values over the predicted part.\n", " error_mean = lambda error_name: np.mean(np.abs(predicted_part[error_name]))\n", " \n", " # Now we can calculate MAPE and MAE and return the resulting dictionary of errors.\n", " return {'MAPE': error_mean('p'), 'MAE': error_mean('e')}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cmp_df2 = make_comparison_dataframe(df_n, forecast2)\n", "for err_name, err_value in calculate_forecast_errors(cmp_df2, prediction_size).items():\n", " print(err_name, err_value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Box Cox has improved the results but still not up to the levels of Lasso and Ridge" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m2.plot(forecast2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cmp_df2.tail(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "FB Prophet has not fared well in comparison with Lasso and Ridge (see predicted results are very far from actual values).\n", "\n", "Now lets run 2 very powerful algorithms and see if they can beat Lasso and Ridge" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "#sys.path.append('/Users/dmitrys/xgboost/python-package/')\n", "from xgboost import XGBRegressor \n", "\n", "xgb = XGBRegressor()\n", "xgb.fit(X_train_scaled, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotModelResults(xgb, X_train_scaled, X_test_scaled, y_train, y_test, plot_intervals=True, plot_anomalies=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lgb = lightgbm.LGBMRegressor()\n", "lgb.fit(X_train_scaled, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotModelResults(lgb, X_train_scaled, X_test_scaled, y_train, y_test, plot_intervals=True, plot_anomalies=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not at all!!\n", "\n", "Tree based algorithms are known to fail miserably on time series predictions which is evident from the above results.\n", "\n", "Now we will do some stacking and see if results on Lasso and Ridge can be improved further.\n", "\n", "Here we will use three classifiers:\n", "\n", "1. Elastic Net (base)\n", "2. Ridge (base)\n", "3. Lasso (Meta)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mlxtend.classifier import StackingClassifier\n", "from mlxtend.regressor import StackingRegressor\n", "from sklearn.linear_model import ElasticNet\n", "\n", "clf1 = ElasticNet(max_iter=10000)\n", "clf2 = ridge\n", "\n", "\n", "sclf = StackingRegressor(regressors=[clf1, clf2], \n", " meta_regressor=lasso)\n", "\n", "sclf.fit(X_train_scaled, y_train)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotModelResults(sclf, X_train_scaled, X_test_scaled, y_train, y_test, plot_intervals=True, plot_anomalies=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred = sclf.predict(X_test_scaled)\n", "\n", "columns = ['Close_actual', 'Close_pred']\n", "df_pred_sclf = pd.DataFrame(columns = columns)\n", "\n", "df_pred_sclf.Close_actual = y_test\n", "df_pred_sclf.Close_pred = y_pred\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(15,8))\n", "plt.plot(df_pred_sclf)\n", "plt.plot(df_pred_sclf.Close_pred, \"b--\", label=\"prediction\", linewidth=0.5)\n", "plt.plot(df_pred_sclf.Close_actual, \"r--\", label=\"actual\", linewidth=0.5)\n", "plt.legend(loc=\"best\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_pred_sclf['diff'] = df_pred_sclf.Close_actual - df_pred_sclf.Close_pred\n", "df_pred_sclf['perc_diff'] = ((df_pred_sclf['diff']) / (df_pred_sclf['Close_pred']))*100\n", "df_pred_sclf.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Results have slightly improved. It turns out that regularized Lasso and Ridge regressions gave the best results. MAPE is around 1.76% and MAE is around INR 6. This is remarkable and it can be further improved through the ways of Hyperparameter tuning or through some advanced techniques such as LSTM." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }