{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Predict Temporal Series" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import addutils.toc ; addutils.toc.js(ipy_notebook=True)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import scipy.io\n", "import numpy as np\n", "import pandas as pd\n", "from addutils import css_notebook\n", "css_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1 Predict Univariate Temporal Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the simpler case in temporal data prediction: we have just one measure to predict. The easier task is to made a one-step prediction. Here we test different ensamble methods on a multi-step prediction.\n", "\n", "Here for simplicity we consider data sampled with a constant sample time. in this way, when we speak about time, it can be either time or the number of the sample.\n", "At every time-step the system will have an output **y(t)** and our goal is to predict the future output nk steps forward **y(t+nk)**.\n", "\n", "To fit the regressor we built a vector of **nk+1** past data samples: **[y(t-nk), ... , y(t-2), y(t-1), y(t)]**. Past measured values that build the regressor vector are referred as **Standard Regressor Components**. Is instead we define the values in the form of arbitrary user-defined functions of delayed input variables we call it **Custom Regressor Components**. In this case we will use just Standard Regressor Components.\n", "\n", "Usually predicting something many timesteps in advance is very hard: here we want to test the predictive capabilitied of many different ensemble regressors by predicting 9 timesteps in the future.\n", "\n", "Hint: try to change nb (the regressor vector length) to see how this affect the prediction" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " \n", "\n", "\n", " \n", "\n", "
\n", " \n", " BokehJS successfully loaded.\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import bokeh.plotting as bk\n", "bk.output_notebook()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import warnings\n", "from time import time\n", "warnings.filterwarnings(\"ignore\", category=DeprecationWarning)\n", "pd.set_option('display.max_columns', None,\n", " 'display.float_format', '{:,.3f}'.format)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n_samples = 700\n", "error = 0.02\n", "df = pd.DataFrame(np.zeros((n_samples,3)), columns=['t', 'y(t)', 'y_true'])\n", "df['t'] = np.linspace(0, 6, n_samples)\n", "df['y_true'] = np.sin(2*df['t']) + 0.88*np.cos(11 * df['t']) + \\\n", " 0.3*np.sin(29*df['t']) + 0.1*np.sin(47*df['t'])\n", "df['y(t)'] = df['y_true'] + np.random.normal(0, error, n_samples)\n", "\n", "nb = 40 # Number of past input terms used to predict the current output.\n", "nk = 9 # Samples delayed from input to output (prediction time steps)\n", "\n", "for i in range(1,nb+1):\n", " col_name = 'y(t-%i)' %(i,)\n", " df.insert(1, col_name, df['y(t)'].shift(i))\n", "col_to_predict = 'y(t+%i)' %(nk,)\n", "df[col_to_predict] = df['y(t)'].shift(-nk)\n", "\n", "first_row, last_row = nb, df.shape[0]-(nk+1)\n", "X = df[df.columns[1:nb+2]].ix[first_row:last_row]\n", "y = df[col_to_predict].ix[first_row:last_row]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RFR Fitted in 31.633[s]\n", "RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=15,\n", " max_features='auto', max_leaf_nodes=None, min_samples_leaf=2,\n", " min_samples_split=4, min_weight_fraction_leaf=0.0,\n", " n_estimators=50, n_jobs=1, oob_score=False, random_state=None,\n", " verbose=0, warm_start=False)\n", "\n", "\n", "XTR Fitted in 26.685[s]\n", "ExtraTreesRegressor(bootstrap=True, criterion='mse', max_depth=10,\n", " max_features=4, max_leaf_nodes=None, min_samples_leaf=2,\n", " min_samples_split=4, min_weight_fraction_leaf=0.0,\n", " n_estimators=25, n_jobs=1, oob_score=False, random_state=None,\n", " verbose=0, warm_start=False)\n", "\n", "\n", "ADB Fitted in 33.769[s]\n", "AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',\n", " n_estimators=150, random_state=None)\n", "\n", "GTB Fitted in 27.986[s]\n", "GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.2, loss='ls',\n", " max_depth=2, max_features=None, max_leaf_nodes=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=50,\n", " random_state=None, subsample=0.5, verbose=0, warm_start=False)\n" ] } ], "source": [ "from sklearn import datasets, cross_validation, ensemble, metrics, grid_search\n", "param_rfr = {'n_estimators': [25, 50],\n", " 'max_depth': [15],\n", " 'min_samples_split' : [1, 2, 4],\n", " 'min_samples_leaf' : [2],\n", " 'bootstrap' : [True]}\n", "\n", "param_xtr = {'n_estimators':[25, 50],\n", " 'max_features':[2, 3, 4],\n", " 'max_depth':[10, 15, 20],\n", " 'min_samples_split':[1, 2, 4],\n", " 'min_samples_leaf':[2],\n", " 'bootstrap':[True]}\n", "\n", "param_adb = {'n_estimators':[75, 150],\n", " 'learning_rate':[0.5, 1.0]}\n", "\n", "param_gtb = {'n_estimators':[50, 100],\n", " 'learning_rate':[0.2, 0.5],\n", " 'subsample':[0.5, 1.0],\n", " 'max_depth':[1, 2]}\n", "\n", "# Fit the Random Forest Regressor\n", "t0 = time()\n", "rfr = grid_search.GridSearchCV(ensemble.RandomForestRegressor(), \n", " param_rfr, cv=15, verbose=0, n_jobs=-1).fit(X, y)\n", "print('\\nRFR Fitted in %0.3f[s]' %(time() - t0))\n", "rfr_best = rfr.best_estimator_\n", "rfr_pred = rfr_best.predict(X)\n", "print(rfr_best)\n", "df['y RFR'] = float('nan')\n", "df['y RFR'][first_row:last_row+1] = rfr_pred\n", "\n", "# Fit the Extra Trees Regressor\n", "t0 = time()\n", "xtr = grid_search.GridSearchCV(ensemble.ExtraTreesRegressor(), \n", " param_xtr, cv=15, verbose=0, n_jobs=-1).fit(X, y)\n", "print('\\n\\nXTR Fitted in %0.3f[s]' %(time() - t0))\n", "xtr_best = xtr.best_estimator_\n", "xtr_pred = xtr_best.predict(X)\n", "print(xtr_best)\n", "df['y XTR'] = float('nan')\n", "df['y XTR'][first_row:last_row+1] = xtr_pred\n", "\n", "# Fit the AdaBoost Regressor\n", "t0 = time()\n", "adb = grid_search.GridSearchCV(ensemble.AdaBoostRegressor(), \n", " param_adb, cv=15, verbose=0, n_jobs=-1).fit(X, y)\n", "print('\\n\\nADB Fitted in %0.3f[s]' %(time() - t0))\n", "adb_best = adb.best_estimator_\n", "adb_pred = adb_best.predict(X)\n", "print(adb_best)\n", "df['y ADB'] = float('nan')\n", "df['y ADB'][first_row:last_row+1] = adb_pred\n", "\n", "# Fit the Gradient Tree Boosting\n", "t0 = time()\n", "gtb = grid_search.GridSearchCV(ensemble.GradientBoostingRegressor(), \n", " param_gtb, cv=15, verbose=0, n_jobs=-1).fit(X, y)\n", "print('\\nGTB Fitted in %0.3f[s]' %(time() - t0))\n", "gtb_best = gtb.best_estimator_\n", "gtb_pred = gtb_best.predict(X)\n", "print(gtb_best)\n", "df['y GTB'] = float('nan')\n", "df['y GTB'][first_row:last_row+1] = gtb_pred\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig1 = bk.figure(plot_width=700, plot_height=400, \n", " title='Prediction with nb=%i past input terms' % nb)\n", "fig1.title_text_font_size = '11pt'\n", "fig1.line(df['t'], df['y_true'].shift(-nk), color='black', \n", " alpha=0.8, legend='Ground Truth')\n", "fig1.circle(df['t'], df[col_to_predict], color='black', \n", " alpha=0.5, legend='Measured', size=3)\n", "fig1.line(df['t'], df['y RFR'], color='green', \n", " alpha=1.0, legend='RFR %s'%(col_to_predict,))\n", "fig1.line(df['t'], df['y XTR'], color='red', \n", " alpha=0.4, legend='XTR %s'%(col_to_predict,))\n", "fig1.line(df['t'], df['y ADB'], color='blue', \n", " alpha=0.4, legend='ADB %s'%(col_to_predict,))\n", "fig1.line(df['t'], df['y GTB'], color='magenta', \n", " alpha=0.8, legend='GTB %s'%(col_to_predict,))\n", "\n", "fig2 = bk.figure(plot_width=700, plot_height=400,\n", " title='Prediction Error with nb=%i past input terms' % nb)\n", "fig2.title_text_font_size = '11pt'\n", "fig2.line(df['t'], df[col_to_predict]-df[col_to_predict], color='black', \n", " alpha=0.5, legend='Measured')\n", "fig2.line(df['t'], df['y RFR']-df[col_to_predict], color='green', \n", " alpha=1.0, legend='RFR %s'%(col_to_predict,))\n", "fig2.line(df['t'], df['y XTR']-df[col_to_predict], color='red', \n", " alpha=0.6, legend='XTR %s'%(col_to_predict,))\n", "fig2.line(df['t'], df['y ADB']-df[col_to_predict], color='blue', \n", " alpha=0.3, legend='ADB %s'%(col_to_predict,))\n", "fig2.line(df['t'], df['y GTB']-df[col_to_predict], color='magenta', \n", " alpha=0.3, legend='GTB %s'%(col_to_predict,))\n", "fig2.legend.orientation = 'bottom_right'\n", "\n", "bk.show(bk.gridplot([[fig1], [fig2]]))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2 Compute the Features Importance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see how a very shallow Decision Tree (a Random forest with just one tree) is able to discover the most important features in presence of strong measurement noise. We generate a sinusoidal signal with a period of 15 samples and fit the tree to predict one step in advance using 22 hystorical steps in the regressor. Even in presence of noise, the tree correctly find the signal periodicity by selecting the features y(t-7) and y(t-15) as the most important in the regressor.\n", "\n", "Hint: try to change error, n_estimators and cv values to see how the model behaves. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RFR Fitted in 0.613[s]\n", "RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,\n", " max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=1, n_jobs=1, oob_score=False, random_state=None,\n", " verbose=0, warm_start=False)\n" ] }, { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.gridspec as gridspec\n", "t = np.arange(0.0, 30.0*np.pi, np.pi/50.0)\n", "error = 0.1\n", "df = pd.DataFrame(np.zeros((t.shape[0],3)), columns=['t', 'y(t)', 'y_true'])\n", "\n", "df['t'] = t\n", "df['y_true'] = np.sin(2*np.pi*df['t'])\n", "df['y(t)'] = df['y_true'] + np.random.normal(0, error, t.shape[0])\n", "nb = 22 # Number of past input terms used to predict the current output.\n", "nk = 1 # Samples delayed from input to output (prediction time steps)\n", "\n", "for i in range(1,nb+1):\n", " col_name = 'y(t-%i)' %(i,)\n", " df.insert(1, col_name, df['y(t)'].shift(i))\n", "col_to_predict = 'y(t+%i)' %(nk,)\n", "df[col_to_predict] = df['y(t)'].shift(-nk)\n", "\n", "first_row, last_row = nb, df.shape[0]-(nk+1)\n", "X = df[df.columns[1:nb+2]].ix[first_row:last_row]\n", "y = df[col_to_predict].ix[first_row:last_row]\n", "\n", "param_rfr1 = {'n_estimators': [1],\n", " 'max_depth': [2],\n", " 'min_samples_split' : [2],\n", " 'min_samples_leaf' : [1],\n", " 'bootstrap' : [True]}\n", "\n", "# Fit the Random Forest Regressor\n", "t0 = time()\n", "rfr = grid_search.GridSearchCV(ensemble.RandomForestRegressor(), \n", " param_rfr1, cv=31, verbose=0, n_jobs=-1).fit(X, y)\n", "print('\\nRFR Fitted in %0.3f[s]' %(time() - t0))\n", "rfr_best = rfr.best_estimator_\n", "rfr_pred = rfr_best.predict(X)\n", "print(rfr_best)\n", "df['y RFR2'] = float('nan')\n", "df['y RFR2'][first_row:last_row+1] = rfr_pred\n", "\n", "feature_importance_raw = rfr_best.feature_importances_\n", "feature_importance = 100.0 * (feature_importance_raw/feature_importance_raw.max())\n", "feature_importance_sorted_idx = np.argsort(feature_importance)\n", "cols = list(X.columns[feature_importance_sorted_idx])\n", "\n", "fig1 = bk.figure(plot_width=700, plot_height=250,\n", " title='Prediction with nb=%i past input terms' % nb)\n", "fig1.title_text_font_size = '11pt'\n", "fig1.legend.orientation = 'bottom_right'\n", "fig1.line(df['t'], df['y_true'].shift(-nk), color='red', \n", " alpha=0.8, legend='Ground Truth')\n", "fig1.line(df['t'], df['y RFR2'], color='green', \n", " alpha=1.0, legend='RFR %s'%(col_to_predict,))\n", "fig1.circle(df['t'], df[col_to_predict], size=3, color='black', \n", " alpha=0.5, legend='Measured')\n", "\n", "fig2 = bk.figure(plot_width=700, plot_height=250, \n", " title='Prediction Error with nb=%i past input terms' % nb)\n", "fig2.title_text_font_size = '10pt'\n", "fig2.line(df['t'], df[col_to_predict]-df[col_to_predict], color='black', \n", " alpha=0.5, line_width=1.5, legend='Measured')\n", "fig2.line(df['t'], df['y RFR2']-df[col_to_predict], color='green', \n", " alpha=1.0, legend='RFR %s'%(col_to_predict,))\n", "\n", "fig3 = bk.figure(plot_width=700, plot_height=700, title='Features Importance', \n", " y_range=cols, x_range=(0,100))\n", "fig3.title_text_font_size = '10pt'\n", "fig3.xaxis.axis_label = 'Relative Importance'\n", "fig3.xaxis.axis_label_text_font_size = '10pt'\n", "fig3.segment(0, cols, feature_importance[feature_importance_sorted_idx], \n", " cols, line_width=25)\n", "\n", "bk.show(bk.gridplot([[fig1], [fig2], [fig3]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "Visit [www.add-for.com]() for more tutorials and updates.\n", "\n", "This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }