{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TSBoost quick start" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import libs & data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's import tsboost, pandas and plotly for some visalisation (pip install plotly (--upgrade))" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import tsboost as tsb\n", "import pandas as pd\n", "import numpy as np\n", "np.random.seed(2017)\n", "\n", "import plotly as py\n", "import plotly.graph_objs as go\n", "py.offline.init_notebook_mode()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's download air passenger data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datevolume
1391960-08-01606
1401960-09-01508
1411960-10-01461
1421960-11-01390
1431960-12-01432
\n", "
" ], "text/plain": [ " date volume\n", "139 1960-08-01 606\n", "140 1960-09-01 508\n", "141 1960-10-01 461\n", "142 1960-11-01 390\n", "143 1960-12-01 432" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv(\"air_passenger.csv\", sep=\";\")\n", "data.date = pd.to_datetime(data.date, dayfirst=True)\n", "data.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Those datas represents the evolution of air transport passengers, let's visualise it:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "linkText": "Export to plot.ly", "plotlyServerURL": "https://plot.ly", "responsive": true, "showLink": false }, "data": [ { "type": "scatter", "uid": "b8c7691e-6fb0-4731-b957-5106c1f44dfd", "x": [ "1949-01-01", "1949-02-01", "1949-03-01", "1949-04-01", "1949-05-01", "1949-06-01", "1949-07-01", "1949-08-01", "1949-09-01", "1949-10-01", "1949-11-01", "1949-12-01", "1950-01-01", "1950-02-01", "1950-03-01", "1950-04-01", "1950-05-01", "1950-06-01", "1950-07-01", "1950-08-01", "1950-09-01", "1950-10-01", "1950-11-01", "1950-12-01", "1951-01-01", "1951-02-01", "1951-03-01", "1951-04-01", "1951-05-01", "1951-06-01", "1951-07-01", "1951-08-01", "1951-09-01", "1951-10-01", "1951-11-01", "1951-12-01", "1952-01-01", "1952-02-01", "1952-03-01", "1952-04-01", "1952-05-01", "1952-06-01", "1952-07-01", "1952-08-01", "1952-09-01", "1952-10-01", "1952-11-01", "1952-12-01", "1953-01-01", "1953-02-01", "1953-03-01", "1953-04-01", "1953-05-01", "1953-06-01", "1953-07-01", "1953-08-01", "1953-09-01", "1953-10-01", "1953-11-01", "1953-12-01", "1954-01-01", "1954-02-01", "1954-03-01", "1954-04-01", "1954-05-01", "1954-06-01", "1954-07-01", "1954-08-01", "1954-09-01", "1954-10-01", "1954-11-01", "1954-12-01", "1955-01-01", "1955-02-01", "1955-03-01", "1955-04-01", "1955-05-01", "1955-06-01", "1955-07-01", "1955-08-01", "1955-09-01", "1955-10-01", "1955-11-01", "1955-12-01", "1956-01-01", "1956-02-01", "1956-03-01", "1956-04-01", "1956-05-01", "1956-06-01", "1956-07-01", "1956-08-01", "1956-09-01", "1956-10-01", "1956-11-01", "1956-12-01", "1957-01-01", "1957-02-01", "1957-03-01", "1957-04-01", "1957-05-01", "1957-06-01", "1957-07-01", "1957-08-01", "1957-09-01", "1957-10-01", "1957-11-01", "1957-12-01", "1958-01-01", "1958-02-01", "1958-03-01", "1958-04-01", "1958-05-01", "1958-06-01", "1958-07-01", "1958-08-01", "1958-09-01", "1958-10-01", "1958-11-01", "1958-12-01", "1959-01-01", "1959-02-01", "1959-03-01", "1959-04-01", "1959-05-01", "1959-06-01", "1959-07-01", "1959-08-01", "1959-09-01", "1959-10-01", "1959-11-01", "1959-12-01", "1960-01-01", "1960-02-01", "1960-03-01", "1960-04-01", "1960-05-01", "1960-06-01", "1960-07-01", "1960-08-01", "1960-09-01", "1960-10-01", "1960-11-01", "1960-12-01" ], "y": [ 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118, 115, 126, 141, 135, 125, 149, 170, 170, 158, 133, 114, 140, 145, 150, 178, 163, 172, 178, 199, 199, 184, 162, 146, 166, 171, 180, 193, 181, 183, 218, 230, 242, 209, 191, 172, 194, 196, 196, 236, 235, 229, 243, 264, 272, 237, 211, 180, 201, 204, 188, 235, 227, 234, 264, 302, 293, 259, 229, 203, 229, 242, 233, 267, 269, 270, 315, 364, 347, 312, 274, 237, 278, 284, 277, 317, 313, 318, 374, 413, 405, 355, 306, 271, 306, 315, 301, 356, 348, 355, 422, 465, 467, 404, 347, 305, 336, 340, 318, 362, 348, 363, 435, 491, 505, 404, 359, 310, 337, 360, 342, 406, 396, 420, 472, 548, 559, 463, 407, 362, 405, 417, 391, 419, 461, 472, 535, 622, 606, 508, 461, 390, 432 ] } ], "layout": {} }, "text/html": [ "
\n", " \n", " \n", "
\n", " \n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "trace = [go.Scatter(x=data.date, y=data.volume)]\n", "py.offline.iplot(trace)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Forecast few years ahead (Production usage)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a start, let's just forecast 5 years without doing any feature engenering / tuning\n", "\n", "We just need to set some data configuration and precise how many steps ahead we want to forecast" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data_config = {\n", " 'target' : \"volume\", # column we want to forecast\n", " 'date' : \"date\", # colum containing the date\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a monthly problem so 5 years = 12*5 = 60 months" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "model = tsb.TSRegressor(horizons=60)\n", "\n", "results = model.fit_predict(data, **data_config)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_last_datahorizondateforecast
551960-12-01561965-08-01916.146024
561960-12-01571965-09-01748.292623
571960-12-01581965-10-01697.493432
581960-12-01591965-11-01590.572526
591960-12-01601965-12-01632.378526
\n", "
" ], "text/plain": [ " date_last_data horizon date forecast\n", "55 1960-12-01 56 1965-08-01 916.146024\n", "56 1960-12-01 57 1965-09-01 748.292623\n", "57 1960-12-01 58 1965-10-01 697.493432\n", "58 1960-12-01 59 1965-11-01 590.572526\n", "59 1960-12-01 60 1965-12-01 632.378526" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Column \"date_last_data\" reprents the starting point from which forecasts have been made, and is the last data we have in our disposal in a production context.\n", "Column \"date\" represents the effective date for the forecast. horizon = date - date_last_data (in the time step unit of the problem, here it's months)\n", "\n", "TSBoost will be in this mode when no cross validation dates is put as input parameters \"cv_dates\".\n", "\n", "in this case, it's like we are in 1960-12-01, and we forecast the futur from this date.\n", "\n", "Let's visualise the forecasts that we've just done:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "linkText": "Export to plot.ly", "plotlyServerURL": "https://plot.ly", "responsive": true, "showLink": false }, "data": [ { "name": "data", "type": "scatter", "uid": "c6d7f504-90f4-4cb1-9f40-1e0004f52c4e", "x": [ "1949-01-01", "1949-02-01", "1949-03-01", "1949-04-01", "1949-05-01", "1949-06-01", "1949-07-01", "1949-08-01", "1949-09-01", "1949-10-01", "1949-11-01", "1949-12-01", "1950-01-01", "1950-02-01", "1950-03-01", "1950-04-01", "1950-05-01", "1950-06-01", "1950-07-01", "1950-08-01", "1950-09-01", "1950-10-01", "1950-11-01", "1950-12-01", "1951-01-01", "1951-02-01", "1951-03-01", "1951-04-01", "1951-05-01", "1951-06-01", "1951-07-01", "1951-08-01", "1951-09-01", "1951-10-01", "1951-11-01", "1951-12-01", "1952-01-01", "1952-02-01", "1952-03-01", "1952-04-01", "1952-05-01", "1952-06-01", "1952-07-01", "1952-08-01", "1952-09-01", "1952-10-01", "1952-11-01", "1952-12-01", "1953-01-01", "1953-02-01", "1953-03-01", "1953-04-01", "1953-05-01", "1953-06-01", "1953-07-01", "1953-08-01", "1953-09-01", "1953-10-01", "1953-11-01", "1953-12-01", "1954-01-01", "1954-02-01", "1954-03-01", "1954-04-01", "1954-05-01", "1954-06-01", "1954-07-01", "1954-08-01", "1954-09-01", "1954-10-01", "1954-11-01", "1954-12-01", "1955-01-01", "1955-02-01", "1955-03-01", "1955-04-01", "1955-05-01", "1955-06-01", "1955-07-01", "1955-08-01", "1955-09-01", "1955-10-01", "1955-11-01", "1955-12-01", "1956-01-01", "1956-02-01", "1956-03-01", "1956-04-01", "1956-05-01", "1956-06-01", "1956-07-01", "1956-08-01", "1956-09-01", "1956-10-01", "1956-11-01", "1956-12-01", "1957-01-01", "1957-02-01", "1957-03-01", "1957-04-01", "1957-05-01", "1957-06-01", "1957-07-01", "1957-08-01", "1957-09-01", "1957-10-01", "1957-11-01", "1957-12-01", "1958-01-01", "1958-02-01", "1958-03-01", "1958-04-01", "1958-05-01", "1958-06-01", "1958-07-01", "1958-08-01", "1958-09-01", "1958-10-01", "1958-11-01", "1958-12-01", "1959-01-01", "1959-02-01", "1959-03-01", "1959-04-01", "1959-05-01", "1959-06-01", "1959-07-01", "1959-08-01", "1959-09-01", "1959-10-01", "1959-11-01", "1959-12-01", "1960-01-01", "1960-02-01", "1960-03-01", "1960-04-01", "1960-05-01", "1960-06-01", "1960-07-01", "1960-08-01", "1960-09-01", "1960-10-01", "1960-11-01", "1960-12-01" ], "y": [ 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118, 115, 126, 141, 135, 125, 149, 170, 170, 158, 133, 114, 140, 145, 150, 178, 163, 172, 178, 199, 199, 184, 162, 146, 166, 171, 180, 193, 181, 183, 218, 230, 242, 209, 191, 172, 194, 196, 196, 236, 235, 229, 243, 264, 272, 237, 211, 180, 201, 204, 188, 235, 227, 234, 264, 302, 293, 259, 229, 203, 229, 242, 233, 267, 269, 270, 315, 364, 347, 312, 274, 237, 278, 284, 277, 317, 313, 318, 374, 413, 405, 355, 306, 271, 306, 315, 301, 356, 348, 355, 422, 465, 467, 404, 347, 305, 336, 340, 318, 362, 348, 363, 435, 491, 505, 404, 359, 310, 337, 360, 342, 406, 396, 420, 472, 548, 559, 463, 407, 362, 405, 417, 391, 419, 461, 472, 535, 622, 606, 508, 461, 390, 432 ] }, { "name": "forecast", "type": "scatter", "uid": "a5f22662-17e8-408b-81ce-332b36e540c4", "x": [ "1961-01-01", "1961-02-01", "1961-03-01", "1961-04-01", "1961-05-01", "1961-06-01", "1961-07-01", "1961-08-01", "1961-09-01", "1961-10-01", "1961-11-01", "1961-12-01", "1962-01-01", "1962-02-01", "1962-03-01", "1962-04-01", "1962-05-01", "1962-06-01", "1962-07-01", "1962-08-01", "1962-09-01", "1962-10-01", "1962-11-01", "1962-12-01", "1963-01-01", "1963-02-01", "1963-03-01", "1963-04-01", "1963-05-01", "1963-06-01", "1963-07-01", "1963-08-01", "1963-09-01", "1963-10-01", "1963-11-01", "1963-12-01", "1964-01-01", "1964-02-01", "1964-03-01", "1964-04-01", "1964-05-01", "1964-06-01", "1964-07-01", "1964-08-01", "1964-09-01", "1964-10-01", "1964-11-01", "1964-12-01", "1965-01-01", "1965-02-01", "1965-03-01", "1965-04-01", "1965-05-01", "1965-06-01", "1965-07-01", "1965-08-01", "1965-09-01", "1965-10-01", "1965-11-01", "1965-12-01" ], "y": [ 440.56751577548334, 419.86480383632517, 463.97069088511864, 486.57665162129547, 490.66275614786997, 570.8042458821101, 667.5303592969328, 634.5132708034196, 537.3890463134463, 491.2767812939254, 416.8078230376567, 464.12017940833846, 514.6009299401613, 483.3558574676411, 530.6022512412969, 550.5216468982093, 545.9746028743596, 624.8241618652991, 724.8925086202904, 692.0771319422081, 592.9657018303127, 533.7245517967915, 461.75242687068715, 498.1734044093377, 523.5010318653983, 500.8019206109825, 530.7883339902107, 570.8367878010105, 568.5795730221934, 651.3258244909375, 773.0349672811617, 766.9961128981896, 643.282225579592, 575.7021947820449, 487.56717647894436, 530.0870351342096, 565.4525150624941, 529.3086592906668, 570.649150093889, 612.4041903791202, 647.4446404789927, 735.085890130265, 848.7844867593914, 848.6354112089664, 705.8822186274433, 633.7801426467489, 542.0795167909423, 572.6184142685335, 621.9193532253564, 581.5571850004283, 609.0585864043348, 669.3191420867789, 694.0458403360813, 774.936507636749, 908.7714356857423, 916.1460243433164, 748.2926229576502, 697.4934317301685, 590.572526250777, 632.3785263609514 ] } ], "layout": {} }, "text/html": [ "
\n", " \n", " \n", "
\n", " \n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "traces = [\n", " go.Scatter(x=data.date, y=data.volume, name=\"data\"),\n", " go.Scatter(x=results.date, y=results.forecast, name=\"forecast\"), \n", "]\n", "py.offline.iplot(traces)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross Validation for Feature Engineering & Tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TSBoost allows data scientists to be in a classical machine learning context, so any feature engenering can be done as well as meta parmeters tuning\n", "\n", "We can use period cross validation, which take time into consideration, to try to validate the use of the feature engenering & tuning in the future.\n", "\n", "We need to create periods with dates. Let's make 2 crossvalidation periods, one for Feature engineering & tuning validation, and one for final testing" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "cv_valid = tsb.TSRegressor.generate_dates(begin_date=\"1959-01-01\", horizon=12, time_step=\"month\")\n", "cv_test = tsb.TSRegressor.generate_dates(begin_date=\"1960-01-01\", horizon=12, time_step=\"month\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "let's visualise those 2 CVs" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "linkText": "Export to plot.ly", "plotlyServerURL": "https://plot.ly", "responsive": true, "showLink": false }, "data": [ { "name": "other_data", "type": "scatter", "uid": "89609f31-9c9e-4aa6-a41f-92fee17b4e52", "x": [ "1949-01-01", "1949-02-01", "1949-03-01", "1949-04-01", "1949-05-01", "1949-06-01", "1949-07-01", "1949-08-01", "1949-09-01", "1949-10-01", "1949-11-01", "1949-12-01", "1950-01-01", "1950-02-01", "1950-03-01", "1950-04-01", "1950-05-01", "1950-06-01", "1950-07-01", "1950-08-01", "1950-09-01", "1950-10-01", "1950-11-01", "1950-12-01", "1951-01-01", "1951-02-01", "1951-03-01", "1951-04-01", "1951-05-01", "1951-06-01", "1951-07-01", "1951-08-01", "1951-09-01", "1951-10-01", "1951-11-01", "1951-12-01", "1952-01-01", "1952-02-01", "1952-03-01", "1952-04-01", "1952-05-01", "1952-06-01", "1952-07-01", "1952-08-01", "1952-09-01", "1952-10-01", "1952-11-01", "1952-12-01", "1953-01-01", "1953-02-01", "1953-03-01", "1953-04-01", "1953-05-01", "1953-06-01", "1953-07-01", "1953-08-01", "1953-09-01", "1953-10-01", "1953-11-01", "1953-12-01", "1954-01-01", "1954-02-01", "1954-03-01", "1954-04-01", "1954-05-01", "1954-06-01", "1954-07-01", "1954-08-01", "1954-09-01", "1954-10-01", "1954-11-01", "1954-12-01", "1955-01-01", "1955-02-01", "1955-03-01", "1955-04-01", "1955-05-01", "1955-06-01", "1955-07-01", "1955-08-01", "1955-09-01", "1955-10-01", "1955-11-01", "1955-12-01", "1956-01-01", "1956-02-01", "1956-03-01", "1956-04-01", "1956-05-01", "1956-06-01", "1956-07-01", "1956-08-01", "1956-09-01", "1956-10-01", "1956-11-01", "1956-12-01", "1957-01-01", "1957-02-01", "1957-03-01", "1957-04-01", "1957-05-01", "1957-06-01", "1957-07-01", "1957-08-01", "1957-09-01", "1957-10-01", "1957-11-01", "1957-12-01", "1958-01-01", "1958-02-01", "1958-03-01", "1958-04-01", "1958-05-01", "1958-06-01", "1958-07-01", "1958-08-01", "1958-09-01", "1958-10-01", "1958-11-01", "1958-12-01" ], "y": [ 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118, 115, 126, 141, 135, 125, 149, 170, 170, 158, 133, 114, 140, 145, 150, 178, 163, 172, 178, 199, 199, 184, 162, 146, 166, 171, 180, 193, 181, 183, 218, 230, 242, 209, 191, 172, 194, 196, 196, 236, 235, 229, 243, 264, 272, 237, 211, 180, 201, 204, 188, 235, 227, 234, 264, 302, 293, 259, 229, 203, 229, 242, 233, 267, 269, 270, 315, 364, 347, 312, 274, 237, 278, 284, 277, 317, 313, 318, 374, 413, 405, 355, 306, 271, 306, 315, 301, 356, 348, 355, 422, 465, 467, 404, 347, 305, 336, 340, 318, 362, 348, 363, 435, 491, 505, 404, 359, 310, 337, 360 ] }, { "name": "cv_valid", "type": "scatter", "uid": "ee1f7131-4ff6-4c8c-8270-3f3e556d9250", "x": [ "1959-01-01", "1959-02-01", "1959-03-01", "1959-04-01", "1959-05-01", "1959-06-01", "1959-07-01", "1959-08-01", "1959-09-01", "1959-10-01", "1959-11-01", "1959-12-01" ], "y": [ 360, 342, 406, 396, 420, 472, 548, 559, 463, 407, 362, 405 ] }, { "name": "cv_test", "type": "scatter", "uid": "f0886e43-23f1-4b59-939f-a2bb67acf20a", "x": [ "1960-01-01", "1960-02-01", "1960-03-01", "1960-04-01", "1960-05-01", "1960-06-01", "1960-07-01", "1960-08-01", "1960-09-01", "1960-10-01", "1960-11-01", "1960-12-01" ], "y": [ 417, 391, 419, 461, 472, 535, 622, 606, 508, 461, 390, 432 ] } ], "layout": {} }, "text/html": [ "
\n", " \n", " \n", "
\n", " \n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "traces = [\n", " go.Scatter(x=data[data.date < \"1959-01-01\"].date, y=data[data.date <= \"1959-01-01\"].volume, name=\"other_data\"),\n", " go.Scatter(x=data[data.date.isin(cv_valid)].date, y=data[data.date.isin(cv_valid)].volume, name=\"cv_valid\"), \n", " go.Scatter(x=data[data.date.isin(cv_test)].date, y=data[data.date.isin(cv_test)].volume, name=\"cv_test\"),\n", "]\n", "\n", "py.offline.iplot(traces)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Engineering example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make a 12 months ahead forecast algorithm for speeding things up:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "horizons = 12" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any features can be added to the datas. You have to be carefull not to use to the future you didn't acces in the past.\n", "\n", "Tsboost create by default lag features from past of the target we wish to forecast, and also temporal features with the informations contained in date column (month, year, day of the week, etc...)\n", "\n", "We can use this with the meta parameter \"inner_feature_eng\" to activate it or not.\n", "\n", "Let's try if this feature engenering is relevant for our problem and can help improving accuracy.\n", "Let's test the algorithm without feature engineering on our validation period :" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "model1 = tsb.TSRegressor(horizons=horizons,\n", " inner_feature_eng=False)\n", "\n", "forecast1_cv_valid = model1.fit_predict(data, cv_dates=cv_valid, **data_config)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_last_datahorizondateforecast
01958-12-0111959-01-01344.489314
11959-01-0111959-02-01334.546558
21959-02-0111959-03-01387.062901
31959-03-0111959-04-01380.945582
41959-04-0111959-05-01412.223502
\n", "
" ], "text/plain": [ " date_last_data horizon date forecast\n", "0 1958-12-01 1 1959-01-01 344.489314\n", "1 1959-01-01 1 1959-02-01 334.546558\n", "2 1959-02-01 1 1959-03-01 387.062901\n", "3 1959-03-01 1 1959-04-01 380.945582\n", "4 1959-04-01 1 1959-05-01 412.223502" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "forecast1_cv_valid.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we analyse the data, it looks like the 60 months forecast we did before, but the column \"date_last_data\" varies, so that we can have each 12 horizons for every dates in the cross validation period and compare those forecasts equitably." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TSBoost has static function to get the mape for each horizon on the period, you can of course use whatever residual analyse you want" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "results1_cv_valid = tsb.TSRegressor.get_result(data, forecast1_cv_valid, metric=\"mape\", **data_config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do the same but with the inner feature engenering of TSBoost:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "model2 = tsb.TSRegressor(horizons=horizons,\n", " inner_feature_eng=True)\n", "\n", "forecast2_cv_valid = model2.fit_predict(data, cv_dates=cv_valid, **data_config)\n", "results2_cv_valid = tsb.TSRegressor.get_result(data, forecast2_cv_valid, metric=\"mape\", **data_config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "let's visualise how those 2 models performed on the validation period with a graph\n", "\n", "on x-axis you have horizon (short term to long term forecasts)\n", "\n", "on y-axis you have the MAPE (Mean Avergae Percentage Error) did on the validation period datas" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "linkText": "Export to plot.ly", "plotlyServerURL": "https://plot.ly", "responsive": true, "showLink": false }, "data": [ { "name": "model_no_FE", "type": "scatter", "uid": "fb7c5c36-22cc-47b6-8b8d-c4a681fca578", "x": [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ], "y": [ 3.0003694822787614, 4.436469333709874, 5.386330641704578, 5.497087885902341, 6.066248093066456, 6.683115974777851, 6.188264996083092, 6.720896664053186, 7.75262522822861, 6.279520750220243, 7.237973934637075, 7.678781412017184 ] }, { "name": "model_with_FE", "type": "scatter", "uid": "8119ef29-c1cc-4529-8016-40fef9f499ac", "x": [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ], "y": [ 2.55576237256856, 3.510780546938893, 4.126838786889175, 4.890154569649497, 5.4580950743551355, 5.34994255283042, 5.346613656704505, 6.146355020069933, 6.374065370068511, 6.431766398402175, 6.6662263505836465, 4.9364817725700565 ] } ], "layout": { "xaxis": { "title": { "text": "horizons" } }, "yaxis": { "title": { "text": "MAPE" } } } }, "text/html": [ "
\n", " \n", " \n", "
\n", " \n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# some plotly configs for graphs\n", "layout = go.Layout(xaxis=dict(title= 'horizons'), yaxis=dict(title= 'MAPE'))\n", "\n", "traces = [\n", " go.Scatter(x=results1_cv_valid.index, y=results1_cv_valid.forecast, name=\"model_no_FE\"),\n", " go.Scatter(x=results2_cv_valid.index, y=results2_cv_valid.forecast, name=\"model_with_FE\"),\n", "]\n", "\n", "fig= go.Figure(data=traces, layout=layout)\n", "py.offline.iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that on this period, adding feature enginnering from TSBoost works quiet well on almost all horizons.\n", "\n", "We have a global patern that should appear on almost every time series problems : \n", "\n", "the mean error of long term forecast is bigger that the shorter terms, we can observe a log shape fashion (to see it clearer we could have had selected larger period for cross validation & test periods).\n", "\n", "It's a general rule for most time serie problems : it's easier to forecast short term than long term. Mainly because of shift distributions: variable ditributions has higher chance to be more similar to short term distribution than long term one" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tuning example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is two main ways to do tuning :\n", "- we can tune tsboost meta parameters.\n", "\n", "- we can tune optimizer (xgboost or lightgbm) meta parameters\n", "\n", "By default, tsboost use the instance of xgboost.XGBRegressor() with default instance parameters, but we can use a non default instance.\n", "\n", "We can also chose lightgbm.LGBMRegressor() as optimizer. Let's see how xgboost's brother performs on this problem :" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "model3 = tsb.TSRegressor(horizons=horizons,\n", " optimizer=\"lgbm\",\n", " inner_feature_eng=True)\n", "\n", "forecast3_cv_valid = model3.fit_predict(data, cv_dates=cv_valid, **data_config)\n", "results3_cv_valid = tsb.TSRegressor.get_result(data, forecast3_cv_valid, metric=\"mape\", **data_config)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "linkText": "Export to plot.ly", "plotlyServerURL": "https://plot.ly", "responsive": true, "showLink": false }, "data": [ { "name": "model_xgboost_with_FE", "type": "scatter", "uid": "ed9321e5-e9cf-43cc-9ad7-f78db1c7456f", "x": [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ], "y": [ 2.55576237256856, 3.510780546938893, 4.126838786889175, 4.890154569649497, 5.4580950743551355, 5.34994255283042, 5.346613656704505, 6.146355020069933, 6.374065370068511, 6.431766398402175, 6.6662263505836465, 4.9364817725700565 ] }, { "name": "model_lightgbm_with_FE", "type": "scatter", "uid": "163a27d8-0070-4043-88da-ab00cbc7f711", "x": [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ], "y": [ 2.2139149668277764, 2.6917497260231467, 3.2891646542995194, 3.0038631903299553, 3.3359970446440634, 4.142725608492236, 4.348678983068264, 4.74723389705568, 5.586527977640689, 6.206552887675495, 5.22419367891539, 5.1153586824255886 ] } ], "layout": { "xaxis": { "title": { "text": "horizons" } }, "yaxis": { "title": { "text": "MAPE" } } } }, "text/html": [ "
\n", " \n", " \n", "
\n", " \n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "traces = [\n", " go.Scatter(x=results2_cv_valid.index, y=results2_cv_valid.forecast, name=\"model_xgboost_with_FE\"),\n", " go.Scatter(x=results3_cv_valid.index, y=results3_cv_valid.forecast, name=\"model_lightgbm_with_FE\"),\n", "]\n", "\n", "fig= go.Figure(data=traces, layout=layout)\n", "py.offline.iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can observ a nice performance of lightgbm with the feature engineering.\n", "\n", "It is to be noted that it is better to find good features than overtuned algorithm." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test models on the future CV period" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the test CV period to validate (or not) if our previous feature engineering / tuning have some positive effect on the future. And maybe conclude that it will be also the case for after (production usage)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "results1 = model1.fit_predict(data, cv_dates=cv_test, **data_config)\n", "results1 = tsb.TSRegressor.get_result(data, results1, metric=\"mape\", **data_config)\n", "\n", "results3 = model3.fit_predict(data, cv_dates=cv_test, **data_config)\n", "results3 = tsb.TSRegressor.get_result(data, results3, metric=\"mape\", **data_config)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "linkText": "Export to plot.ly", "plotlyServerURL": "https://plot.ly", "responsive": true, "showLink": false }, "data": [ { "name": "model1", "type": "scatter", "uid": "dbae1282-cf74-4f07-876d-eba91d894584", "x": [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ], "y": [ 3.557016687581981, 3.2046702059793137, 3.5784940699072103, 4.300402480627134, 3.866059741452242, 3.7170508223693584, 4.662861594497444, 4.505176849886731, 4.293815174158836, 4.232933726778993, 4.937544606462944, 5.535243859889725 ] }, { "name": "model3", "type": "scatter", "uid": "5b7e8b47-0f32-403e-9c09-55898284c002", "x": [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ], "y": [ 3.5268386238303275, 3.2014714126492674, 3.2370894483637156, 3.0149670498179546, 2.7939668914848874, 3.2957712660350844, 3.5335754864426554, 3.7779284667349935, 3.8739954343806944, 4.3084282327786045, 4.485047262115917, 4.586108569973484 ] } ], "layout": { "xaxis": { "title": { "text": "horizons" } }, "yaxis": { "title": { "text": "MAPE" } } } }, "text/html": [ "
\n", " \n", " \n", "
\n", " \n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "traces = [\n", " go.Scatter(x=results1.index, y=results1.forecast, name=\"model1\"),\n", " go.Scatter(x=results3.index, y=results3.forecast, name=\"model3\"),\n", "]\n", "fig= go.Figure(data=traces, layout=layout)\n", "py.offline.iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that on the last CV period, the MAPE improvement is globaly there (even if short term could need some investigation)\n", "\n", "We can suppose that the feature engenering & tuning should give better for the future (even if it's not completly sure with distribution drift)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## If satisfied, we can use this last algo to forecast 12 steps ahead in production mode" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "results = model3.fit_predict(data, **data_config)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "linkText": "Export to plot.ly", "plotlyServerURL": "https://plot.ly", "responsive": true, "showLink": false }, "data": [ { "name": "data", "type": "scatter", "uid": "d01c32da-77bd-4286-bed2-d06d93da66b4", "x": [ "1949-01-01", "1949-02-01", "1949-03-01", "1949-04-01", "1949-05-01", "1949-06-01", "1949-07-01", "1949-08-01", "1949-09-01", "1949-10-01", "1949-11-01", "1949-12-01", "1950-01-01", "1950-02-01", "1950-03-01", "1950-04-01", "1950-05-01", "1950-06-01", "1950-07-01", "1950-08-01", "1950-09-01", "1950-10-01", "1950-11-01", "1950-12-01", "1951-01-01", "1951-02-01", "1951-03-01", "1951-04-01", "1951-05-01", "1951-06-01", "1951-07-01", "1951-08-01", "1951-09-01", "1951-10-01", "1951-11-01", "1951-12-01", "1952-01-01", "1952-02-01", "1952-03-01", "1952-04-01", "1952-05-01", "1952-06-01", "1952-07-01", "1952-08-01", "1952-09-01", "1952-10-01", "1952-11-01", "1952-12-01", "1953-01-01", "1953-02-01", "1953-03-01", "1953-04-01", "1953-05-01", "1953-06-01", "1953-07-01", "1953-08-01", "1953-09-01", "1953-10-01", "1953-11-01", "1953-12-01", "1954-01-01", "1954-02-01", "1954-03-01", "1954-04-01", "1954-05-01", "1954-06-01", "1954-07-01", "1954-08-01", "1954-09-01", "1954-10-01", "1954-11-01", "1954-12-01", "1955-01-01", "1955-02-01", "1955-03-01", "1955-04-01", "1955-05-01", "1955-06-01", "1955-07-01", "1955-08-01", "1955-09-01", "1955-10-01", "1955-11-01", "1955-12-01", "1956-01-01", "1956-02-01", "1956-03-01", "1956-04-01", "1956-05-01", "1956-06-01", "1956-07-01", "1956-08-01", "1956-09-01", "1956-10-01", "1956-11-01", "1956-12-01", "1957-01-01", "1957-02-01", "1957-03-01", "1957-04-01", "1957-05-01", "1957-06-01", "1957-07-01", "1957-08-01", "1957-09-01", "1957-10-01", "1957-11-01", "1957-12-01", "1958-01-01", "1958-02-01", "1958-03-01", "1958-04-01", "1958-05-01", "1958-06-01", "1958-07-01", "1958-08-01", "1958-09-01", "1958-10-01", "1958-11-01", "1958-12-01", "1959-01-01", "1959-02-01", "1959-03-01", "1959-04-01", "1959-05-01", "1959-06-01", "1959-07-01", "1959-08-01", "1959-09-01", "1959-10-01", "1959-11-01", "1959-12-01", "1960-01-01", "1960-02-01", "1960-03-01", "1960-04-01", "1960-05-01", "1960-06-01", "1960-07-01", "1960-08-01", "1960-09-01", "1960-10-01", "1960-11-01", "1960-12-01" ], "y": [ 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118, 115, 126, 141, 135, 125, 149, 170, 170, 158, 133, 114, 140, 145, 150, 178, 163, 172, 178, 199, 199, 184, 162, 146, 166, 171, 180, 193, 181, 183, 218, 230, 242, 209, 191, 172, 194, 196, 196, 236, 235, 229, 243, 264, 272, 237, 211, 180, 201, 204, 188, 235, 227, 234, 264, 302, 293, 259, 229, 203, 229, 242, 233, 267, 269, 270, 315, 364, 347, 312, 274, 237, 278, 284, 277, 317, 313, 318, 374, 413, 405, 355, 306, 271, 306, 315, 301, 356, 348, 355, 422, 465, 467, 404, 347, 305, 336, 340, 318, 362, 348, 363, 435, 491, 505, 404, 359, 310, 337, 360, 342, 406, 396, 420, 472, 548, 559, 463, 407, 362, 405, 417, 391, 419, 461, 472, 535, 622, 606, 508, 461, 390, 432 ] }, { "name": "forecast", "type": "scatter", "uid": "99175bfe-48be-479d-9534-7e62edd45f51", "x": [ "1961-01-01", "1961-02-01", "1961-03-01", "1961-04-01", "1961-05-01", "1961-06-01", "1961-07-01", "1961-08-01", "1961-09-01", "1961-10-01", "1961-11-01", "1961-12-01" ], "y": [ 450.2638389907019, 428.05687263739975, 457.26433759714723, 485.42809359388843, 501.3687538498249, 562.0645857353153, 670.5218926811143, 646.3867528683888, 546.8874736248398, 503.3624114270947, 416.19199242798624, 466.68461823910593 ] } ], "layout": {} }, "text/html": [ "
\n", " \n", " \n", "
\n", " \n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "traces = [\n", " go.Scatter(x=data.date, y=data.volume, name=\"data\"),\n", " go.Scatter(x=results.date, y=results.forecast, name=\"forecast\"), \n", "]\n", "\n", "py.offline.iplot(traces)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is important to note that we tuned and make feature engenering for 12 step ahead algorithm at the same time for this example.\n", "\n", "Tsboost creates independant step forecasts, so each horizon can have it's own feature engineering & tuning without affecting the other horizons.\n", "\n", "If we only need a 3 months steps ahead algorithm for example, we can set horizons like this:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "model = tsb.TSRegressor(horizons=[3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "you could also do batches of horizons:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "model = tsb.TSRegressor(horizons=[3,4,5])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }