{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"tags": [
"hide_input"
]
},
"source": [
"# Why feature learning is better than simple propositionalization\n",
"\n",
"In this notebook we will compare getML to featuretools and tsfresh, both of which open-source libraries for feature engineering. We find that advanced algorithms featured in getML yield significantly better predictions on this dataset. We then discuss why that is.\n",
"\n",
"Summary:\n",
"\n",
"- Prediction type: __Regression model__\n",
"- Domain: __Air pollution__\n",
"- Prediction target: __pm 2.5 concentration__\n",
"- Source data: __Multivariate time series__\n",
"- Population size: __41757__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Background\n",
"\n",
"Many data scientists and AutoML tools use propositionalization methods for feature engineering. These propositionalization methods usually work as follows:\n",
"\n",
"- Generate a large number of hard-coded features,\n",
"- Use feature selection to pick a percentage of these features.\n",
"\n",
"By contrast, getML contains approaches for feature learning: Feature learning adapts machine learning approaches such as decision trees or gradient boosting to the problem of extracting features from relational data and time series.\n",
"\n",
"In this notebook, we will benchmark getML against [featuretools](https://www.featuretools.com/) and [tsfresh](https://tsfresh.readthedocs.io/en/latest/). Both of these libaries use propositionalization approaches for feature engineering.\n",
"\n",
"As our example dataset, we use a publicly available dataset on air pollution in Beijing, China: [Beijing PM2.5 Data](https://archive.ics.uci.edu/dataset/381/beijing+pm2+5+data). The data set has been originally used in the following study:\n",
"> Liang, X., Zou, T., Guo, B., Li, S., Zhang, H., Zhang, S., Huang, H. and Chen, S. X. (2015). Assessing Beijing's PM2.5 pollution: severity, weather impact, APEC and winter heating. Proceedings of the Royal Society A, 471, 20150257.\n",
"\n",
"We find that getML significantly outperforms featuretools and tsfresh in terms of predictive accuracy ( [see Discussion](#3.-Discussion) ). Our findings indicate that getML's feature learning algorithms are better at adapting to data sets and are also more scalable due to their lower memory requirement."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of contents\n",
"\n",
"1. [Loading data](#1.-Loading-data)\n",
"2. [Predictive modeling](#2.-Predictive-modeling)\n",
"3. [Discussion](#3.-Discussion)\n",
"4. [Conclusion](#4.-Conclusion)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We start the analysis with the setup of our session."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"import os\n",
"os.environ[\"PYARROW_IGNORE_TIMEZONE\"] = \"1\"\n",
"from pathlib import Path\n",
"\n",
"from urllib import request\n",
"\n",
"import getml\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"from scipy.stats import pearsonr\n",
"\n",
"from utils.load import load_or_retrieve\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# NOTE: Due to featuretools's and tsfresh's substantial resource requirements, prepared data can be used via RUN_FEATURETOOLS or RUN_TSFRESH.\n",
"\n",
"RUN_FEATURETOOLS = False\n",
"RUN_TSFRESH = False\n",
"\n",
"if RUN_FEATURETOOLS:\n",
" from utils import FTTimeSeriesBuilder\n",
"\n",
"if RUN_TSFRESH:\n",
" from utils import TSFreshBuilder"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"getML engine is already running.\n",
"Loading pipelines... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"\n",
"Connected to project 'air_pollution'\n"
]
}
],
"source": [
"getml.engine.launch(home_directory=Path.home(), allow_remote_ips=True, token='token')\n",
"getml.set_project(\"air_pollution\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Loading data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 Download from source\n",
"\n",
"Downloading the raw data from the UCI Machine Learning Repository into a prediction ready format takes time. To get to the getML model building as fast as possible, we prepared the data for you and excluded the code from this notebook."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Loading population...\n",
" 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n"
]
}
],
"source": [
"data = getml.datasets.load_air_pollution()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Predictive modeling\n",
"\n",
"\n",
"### 2.1 Pipeline 1: Complex features, 7 days"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we spilt our data. We introduce a [simple, time-based split](https://docs.getml.com/latest/api/split/getml.data.split.time.html) and use all data until 2013-12-31 for training and everything starting from 2014-01-01 for testing."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" train \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" train \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 2 \n",
" \n",
" \n",
" train \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 3 \n",
" \n",
" \n",
" train \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 4 \n",
" \n",
" \n",
" train \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" ... \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"\n",
" \n",
" 41757 rows \n",
" \n",
" type: StringColumnView \n",
" \n",
"
\n"
],
"text/plain": [
" \n",
" 0 train\n",
" 1 train\n",
" 2 train\n",
" 3 train\n",
" 4 train\n",
" ... \n",
"\n",
"\n",
"41757 rows\n",
"type: StringColumnView"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"split = getml.data.split.time(\n",
" population=data, time_stamp=\"date\", test=getml.data.time.datetime(2014, 1, 1)\n",
")\n",
"\n",
"split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For our first experiment, we will learn complex features and allow a memory of up to seven days. That means at every given point in time, the algorithm is allowed to back seven days into the past."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"data model \n",
" \n",
"
diagram
\n",
"
population population date <= date Memory: 7.0 days \n",
"
\n",
"\n",
" \n",
"
staging
\n",
" \n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" data frames \n",
" \n",
" \n",
" \n",
" staging table \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" POPULATION__STAGING_TABLE_1 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" POPULATION__STAGING_TABLE_2 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
" \n",
"container \n",
"\n",
"
\n",
"
population
\n",
" \n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" subset \n",
" \n",
" \n",
" \n",
" name \n",
" \n",
" \n",
" \n",
" rows \n",
" \n",
" \n",
" \n",
" type \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" test \n",
" \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" 8661 \n",
" \n",
" \n",
" \n",
" View \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" train \n",
" \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" 33096 \n",
" \n",
" \n",
" \n",
" View \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
"
peripheral
\n",
" \n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" name \n",
" \n",
" \n",
" \n",
" rows \n",
" \n",
" \n",
" \n",
" type \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" 41757 \n",
" \n",
" \n",
" \n",
" DataFrame \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
"
"
],
"text/plain": [
"data model\n",
"\n",
" population:\n",
" columns:\n",
" - DEWP: numerical\n",
" - TEMP: numerical\n",
" - PRES: numerical\n",
" - Iws: numerical\n",
" - Is: numerical\n",
" - ...\n",
"\n",
" joins:\n",
" - right: 'population'\n",
" time_stamps: (population.date, population.date)\n",
" relationship: 'many-to-many'\n",
" memory: 604800.0\n",
" lagged_targets: False\n",
"\n",
" population:\n",
" columns:\n",
" - DEWP: numerical\n",
" - TEMP: numerical\n",
" - PRES: numerical\n",
" - Iws: numerical\n",
" - Is: numerical\n",
" - ...\n",
"\n",
"\n",
"container\n",
"\n",
" population\n",
" subset name rows type\n",
" 0 test population 8661 View\n",
" 1 train population 33096 View\n",
"\n",
" peripheral\n",
" name rows type \n",
" 0 population 41757 DataFrame"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"time_series1 = getml.data.TimeSeries(\n",
" population=data,\n",
" alias=\"population\",\n",
" split=split,\n",
" time_stamps=\"date\",\n",
" memory=getml.data.time.days(7),\n",
")\n",
"\n",
"time_series1"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"Pipeline(data_model='population',\n",
" feature_learners=['RelMT'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: RelMT', 'memory: 7d', 'complex features']) "
],
"text/plain": [
"Pipeline(data_model='population',\n",
" feature_learners=['RelMT'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: RelMT', 'memory: 7d', 'complex features'])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"relmt = getml.feature_learning.RelMT(\n",
" num_features=10,\n",
" loss_function=getml.feature_learning.loss_functions.SquareLoss,\n",
" seed=4367,\n",
" num_threads=1,\n",
")\n",
"\n",
"predictor = getml.predictors.XGBoostRegressor(n_jobs=1)\n",
"\n",
"pipe1 = getml.pipeline.Pipeline(\n",
" tags=[\"getML: RelMT\", \"memory: 7d\", \"complex features\"],\n",
" data_model=time_series1.data_model,\n",
" feature_learners=[relmt],\n",
" predictors=[predictor],\n",
")\n",
"\n",
"pipe1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is good practice to always check your data model first, even though `check(...)` is also called by `fit(...)`. That enables us to make last-minute changes."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking data model...\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"Checking... 100% |██████████| [elapsed: 00:02, remaining: 00:00] \n",
"\n",
"OK.\n"
]
}
],
"source": [
"pipe1.check(time_series1.train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now fit our data on the training set and evaluate our findings, both, in-sample and out-of-sample."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking data model...\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"\n",
"OK.\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"RelMT: Training features... 100% |██████████| [elapsed: 01:32, remaining: 00:00] \n",
"RelMT: Building features... 100% |██████████| [elapsed: 00:15, remaining: 00:00] \n",
"XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:02, remaining: 00:00] \n",
"\n",
"Trained pipeline.\n",
"Time taken: 0h:1m:48.467388\n",
"\n"
]
},
{
"data": {
"text/html": [
"Pipeline(data_model='population',\n",
" feature_learners=['RelMT'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: RelMT', 'memory: 7d', 'complex features', 'container-lahmUK']) "
],
"text/plain": [
"Pipeline(data_model='population',\n",
" feature_learners=['RelMT'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: RelMT', 'memory: 7d', 'complex features', 'container-lahmUK'])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe1.fit(time_series1.train)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"RelMT: Building features... 100% |██████████| [elapsed: 00:04, remaining: 00:00] \n",
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" date time \n",
" \n",
" \n",
" \n",
" set used \n",
" \n",
" \n",
" \n",
" target \n",
" \n",
" \n",
" \n",
" mae \n",
" \n",
" \n",
" \n",
" rmse \n",
" \n",
" \n",
" \n",
" rsquared \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" 2024-02-21 14:55:35 \n",
" \n",
" \n",
" \n",
" train \n",
" \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" 35.1664 \n",
" \n",
" \n",
" \n",
" 50.9038 \n",
" \n",
" \n",
" \n",
" 0.6925 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" 2024-02-21 14:55:39 \n",
" \n",
" \n",
" \n",
" test \n",
" \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" 39.6596 \n",
" \n",
" \n",
" \n",
" 57.5014 \n",
" \n",
" \n",
" \n",
" 0.6306 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
"
],
"text/plain": [
" date time set used target mae rmse rsquared\n",
"0 2024-02-21 14:55:35 train pm2.5 35.1664 50.9038 0.6925\n",
"1 2024-02-21 14:55:39 test pm2.5 39.6596 57.5014 0.6306"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe1.score(time_series1.test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 Pipeline 2: Complex features, 1 day"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"data model \n",
" \n",
"
diagram
\n",
"
population population date <= date Memory: 1.0 days \n",
"
\n",
"\n",
" \n",
"
staging
\n",
" \n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" data frames \n",
" \n",
" \n",
" \n",
" staging table \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" POPULATION__STAGING_TABLE_1 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" POPULATION__STAGING_TABLE_2 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
" \n",
"container \n",
"\n",
"
\n",
"
population
\n",
" \n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" subset \n",
" \n",
" \n",
" \n",
" name \n",
" \n",
" \n",
" \n",
" rows \n",
" \n",
" \n",
" \n",
" type \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" test \n",
" \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" 8661 \n",
" \n",
" \n",
" \n",
" View \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" train \n",
" \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" 33096 \n",
" \n",
" \n",
" \n",
" View \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
"
peripheral
\n",
" \n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" name \n",
" \n",
" \n",
" \n",
" rows \n",
" \n",
" \n",
" \n",
" type \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" 41757 \n",
" \n",
" \n",
" \n",
" DataFrame \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
"
"
],
"text/plain": [
"data model\n",
"\n",
" population:\n",
" columns:\n",
" - DEWP: numerical\n",
" - TEMP: numerical\n",
" - PRES: numerical\n",
" - Iws: numerical\n",
" - Is: numerical\n",
" - ...\n",
"\n",
" joins:\n",
" - right: 'population'\n",
" time_stamps: (population.date, population.date)\n",
" relationship: 'many-to-many'\n",
" memory: 86400.0\n",
" lagged_targets: False\n",
"\n",
" population:\n",
" columns:\n",
" - DEWP: numerical\n",
" - TEMP: numerical\n",
" - PRES: numerical\n",
" - Iws: numerical\n",
" - Is: numerical\n",
" - ...\n",
"\n",
"\n",
"container\n",
"\n",
" population\n",
" subset name rows type\n",
" 0 test population 8661 View\n",
" 1 train population 33096 View\n",
"\n",
" peripheral\n",
" name rows type \n",
" 0 population 41757 DataFrame"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"time_series2 = getml.data.TimeSeries(\n",
" population=data,\n",
" alias=\"population\",\n",
" split=split,\n",
" time_stamps=\"date\",\n",
" memory=getml.data.time.days(1),\n",
")\n",
"\n",
"time_series2"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Pipeline(data_model='population',\n",
" feature_learners=['RelMT'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: RelMT', 'memory: 1d', 'complex features']) "
],
"text/plain": [
"Pipeline(data_model='population',\n",
" feature_learners=['RelMT'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: RelMT', 'memory: 1d', 'complex features'])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"relmt = getml.feature_learning.RelMT(\n",
" num_features=10,\n",
" loss_function=getml.feature_learning.loss_functions.SquareLoss,\n",
" seed=4367,\n",
" num_threads=1,\n",
")\n",
"\n",
"predictor = getml.predictors.XGBoostRegressor(n_jobs=1)\n",
"\n",
"pipe2 = getml.pipeline.Pipeline(\n",
" tags=[\"getML: RelMT\", \"memory: 1d\", \"complex features\"],\n",
" data_model=time_series2.data_model,\n",
" feature_learners=[relmt],\n",
" predictors=[predictor],\n",
")\n",
"\n",
"pipe2"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking data model...\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"Checking... 100% |██████████| [elapsed: 00:02, remaining: 00:00] \n",
"\n",
"OK.\n"
]
}
],
"source": [
"pipe2.check(time_series2.train)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking data model...\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"\n",
"OK.\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"RelMT: Training features... 100% |██████████| [elapsed: 00:28, remaining: 00:00] \n",
"RelMT: Building features... 100% |██████████| [elapsed: 00:02, remaining: 00:00] \n",
"XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:02, remaining: 00:00] \n",
"\n",
"Trained pipeline.\n",
"Time taken: 0h:0m:32.22083\n",
"\n"
]
},
{
"data": {
"text/html": [
"Pipeline(data_model='population',\n",
" feature_learners=['RelMT'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: RelMT', 'memory: 1d', 'complex features', 'container-JKQRr0']) "
],
"text/plain": [
"Pipeline(data_model='population',\n",
" feature_learners=['RelMT'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: RelMT', 'memory: 1d', 'complex features', 'container-JKQRr0'])"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe2.fit(time_series2.train)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"RelMT: Building features... 100% |██████████| [elapsed: 00:01, remaining: 00:00] \n",
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" date time \n",
" \n",
" \n",
" \n",
" set used \n",
" \n",
" \n",
" \n",
" target \n",
" \n",
" \n",
" \n",
" mae \n",
" \n",
" \n",
" \n",
" rmse \n",
" \n",
" \n",
" \n",
" rsquared \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" 2024-02-21 14:56:13 \n",
" \n",
" \n",
" \n",
" train \n",
" \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" 38.1593 \n",
" \n",
" \n",
" \n",
" 55.3541 \n",
" \n",
" \n",
" \n",
" 0.6366 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" 2024-02-21 14:56:14 \n",
" \n",
" \n",
" \n",
" test \n",
" \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" 47.5451 \n",
" \n",
" \n",
" \n",
" 66.9418 \n",
" \n",
" \n",
" \n",
" 0.4901 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
"
],
"text/plain": [
" date time set used target mae rmse rsquared\n",
"0 2024-02-21 14:56:13 train pm2.5 38.1593 55.3541 0.6366\n",
"1 2024-02-21 14:56:14 test pm2.5 47.5451 66.9418 0.4901"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe2.score(time_series2.test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 Pipeline 3: Simple features, 7 days\n",
"\n",
"For our third experiment, we will learn simple features and allow a memory of up to seven days."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"data model \n",
" \n",
"
diagram
\n",
"
population population date <= date Memory: 7.0 days \n",
"
\n",
"\n",
" \n",
"
staging
\n",
" \n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" data frames \n",
" \n",
" \n",
" \n",
" staging table \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" POPULATION__STAGING_TABLE_1 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" POPULATION__STAGING_TABLE_2 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
" \n",
"container \n",
"\n",
"
\n",
"
population
\n",
" \n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" subset \n",
" \n",
" \n",
" \n",
" name \n",
" \n",
" \n",
" \n",
" rows \n",
" \n",
" \n",
" \n",
" type \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" test \n",
" \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" 8661 \n",
" \n",
" \n",
" \n",
" View \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" train \n",
" \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" 33096 \n",
" \n",
" \n",
" \n",
" View \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
"
peripheral
\n",
" \n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" name \n",
" \n",
" \n",
" \n",
" rows \n",
" \n",
" \n",
" \n",
" type \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" 41757 \n",
" \n",
" \n",
" \n",
" DataFrame \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
"
"
],
"text/plain": [
"data model\n",
"\n",
" population:\n",
" columns:\n",
" - DEWP: numerical\n",
" - TEMP: numerical\n",
" - PRES: numerical\n",
" - Iws: numerical\n",
" - Is: numerical\n",
" - ...\n",
"\n",
" joins:\n",
" - right: 'population'\n",
" time_stamps: (population.date, population.date)\n",
" relationship: 'many-to-many'\n",
" memory: 604800.0\n",
" lagged_targets: False\n",
"\n",
" population:\n",
" columns:\n",
" - DEWP: numerical\n",
" - TEMP: numerical\n",
" - PRES: numerical\n",
" - Iws: numerical\n",
" - Is: numerical\n",
" - ...\n",
"\n",
"\n",
"container\n",
"\n",
" population\n",
" subset name rows type\n",
" 0 test population 8661 View\n",
" 1 train population 33096 View\n",
"\n",
" peripheral\n",
" name rows type \n",
" 0 population 41757 DataFrame"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"time_series3 = getml.data.TimeSeries(\n",
" population=data,\n",
" alias=\"population\",\n",
" split=split,\n",
" time_stamps=\"date\",\n",
" memory=getml.data.time.days(7),\n",
")\n",
"\n",
"time_series3"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Pipeline(data_model='population',\n",
" feature_learners=['FastProp'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: FastProp', 'memory: 7d', 'simple features']) "
],
"text/plain": [
"Pipeline(data_model='population',\n",
" feature_learners=['FastProp'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: FastProp', 'memory: 7d', 'simple features'])"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fast_prop = getml.feature_learning.FastProp(\n",
" loss_function=getml.feature_learning.loss_functions.SquareLoss,\n",
" num_threads=1,\n",
" aggregation=getml.feature_learning.FastProp.agg_sets.All,\n",
")\n",
"\n",
"predictor = getml.predictors.XGBoostRegressor(n_jobs=1)\n",
"\n",
"pipe3 = getml.pipeline.Pipeline(\n",
" tags=[\"getML: FastProp\", \"memory: 7d\", \"simple features\"],\n",
" data_model=time_series3.data_model,\n",
" feature_learners=[fast_prop],\n",
" predictors=[predictor],\n",
")\n",
"\n",
"pipe3"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking data model...\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"Checking... 100% |██████████| [elapsed: 00:02, remaining: 00:00] \n",
"\n",
"OK.\n"
]
}
],
"source": [
"pipe3.check(time_series3.train)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking data model...\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"\n",
"OK.\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"FastProp: Trying 378 features... 100% |██████████| [elapsed: 00:37, remaining: 00:00] \n",
"FastProp: Building features... 100% |██████████| [elapsed: 00:21, remaining: 00:00] \n",
"XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:18, remaining: 00:00] \n",
"\n",
"Trained pipeline.\n",
"Time taken: 0h:1m:16.310261\n",
"\n"
]
},
{
"data": {
"text/html": [
"Pipeline(data_model='population',\n",
" feature_learners=['FastProp'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: FastProp', 'memory: 7d', 'simple features', 'container-O5cthD']) "
],
"text/plain": [
"Pipeline(data_model='population',\n",
" feature_learners=['FastProp'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: FastProp', 'memory: 7d', 'simple features', 'container-O5cthD'])"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe3.fit(time_series3.train)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"FastProp: Building features... 100% |██████████| [elapsed: 00:06, remaining: 00:00] \n",
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" date time \n",
" \n",
" \n",
" \n",
" set used \n",
" \n",
" \n",
" \n",
" target \n",
" \n",
" \n",
" \n",
" mae \n",
" \n",
" \n",
" \n",
" rmse \n",
" \n",
" \n",
" \n",
" rsquared \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" 2024-02-21 14:57:32 \n",
" \n",
" \n",
" \n",
" train \n",
" \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" 35.9677 \n",
" \n",
" \n",
" \n",
" 50.7711 \n",
" \n",
" \n",
" \n",
" 0.7036 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" 2024-02-21 14:57:38 \n",
" \n",
" \n",
" \n",
" test \n",
" \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" 45.4586 \n",
" \n",
" \n",
" \n",
" 62.6197 \n",
" \n",
" \n",
" \n",
" 0.5617 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
"
],
"text/plain": [
" date time set used target mae rmse rsquared\n",
"0 2024-02-21 14:57:32 train pm2.5 35.9677 50.7711 0.7036\n",
"1 2024-02-21 14:57:38 test pm2.5 45.4586 62.6197 0.5617"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe3.score(time_series3.test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4 Pipeline 4: Simple features, 1 day\n",
"\n",
"For our fourth experiment, we will learn simple features and allow a memory of up to one day."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"data model \n",
" \n",
"
diagram
\n",
"
population population date <= date Memory: 1.0 days \n",
"
\n",
"\n",
" \n",
"
staging
\n",
" \n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" data frames \n",
" \n",
" \n",
" \n",
" staging table \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" POPULATION__STAGING_TABLE_1 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" POPULATION__STAGING_TABLE_2 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
" \n",
"container \n",
"\n",
"
\n",
"
population
\n",
" \n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" subset \n",
" \n",
" \n",
" \n",
" name \n",
" \n",
" \n",
" \n",
" rows \n",
" \n",
" \n",
" \n",
" type \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" test \n",
" \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" 8661 \n",
" \n",
" \n",
" \n",
" View \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" train \n",
" \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" 33096 \n",
" \n",
" \n",
" \n",
" View \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
"
peripheral
\n",
" \n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" name \n",
" \n",
" \n",
" \n",
" rows \n",
" \n",
" \n",
" \n",
" type \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" 41757 \n",
" \n",
" \n",
" \n",
" DataFrame \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
"
"
],
"text/plain": [
"data model\n",
"\n",
" population:\n",
" columns:\n",
" - DEWP: numerical\n",
" - TEMP: numerical\n",
" - PRES: numerical\n",
" - Iws: numerical\n",
" - Is: numerical\n",
" - ...\n",
"\n",
" joins:\n",
" - right: 'population'\n",
" time_stamps: (population.date, population.date)\n",
" relationship: 'many-to-many'\n",
" memory: 86400.0\n",
" lagged_targets: False\n",
"\n",
" population:\n",
" columns:\n",
" - DEWP: numerical\n",
" - TEMP: numerical\n",
" - PRES: numerical\n",
" - Iws: numerical\n",
" - Is: numerical\n",
" - ...\n",
"\n",
"\n",
"container\n",
"\n",
" population\n",
" subset name rows type\n",
" 0 test population 8661 View\n",
" 1 train population 33096 View\n",
"\n",
" peripheral\n",
" name rows type \n",
" 0 population 41757 DataFrame"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"time_series4 = getml.data.TimeSeries(\n",
" population=data,\n",
" alias=\"population\",\n",
" split=split,\n",
" time_stamps=\"date\",\n",
" memory=getml.data.time.days(1),\n",
")\n",
"\n",
"time_series4"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Pipeline(data_model='population',\n",
" feature_learners=['FastProp'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: FastProp', 'memory: 1d', 'simple features']) "
],
"text/plain": [
"Pipeline(data_model='population',\n",
" feature_learners=['FastProp'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: FastProp', 'memory: 1d', 'simple features'])"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fast_prop = getml.feature_learning.FastProp(\n",
" loss_function=getml.feature_learning.loss_functions.SquareLoss,\n",
" num_threads=1,\n",
" aggregation=getml.feature_learning.FastProp.agg_sets.All,\n",
")\n",
"\n",
"predictor = getml.predictors.XGBoostRegressor(n_jobs=1)\n",
"\n",
"pipe4 = getml.pipeline.Pipeline(\n",
" tags=[\"getML: FastProp\", \"memory: 1d\", \"simple features\"],\n",
" data_model=time_series4.data_model,\n",
" feature_learners=[fast_prop],\n",
" predictors=[predictor],\n",
")\n",
"\n",
"pipe4"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking data model...\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"Checking... 100% |██████████| [elapsed: 00:02, remaining: 00:00] \n",
"\n",
"OK.\n"
]
}
],
"source": [
"pipe4.check(time_series4.train)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking data model...\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"\n",
"OK.\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"FastProp: Trying 378 features... 100% |██████████| [elapsed: 00:07, remaining: 00:00] \n",
"FastProp: Building features... 100% |██████████| [elapsed: 00:03, remaining: 00:00] \n",
"XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:18, remaining: 00:00] \n",
"\n",
"Trained pipeline.\n",
"Time taken: 0h:0m:27.714016\n",
"\n"
]
},
{
"data": {
"text/html": [
"Pipeline(data_model='population',\n",
" feature_learners=['FastProp'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: FastProp', 'memory: 1d', 'simple features', 'container-EGLb2M']) "
],
"text/plain": [
"Pipeline(data_model='population',\n",
" feature_learners=['FastProp'],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=['population'],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['getML: FastProp', 'memory: 1d', 'simple features', 'container-EGLb2M'])"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe4.fit(time_series4.train)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"FastProp: Building features... 100% |██████████| [elapsed: 00:01, remaining: 00:00] \n",
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" date time \n",
" \n",
" \n",
" \n",
" set used \n",
" \n",
" \n",
" \n",
" target \n",
" \n",
" \n",
" \n",
" mae \n",
" \n",
" \n",
" \n",
" rmse \n",
" \n",
" \n",
" \n",
" rsquared \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" 2024-02-21 14:58:08 \n",
" \n",
" \n",
" \n",
" train \n",
" \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" 38.3028 \n",
" \n",
" \n",
" \n",
" 55.2472 \n",
" \n",
" \n",
" \n",
" 0.6438 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" 2024-02-21 14:58:09 \n",
" \n",
" \n",
" \n",
" test \n",
" \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" 44.2526 \n",
" \n",
" \n",
" \n",
" 63.4191 \n",
" \n",
" \n",
" \n",
" 0.5462 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
"
],
"text/plain": [
" date time set used target mae rmse rsquared\n",
"0 2024-02-21 14:58:08 train pm2.5 38.3028 55.2472 0.6438\n",
"1 2024-02-21 14:58:09 test pm2.5 44.2526 63.4191 0.5462"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe4.score(time_series4.test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.5 Using featuretools\n",
"\n",
"To make things a bit easier, we have written high-level wrappers around featuretools and tsfresh which we placed in a separate module (`utils`)."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"data_train_pandas = time_series1.train.population.to_pandas()\n",
"data_test_pandas = time_series1.test.population.to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"tsfresh and featuretools require the time series to have ids. Since there is only a single time series, that series has the same id."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"data_train_pandas[\"id\"] = 1\n",
"data_test_pandas[\"id\"] = 1"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"if RUN_FEATURETOOLS:\n",
" ft_builder = FTTimeSeriesBuilder(\n",
" num_features=200,\n",
" horizon=pd.Timedelta(days=0),\n",
" memory=pd.Timedelta(days=1),\n",
" column_id=\"id\",\n",
" time_stamp=\"date\",\n",
" target=\"pm2.5\",\n",
" )\n",
" #\n",
" featuretools_training = ft_builder.fit(data_train_pandas)\n",
" featuretools_test = ft_builder.transform(data_test_pandas)\n",
"\n",
" data_featuretools_training = getml.data.DataFrame.from_pandas(\n",
" featuretools_training, name=\"featuretools_training\"\n",
" )\n",
" data_featuretools_test = getml.data.DataFrame.from_pandas(\n",
" featuretools_test, name=\"featuretools_test\"\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loading 'featuretools_training' from disk (project folder).\n",
"\n",
"Loading 'featuretools_test' from disk (project folder).\n",
"\n"
]
}
],
"source": [
"if not RUN_FEATURETOOLS:\n",
" data_featuretools_training = load_or_retrieve(\n",
" \"https://static.getml.com/datasets/air_pollution/featuretools/featuretools_training.csv\"\n",
" )\n",
" data_featuretools_test = load_or_retrieve(\n",
" \"https://static.getml.com/datasets/air_pollution/featuretools/featuretools_test.csv\"\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"def set_roles_featuretools(df):\n",
" df.set_role([\"date\"], getml.data.roles.time_stamp)\n",
" df.set_role([\"pm2.5\"], getml.data.roles.target)\n",
" df.set_role([\"date\"], getml.data.roles.time_stamp)\n",
" df.set_role(df.roles.unused, getml.data.roles.numerical)\n",
" df.set_role([\"id\"], getml.data.roles.unused_float)\n",
" return df\n",
"\n",
"df_featuretools_training = set_roles_featuretools(data_featuretools_training)\n",
"df_featuretools_test = set_roles_featuretools(data_featuretools_test)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Pipeline(data_model='population',\n",
" feature_learners=[],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=[],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['featuretools', 'memory: 1d', 'simple features']) "
],
"text/plain": [
"Pipeline(data_model='population',\n",
" feature_learners=[],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=[],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['featuretools', 'memory: 1d', 'simple features'])"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predictor = getml.predictors.XGBoostRegressor()\n",
"\n",
"pipe5 = getml.pipeline.Pipeline(\n",
" tags=[\"featuretools\", \"memory: 1d\", \"simple features\"], predictors=[predictor]\n",
")\n",
"\n",
"pipe5"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking data model...\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"Checking... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"\n",
"OK.\n"
]
}
],
"source": [
"pipe5.check(df_featuretools_training)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking data model...\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"\n",
"OK.\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:08, remaining: 00:00] \n",
"\n",
"Trained pipeline.\n",
"Time taken: 0h:0m:8.163012\n",
"\n"
]
},
{
"data": {
"text/html": [
"Pipeline(data_model='population',\n",
" feature_learners=[],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=[],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['featuretools', 'memory: 1d', 'simple features']) "
],
"text/plain": [
"Pipeline(data_model='population',\n",
" feature_learners=[],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=[],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['featuretools', 'memory: 1d', 'simple features'])"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe5.fit(df_featuretools_training)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" date time \n",
" \n",
" \n",
" \n",
" set used \n",
" \n",
" \n",
" \n",
" target \n",
" \n",
" \n",
" \n",
" mae \n",
" \n",
" \n",
" \n",
" rmse \n",
" \n",
" \n",
" \n",
" rsquared \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" 2024-02-21 14:58:19 \n",
" \n",
" \n",
" \n",
" featuretools_training \n",
" \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" 38.0455 \n",
" \n",
" \n",
" \n",
" 54.4693 \n",
" \n",
" \n",
" \n",
" 0.6567 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" 2024-02-21 14:58:19 \n",
" \n",
" \n",
" \n",
" featuretools_test \n",
" \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" 45.3084 \n",
" \n",
" \n",
" \n",
" 64.2717 \n",
" \n",
" \n",
" \n",
" 0.5373 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
"
],
"text/plain": [
" date time set used target mae rmse rsquared\n",
"0 2024-02-21 14:58:19 featuretools_training pm2.5 38.0455 54.4693 0.6567\n",
"1 2024-02-21 14:58:19 featuretools_test pm2.5 45.3084 64.2717 0.5373"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe5.score(df_featuretools_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.6 Using tsfresh\n",
"\n",
"Next, we construct features with tsfresh. tsfresh is based on pandas and rely\\ies on explicit copies for meny operations. This leads to an excessive memory consumption that renders tsfresh nearly unusable for real-world scenarios. Remeber, this is a relatively small data set.\n",
"\n",
"To limit the memory consumption, we undertake the following steps:\n",
"\n",
"- We limit ourselves to a memory of 1 day from any point in time. This is necessary, because tsfresh duplicates records for every time stamp. That means that looking back 7 days instead of one day, the memory consumption would be seven times as high.\n",
"- We extract only tsfresh's `MinimalFCParameters` and `IndexBasedFCParameters` (the latter is a superset of `TimeBasedFCParameters`).\n",
"\n",
"In order to make sure that tsfresh's features can be compared to getML's features, we also do the following:\n",
"\n",
"- We apply tsfresh's built-in feature selection algorithm.\n",
"- Of the remaining features, we only keep the 40 features most correlated with the target (in terms of the absolute value of the correlation).\n",
"- We add the original columns as additional features.\n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" DEWP \n",
" TEMP \n",
" PRES \n",
" Iws \n",
" Is \n",
" Ir \n",
" pm2.5 \n",
" date \n",
" id \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" -16.0 \n",
" -4.0 \n",
" 1020.0 \n",
" 1.79 \n",
" 0.0 \n",
" 0.0 \n",
" 129.0 \n",
" 2010-01-02 00:00:00 \n",
" 1 \n",
" \n",
" \n",
" 1 \n",
" -15.0 \n",
" -4.0 \n",
" 1020.0 \n",
" 2.68 \n",
" 0.0 \n",
" 0.0 \n",
" 148.0 \n",
" 2010-01-02 01:00:00 \n",
" 1 \n",
" \n",
" \n",
" 2 \n",
" -11.0 \n",
" -5.0 \n",
" 1021.0 \n",
" 3.57 \n",
" 0.0 \n",
" 0.0 \n",
" 159.0 \n",
" 2010-01-02 02:00:00 \n",
" 1 \n",
" \n",
" \n",
" 3 \n",
" -7.0 \n",
" -5.0 \n",
" 1022.0 \n",
" 5.36 \n",
" 1.0 \n",
" 0.0 \n",
" 181.0 \n",
" 2010-01-02 03:00:00 \n",
" 1 \n",
" \n",
" \n",
" 4 \n",
" -7.0 \n",
" -5.0 \n",
" 1022.0 \n",
" 6.25 \n",
" 2.0 \n",
" 0.0 \n",
" 138.0 \n",
" 2010-01-02 04:00:00 \n",
" 1 \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 33091 \n",
" -19.0 \n",
" 7.0 \n",
" 1013.0 \n",
" 114.87 \n",
" 0.0 \n",
" 0.0 \n",
" 22.0 \n",
" 2013-12-31 19:00:00 \n",
" 1 \n",
" \n",
" \n",
" 33092 \n",
" -21.0 \n",
" 7.0 \n",
" 1014.0 \n",
" 119.79 \n",
" 0.0 \n",
" 0.0 \n",
" 18.0 \n",
" 2013-12-31 20:00:00 \n",
" 1 \n",
" \n",
" \n",
" 33093 \n",
" -21.0 \n",
" 7.0 \n",
" 1014.0 \n",
" 125.60 \n",
" 0.0 \n",
" 0.0 \n",
" 23.0 \n",
" 2013-12-31 21:00:00 \n",
" 1 \n",
" \n",
" \n",
" 33094 \n",
" -21.0 \n",
" 6.0 \n",
" 1014.0 \n",
" 130.52 \n",
" 0.0 \n",
" 0.0 \n",
" 20.0 \n",
" 2013-12-31 22:00:00 \n",
" 1 \n",
" \n",
" \n",
" 33095 \n",
" -20.0 \n",
" 7.0 \n",
" 1014.0 \n",
" 137.67 \n",
" 0.0 \n",
" 0.0 \n",
" 23.0 \n",
" 2013-12-31 23:00:00 \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
33096 rows × 9 columns
\n",
"
"
],
"text/plain": [
" DEWP TEMP PRES Iws Is Ir pm2.5 date id\n",
"0 -16.0 -4.0 1020.0 1.79 0.0 0.0 129.0 2010-01-02 00:00:00 1\n",
"1 -15.0 -4.0 1020.0 2.68 0.0 0.0 148.0 2010-01-02 01:00:00 1\n",
"2 -11.0 -5.0 1021.0 3.57 0.0 0.0 159.0 2010-01-02 02:00:00 1\n",
"3 -7.0 -5.0 1022.0 5.36 1.0 0.0 181.0 2010-01-02 03:00:00 1\n",
"4 -7.0 -5.0 1022.0 6.25 2.0 0.0 138.0 2010-01-02 04:00:00 1\n",
"... ... ... ... ... ... ... ... ... ..\n",
"33091 -19.0 7.0 1013.0 114.87 0.0 0.0 22.0 2013-12-31 19:00:00 1\n",
"33092 -21.0 7.0 1014.0 119.79 0.0 0.0 18.0 2013-12-31 20:00:00 1\n",
"33093 -21.0 7.0 1014.0 125.60 0.0 0.0 23.0 2013-12-31 21:00:00 1\n",
"33094 -21.0 6.0 1014.0 130.52 0.0 0.0 20.0 2013-12-31 22:00:00 1\n",
"33095 -20.0 7.0 1014.0 137.67 0.0 0.0 23.0 2013-12-31 23:00:00 1\n",
"\n",
"[33096 rows x 9 columns]"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_train_pandas"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"if RUN_TSFRESH:\n",
" tsfresh_builder = TSFreshBuilder(\n",
" num_features=200, memory=24, column_id=\"id\", time_stamp=\"date\", target=\"pm2.5\"\n",
" )\n",
" #\n",
" tsfresh_training = tsfresh_builder.fit(data_train_pandas)\n",
" tsfresh_test = tsfresh_builder.transform(data_test_pandas)\n",
" #\n",
" data_tsfresh_training = getml.data.DataFrame.from_pandas(\n",
" tsfresh_training, name=\"tsfresh_training\"\n",
" )\n",
" data_tsfresh_test = getml.data.DataFrame.from_pandas(\n",
" tsfresh_test, name=\"tsfresh_test\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"tsfresh does not contain built-in machine learning algorithms. In order to ensure a fair comparison, we use the exact same machine learning algorithm we have also used for getML: An XGBoost regressor with all hyperparameters set to their default values.\n",
"\n",
"In order to do so, we load the tsfresh features into the getML engine."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loading 'tsfresh_training' from disk (project folder).\n",
"\n",
"Loading 'tsfresh_test' from disk (project folder).\n",
"\n"
]
}
],
"source": [
"if not RUN_TSFRESH:\n",
" data_tsfresh_training = load_or_retrieve(\n",
" \"https://static.getml.com/datasets/air_pollution/tsfresh/tsfresh_training.csv\"\n",
" )\n",
" data_tsfresh_test = load_or_retrieve(\n",
" \"https://static.getml.com/datasets/air_pollution/tsfresh/tsfresh_test.csv\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As usual, we need to set roles:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"def set_roles_tsfresh(df):\n",
" df.set_role([\"date\"], getml.data.roles.time_stamp)\n",
" df.set_role([\"pm2.5\"], getml.data.roles.target)\n",
" df.set_role([\"date\"], getml.data.roles.time_stamp)\n",
" df.set_role(df.roles.unused, getml.data.roles.numerical)\n",
" df.set_role([\"id\"], getml.data.roles.unused_float)\n",
" return df\n",
"\n",
"df_tsfresh_training = set_roles_tsfresh(data_tsfresh_training)\n",
"df_tsfresh_test = set_roles_tsfresh(data_tsfresh_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, our pipeline is very simple. It only consists of a single XGBoostRegressor."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Pipeline(data_model='population',\n",
" feature_learners=[],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=[],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['tsfresh', 'memory: 1d', 'simple features']) "
],
"text/plain": [
"Pipeline(data_model='population',\n",
" feature_learners=[],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=[],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['tsfresh', 'memory: 1d', 'simple features'])"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predictor = getml.predictors.XGBoostRegressor()\n",
"\n",
"pipe6 = getml.pipeline.Pipeline(\n",
" tags=[\"tsfresh\", \"memory: 1d\", \"simple features\"], predictors=[predictor]\n",
")\n",
"\n",
"pipe6"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking data model...\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"Checking... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"\n",
"OK.\n"
]
}
],
"source": [
"pipe6.check(df_tsfresh_training)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking data model...\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"\n",
"OK.\n",
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:06, remaining: 00:00] \n",
"\n",
"Trained pipeline.\n",
"Time taken: 0h:0m:5.970352\n",
"\n"
]
},
{
"data": {
"text/html": [
"Pipeline(data_model='population',\n",
" feature_learners=[],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=[],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['tsfresh', 'memory: 1d', 'simple features']) "
],
"text/plain": [
"Pipeline(data_model='population',\n",
" feature_learners=[],\n",
" feature_selectors=[],\n",
" include_categorical=False,\n",
" loss_function='SquareLoss',\n",
" peripheral=[],\n",
" predictors=['XGBoostRegressor'],\n",
" preprocessors=[],\n",
" share_selected_features=0.5,\n",
" tags=['tsfresh', 'memory: 1d', 'simple features'])"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe6.fit(df_tsfresh_training)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] \n",
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" date time \n",
" \n",
" \n",
" \n",
" set used \n",
" \n",
" \n",
" \n",
" target \n",
" \n",
" \n",
" \n",
" mae \n",
" \n",
" \n",
" \n",
" rmse \n",
" \n",
" \n",
" \n",
" rsquared \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" 2024-02-21 14:58:26 \n",
" \n",
" \n",
" \n",
" tsfresh_training \n",
" \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" 40.8062 \n",
" \n",
" \n",
" \n",
" 57.7874 \n",
" \n",
" \n",
" \n",
" 0.6106 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" 2024-02-21 14:58:26 \n",
" \n",
" \n",
" \n",
" tsfresh_test \n",
" \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" 46.698 \n",
" \n",
" \n",
" \n",
" 65.9163 \n",
" \n",
" \n",
" \n",
" 0.5105 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
"
],
"text/plain": [
" date time set used target mae rmse rsquared\n",
"0 2024-02-21 14:58:26 tsfresh_training pm2.5 40.8062 57.7874 0.6106\n",
"1 2024-02-21 14:58:26 tsfresh_test pm2.5 46.698 65.9163 0.5105"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe6.score(df_tsfresh_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" target \n",
" \n",
" \n",
" \n",
" name \n",
" \n",
" \n",
" \n",
" correlation \n",
" \n",
" \n",
" \n",
" importance \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" feature_1_1 \n",
" \n",
" \n",
" \n",
" 0.7269 \n",
" \n",
" \n",
" \n",
" 0.18463717 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" feature_1_2 \n",
" \n",
" \n",
" \n",
" 0.7046 \n",
" \n",
" \n",
" \n",
" 0.11726964 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 2 \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" feature_1_3 \n",
" \n",
" \n",
" \n",
" 0.7158 \n",
" \n",
" \n",
" \n",
" 0.08975971 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 3 \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" feature_1_4 \n",
" \n",
" \n",
" \n",
" 0.6811 \n",
" \n",
" \n",
" \n",
" 0.01235796 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 4 \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" feature_1_5 \n",
" \n",
" \n",
" \n",
" 0.7363 \n",
" \n",
" \n",
" \n",
" 0.27688485 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" ... \n",
" \n",
" \n",
" \n",
" ... \n",
" \n",
" \n",
" \n",
" ... \n",
" \n",
" \n",
" \n",
" ... \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 11 \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" temp \n",
" \n",
" \n",
" \n",
" -0.2112 \n",
" \n",
" \n",
" \n",
" 0.00403082 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 12 \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" pres \n",
" \n",
" \n",
" \n",
" 0.0811 \n",
" \n",
" \n",
" \n",
" 0.00672836 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 13 \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" iws \n",
" \n",
" \n",
" \n",
" -0.2166 \n",
" \n",
" \n",
" \n",
" 0.00111994 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 14 \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" is \n",
" \n",
" \n",
" \n",
" 0.0045 \n",
" \n",
" \n",
" \n",
" 0.00006808 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 15 \n",
" \n",
" \n",
" pm2.5 \n",
" \n",
" \n",
" \n",
" ir \n",
" \n",
" \n",
" \n",
" -0.0541 \n",
" \n",
" \n",
" \n",
" 0.00060757 \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
"
],
"text/plain": [
" target name correlation importance\n",
" 0 pm2.5 feature_1_1 0.7269 0.18463717\n",
" 1 pm2.5 feature_1_2 0.7046 0.11726964\n",
" 2 pm2.5 feature_1_3 0.7158 0.08975971\n",
" 3 pm2.5 feature_1_4 0.6811 0.01235796\n",
" 4 pm2.5 feature_1_5 0.7363 0.27688485\n",
" ... ... ... ...\n",
"11 pm2.5 temp -0.2112 0.00403082\n",
"12 pm2.5 pres 0.0811 0.00672836\n",
"13 pm2.5 iws -0.2166 0.00111994\n",
"14 pm2.5 is 0.0045 0.00006808\n",
"15 pm2.5 ir -0.0541 0.00060757"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe1.features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.7 Studying features"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"```sql\n",
"DROP TABLE IF EXISTS \"FEATURE_1_5\";\n",
"\n",
"CREATE TABLE \"FEATURE_1_5\" AS\n",
"SELECT SUM( \n",
" CASE\n",
" WHEN ( t2.\"iws\" > 2.996864 ) AND ( t1.\"date\" - t2.\"date\" > 111439.618138 ) THEN COALESCE( t1.\"dewp\" - 1.514736211828887, 0.0 ) * 0.03601656954671504 + COALESCE( t1.\"temp\" - 11.89228926884439, 0.0 ) * -0.04267662247010789 + COALESCE( t1.\"is\" - 0.06612999800412481, 0.0 ) * -0.04725594429771846 + COALESCE( t1.\"ir\" - 0.2116958286208502, 0.0 ) * -0.06321612715292692 + COALESCE( t1.\"pres\" - 1016.466458208569, 0.0 ) * -0.01400320405937972 + COALESCE( t1.\"iws\" - 25.06146098064165, 0.0 ) * -0.001201539145976257 + COALESCE( t1.\"date\" - 1326377672.037789, 0.0 ) * -1.91747588977566e-07 + COALESCE( t2.\"dewp\" - 1.668379064426448, 0.0 ) * 0.001731759164268386 + COALESCE( t2.\"is\" - 0.06086063096820984, 0.0 ) * 0.06061734539826681 + COALESCE( t2.\"ir\" - 0.2111990813489665, 0.0 ) * -0.05403702019138458 + COALESCE( t2.\"temp\" - 12.06001450501632, 0.0 ) * -0.007673283821278228 + COALESCE( t2.\"pres\" - 1016.398404448205, 0.0 ) * 0.002389770661357365 + COALESCE( t2.\"iws\" - 25.00605463556588, 0.0 ) * 0.0006834212113433441 + COALESCE( t2.\"date\" - 1326341256.690439, 0.0 ) * 1.915110364308629e-07 + -3.8304206632990674e-02\n",
" WHEN ( t2.\"iws\" > 2.996864 ) AND ( t1.\"date\" - t2.\"date\" <= 111439.618138 OR t1.\"date\" IS NULL OR t2.\"date\" IS NULL ) THEN COALESCE( t1.\"dewp\" - 1.514736211828887, 0.0 ) * -0.09710953675504476 + COALESCE( t1.\"temp\" - 11.89228926884439, 0.0 ) * 0.1898035531135342 + COALESCE( t1.\"is\" - 0.06612999800412481, 0.0 ) * 0.2216653278054844 + COALESCE( t1.\"ir\" - 0.2116958286208502, 0.0 ) * 0.2229778643018364 + COALESCE( t1.\"pres\" - 1016.466458208569, 0.0 ) * 0.0137232972180141 + COALESCE( t1.\"iws\" - 25.06146098064165, 0.0 ) * 0.008757818778673529 + COALESCE( t1.\"date\" - 1326377672.037789, 0.0 ) * 3.828784685924325e-05 + COALESCE( t2.\"dewp\" - 1.668379064426448, 0.0 ) * 0.01390967975925731 + COALESCE( t2.\"is\" - 0.06086063096820984, 0.0 ) * -0.05439808805172083 + COALESCE( t2.\"ir\" - 0.2111990813489665, 0.0 ) * -0.1708184586046546 + COALESCE( t2.\"temp\" - 12.06001450501632, 0.0 ) * 0.04153068423706643 + COALESCE( t2.\"pres\" - 1016.398404448205, 0.0 ) * 0.06692208362422453 + COALESCE( t2.\"iws\" - 25.00605463556588, 0.0 ) * 0.0003737878687475554 + COALESCE( t2.\"date\" - 1326341256.690439, 0.0 ) * -3.82854279001483e-05 + -2.6193156234554085e+00\n",
" WHEN ( t2.\"iws\" <= 2.996864 OR t2.\"iws\" IS NULL ) AND ( t1.\"dewp\" > 11.000000 ) THEN COALESCE( t1.\"dewp\" - 1.514736211828887, 0.0 ) * 0.1393137945465734 + COALESCE( t1.\"temp\" - 11.89228926884439, 0.0 ) * -0.01214102681155058 + COALESCE( t1.\"is\" - 0.06612999800412481, 0.0 ) * -0.1338390047625948 + COALESCE( t1.\"ir\" - 0.2116958286208502, 0.0 ) * -0.0162258602121815 + COALESCE( t1.\"pres\" - 1016.466458208569, 0.0 ) * 0.001699285730186932 + COALESCE( t1.\"iws\" - 25.06146098064165, 0.0 ) * 0.003371800239128447 + COALESCE( t1.\"date\" - 1326377672.037789, 0.0 ) * -1.911896389633282e-07 + COALESCE( t2.\"dewp\" - 1.668379064426448, 0.0 ) * -0.06644334047600213 + COALESCE( t2.\"is\" - 0.06086063096820984, 0.0 ) * -0.1417763462234966 + COALESCE( t2.\"ir\" - 0.2111990813489665, 0.0 ) * -0.1305274026025503 + COALESCE( t2.\"temp\" - 12.06001450501632, 0.0 ) * -0.1337481445687078 + COALESCE( t2.\"pres\" - 1016.398404448205, 0.0 ) * -0.0159175033671927 + COALESCE( t2.\"iws\" - 25.00605463556588, 0.0 ) * 0.0526624167554332 + COALESCE( t2.\"date\" - 1326341256.690439, 0.0 ) * 1.855001999924372e-07 + 1.4765736620125673e+00\n",
" WHEN ( t2.\"iws\" <= 2.996864 OR t2.\"iws\" IS NULL ) AND ( t1.\"dewp\" <= 11.000000 OR t1.\"dewp\" IS NULL ) THEN COALESCE( t1.\"dewp\" - 1.514736211828887, 0.0 ) * 0.04638612658210784 + COALESCE( t1.\"temp\" - 11.89228926884439, 0.0 ) * -0.02616592034638174 + COALESCE( t1.\"is\" - 0.06612999800412481, 0.0 ) * -0.04279224385040904 + COALESCE( t1.\"ir\" - 0.2116958286208502, 0.0 ) * -0.02539472146735003 + COALESCE( t1.\"pres\" - 1016.466458208569, 0.0 ) * -0.0271755357448101 + COALESCE( t1.\"iws\" - 25.06146098064165, 0.0 ) * -0.002779614164530073 + COALESCE( t1.\"date\" - 1326377672.037789, 0.0 ) * -1.127852653091244e-06 + COALESCE( t2.\"dewp\" - 1.668379064426448, 0.0 ) * -0.009599325629289402 + COALESCE( t2.\"is\" - 0.06086063096820984, 0.0 ) * -0.2440324127160611 + COALESCE( t2.\"ir\" - 0.2111990813489665, 0.0 ) * -0.1198017640418239 + COALESCE( t2.\"temp\" - 12.06001450501632, 0.0 ) * -0.0723450470366292 + COALESCE( t2.\"pres\" - 1016.398404448205, 0.0 ) * -0.006849322627387147 + COALESCE( t2.\"iws\" - 25.00605463556588, 0.0 ) * -0.4129622526694541 + COALESCE( t2.\"date\" - 1326341256.690439, 0.0 ) * 1.129731320846992e-06 + -9.4331291663918275e+00\n",
" ELSE NULL\n",
" END\n",
") AS \"feature_1_5\",\n",
" t1.rowid AS rownum\n",
"FROM \"POPULATION__STAGING_TABLE_1\" t1\n",
"INNER JOIN \"POPULATION__STAGING_TABLE_2\" t2\n",
"ON 1 = 1\n",
"WHERE t2.\"date\" <= t1.\"date\"\n",
"AND ( t2.\"date__7_000000_days\" > t1.\"date\" OR t2.\"date__7_000000_days\" IS NULL )\n",
"GROUP BY t1.rowid;\n",
"```"
],
"text/plain": [
"'DROP TABLE IF EXISTS \"FEATURE_1_5\";\\n\\nCREATE TABLE \"FEATURE_1_5\" AS\\nSELECT SUM( \\n CASE\\n WHEN ( t2.\"iws\" > 2.996864 ) AND ( t1.\"date\" - t2.\"date\" > 111439.618138 ) THEN COALESCE( t1.\"dewp\" - 1.514736211828887, 0.0 ) * 0.03601656954671504 + COALESCE( t1.\"temp\" - 11.89228926884439, 0.0 ) * -0.04267662247010789 + COALESCE( t1.\"is\" - 0.06612999800412481, 0.0 ) * -0.04725594429771846 + COALESCE( t1.\"ir\" - 0.2116958286208502, 0.0 ) * -0.06321612715292692 + COALESCE( t1.\"pres\" - 1016.466458208569, 0.0 ) * -0.01400320405937972 + COALESCE( t1.\"iws\" - 25.06146098064165, 0.0 ) * -0.001201539145976257 + COALESCE( t1.\"date\" - 1326377672.037789, 0.0 ) * -1.91747588977566e-07 + COALESCE( t2.\"dewp\" - 1.668379064426448, 0.0 ) * 0.001731759164268386 + COALESCE( t2.\"is\" - 0.06086063096820984, 0.0 ) * 0.06061734539826681 + COALESCE( t2.\"ir\" - 0.2111990813489665, 0.0 ) * -0.05403702019138458 + COALESCE( t2.\"temp\" - 12.06001450501632, 0.0 ) * -0.007673283821278228 + COALESCE( t2.\"pres\" - 1016.398404448205, 0.0 ) * 0.002389770661357365 + COALESCE( t2.\"iws\" - 25.00605463556588, 0.0 ) * 0.0006834212113433441 + COALESCE( t2.\"date\" - 1326341256.690439, 0.0 ) * 1.915110364308629e-07 + -3.8304206632990674e-02\\n WHEN ( t2.\"iws\" > 2.996864 ) AND ( t1.\"date\" - t2.\"date\" <= 111439.618138 OR t1.\"date\" IS NULL OR t2.\"date\" IS NULL ) THEN COALESCE( t1.\"dewp\" - 1.514736211828887, 0.0 ) * -0.09710953675504476 + COALESCE( t1.\"temp\" - 11.89228926884439, 0.0 ) * 0.1898035531135342 + COALESCE( t1.\"is\" - 0.06612999800412481, 0.0 ) * 0.2216653278054844 + COALESCE( t1.\"ir\" - 0.2116958286208502, 0.0 ) * 0.2229778643018364 + COALESCE( t1.\"pres\" - 1016.466458208569, 0.0 ) * 0.0137232972180141 + COALESCE( t1.\"iws\" - 25.06146098064165, 0.0 ) * 0.008757818778673529 + COALESCE( t1.\"date\" - 1326377672.037789, 0.0 ) * 3.828784685924325e-05 + COALESCE( t2.\"dewp\" - 1.668379064426448, 0.0 ) * 0.01390967975925731 + COALESCE( t2.\"is\" - 0.06086063096820984, 0.0 ) * -0.05439808805172083 + COALESCE( t2.\"ir\" - 0.2111990813489665, 0.0 ) * -0.1708184586046546 + COALESCE( t2.\"temp\" - 12.06001450501632, 0.0 ) * 0.04153068423706643 + COALESCE( t2.\"pres\" - 1016.398404448205, 0.0 ) * 0.06692208362422453 + COALESCE( t2.\"iws\" - 25.00605463556588, 0.0 ) * 0.0003737878687475554 + COALESCE( t2.\"date\" - 1326341256.690439, 0.0 ) * -3.82854279001483e-05 + -2.6193156234554085e+00\\n WHEN ( t2.\"iws\" <= 2.996864 OR t2.\"iws\" IS NULL ) AND ( t1.\"dewp\" > 11.000000 ) THEN COALESCE( t1.\"dewp\" - 1.514736211828887, 0.0 ) * 0.1393137945465734 + COALESCE( t1.\"temp\" - 11.89228926884439, 0.0 ) * -0.01214102681155058 + COALESCE( t1.\"is\" - 0.06612999800412481, 0.0 ) * -0.1338390047625948 + COALESCE( t1.\"ir\" - 0.2116958286208502, 0.0 ) * -0.0162258602121815 + COALESCE( t1.\"pres\" - 1016.466458208569, 0.0 ) * 0.001699285730186932 + COALESCE( t1.\"iws\" - 25.06146098064165, 0.0 ) * 0.003371800239128447 + COALESCE( t1.\"date\" - 1326377672.037789, 0.0 ) * -1.911896389633282e-07 + COALESCE( t2.\"dewp\" - 1.668379064426448, 0.0 ) * -0.06644334047600213 + COALESCE( t2.\"is\" - 0.06086063096820984, 0.0 ) * -0.1417763462234966 + COALESCE( t2.\"ir\" - 0.2111990813489665, 0.0 ) * -0.1305274026025503 + COALESCE( t2.\"temp\" - 12.06001450501632, 0.0 ) * -0.1337481445687078 + COALESCE( t2.\"pres\" - 1016.398404448205, 0.0 ) * -0.0159175033671927 + COALESCE( t2.\"iws\" - 25.00605463556588, 0.0 ) * 0.0526624167554332 + COALESCE( t2.\"date\" - 1326341256.690439, 0.0 ) * 1.855001999924372e-07 + 1.4765736620125673e+00\\n WHEN ( t2.\"iws\" <= 2.996864 OR t2.\"iws\" IS NULL ) AND ( t1.\"dewp\" <= 11.000000 OR t1.\"dewp\" IS NULL ) THEN COALESCE( t1.\"dewp\" - 1.514736211828887, 0.0 ) * 0.04638612658210784 + COALESCE( t1.\"temp\" - 11.89228926884439, 0.0 ) * -0.02616592034638174 + COALESCE( t1.\"is\" - 0.06612999800412481, 0.0 ) * -0.04279224385040904 + COALESCE( t1.\"ir\" - 0.2116958286208502, 0.0 ) * -0.02539472146735003 + COALESCE( t1.\"pres\" - 1016.466458208569, 0.0 ) * -0.0271755357448101 + COALESCE( t1.\"iws\" - 25.06146098064165, 0.0 ) * -0.002779614164530073 + COALESCE( t1.\"date\" - 1326377672.037789, 0.0 ) * -1.127852653091244e-06 + COALESCE( t2.\"dewp\" - 1.668379064426448, 0.0 ) * -0.009599325629289402 + COALESCE( t2.\"is\" - 0.06086063096820984, 0.0 ) * -0.2440324127160611 + COALESCE( t2.\"ir\" - 0.2111990813489665, 0.0 ) * -0.1198017640418239 + COALESCE( t2.\"temp\" - 12.06001450501632, 0.0 ) * -0.0723450470366292 + COALESCE( t2.\"pres\" - 1016.398404448205, 0.0 ) * -0.006849322627387147 + COALESCE( t2.\"iws\" - 25.00605463556588, 0.0 ) * -0.4129622526694541 + COALESCE( t2.\"date\" - 1326341256.690439, 0.0 ) * 1.129731320846992e-06 + -9.4331291663918275e+00\\n ELSE NULL\\n END\\n) AS \"feature_1_5\",\\n t1.rowid AS rownum\\nFROM \"POPULATION__STAGING_TABLE_1\" t1\\nINNER JOIN \"POPULATION__STAGING_TABLE_2\" t2\\nON 1 = 1\\nWHERE t2.\"date\" <= t1.\"date\"\\nAND ( t2.\"date__7_000000_days\" > t1.\"date\" OR t2.\"date__7_000000_days\" IS NULL )\\nGROUP BY t1.rowid;'"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe1.features.sort(by=\"importances\")[0].sql"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a typical [RelMT](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relmt) feature, where the aggregation (`SUM` in this case) is applied conditionally – the conditions are learned by `RelMT` – to a set of linear models, whose weights are, again, learned by `RelMT`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.8 Productionization\n",
"\n",
"It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's `sqlite3` module."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Creates a folder named air_pollution_pipeline containing the SQL code\n",
"pipe1.features.to_sql().save(\"air_pollution_pipeline\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Discussion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have seen that getML outperforms tsfresh by more than 10 percentage points in terms of R-squared. We now want to analyze why that is.\n",
"\n",
"There are two possible hypotheses:\n",
"\n",
"- getML outperforms featuretools and tsfresh, because it using feature learning and is able to produce more complex features\n",
"- getML outperforms featuretools and tsfresh, because it makes better use of memory and is able to look back further.\n",
"\n",
"Let's summarize our findings:"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" tool \n",
" memory \n",
" feature_complexity \n",
" rsquared \n",
" rmse \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" getML: RelMT \n",
" 7d \n",
" complex \n",
" 63.1% \n",
" 57.5 \n",
" \n",
" \n",
" 1 \n",
" getML: RelMT \n",
" 1d \n",
" complex \n",
" 49.0% \n",
" 66.9 \n",
" \n",
" \n",
" 2 \n",
" getML: FastProp \n",
" 7d \n",
" simple \n",
" 56.2% \n",
" 62.6 \n",
" \n",
" \n",
" 3 \n",
" getML: FastProp \n",
" 1d \n",
" simple \n",
" 54.6% \n",
" 63.4 \n",
" \n",
" \n",
" 4 \n",
" featuretools \n",
" 1d \n",
" simple \n",
" 53.7% \n",
" 64.3 \n",
" \n",
" \n",
" 5 \n",
" tsfresh \n",
" 1d \n",
" simple \n",
" 51.0% \n",
" 65.9 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tool memory feature_complexity rsquared rmse\n",
"0 getML: RelMT 7d complex 63.1% 57.5\n",
"1 getML: RelMT 1d complex 49.0% 66.9\n",
"2 getML: FastProp 7d simple 56.2% 62.6\n",
"3 getML: FastProp 1d simple 54.6% 63.4\n",
"4 featuretools 1d simple 53.7% 64.3\n",
"5 tsfresh 1d simple 51.0% 65.9"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipes = [pipe1, pipe2, pipe3, pipe4, pipe5, pipe6]\n",
"\n",
"comparison = pd.DataFrame(\n",
" dict(\n",
" tool=[pipe.tags[0] for pipe in pipes],\n",
" memory=[pipe.tags[1].split()[1] for pipe in pipes],\n",
" feature_complexity=[pipe.tags[2].split()[0] for pipe in pipes],\n",
" rsquared=[f\"{pipe.rsquared:.1%}\" for pipe in pipes],\n",
" rmse=[f\"{pipe.rmse:.3}\" for pipe in pipes],\n",
" )\n",
")\n",
"\n",
"comparison"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The summary table shows that combination of both of our hypotheses explains why getML outperforms featuretools and tsfresh. Complex features do better than simple features with a memory of one day. With a memory of seven days, simple features actually get worse. But when you look back seven days and allow more complex features, you get good results.\n",
"\n",
"This suggests that getML outperforms featuretools and tsfresh, because it can make more efficient use of memory and thus look back further. Because RelMT uses feature learning and can build more complex features it can make better use of the greater look-back window."
]
},
{
"cell_type": "markdown",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"## 4. Conclusion\n",
"\n",
"We have compared getML's feature learning algorithms to tsfresh's brute-force feature engineering approaches on a data set related to air pollution in China. We found that getML significantly outperforms featuretools and tsfresh. These results are consistent with the view that feature learning can yield significant improvements over simple propositionalization approaches.\n",
"\n",
"However, there are other datasets on which simple propositionalization performs well. Our suggestion is therefore to think of algorithms like `FastProp` and `RelMT` as tools in a toolbox. If a simple tool like `FastProp` gets the job done, then use that. But when you need more advanced approaches, like `RelMT`, you should have them at your disposal as well.\n",
"\n",
"You are encouraged to reproduce these results."
]
}
],
"metadata": {
"celltoolbar": "Tags",
"jupytext": {
"encoding": "# -*- coding: utf-8 -*-",
"formats": "ipynb,py:percent"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.18"
},
"toc": {
"base_numbering": 1
},
"vscode": {
"interpreter": {
"hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}