In [1]:
from tsfresh.feature_extraction import extract_features
from tsfresh.feature_extraction.settings import ComprehensiveFCParameters, MinimalFCParameters, EfficientFCParameters
from tsfresh.feature_extraction.settings import from_columns

import numpy as np
import pandas as pd

  from pandas.core import datetools


This notebooks illustrates the `"fc_parameters"` or `"kind_to_fc_parameters"` dictionaries.

For a detailed explanation, see also http://tsfresh.readthedocs.io/en/latest/text/feature_extraction_settings.html

## Construct a time series container

We construct the time series container that includes two sensor time series, _"temperature"_ and _"pressure"_, for two devices _"a"_ and _"b"_

In [2]:
df = pd.DataFrame({"id": ["a", "a", "b", "b"], "temperature": [1,2,3,1], "pressure": [-1, 2, -1, 7]})
df

Unnamed: 0,id,pressure,temperature
0,a,-1,1
1,a,2,2
2,b,-1,3
3,b,7,1


## The default_fc_parameters

Which features are calculated by tsfresh is controlled by a dictionary that contains a mapping from feature calculator names to their parameters. 
This dictionary is called `fc_parameters`. It maps feature calculator names (=keys) to parameters (=values). As keys, always the same names as in the tsfresh.feature_extraction.feature_calculators module are used.

In the following we load an exemplary dictionary

In [3]:
settings_minimal = MinimalFCParameters() # only a few basic features
settings_minimal

{'length': None,
 'maximum': None,
 'mean': None,
 'median': None,
 'minimum': None,
 'standard_deviation': None,
 'sum_values': None,
 'variance': None}

This dictionary can passed to the extract method, resulting in a few basic time series beeing calculated:

In [4]:
X_tsfresh = extract_features(df, column_id="id", default_fc_parameters = settings_minimal)
X_tsfresh.head()

Feature Extraction: 100%|██████████| 4/4 [00:00<00:00, 16336.14it/s]


variable,pressure__length,pressure__maximum,pressure__mean,pressure__median,pressure__minimum,pressure__standard_deviation,pressure__sum_values,pressure__variance,temperature__length,temperature__maximum,temperature__mean,temperature__median,temperature__minimum,temperature__standard_deviation,temperature__sum_values,temperature__variance
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
a,2.0,2.0,0.5,0.5,-1.0,1.5,1.0,2.25,2.0,2.0,1.5,1.5,1.0,0.5,3.0,0.25
b,2.0,7.0,3.0,3.0,-1.0,4.0,6.0,16.0,2.0,3.0,2.0,2.0,1.0,1.0,4.0,1.0


By using the settings_minimal as value of the default_fc_parameters parameter, those settings are used for all type of time series. In this case, the `settings_minimal` dictionary is used for both _"temperature"_ and _"pressure"_ time series.

Now, lets say we want to remove the length feature and prevent it from beeing calculated. We just delete it from the dictionary.

In [5]:
del settings_minimal["length"]
settings_minimal

{'maximum': None,
 'mean': None,
 'median': None,
 'minimum': None,
 'standard_deviation': None,
 'sum_values': None,
 'variance': None}

Now, if we extract features for this reduced dictionary, the length feature will not be calculated

In [6]:
X_tsfresh = extract_features(df, column_id="id", default_fc_parameters = settings_minimal)
X_tsfresh.head()

Feature Extraction: 100%|██████████| 4/4 [00:00<00:00, 1171.27it/s]


variable,pressure__maximum,pressure__mean,pressure__median,pressure__minimum,pressure__standard_deviation,pressure__sum_values,pressure__variance,temperature__maximum,temperature__mean,temperature__median,temperature__minimum,temperature__standard_deviation,temperature__sum_values,temperature__variance
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
a,2.0,0.5,0.5,-1.0,1.5,1.0,2.25,2.0,1.5,1.5,1.0,0.5,3.0,0.25
b,7.0,3.0,3.0,-1.0,4.0,6.0,16.0,3.0,2.0,2.0,1.0,1.0,4.0,1.0


## The kind_to_fc_parameters

now, lets say we do not want to calculate the same features for both type of time series. Instead there should be different sets of features for each kind.

To do that, we can use the `kind_to_fc_parameters` parameter, which lets us finely specifiy which `fc_parameters` we want to use for which kind of time series:

In [7]:
fc_parameters_pressure = {"length": None, 
                          "sum_values": None}

fc_parameters_temperature = {"maximum": None, 
                             "minimum": None}

kind_to_fc_parameters = {
    "temperature": fc_parameters_temperature,
    "pressure": fc_parameters_pressure
}

print(kind_to_fc_parameters)

{'pressure': {'length': None, 'sum_values': None}, 'temperature': {'minimum': None, 'maximum': None}}


So, in this case, for sensor _"pressure"_ both _"max"_ and _"min"_ are calculated. For the _"temperature"_ signal, the length and sum_values features are extracted instead.

In [8]:
X_tsfresh = extract_features(df, column_id="id", kind_to_fc_parameters = kind_to_fc_parameters)
X_tsfresh.head()

Feature Extraction: 100%|██████████| 4/4 [00:00<00:00, 1473.37it/s]


variable,pressure__length,pressure__sum_values,temperature__maximum,temperature__minimum
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,2.0,1.0,2.0,1.0
b,2.0,6.0,3.0,1.0


So, lets say we lost the kind_to_fc_parameters dictionary. Or we apply a feature selection algorithm to drop 
irrelevant feature columns, so our extraction settings contain irrelevant features. 

In both cases, we can use the provided "from_columns" method to infer the creating dictionary from 
the dataframe containing the features

In [9]:
recovered_settings = from_columns(X_tsfresh)
recovered_settings

{'pressure': {'length': None, 'sum_values': None},
 'temperature': {'maximum': None, 'minimum': None}}

Lets drop a column to show that the inferred settings dictionary really changes

In [10]:
X_tsfresh.iloc[:, 1:]

variable,pressure__sum_values,temperature__maximum,temperature__minimum
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1.0,2.0,1.0
b,6.0,3.0,1.0


In [11]:
recovered_settings = from_columns(X_tsfresh.iloc[:, 1:])
recovered_settings

{'pressure': {'sum_values': None},
 'temperature': {'maximum': None, 'minimum': None}}

## More complex dictionaries

We provide custom fc_parameters dictionaries with greater sets of features.

The `EfficientFCParameters` contain features and parameters that should be calculated quite fastly:

In [12]:
settings_efficient = EfficientFCParameters()
settings_efficient

{'abs_energy': None,
 'absolute_sum_of_changes': None,
 'agg_autocorrelation': [{'f_agg': 'mean'},
  {'f_agg': 'median'},
  {'f_agg': 'var'}],
 'agg_linear_trend': [{'attr': 'rvalue', 'chunk_len': 5, 'f_agg': 'max'},
  {'attr': 'rvalue', 'chunk_len': 5, 'f_agg': 'min'},
  {'attr': 'rvalue', 'chunk_len': 5, 'f_agg': 'mean'},
  {'attr': 'rvalue', 'chunk_len': 5, 'f_agg': 'var'},
  {'attr': 'rvalue', 'chunk_len': 10, 'f_agg': 'max'},
  {'attr': 'rvalue', 'chunk_len': 10, 'f_agg': 'min'},
  {'attr': 'rvalue', 'chunk_len': 10, 'f_agg': 'mean'},
  {'attr': 'rvalue', 'chunk_len': 10, 'f_agg': 'var'},
  {'attr': 'rvalue', 'chunk_len': 50, 'f_agg': 'max'},
  {'attr': 'rvalue', 'chunk_len': 50, 'f_agg': 'min'},
  {'attr': 'rvalue', 'chunk_len': 50, 'f_agg': 'mean'},
  {'attr': 'rvalue', 'chunk_len': 50, 'f_agg': 'var'},
  {'attr': 'intercept', 'chunk_len': 5, 'f_agg': 'max'},
  {'attr': 'intercept', 'chunk_len': 5, 'f_agg': 'min'},
  {'attr': 'intercept', 'chunk_len': 5, 'f_agg': 'mean'},
  {'at

The `ComprehensiveFCParameters` are the biggest set of features. It will take the longest to calculate

In [13]:
settings_comprehensive = ComprehensiveFCParameters()
settings_comprehensive

{'abs_energy': None,
 'absolute_sum_of_changes': None,
 'agg_autocorrelation': [{'f_agg': 'mean'},
  {'f_agg': 'median'},
  {'f_agg': 'var'}],
 'agg_linear_trend': [{'attr': 'rvalue', 'chunk_len': 5, 'f_agg': 'max'},
  {'attr': 'rvalue', 'chunk_len': 5, 'f_agg': 'min'},
  {'attr': 'rvalue', 'chunk_len': 5, 'f_agg': 'mean'},
  {'attr': 'rvalue', 'chunk_len': 5, 'f_agg': 'var'},
  {'attr': 'rvalue', 'chunk_len': 10, 'f_agg': 'max'},
  {'attr': 'rvalue', 'chunk_len': 10, 'f_agg': 'min'},
  {'attr': 'rvalue', 'chunk_len': 10, 'f_agg': 'mean'},
  {'attr': 'rvalue', 'chunk_len': 10, 'f_agg': 'var'},
  {'attr': 'rvalue', 'chunk_len': 50, 'f_agg': 'max'},
  {'attr': 'rvalue', 'chunk_len': 50, 'f_agg': 'min'},
  {'attr': 'rvalue', 'chunk_len': 50, 'f_agg': 'mean'},
  {'attr': 'rvalue', 'chunk_len': 50, 'f_agg': 'var'},
  {'attr': 'intercept', 'chunk_len': 5, 'f_agg': 'max'},
  {'attr': 'intercept', 'chunk_len': 5, 'f_agg': 'min'},
  {'attr': 'intercept', 'chunk_len': 5, 'f_agg': 'mean'},
  {'at

You see those parameters as values in the fc_paramter dictionary? Those are the parameters of the feature extraction methods.

In detail, the value in a fc_parameters dicitonary can contain a list of dictionaries. Every dictionary in that list is one feature.

So, for example

In [14]:
settings_comprehensive['large_standard_deviation']

[{'r': 0.05},
 {'r': 0.1},
 {'r': 0.15000000000000002},
 {'r': 0.2},
 {'r': 0.25},
 {'r': 0.30000000000000004},
 {'r': 0.35000000000000003},
 {'r': 0.4},
 {'r': 0.45},
 {'r': 0.5},
 {'r': 0.55},
 {'r': 0.6000000000000001},
 {'r': 0.65},
 {'r': 0.7000000000000001},
 {'r': 0.75},
 {'r': 0.8},
 {'r': 0.8500000000000001},
 {'r': 0.9},
 {'r': 0.9500000000000001}]

would trigger the calculation of 20 different 'large_standard_deviation' features, one for r=0.05, for n=0.10 up to r=0.95.  Lets just take them and extract some features

In [15]:
settings_value_count = {'large_standard_deviation': settings_comprehensive['large_standard_deviation']}
settings_value_count

{'large_standard_deviation': [{'r': 0.05},
  {'r': 0.1},
  {'r': 0.15000000000000002},
  {'r': 0.2},
  {'r': 0.25},
  {'r': 0.30000000000000004},
  {'r': 0.35000000000000003},
  {'r': 0.4},
  {'r': 0.45},
  {'r': 0.5},
  {'r': 0.55},
  {'r': 0.6000000000000001},
  {'r': 0.65},
  {'r': 0.7000000000000001},
  {'r': 0.75},
  {'r': 0.8},
  {'r': 0.8500000000000001},
  {'r': 0.9},
  {'r': 0.9500000000000001}]}

In [16]:
X_tsfresh = extract_features(df, column_id="id", default_fc_parameters=settings_value_count)
X_tsfresh.head()

Feature Extraction: 100%|██████████| 4/4 [00:00<00:00, 772.36it/s]


variable,pressure__large_standard_deviation__r_0.05,pressure__large_standard_deviation__r_0.1,pressure__large_standard_deviation__r_0.15,pressure__large_standard_deviation__r_0.2,pressure__large_standard_deviation__r_0.25,pressure__large_standard_deviation__r_0.3,pressure__large_standard_deviation__r_0.35,pressure__large_standard_deviation__r_0.4,pressure__large_standard_deviation__r_0.45,pressure__large_standard_deviation__r_0.5,...,temperature__large_standard_deviation__r_0.5,temperature__large_standard_deviation__r_0.55,temperature__large_standard_deviation__r_0.6,temperature__large_standard_deviation__r_0.65,temperature__large_standard_deviation__r_0.7,temperature__large_standard_deviation__r_0.75,temperature__large_standard_deviation__r_0.8,temperature__large_standard_deviation__r_0.85,temperature__large_standard_deviation__r_0.9,temperature__large_standard_deviation__r_0.95
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
b,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The nice thing is, we actually contain the parameters in the feature name, so it is possible to reconstruct 
how the feature was calculated.

In [17]:
from_columns(X_tsfresh)

{'pressure': {'large_standard_deviation': [{'r': 0.05},
   {'r': 0.1},
   {'r': 0.15},
   {'r': 0.2},
   {'r': 0.25},
   {'r': 0.3},
   {'r': 0.35},
   {'r': 0.4},
   {'r': 0.45},
   {'r': 0.5},
   {'r': 0.55},
   {'r': 0.6},
   {'r': 0.65},
   {'r': 0.7},
   {'r': 0.75},
   {'r': 0.8},
   {'r': 0.85},
   {'r': 0.9},
   {'r': 0.95}]},
 'temperature': {'large_standard_deviation': [{'r': 0.05},
   {'r': 0.1},
   {'r': 0.15},
   {'r': 0.2},
   {'r': 0.25},
   {'r': 0.3},
   {'r': 0.35},
   {'r': 0.4},
   {'r': 0.45},
   {'r': 0.5},
   {'r': 0.55},
   {'r': 0.6},
   {'r': 0.65},
   {'r': 0.7},
   {'r': 0.75},
   {'r': 0.8},
   {'r': 0.85},
   {'r': 0.9},
   {'r': 0.95}]}}

This means that you should never change a column name. Otherwise the information how it was calculated can get lost.