tsfresh

Generates hundreds of time features.

Usage

This calculator allows you to automatically calculate a large number of time series characteristics. The exact number of added features depends on the data and the value of fc_parameters. tsfresh is used for systematic feature engineering from time-series and other sequential data. These data have in common that they are ordered by an independent variable.

This calculator can be used with the following method:

tsfresh

Examples:

  • You want to calculate different characteristics such as the maximum or minimum quantity, the average quantity. Without tsfresh, you would have to calculate all those characteristics manually; tsfresh automates this process calculating and returning all those features automatically.

  • The extracted features by tsfresh can also be used to cluster time series.

Recommendations:

  • For more information on how tsfresh works, you can read its documentation.

  • Use a PCA after tsfresh, to reduce the dimensions and avoid having too many features as input to the model, which could be very long to run.


Main Parameters

The bold options represent the default values when the parameters are optional.

  • input_columns list of columns used as input of the calculators: The list of columns that will be used to fill the output column.

  • output_columns_prefix

  • prefix of the columns added when the output columns cannot be listed: Prefix to use for the output columns, as this calculator adds several.

  • global (true, false) Should this calculator be performed before data splitting during training for cross-validation

  • steps [optionnal] (training, prediction, postprocessing*)*** List of steps in a pipeline where columns from this calculator are added to the data. Note that when the training option is listed, the calculator is actually added during preprocessing.

  • store_in_model [optionnal] (true, false) Please indicate whether the "calculated" columns by the calculator should be stored in the model or not to avoid recalculating them during prediction. This is only relevant if the calculated columns are added to both training and prediction. Without this parameter, the values will not be stored in the model. The following parameters only make sense if this parameter is set to true.

  • stored_columns [required if store_in_model is true] List indicating the columns to be stored among the output_columns.

  • stored_keys [required if store_in_model is true] List indicating the columns to use for identifying the correct values to join on the data for prediction among the stored values (logically, they are to be chosen from the input_columns).


Specific Parameters

  • column_value Column analyzed

  • column_sort Column used for sorting, generally date

  • group_by A list of columns to use as time series resolution. Time series resolution allow to identify multiple time series in the dataset based on categories or ids. This is not related to the time resolution of the time series.

  • fc_parameters (comprehensive, efficient, minimal)

    The sub ensemble of features to compute with tsfresh. Defaults to comprehensive.


Examples

  1. Given a dataset containing daily sales data (qty_sold) of multiple products identified by a key (itemid) for multiple stores (locid). The user wants to generate lots of time series characteristics and then use them as features in a forecasting model.

    First, he uses the tsfresh calculator to generate the time series features. Then he uses a PCA to reduce the dimensions (in this example, he keeps 8 components of the PCA), and uses the results of the PCA as features.

    calculated_cols:
      tsfresh_per_locid_item_id:
        method: tsfresh
        input_columns:
        - locid
        - itemid
        - qty_sold
        - date
        output_columns_prefix: ts
        store_in_model: true
        stored_keys:
          - locid
          - item_id
        stored_columns:
          - ts.*
        params:
          column_value: qty_sold
          column_sort: date
          fc_parameters: minimal
          group_by:
          - locid
          - itemid
    
      compute_pca:
        method: pca
        input_columns:
          - ts.*
        output_columns_prefix: pca_tsf
        store_in_model: true
        stored_keys:
          - loc_id
          - item_id
        stored_columns:
          - pca_tsf.*
        params:
          n_components: 8
        default_value:
          pca_tsf_0: 0
          pca_tsf_1: 0
          pca_tsf_2: 0
          pca_tsf_3: 0
          pca_tsf_4: 0
          pca_tsf_5: 0
          pca_tsf_6: 0
          pca_tsf_7: 0
          pca_tsf_8: 0
    
    features:
      numerical_columns:
        - pca_tsf.*

Last updated