glmm_encoder

Turns categories to numerals using GLMM encoding.

Usage

This calculator allows you to transform categorical values into numerical ones using generalized linear mixed models encoding.

Please note that in order to train the underlying model, GLMM needs enough data, i.e. at least 3 unique values in the target column. When it’s not the case, this calculator will raise a warning and fill the feature with -1.

It is not necessary to use the glmm_encoder when the algorithm is lightgbm (as this algorithm specifically handles categorical features).

This calculator can be used with the following method:

glmm_encoder

Examples:

  • Transform a product family column containing 10 different categorical values into a numerical column.

  • This calculator is often used on columns containing many different categorical values, to avoid one_hot_encoding and too many features.


Main Parameters

The bold options represent the default values when the parameters are optional.

  • input_columns list of columns used as input of the calculators: the column you want to GLMM encode and the target (for example, the column to be forecast).

  • output_columns list of columns added by the calculators : name of the column which will be the result of glmm encoding of the categorical column specified in input_columns.

  • global (true, false) Should this calculator be performed before data splitting during training for cross-validation

  • steps [optionnal] (training, prediction, postprocessing) List of steps in a pipeline where columns from this calculator are added to the data. Note that when the training option is listed, the calculator is actually added during preprocessing.

  • store_in_model [optionnal] (true, false) Please indicate whether the "calculated" columns by the calculator should be stored in the model or not to avoid recalculating them during prediction. This is only relevant if the calculated columns are added to both training and prediction. Without this parameter, the values will not be stored in the model. The following parameters only make sense if this parameter is set to true.

  • stored_columns [required if store_in_model is true] List indicating the columns to be stored among the output_columns.

  • stored_keys [required if store_in_model is true] List indicating the columns to use for identifying the correct values to join on the data for prediction among the stored values (logically, they are to be chosen from the input_columns).


Specific Parameters

  • values

    List of columns to encode. Those are the categorical values that we want to encode.

  • target

    The name of the column to use as target. The Generalized Linear Mixed Model is trained based on the values of this column.


Examples

  1. A given dataset contains sales data (qty_sold) for several products. These products are characterized by a family (concat_famid), which is a categorical column containing several values. The user wants to transform this categorical column into a numerical column.

    calculated_cols:
      glmm_encoded:
        method: glmm_encoder
        input_columns:
        - concat_famid
        - qty_sold
        output_columns:
        - concat_famid_glmm
        params:
          values:
          - concat_famid
          target: qty_sold
        store_in_model: True
        stored_keys:
          - concat_famid
        stored_columns:
          - concat_famid_glmm

    Example of output dataset :

    item_id
    receipt_date
    qty_sold
    concat_famid
    concat_famid_glmm

    877988

    2024-01-01

    50

    fam_1

    30.68

    556764

    2024-01-01

    43

    fam_1

    30.68

    321132

    2024-01-01

    18

    fam_2

    12.82

    121453

    2024-01-01

    9

    fam_3

    5.14

Last updated