case_na

Fills NaN values according to hierarchy.

Usage

This calculator allows you to create a new column that is a copy of first column of priority list then fill nan of this column using next columns of priority list.

This calculator can be used with the following method:

case_na

Examples:

  • Fill a price column containing nulls with a constant value.

  • Fill an average column per product containing nulls with less precise averages (sub_family, family), respecting an order of priority.


Main Parameters

The bold options represent the default values when the parameters are optional.

  • input_columns list of columns used as input of the calculators: The list of columns that will be used to fill the output column.

  • output_columns list of columns added by the calculators : Name of the filled column added to the dataset.

  • global (true, false) Should this calculator be performed before data splitting during training for cross-validation

  • steps [optionnal] (training, prediction, postprocessing) List of steps in a pipeline where columns from this calculator are added to the data. Note that when the training option is listed, the calculator is actually added during preprocessing.

  • store_in_model [optionnal] (true, false) Please indicate whether the "calculated" columns by the calculator should be stored in the model or not to avoid recalculating them during prediction. This is only relevant if the calculated columns are added to both training and prediction. Without this parameter, the values will not be stored in the model. The following parameters only make sense if this parameter is set to true.

  • stored_columns [required if store_in_model is true] List indicating the columns to be stored among the output_columns.

  • stored_keys [required if store_in_model is true] List indicating the columns to use for identifying the correct values to join on the data for prediction among the stored values (logically, they are to be chosen from the input_columns).


Specific Parameters

  • priority

    Priority order to follow for substitution.

    NAN or NULL values will be replaced by the first non null value in this list.


Examples

  1. The user wants to calculate a precise sales average. To do so, he first calculates: - the average per product (average_per_item_id) - the average per subfamily (average_per_subfamily) - the average per family (average_per_family) - and the global average (global_average) using the aggregate_val_group_by_key calculator. Next, he will use the case_na to prioritize the average_per_item_id column, and if this column contains nulls, take the values of the average_per_subfamily column, and if this column contains nulls, take the average_per_family values, and so on. The result will be in the cascaded_average column.

calculated_cols:
  cascading_average:
    method: case_na
    input_columns:
    - average_per_item_id
    - average_per_subfamily
    - average_per_family
    - global_average
    output_columns:
    - cascaded_average
    params:
        priority:
        - average_per_item_id
        - average_per_subfamily
        - average_per_family
        - global_average

Another way of performing the same calculation would be to use the hierarchical_aggregate calculator.

  1. The user wants to fill a column (price) containing NAN with a constant (15). Two calculators can be used in succession:

calculated_cols:
  add_constant:
      method: constant
      params:
        value: 15
      output_columns:
        - constant_price
  
  case_na_price:
      method: case_na
      params:
        priority:
          - price
          - constant_price
      input_columns:
        - price
        - constant_price
      output_columns:
        - filling_price

Last updated