aggregate_val_group_by_key

Aggregates values of a column.

Usage

This calculator allows you to add columns containing aggregations of values from other columns to a given resolution.

This calculator can be used with the following method:

aggregate_val_group_by_key

Examples:

  • compute average sales per items

  • Calculate the first sales date of each product in each store.

  • Calculate the maximum promotion applied to products sold per store per month.


Main Parameters

The bold options represent the default values when the parameters are optional.

  • input_columns list of columns used as input of the calculators: list of all columns to group and all columns to aggregate

  • output_columns list of columns added by the calculators: name of the column containing the values after aggregation.

  • global (true, false) Should this calculator be performed before data splitting during training for cross-validation

  • steps [optionnal] (training, prediction, postprocessing) List of steps in a pipeline where columns from this calculator are added to the data. Note that when the training option is listed, the calculator is actually added during preprocessing.

  • store_in_model [optionnal] (true, false) Please indicate whether the "calculated" columns by the calculator should be stored in the model or not to avoid recalculating them during prediction. This is only relevant if the calculated columns are added to both training and prediction. Without this parameter, the values will not be stored in the model. The following parameters only make sense if this parameter is set to true.

  • stored_columns [required if store_in_model is true] List indicating the columns to be stored among the output_columns.

  • stored_keys [required if store_in_model is true] List indicating the columns to use for identifying the correct values to join on the data for prediction among the stored values (logically, they are to be chosen from the input_columns).


Specific Parameters

  • aggregation: count, cov, cumcount, cummax, cummin, cumprod, cumsum, diff, first, head, hist, idxmax, idxmin, last, max, mean, median, min, nunique, size, std, sum) ⇒ method used for aggregation among the available methods in Python

  • val: list of columns among the input_columns that will be aggregated.


Examples

  1. Given dataset contains daily sales data (quantity_sold) of several products identified by a key (item_identifier) for a single store, the user wants to compute the average sales of each product per day.

calculated_cols:
  aggregate_sales_mnfid:
    method: aggregate_val_group_by_key
    input_columns:
    - item_identifier
    - quantity_sold
    output_columns:
    - mean_sales_per_item
    steps:
    - training
    params:
        aggregation: mean
        val:
        - quantity_sold
        group_by:
        - item_identifier
  1. The same scenario, but this time we do not want to recalculate this feature on all predictions since the quantity sold is the target value and is not available during the prediction step. Therefore, we will store it in the model to infer it from training.

calculated_cols:
  aggregate_sales_mnfid:
    method: aggregate_val_group_by_key
    input_columns:
    - item_identifier
    - quantity_sold
    output_columns:
    - mean_sales_per_item
    steps:
    - training
    - prediction
    params:
        aggregation: mean
        val:
        - quantity_sold
        group_by:
        - item_identifier
	  store_in_model: true
	  stored_keys:
	  - item_identifier
	  stored_columns:
	  - mean_sales_per_item
	  default_value:
	    mean_sales_per_item: 0

Which can be simplified by removing the steps field since it contains default values.

calculated_cols:
  aggregate_sales_mnfid:
    method: aggregate_val_group_by_key
    input_columns:
    - item_identifier
    - quantity_sold
    output_columns:
    - mean_sales_per_item
    params:
        aggregation: mean
        val:
        - quantity_sold
        group_by:
        - item_identifier
	  store_in_model: true
	  stored_keys:
	  - item_identifier
	  stored_columns:
	  - mean_sales_per_item
	  default_value:
	    mean_sales_per_item: 0

Last updated