# hierarchical\_aggregate

## Usage

{% hint style="info" %}
This calculator allows you to add columns containing aggregations of values from other columns to a given resolution. It will try to group by the different groups in input until it finds one with enough elements in it.
{% endhint %}

This calculator can be used with the following method:

<mark style="color:red;">**`hierarchical_aggregate`**</mark>

Examples:

* Compute average sales per items but use the category items average when there is not enough data.
* Calculate the maximum promotion applied to products sold per store per month.

***

## Main Parameters

{% hint style="success" %}
**The bold options** represent the default values when the parameters are optional.
{% endhint %}

* *<mark style="color:blue;">input\_columns</mark>* \
  list of columns used as input of the calculators: The list of columns that will be used to fill the output column.
* *<mark style="color:blue;">output\_columns</mark>* \
  list of columns added by the calculators : Name of the filled column added to the dataset.
* *<mark style="color:blue;">global</mark>* *(true, **false)*** \
  Should this calculator be performed before data splitting during training for cross-validation
* *<mark style="color:blue;">steps</mark>* \[optionnal] *(**training, prediction**, postprocessing*)\
  List of steps in a pipeline where columns from this calculator are added to the data. Note that when the training option is listed, the calculator is actually added during preprocessing.
* *<mark style="color:blue;">store\_in\_model</mark>* \[optionnal] *(true, **false)*** \
  Please indicate whether the "calculated" columns by the calculator should be stored in the model or not to avoid recalculating them during prediction. This is only relevant if the calculated columns are added to both training and prediction. Without this parameter, the values will not be stored in the model. The following parameters only make sense if this parameter is set to *true*.
* *<mark style="color:blue;">stored\_columns</mark>* \[required if *store\_in\_model is true*] \
  List indicating the columns to be stored among the *<mark style="color:blue;">output\_columns</mark>*.
* *<mark style="color:blue;">stored\_keys</mark>* \[required if *store\_in\_model is true*] \
  List indicating the columns to use for identifying the correct values to join on the data for prediction among the stored values (logically, they are to be chosen from the *input\_columns*).

***

## Specific Parameters

* <mark style="color:blue;">operation</mark> \
  (count, cov, cumcount, cummax, cummin, cumprod, cumsum, diff, first, head, hist, idxmax, idxmin, last, max, mean, median, min, size, std & sum)

  List of methods used for aggregation
* *<mark style="color:blue;">val</mark>*

  List of columns among the *<mark style="color:blue;">input\_columns</mark>* that will be aggregated.
* *<mark style="color:blue;">group\_by</mark>*

  List of columns among the *<mark style="color:blue;">input\_columns</mark>* used as keys to group aggregated cols by. The columns need to be organized by ordered group. The aggregation will try to use each group in order until one works.
* *<mark style="color:blue;">min\_matching\_rows</mark>*

  Number of rows to consider for the historical subset to be a match. Defaults to 1.

***

## Examples

1. Given dataset contains daily sales data (`quantity_sold`) of several products identified by a key **(**`item_identifier`**)** and a category (`category`). The user wants to compute the average sales of each product but some of the product don’t have sales and want to replace it by the average sales by category.

   ```yaml
   calculated_cols:
     aggregate_price:
       method: hierarchical_aggregate
       input_columns:
       - item_level_1
       - item_id
       - quantity_sold
       output_columns:
       - agg_mean_quantity_sold
       params:
         val: quantity_sold
         group_by:
           group_1:
           - item_level_1
           - item_id
           group_2:
           - item_level_1
         operation:
         - mean
   ```

   **Input :**

   | category  | item\_identifier | quantity\_sold |
   | --------- | ---------------- | -------------- |
   | vegetable | carrot           | 10             |
   | vegatable | carrot           | 5              |
   | vegetable | avocado          |                |
   | vegatable | mushroom         | 10             |

   **Output**

   | category  | item\_identifier | agg\_mean\_quantity\_sold |
   | --------- | ---------------- | ------------------------- |
   | vegetable | carrot           | 7.5                       |
   | vegatable | carrot           | 7.5                       |
   | vegetable | avocado          | 8.333333333               |
   | vegatable | mushroom         | 10                        |
2. The same scenario, but this time we do not want to recalculate this feature on all predictions since the quantity sold is the target value and is not available during the prediction step. Therefore, we will store it in the model to infer it from training.

   ```yaml
   calculated_cols:
     aggregate_price:
       method: hierarchical_aggregate
       input_columns:
       - item_level_1
       - item_id
       - quantity_sold
       output_columns:
       - agg_mean_quantity_sold
       params:
         val: quantity_sold
         group_by:
           group_1:
           - item_level_1
           - item_id
           group_2:
           - item_level_1
         operation:
         - mean
       store_in_model: true
       stored_keys:
       - item_identifier
       - cateogory
       stored_columns:
       - agg_mean_quantity_sold
   ```

   You can also precise the minimum of element present into a category with the parameter min\_matching\_rows

   ```yaml
   calculated_cols:
     aggregate_price:
       method: hierarchical_aggregate
       input_columns:
       - item_level_1
       - item_id
       - quantity_sold
       output_columns:
       - agg_mean_quantity_sold
       params:
         val: quantity_sold
         group_by:
           group_1:
           - item_level_1
           - item_id
           group_2:
           - item_level_1
         operation:
         - mean
         min_matching_rows: 10
       store_in_model: true
       stored_keys:
       - item_identifier
       - cateogory
       stored_columns:
       - agg_mean_quantity_sold
   ```
