get_from_dataset

Creates a left join with a second dataset from Verteego.

Usage

Used to join external data to the dataset used by Verteego plateform, similar to an SQL join or a VLOOKUP in a spreadsheet. However, please note that it will always be a left join, as we do not want to remove any rows from our main dataset that contains the lines corresponding to our scope of interest. Here, we are solely aiming to specify those lines.

This calculator can be used with the following method:

get_from_dataset

Examples:

  • get additional information on items/pos/warehouses caracteristics

  • add additional complex features computed out of plateform

  • add promotional calendar or tag different period


Main Parameters

The bold options represent the default values when the parameters are optional.

  • input_columns list of columns used as inputs for the calculator to perform the join. These columns must be present in the main dataset used by our Verteego calculation step.

  • output_columns list of columns added by the calculator during the join. These columns must be present in the external dataset.

  • global (true, false) Should this calculator be performed before data splitting during training for cross-validation

  • steps [optionnal] (training, prediction, postprocessing) List of steps in a pipeline where columns from this calculator are added to the data. Note that when the training option is listed, the calculator is actually added during preprocessing.

  • store_in_model [optionnal] (true, false) Please indicate whether the "calculated" columns by the calculator should be stored in the model or not to avoid recalculating them during prediction. This is only relevant if the calculated columns are added to both training and prediction. Without this parameter, the values will not be stored in the model. The following parameters only make sense if this parameter is set to true.

  • stored_columns [required if store_in_model is true] List indicating the columns to be stored among the output_columns.

  • stored_keys [required if store_in_model is true] List indicating the columns to use for identifying the correct values to join on the data for prediction among the stored values (logically, they are to be chosen from the input_columns).


Specific Parameters

  • file_name (possible values) name of the Verteego dataset from which we will extract data.

  • default_values (possible values) Default values to use when the external file does not contain values for a given key. Can be a string or a number.

  • join_options [optional] If not provided, the calculator expects to find the input_columns in the secondary dataset to perform an exact join between columns with the same name. If a more complex join is desired (different name, constraints other than equality, ...), the join logic should be detailed as a dictionary of join options where the keys are the input columns and the values are the specific join options for that key.

    Possible options are:

equal: the left column must be equal to the specified column

greater_than: the left column must be greater than or equal to the specified column

greater_than_or_equal: the left column must be greater than the specified column

lesser_than: the left column must be less than the specified column

lesser_than_or_equal: the left column must be less than or equal to the specified column


Examples

  1. In this first example, product information data is extracted from the secondary dataset item_repository_tb.

    For this, a join is performed on the product identifier column item_id, which is present in both our main and secondary datasets. Only the columns listed in output_columns are added.

    • if item_repository_tb contains other columns, they will not be taken into account

    • If no matching row is found for a item_id in the main dataset, provided default value will be added, or NaN if no value is given for a column (example item_status not in default_values , so the value will be NaN)

calculated_cols:
  item_desc_feat:
      method: get_from_dataset
      input_columns:
      - item_id
      output_columns:
      - item_desc
      - item_status
      - item_level_1_id
      - item_level_1_label
      - item_level_2_id
      - item_level_2_label
      - item_brand_level_1_id
      params:
        file_name: item_repository_tb
        default_values:
          item_desc: default_values
          item_level_1_id: 0
          item_level_1_label: default_values
          item_level_2_id: 0
          item_level_2_label: default_values
          item_brand_level_1_id: 0
  1. Let's take the previous case and imagine that, during the dataset preparation, an inaccuracy was made: the main dataset contains an English name item_id, while the secondary dataset contains a French name identifiant_produit. We can then perform a more complex join: we use the join_options parameter on the key of the main dataset item_id and set the join method as equal, with the corresponding column in the secondary dataset as identifiant_produit

calculated_cols:
  item_desc_feat:
      method: get_from_dataset
      input_columns:
      - item_id
      output_columns:
      - item_desc
      - item_status
      - item_level_1_id
      - item_level_1_label
      - item_level_2_id
      - item_level_2_label
      - item_brand_level_1_id
      params:
        file_name: item_repository_tb
        default_values:
          item_desc: default_values
          item_level_1_id: 0
          item_level_1_label: default_values
          item_level_2_id: 0
          item_level_2_label: default_values
          item_brand_level_1_id: 0
        join_options:
          item_id:
            equal: identifiant_produit
  1. Lastly, let's change the context and work with dates.

    We are studying a dataset of daily sales receipts. The column receipt_date corresponds to the daily date. We want to identify for each sale if it occurred during a promotional period, and for that, we have a dataset is_promotional_sales_date_and_first_day_view (only for France) listing the sale periods, with columns:

    • is_promotional_sales_day always equal to 1

    • type indicating the type of the sale period

    • duration : how long does the sales period last

    • 2 columns begin_date and end_date indicating the first and last days of that sale period (inclusive).

    To identify the sales during promotions, we simply add the value 1 for the rows where the receipt_date is between the begin_date and end_date columns of the same row in the secondary dataset, and set it to 0 otherwise. We proceed as follows:

    calculated_cols:
      add_promotional_sales:
          global: true
          method: get_from_dataset
          input_columns:
          - receipt_date
          output_columns:
          - is_promotional_sales_day
          - type
          - duration
          params:
            file_name: is_promotional_sales_date_and_first_day_view
            default_values:
              is_promotional_sales_day: 0
            join_options:
              receipt_date:
                greater_than_or_equal: begin_date
                lesser_than_or_equal: end_date
    

    Example of the output:

receipt_date
is_promotional_sales_day
type
duration
begin_date
end_date

2023-01-08

0.0

null

null

null

null

2023-01-09

0.0

null

null

null

null

2023-01-10

0.0

null

null

null

null

2023-01-11

1.0

winter

27

2023-01-11

2023-02-07

2023-01-12

1.0

winter

27

2023-01-11

2023-02-07

2023-01-13

1.0

winter

27

2023-01-11

2023-02-07

2023-01-14

1.0

winter

27

2023-01-11

2023-02-07

2023-01-15

1.0

winter

27

2023-01-11

2023-02-07

2023-01-16

1.0

winter

27

2023-01-11

2023-02-07

2023-01-17

1.0

winter

27

2023-01-11

2023-02-07

2023-01-18

1.0

winter

27

2023-01-11

2023-02-07

2023-01-20

1.0

winter

27

2023-01-11

2023-02-07

Last updated