get_from_dataset
Creates a left join with a second dataset from Verteego.
Usage
Used to join external data to the dataset used by Verteego plateform, similar to an SQL join or a VLOOKUP in a spreadsheet. However, please note that it will always be a left join, as we do not want to remove any rows from our main dataset that contains the lines corresponding to our scope of interest. Here, we are solely aiming to specify those lines.
This calculator can be used with the following method:
get_from_dataset
Examples:
get additional information on items/pos/warehouses caracteristics
add additional complex features computed out of plateform
add promotional calendar or tag different period
Main Parameters
The bold options represent the default values when the parameters are optional.
input_columns list of columns used as inputs for the calculator to perform the join. These columns must be present in the main dataset used by our Verteego calculation step.
output_columns list of columns added by the calculator during the join. These columns must be present in the external dataset.
global (true, false) Should this calculator be performed before data splitting during training for cross-validation
steps [optionnal] (training, prediction, postprocessing) List of steps in a pipeline where columns from this calculator are added to the data. Note that when the training option is listed, the calculator is actually added during preprocessing.
store_in_model [optionnal] (true, false) Please indicate whether the "calculated" columns by the calculator should be stored in the model or not to avoid recalculating them during prediction. This is only relevant if the calculated columns are added to both training and prediction. Without this parameter, the values will not be stored in the model. The following parameters only make sense if this parameter is set to true.
stored_columns [required if store_in_model is true] List indicating the columns to be stored among the output_columns.
stored_keys [required if store_in_model is true] List indicating the columns to use for identifying the correct values to join on the data for prediction among the stored values (logically, they are to be chosen from the input_columns).
Specific Parameters
file_name (possible values) name of the Verteego dataset from which we will extract data.
default_values (possible values) Default values to use when the external file does not contain values for a given key. Can be a string or a number.
join_options [optional] If not provided, the calculator expects to find the input_columns in the secondary dataset to perform an exact join between columns with the same name. If a more complex join is desired (different name, constraints other than equality, ...), the join logic should be detailed as a dictionary of join options where the keys are the input columns and the values are the specific join options for that key.
Possible options are:
equal
: the left column must be equal to the specified column
greater_than
: the left column must be greater than or equal to the specified column
greater_than_or_equal
: the left column must be greater than the specified column
lesser_than
: the left column must be less than the specified column
lesser_than_or_equal
: the left column must be less than or equal to the specified column
Examples
In this first example, product information data is extracted from the secondary dataset
item_repository_tb
.For this, a join is performed on the product identifier column
item_id
, which is present in both our main and secondary datasets. Only the columns listed in output_columns are added.if
item_repository_tb
contains other columns, they will not be taken into accountIf no matching row is found for a
item_id
in the main dataset, provided default value will be added, or NaN if no value is given for a column (exampleitem_status
not indefault_values
, so the value will be NaN)
Let's take the previous case and imagine that, during the dataset preparation, an inaccuracy was made: the main dataset contains an English name
item_id
, while the secondary dataset contains a French nameidentifiant_produit
. We can then perform a more complex join: we use the join_options parameter on the key of the main datasetitem_id
and set the join method asequal
, with the corresponding column in the secondary dataset asidentifiant_produit
Lastly, let's change the context and work with dates.
We are studying a dataset of daily sales receipts. The column
receipt_date
corresponds to the daily date. We want to identify for each sale if it occurred during a promotional period, and for that, we have a datasetis_promotional_sales_date_and_first_day_view
(only for France) listing the sale periods, with columns:is_promotional_sales_day
always equal to 1type
indicating the type of the sale periodduration
: how long does the sales period last2 columns
begin_date
andend_date
indicating the first and last days of that sale period (inclusive).
To identify the sales during promotions, we simply add the value 1 for the rows where the
receipt_date
is between thebegin_date
andend_date
columns of the same row in the secondary dataset, and set it to 0 otherwise. We proceed as follows:Example of the output:
2023-01-08
0.0
null
null
null
null
2023-01-09
0.0
null
null
null
null
2023-01-10
0.0
null
null
null
null
2023-01-11
1.0
winter
27
2023-01-11
2023-02-07
2023-01-12
1.0
winter
27
2023-01-11
2023-02-07
2023-01-13
1.0
winter
27
2023-01-11
2023-02-07
2023-01-14
1.0
winter
27
2023-01-11
2023-02-07
2023-01-15
1.0
winter
27
2023-01-11
2023-02-07
2023-01-16
1.0
winter
27
2023-01-11
2023-02-07
2023-01-17
1.0
winter
27
2023-01-11
2023-02-07
2023-01-18
1.0
winter
27
2023-01-11
2023-02-07
2023-01-20
1.0
winter
27
2023-01-11
2023-02-07
Last updated