# get\_from\_dataset

## Usage

{% hint style="info" %}
Used to join external data to the dataset used by Verteego plateform, similar to an SQL join or a VLOOKUP in a spreadsheet. However, please note that it <mark style="color:red;">will always be a left join</mark>, as we do not want to remove any rows from our main dataset that contains the lines corresponding to our scope of interest. Here, we are solely aiming to specify those lines.
{% endhint %}

This calculator can be used with the following method:

<mark style="color:red;">**`get_from_dataset`**</mark>

Examples:

* get additional information on items/pos/warehouses caracteristics
* add additional complex features computed out of plateform
* add promotional calendar or tag different period

***

## Main Parameters

{% hint style="success" %}
**The bold options** represent the default values when the parameters are optional.
{% endhint %}

* *<mark style="color:blue;">input\_columns</mark>* \
  list of columns used as inputs for the calculator to perform the join. These columns must be present in the main dataset used by our Verteego calculation step.
* *<mark style="color:blue;">output\_columns</mark>* \
  list of columns added by the calculator during the join. These columns must be present in the external dataset.
* *<mark style="color:blue;">global</mark>* *(true, **false)*** \
  Should this calculator be performed before data splitting during training for cross-validation
* *<mark style="color:blue;">steps</mark>* \[optionnal] *(**training, prediction**, postprocessing*) \
  List of steps in a pipeline where columns from this calculator are added to the data. Note that when the training option is listed, the calculator is actually added during preprocessing.
* *<mark style="color:blue;">store\_in\_model</mark>* \[optionnal] *(true, **false)*** \
  Please indicate whether the "calculated" columns by the calculator should be stored in the model or not to avoid recalculating them during prediction. This is only relevant if the calculated columns are added to both training and prediction. Without this parameter, the values will not be stored in the model. The following parameters only make sense if this parameter is set to *true*.
* *<mark style="color:blue;">stored\_columns</mark>* \[required if *<mark style="color:blue;">store\_in\_model</mark> is true*] \
  List indicating the columns to be stored among the *<mark style="color:blue;">output\_columns</mark>*.
* *<mark style="color:blue;">stored\_keys</mark>* \[required if *<mark style="color:blue;">store\_in\_model</mark> is true*] \
  List indicating the columns to use for identifying the correct values to join on the data for prediction among the stored values (logically, they are to be chosen from the *<mark style="color:blue;">input\_columns</mark>*).

***

## Specific Parameters

* *<mark style="color:blue;">file\_name</mark> (possible values)* \
  name of the Verteego dataset from which we will extract data.
* *<mark style="color:blue;">default\_values</mark> (possible values)* \
  Default values to use when the external file does not contain values for a given key. Can be a string or a number.
* *<mark style="color:blue;">join\_options</mark>* \[optional] \
  If not provided, the calculator expects to find the *<mark style="color:blue;">input\_columns</mark>* in the secondary dataset to perform an exact join between columns with the same name. If a more complex join is desired (different name, constraints other than equality, ...), the join logic should be detailed as a dictionary of join options where the keys are the input columns and the values are the specific join options for that key.

  \
  *Possible options are:*

> <mark style="color:blue;">**`equal`**</mark>: the left column must be equal to the specified column&#x20;
>
> <mark style="color:blue;">**`greater_than`**</mark>: the left column must be greater than or equal to the specified column
>
> <mark style="color:blue;">**`greater_than_or_equal`**</mark>: the left column must be greater than the specified column
>
> <mark style="color:blue;">**`lesser_than`**</mark>: the left column must be less than the specified column&#x20;
>
> <mark style="color:blue;">**`lesser_than_or_equal`**</mark>: the left column must be less than or equal to the specified column

***

## Examples

1. In this first example, product information data is extracted from the secondary dataset **`item_repository_tb`**.

   For this, a join is performed on the product identifier column **`item_id`**, which **is present in both** our main and secondary datasets. Only the columns listed in *output\_columns* are added.

   * if **`item_repository_tb`** contains other columns, they will not be taken into account
   * If no matching row is found for a **`item_id`** in the main dataset, provided default value will be added, or *NaN* if no value is given for a column (example **`item_status`** not in ***`default_values`*** *, so the value will be NaN)*

```yaml
calculated_cols:
  item_desc_feat:
      method: get_from_dataset
      input_columns:
      - item_id
      output_columns:
      - item_desc
      - item_status
      - item_level_1_id
      - item_level_1_label
      - item_level_2_id
      - item_level_2_label
      - item_brand_level_1_id
      params:
        file_name: item_repository_tb
        default_values:
          item_desc: default_values
          item_level_1_id: 0
          item_level_1_label: default_values
          item_level_2_id: 0
          item_level_2_label: default_values
          item_brand_level_1_id: 0
```

2. Let's take the previous case and imagine that, during the dataset preparation, an inaccuracy was made: the main dataset contains an English name `item_id`, while the secondary dataset contains a French name `identifiant_produit`.\
   We can then perform a more complex join: we use the *<mark style="color:blue;">join\_options</mark>* parameter on the key of the main dataset `item_id` and set the join method as <mark style="color:blue;">**`equal`**</mark>, with the corresponding column in the secondary dataset as `identifiant_produit`

```
calculated_cols:
  item_desc_feat:
      method: get_from_dataset
      input_columns:
      - item_id
      output_columns:
      - item_desc
      - item_status
      - item_level_1_id
      - item_level_1_label
      - item_level_2_id
      - item_level_2_label
      - item_brand_level_1_id
      params:
        file_name: item_repository_tb
        default_values:
          item_desc: default_values
          item_level_1_id: 0
          item_level_1_label: default_values
          item_level_2_id: 0
          item_level_2_label: default_values
          item_brand_level_1_id: 0
        join_options:
          item_id:
            equal: identifiant_produit
```

3. Lastly, let's change the context and work with dates.

   We are studying a dataset of daily sales receipts. The column **`receipt_date`** corresponds to the daily date. We want to identify for each sale if it occurred during a promotional period, and for that, we have a dataset **`is_promotional_sales_date_and_first_day_view`** *(only for France)* listing the sale periods, with columns:

   * **`is_promotional_sales_day`** always equal to **1**
   * **`type`** indicating the type of the sale period
   * **`duration`** : how long does the sales period last
   * 2 columns **`begin_date`** and **`end_date`** indicating the first and last days of that sale period (inclusive).

   To identify the sales during promotions, we simply add the value 1 for the rows where the **`receipt_date`** is between the **`begin_date`** and **`end_date`** columns of the same row in the secondary dataset, and set it to 0 otherwise. We proceed as follows:

   ```yaml
   calculated_cols:
     add_promotional_sales:
         global: true
         method: get_from_dataset
         input_columns:
         - receipt_date
         output_columns:
         - is_promotional_sales_day
         - type
         - duration
         params:
           file_name: is_promotional_sales_date_and_first_day_view
           default_values:
             is_promotional_sales_day: 0
           join_options:
             receipt_date:
               greater_than_or_equal: begin_date
               lesser_than_or_equal: end_date

   ```

   \
   Example of the output:

| receipt\_date | is\_promotional\_sales\_day | type   | duration | begin\_date | end\_date  |
| ------------- | --------------------------- | ------ | -------- | ----------- | ---------- |
| 2023-01-08    | 0.0                         | null   | null     | null        | null       |
| 2023-01-09    | 0.0                         | null   | null     | null        | null       |
| 2023-01-10    | 0.0                         | null   | null     | null        | null       |
| 2023-01-11    | 1.0                         | winter | 27       | 2023-01-11  | 2023-02-07 |
| 2023-01-12    | 1.0                         | winter | 27       | 2023-01-11  | 2023-02-07 |
| 2023-01-13    | 1.0                         | winter | 27       | 2023-01-11  | 2023-02-07 |
| 2023-01-14    | 1.0                         | winter | 27       | 2023-01-11  | 2023-02-07 |
| 2023-01-15    | 1.0                         | winter | 27       | 2023-01-11  | 2023-02-07 |
| 2023-01-16    | 1.0                         | winter | 27       | 2023-01-11  | 2023-02-07 |
| 2023-01-17    | 1.0                         | winter | 27       | 2023-01-11  | 2023-02-07 |
| 2023-01-18    | 1.0                         | winter | 27       | 2023-01-11  | 2023-02-07 |
| 2023-01-20    | 1.0                         | winter | 27       | 2023-01-11  | 2023-02-07 |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://doc.verteego.com/verteego-doc/pipelines/forecasting-pipelines/calculators/external-source/get_from_dataset.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
