# one\_hot\_encode

## Usage

{% hint style="info" %}
This calculator allows the user to **transform categorical columns into binary ones**. \
For each category it will create a binary column with 1 if it was this category and 0 if not. It can take into account multiple categories at the same time if they use a specific separator.&#x20;

*Please note that:* \
\- Columns existing during preprocessing but missing during prediction are added and filled by False

\- Columns existing during prediction but missing during preprocessing are removed
{% endhint %}

{% hint style="danger" %}
Some algorithm already handle encoding (for example `lightgbm`) so `one_hot_encode` is not necessary.
{% endhint %}

This calculator can be used with the following method:

<mark style="color:red;">**`one_hot_encode`**</mark>

Examples:

* Transform a product category with 5 different categories into 5 binary columns.
* This calculator is best used when there is a few different possible categories.

***

## Main Parameters

{% hint style="success" %}
**The bold options** represent the default values when the parameters are optional.
{% endhint %}

* *<mark style="color:blue;">input\_columns</mark>* \
  list of columns used as input of the calculators: The list of columns that will be used to fill the output column.
* <mark style="color:blue;">output\_columns\_prefix</mark> \
  prefix of the columns added when the output columns cannot be listed
* *<mark style="color:blue;">global</mark>* *(true, **false)*** \
  Should this calculator be performed before data splitting during training for cross-validation
* *<mark style="color:blue;">steps</mark>* \[optionnal] *(**training, prediction**, postprocessing*) \
  List of steps in a pipeline where columns from this calculator are added to the data. Note that when the training option is listed, the calculator is actually added during preprocessing.
* *<mark style="color:blue;">store\_in\_model</mark>* \[optionnal] *(true, **false)*** \
  Please indicate whether the "calculated" columns by the calculator should be stored in the model or not to avoid recalculating them during prediction. This is only relevant if the calculated columns are added to both training and prediction. Without this parameter, the values will not be stored in the model. The following parameters only make sense if this parameter is set to *true*.
* *<mark style="color:blue;">stored\_columns</mark>* \[required if *<mark style="color:blue;">store\_in\_model</mark> is true*] \
  List indicating the columns to be stored among the *<mark style="color:blue;">output\_columns</mark>*.
* *<mark style="color:blue;">stored\_keys</mark>* \[required if *<mark style="color:blue;">store\_in\_model</mark> is true*] \
  List indicating the columns to use for identifying the correct values to join on the data for prediction among the stored values (logically, they are to be chosen from the *<mark style="color:blue;">input\_columns</mark>*).

***

## Specific Parameters

* *<mark style="color:blue;">separator</mark>* \[optionnal] \
  Char used as separator in the input column. When a separator is provided, the columns to one hot encode need to be of type str.

***

## Examples

1. A given dataset contains sales data (`qty_sold`) for several products. These products are characterized by a family (`concat_famid`) and a color (`product_color`), which is are categorical column containing several values. \
   The user wants to transform those categorical columns into binary ones.

   ```yaml
   calculated_cols:
     ohe_cat_feat:
       method: one_hot_encode
       input_columns:
       - concat_famid
       - product_color
       output_columns_prefix: oh
   ```

**Input :**

| item\_id | receipt\_date | qty\_sold | concat\_famid | product\_color |
| -------- | ------------- | --------- | ------------- | -------------- |
| 877988   | 2024-01-01    | 50        | fam\_1        | blue           |
| 556764   | 2024-01-01    | 43        | fam\_1        | red            |
| 321132   | 2024-01-01    | 18        | fam\_2        | blue           |
| 121453   | 2024-01-01    | 9         | fam\_3        | green          |

**Result :**

<table><thead><tr><th width="117">item_id</th><th width="141">receipt_date</th><th>qty_sold</th><th width="116">oh_fam_1</th><th width="115">oh_fam_2</th><th width="121">oh_fam_3</th><th>ohe_blue</th><th width="103">ohe_red</th><th>ohe_green</th></tr></thead><tbody><tr><td>877988</td><td>2024-01-01</td><td>50</td><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td></tr><tr><td>556764</td><td>2024-01-01</td><td>43</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td></tr><tr><td>321132</td><td>2024-01-01</td><td>18</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td></tr><tr><td>121453</td><td>2024-01-01</td><td>9</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td></tr></tbody></table>

2. In this new example a product can have multiple colors which are separated by a comma in the input dataset. The one\_hot\_encode can handle this kind of use case with the following config :

```yaml
calculated_cols:
  ohe_cat_feat:
    method: one_hot_encode
    input_columns:
    - concat_famid
    - product_color
    output_columns_prefix: oh
    params:
        separator: ,
```

**Input :**

| item\_id | receipt\_date | qty\_sold | product\_color |
| -------- | ------------- | --------- | -------------- |
| 877988   | 2024-01-01    | 50        | blue,red       |
| 556764   | 2024-01-01    | 43        | red,green      |
| 321132   | 2024-01-01    | 18        | blue,red,green |
| 121453   | 2024-01-01    | 9         | green          |

**Result :**

| item\_id | receipt\_date | qty\_sold | ohe\_blue | ohe\_red | ohe\_green |
| -------- | ------------- | --------- | --------- | -------- | ---------- |
| 877988   | 2024-01-01    | 50        | 1         | 1        | 0          |
| 556764   | 2024-01-01    | 43        | 0         | 1        | 1          |
| 321132   | 2024-01-01    | 18        | 1         | 1        | 1          |
| 121453   | 2024-01-01    | 9         | 0         | 0        | 1          |
