# clustering

## Usage

{% hint style="info" %}
This calculator allows you to group data into several groups, where the data in each group share similar characteristics. You can use several algorithms of clustering.
{% endhint %}

This calculator can be used with the following method:

<mark style="color:red;">**`clustering`**</mark>

Examples:

* Group together products with similar characteristics.
* Group together certain stores with common characteristics.

***

## Main Parameters

{% hint style="success" %}
**The bold options** represent the default values when the parameters are optional.
{% endhint %}

* *<mark style="color:blue;">input\_columns</mark>* \
  list of columns used as input of the calculators: the list of columns that will be used to build clusters from.
* *<mark style="color:blue;">output\_columns</mark>* \
  list of columns added by the calculators: name of the column containing the clustering results, i.e. the values of the various groups created by clustering.
* *<mark style="color:blue;">global</mark>* *(true, **false)*** \
  Should this calculator be performed before data splitting during training for cross-validation
* *<mark style="color:blue;">steps</mark>* \[optionnal] *(**training, prediction**, postprocessing*) \
  List of steps in a pipeline where columns from this calculator are added to the data. Note that when the training option is listed, the calculator is actually added during preprocessing.
* *<mark style="color:blue;">store\_in\_model</mark>* \[optionnal] *(true, **false)*** \
  Please indicate whether the "calculated" columns by the calculator should be stored in the model or not to avoid recalculating them during prediction. This is only relevant if the calculated columns are added to both training and prediction. Without this parameter, the values will not be stored in the model. The following parameters only make sense if this parameter is set to *true*.
* *<mark style="color:blue;">stored\_columns</mark>* \[required if *<mark style="color:blue;">store\_in\_model</mark> is true*] \
  List indicating the columns to be stored among the *<mark style="color:blue;">output\_columns</mark>*.
* *<mark style="color:blue;">stored\_keys</mark>* \[required if *<mark style="color:blue;">store\_in\_model</mark> is true*] \
  List indicating the columns to use for identifying the correct values to join on the data for prediction among the stored values (logically, they are to be chosen from the *<mark style="color:blue;">input\_columns</mark>*).

***

## Specific Parameters

* *<mark style="color:blue;">algo\_name</mark>*

  (`KMeans`, `MeanShift`, `DBSCAN`, `‘OPTICS’`) \
  Algorithm used for clustering.
* *<mark style="color:blue;">algo\_args</mark>*

  Specific parameters for a given clustering algorithm.

  Example : for KMeans, you can specify the number of clusters you want (n\_clusters). See scikit-learn for details.

***

## Examples

1. Given temporal features of products (ts\_.\*), the user wants to create 30 clusters based on the temporal characteristics of products, to group together products with similar time series.

   ```yaml
   calculated_cols:
     cluster_calculation:
         method: clustering
         input_columns:
         - ts_.*
         output_columns:
         - cluster_id
         store_in_model: true
         stored_keys:
         - item_id
         stored_columns:
         - cluster_id
         params:
           algo_name: KMeans
           algo_args:
             n_clusters: 30
             random_state: 12
   ```
