clustering

Clusterizes dataset on a subset of columns.

Usage

This calculator allows you to group data into several groups, where the data in each group share similar characteristics. You can use several algorithms of clustering.

This calculator can be used with the following method:

clustering

Examples:

  • Group together products with similar characteristics.

  • Group together certain stores with common characteristics.


Main Parameters

The bold options represent the default values when the parameters are optional.

  • input_columns list of columns used as input of the calculators: the list of columns that will be used to build clusters from.

  • output_columns list of columns added by the calculators: name of the column containing the clustering results, i.e. the values of the various groups created by clustering.

  • global (true, false) Should this calculator be performed before data splitting during training for cross-validation

  • steps [optionnal] (training, prediction, postprocessing) List of steps in a pipeline where columns from this calculator are added to the data. Note that when the training option is listed, the calculator is actually added during preprocessing.

  • store_in_model [optionnal] (true, false) Please indicate whether the "calculated" columns by the calculator should be stored in the model or not to avoid recalculating them during prediction. This is only relevant if the calculated columns are added to both training and prediction. Without this parameter, the values will not be stored in the model. The following parameters only make sense if this parameter is set to true.

  • stored_columns [required if store_in_model is true] List indicating the columns to be stored among the output_columns.

  • stored_keys [required if store_in_model is true] List indicating the columns to use for identifying the correct values to join on the data for prediction among the stored values (logically, they are to be chosen from the input_columns).


Specific Parameters

  • algo_name

    (KMeans, MeanShift, DBSCAN, ‘OPTICS’) Algorithm used for clustering.

  • algo_args

    Specific parameters for a given clustering algorithm.

    Example : for KMeans, you can specify the number of clusters you want (n_clusters). See scikit-learn for details.


Examples

  1. Given temporal features of products (ts_.*), the user wants to create 30 clusters based on the temporal characteristics of products, to group together products with similar time series.

    calculated_cols:
      cluster_calculation:
          method: clustering
          input_columns:
          - ts_.*
          output_columns:
          - cluster_id
          store_in_model: true
          stored_keys:
          - item_id
          stored_columns:
          - cluster_id
          params:
            algo_name: KMeans
            algo_args:
              n_clusters: 30
              random_state: 12

Last updated