Datasets

Recipes in Verteego allow for intricate handling of datasets, enabling the import, export, and transformation of data as part of the decision-making workflow.

Import Data

Importing data into Verteego allows you to initialize datasets for analysis and periodically update them based on changes in source data.

Process for Data Import

  • Initial Import: Manually load the data for the first time using the "Data" tab on the Verteego web interface. This sets up the dataset in the system.

  • Subsequent Imports: To update or re-import the data as part of an automated workflow, use the import_from_dataset block in the recipe.

Parameters

  • type: Specifies the action to be performed. For data imports, this should always be set to import_from_dataset. Mandatory.

  • dataset_name: The name of the dataset to be imported. This should match the name of the dataset as defined in Verteego. Mandatory.

  • refresh: A boolean parameter that dictates how the dataset should be loaded. Default: true.

    • true (default value) – Reload the dataset from the source data. Use this option to ensure the dataset in Verteego is synchronized with the latest data from the external datasource.

    • false – Clone the existing dataset in Verteego without reaching out to the external datasource. This is useful for operations where the data does not need to be the latest but should replicate a consistent state for testing or analysis.

Example

Below is an example of a YAML configuration for importing a dataset, with an explicit instruction to refresh the data from the source:

import_item_referential:
  type: import_from_dataset
  params:
    dataset_name: item_referential_tb
    refresh: true

Export Data

Verteego facilitates the externalization of datasets with configurable parameters that define how the data should be integrated into the target environment.

Parameters

  • dataset_name: The identifier of the dataset to be exported. This is the name of the dataset within Verteego that you want to send to an external source. Mandatory.

  • data_source: The target destination for the dataset. This could refer to a database, a file storage system, or any other data sink supported by Verteego. Mandatory.

  • method: Specifies the export mode. Default: overwrite. The available options are:

    • overwrite (default value) – This mode will completely replace the content at the destination with the new dataset.

    • append – In this mode, new data will be added to the existing dataset at the destination, preserving previous data and maintaining the established format and data schema.

Example

Here's how you would structure your YAML configuration for exporting data:

export_item_referential:
  type: export
  params:
    dataset_name: item_referential_tb
    data_source: bigquery_item_datasource
    method: append

Extract Pipeline results as Dataset

Verteego enables the extraction of data from various stages within a pipeline, which can then be stored as datasets for detailed analysis and further processing. This is accomplished using the import_from_pipeline method.

Pipeline Stages for Data Extraction

  • Preprocessing (forecast pipeline) : Data from the initial preparation phase can be exported. This typically includes cleaned and transformed data ready for modeling.

  • Prediction (forecast pipeline): This stage allows for the export of model predictions. Users can choose to include additional analytical insights such as Shapley values, which explain the contribution of each feature to the prediction, or export all features used in the prediction model.

  • Postprocessing (forecast pipeline): Data from the final adjustments made to predictions is exported. This might include recalibrated or refined prediction outputs after initial model processing.

  • Optimization (optimization pipeline) : This stage allows to export the final result of the optimization.

Parameters

  • type: Specifies the action to be performed. For extracting pipeline data, this should be set to import_from_pipeline.

  • pipeline_name: The name of the pipeline from which to extract data.

  • pipeline_step: Indicates the specific stage of the pipeline (e.g., preprocessing, prediction, postprocessing, optimization) from which to extract data.

  • shapley_values: Optional boolean parameter that specifies whether to include Shapley values in the exported dataset. This is relevant for the prediction stage only and assists in understanding the influence of each input feature

    • true – Include values in the dataset.

    • false (default value) – Exclude values from the dataset.

  • preprocessed_columns: Optional boolean parameter that specifies whether to include all calculated columns/features in the exported dataset. This is relevant for the prediction stage only.

    • true – Include calculated columns in the dataset.

    • false (default value) – Excludes calculated columns from the dataset.

shapley_values and preprocessed_columns are available only for Forecast Pipeline

Example

extract_forecast_results:
  type: import_from_pipeline
  params:
    pipeline_name: sales_forecast_model
    pipeline_step: prediction
    shapley_values: true

Last updated