Datasets
Recipes in Verteego allow for intricate handling of datasets, enabling the import, export, and transformation of data as part of the decision-making workflow.
Import Data
Importing data into Verteego allows you to initialize datasets for analysis and periodically update them based on changes in source data.
Process for Data Import
Initial Import: Manually load the data for the first time using the "Data" tab on the Verteego web interface. This sets up the dataset in the system.
Subsequent Imports: To update or re-import the data as part of an automated workflow, use the
import_from_dataset
block in the recipe.
Parameters
type
: Specifies the action to be performed. For data imports, this should always be set toimport_from_dataset
. Mandatory.dataset_name
: The name of the dataset to be imported. This should match the name of the dataset as defined in Verteego. Mandatory.refresh
: A boolean parameter that dictates how the dataset should be loaded. Default:true.
true
(default value) – Reload the dataset from the source data. Use this option to ensure the dataset in Verteego is synchronized with the latest data from the external datasource.false
– Clone the existing dataset in Verteego without reaching out to the external datasource. This is useful for operations where the data does not need to be the latest but should replicate a consistent state for testing or analysis.
Example
Below is an example of a YAML configuration for importing a dataset, with an explicit instruction to refresh the data from the source:
Export Data
Verteego facilitates the externalization of datasets with configurable parameters that define how the data should be integrated into the target environment.
Parameters
dataset_name
: The identifier of the dataset to be exported. This is the name of the dataset within Verteego that you want to send to an external source. Mandatory.data_source
: The target destination for the dataset. This could refer to a database, a file storage system, or any other data sink supported by Verteego. Mandatory.method
: Specifies the export mode. Default:overwrite
. The available options are:overwrite
(default value) – This mode will completely replace the content at the destination with the new dataset.append
– In this mode, new data will be added to the existing dataset at the destination, preserving previous data and maintaining the established format and data schema.
Example
Here's how you would structure your YAML configuration for exporting data:
Extract Pipeline results as Dataset
Verteego enables the extraction of data from various stages within a pipeline, which can then be stored as datasets for detailed analysis and further processing. This is accomplished using the import_from_pipeline
method.
Pipeline Stages for Data Extraction
Preprocessing (forecast pipeline) : Data from the initial preparation phase can be exported. This typically includes cleaned and transformed data ready for modeling.
Prediction (forecast pipeline): This stage allows for the export of model predictions. Users can choose to include additional analytical insights such as Shapley values, which explain the contribution of each feature to the prediction, or export all features used in the prediction model.
Postprocessing (forecast pipeline): Data from the final adjustments made to predictions is exported. This might include recalibrated or refined prediction outputs after initial model processing.
Optimization (optimization pipeline) : This stage allows to export the final result of the optimization.
Parameters
type
: Specifies the action to be performed. For extracting pipeline data, this should be set toimport_from_pipeline
.pipeline_name
: The name of the pipeline from which to extract data.pipeline_step
: Indicates the specific stage of the pipeline (e.g.,preprocessing
,prediction
,postprocessing, optimization
) from which to extract data.shapley_values
: Optional boolean parameter that specifies whether to include Shapley values in the exported dataset. This is relevant for the prediction stage only and assists in understanding the influence of each input featuretrue
– Include values in the dataset.false
(default value) – Exclude values from the dataset.
preprocessed_columns
: Optional boolean parameter that specifies whether to include all calculated columns/features in the exported dataset. This is relevant for the prediction stage only.true
– Include calculated columns in the dataset.false
(default value) – Excludes calculated columns from the dataset.
shapley_values
and preprocessed_columns
are available only for Forecast Pipeline
Example
Last updated