Getting started

My first Forecast

1. Overview

Let’s walk through a simple example, using the Verteego platform to generate predictions.

You will need :

Training file: contains data points on which model can be trained
Model configuration: YAML file (created directly in Verteego) which allows to create a predictive model
Prediction file: contains the points for which you want predictions

Here, we will walk through an example: Let’s say we want to predict sales for a given item, point-of-sales, and date.

We will need to:

Add the relevant train and predict datasets
Set up our configuration file
Launch a forecast pipeline run
Analyse the results
Add calculated features
Add features from external sources

2. A simple example

Context : Let’s say we have two products (item_id), 123456 and 987654. They’re sold in two different points-of-sales, with ids (pos_id) 50 and 100.

Goal : For each product and shop, we have the total sold quantity per day, from 2019 to 2023, and we would like to predict for 2024.

For the sake of this introductory example, the data is relatively simple:

Quantity sold is the same every day for a given shop
With a low-season value and a high-season value (spring-summer or fall-winter).
There’s a bit of growth every year. This is a graph of the data we have at our disposal

Let’s use the Verteego platform to predict sales for 2024. In this instance, we can easily extrapolate the values for 2024, assuming all trends remain the same - it is represented below in dots. We’ll be able to contrast our predictions with this extrapolation.

3. First Pipeline

3.1. Creating train/test datasets to train a model

To start to build a model for our scenario, we will split our existing data into train and test datasets.

A model will be built on data from 2019 to 2022 inclusive, ie the “train dataset”, and will be tested on 2023, the “test dataset”.

Train and test datasets should have the same columns. The only tolerated difference is that your test dataset might not contain the column to predict. In that case, you will not get scores, as it cannot evaluate your predictions comparing to what is really happened.

We will be able to compare our predictions to the real sales, and therefore evaluate the quality of our model.

Here is a graph of the period of time we’ll use for training and testing. The vertical black line marks the split between our train and test data.

In real life, we usually cannot compare our predictions to anything, since it is in the future, but here we’ll be able to compare to our easy extrapolation and see/evaluate the performance of the model.

In real life, once a final model is decided, we would retrain the model on all of our data - 2023 included and use this new model to predict 2024.

3.2. Adding datasets to the platform

Verteego supports a several number of Connectors which you can use to create datasources and datasets.

You can also upload CSV files (less than 50mb).

For our example, we will use two CSV files:

How can you upload the data?

Data -> Datasets section and add a dataset (select the “Upload file” option).

Here is a sample of what they contain:

pos_id

item_id

sales_date

qty_sold

123456

2019-01-01

123456

2019-01-02

123456

2019-01-03

Verteego will validate your dataset after upload. This may take a few minutes. Once your dataset’s status has changed to "Valid", it is correctly formatted and ready to be used.

If you click on your dataset, you can see an overview (number of rows, etc), but also quite a bit of details for each variable in the Variables tab.

3.3. Creating the first configuration

To create your first model, you need to create a Pipeline to experiment in.

In the Pipelines section, create a new Forecast pipeline, and set its initial configuration to the YAML example below.

For our example, this configuration is very simple and the only features our XGBoost model has are the item_id and the pos_id.

That is, the only variables we will feed into our model are the IDs of the items and of the points-of-sales.

# Define the column that you want to predict
column_to_predict: qty_sold

# Define the temporal column
date_col: sales_date

# Columns which are presented in your train and test sets 
# (except the column_to_predict)
# If your dataset columns presented in prediction_resolution are the same as
# then input_prediction_columns is not required

input_prediction_columns:
- sales_date
- item_id
- pos_id

# Define the type of each column (np.float32, np.int64 and str are availables)
cols_type:
  pos_id: np.int64
  item_id: np.int64
  sales_date: str
  qty_sold: np.int64

# What is your prediction resolution
prediction_resolution:
- sales_date
- item_id
- pos_id

#   -------------------------------------------------------
#   ------------------------ ALGORITHM --------------------
#   -------------------------------------------------------

algo_name: xgboost

#   -------------------------------------------------------
#   ------------------------- FEATURES --------------------
#   -------------------------------------------------------
features:
  categorical_columns:
  - pos_id
  - item_id

The configuration tells Verteego which features you want to use and which algorithms you want to train and score with.

You can find more details on the various sections in the Configuration page

3.4. Launching our first pipeline run

Once you’ve configured your pipeline, and your Train and Test datasets are valid, you are ready to launch a pipeline run.

In Pipelines -> Runs, click on “Run”. You can name your pipeline run
Specify which dataset to use for training (“Train Dataset”)
Specify which dataset to use for predicting and testing (“Test Dataset”).
Hit “Create”, the pipeline will automatically launch. It will run several steps, described in more details in the Concepts.

3.5. Analyzing first model's results

Once the pipeline has finished running, you can look at differents metrics calculated on training and on predictions (score)

3.5.1. Looking at Training details

You can look at a specific Training step, by clicking on Pipelines -> Trainings, then on the training of your choice.

Specific Traing page will summarize:

Parameters of this model
Metrics of the model (on the training set).
Feature importance, which you might find useful in figuring out whether a feature adds signal or noise. In our example, item_id is the most useful feature.

3.5.2. Analyzing the scores

You can head over to Pipelines -> Scores to see the scores of your model.

A set of standard metrics are calculated:

MAPE -> (Mean Absolute Pourcentage Error) measures accuracy of a forecast system as a percentage. It’s the average of the ratio between the error and the actual observation.
WARNING ⇒ The lower the actual observation is, the higher the MAPE could be!
MAE -> (Mean Absolute Error) measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation.
RMSE -> (Root Mean Squared Error) is a quadratic scoring rule that measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation.
R2 -> is a measure of the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An R2 of 1 indicates that the regression predictions perfectly fit the data.
MSE -> (Mean Squared Error), the average squared difference between the value observed in a statistical study and the values predicted from a model.

You can refer to Performance analysis and ML model improvement to see how to interpret different metrics

They’re not all displayed by default. You can change which columns you see by clicking on the icon on the top right of the table.

This will allow you to compare your pipeline runs to one another. You can choose to show only the Score metrics (calculated on your Test set), or only the Train metrics, or both.

In our example

In Pipeline -> Scores, we can see we’ve got two new results:

PIPELINENAME_score_on_prediction
PIPELINENAME_score_on_postprocessing

For now, we can ignore the score_on_postprocessing (usefull when you have postprocessed your raw results , example, round predicted quantities or replace negative value per 0)

Looking at the score_on_prediction results:

We see a Mean Absolute Error (MAE) of 94. It is an average error over our predictions, which is not iquite good considering the average qty_sold over the test data is 259.
Our coefficient of determination, R2, sits at 0.58. Giving us room to improve, seeing as the theoretical best R2 is 1.

For a more in-depth coverage of metrics and model evaluation in general, please see Performance analysis and ML model improvement

3.5.3. Getting our predictions

You can click on Pipeline -> Predictions, then on the prediction of your pipeline. In the top-right, you can then choose to download the prediction file (as a CSV), or export it to an existing DataSource.

Here is a plot of our first predictions, in dashed. We can see it essentially predicted an average value over our training dataset, for each item and pos.

4. V2 of Pipeline - Adding calculated features

One of the powerful features of Verteego consists in its ability to use its Calculators.

Calculators can be used to generate additional features with only a few lines of configuration.

For example:

Calculate averages quantities at different levels (aggregagate_val_group_be_key, hierarchical_aggregate)
Automatically extract seasonal patterns
Encode your categorical features
Perform PCA
Generate clusters
Use TSFresh to generate hundreds of time-related features
Get weather information using gps coordinates
and a lot more

4.1. Adding date attributes

Let’s use a simple calculator on our example.

So far, we’re not using our sales_date feature, and we are not capturing seasonality at all. We can add date attributes that could prove relevant, such as the month. Let’s add the date_attributes calculator to our calculated_cols block, as such:

calculated_cols:
  date_attributes:
    method: date_attributes
    input_columns:
    - sales_date
    output_columns:
    - month

Let’s not forget to also make month available to our model, as a categorical feature, by updating the features block of our configuration:

features:
  categorical_columns:
  - pos_id
  - item_id
  - month

4.2. Analyzing metrics

If we run a new pipeline run with this updated configuration, we get an improved scores not only for Training but alos for score.

Metrics on Training:

MAE from 69 to 26
R2 from 0.69 to 0.96

Metrics on Prediction:

MAE from 93 to 66
R2 from 0.56 to 0.86.

Results are really improving. Let’s visualize the results. We can see that it is capturing the seasonal pattern, but it is failing to capture the growth, and essentially returns an average per month over our train set.

5. V3 of Pipeline - Capturing growth

Our model is capturing seasonality, thanks to our month feature, but it is failing to capture growth. This is due to our use of XGBoost, which does not inherently extrapolate, though various techniques can be used to capture that growth.

5.1. Switching to LightGBM model (growth capturing)

What we’ll do, is switch our model to LightGBM, which can be used to capture linear relationships between numerical variables. Let’s update our model configuration as such:

algo_name: lightgbm
algorithm_parameters:
  objective: regression
  linear_tree: true
  min_data_in_leaf: 2

Let’s add an ordinal output to our date_attributes calculator, which will output a number for each date that can then be used in a linear equation. This is our update to the date_attributes calculator:

  date_attributes:
    method: date_attributes
    input_columns:
    - sales_date
    output_columns:
    - month
    - ordinal

And finally, let’s add that ordinal to the numerical features for our model:

features:
  categorical_columns:
  - pos_id
  - item_id
  - month
  numerical_columns:
  - ordinal

5.2. Analyzing metrics

In our example

Running with this new configuration, we get improved results: an MAE of 21 and an r2 of 0.98!

We’ve managed to increase these predictions and capture some of the growth, but it is not quite linear. This is because the quantity sold is not linear in the sales date. It is linear per season. Relative to the sales date, it is constant for periods of 6 months and increases linearly every other season.

Here is a visualization of our predictions:

6. V4 of Pipeline - Modeling high and low seasons

In order to capture the true linear relationship in our data between high/low seasons and the sales, we need to identify the seasons.

Let’s use:

Calculator to differentiate between high season and low season
External dataset to assign a number to a season. That way, lightGBM should capture the linear relationships between season_number and qty_sold, whether in high season or low season.

6.1. Taging high season and low season

`is_summer` feature

Let’s update our configuration to add a true/false feature for whether we’re in summer season. We can add the following calculator to our calculators section:

calculated_cols:
  [...]
  is_summer_feat:
    method: mathematical_expression
    input_columns:
    - month
    output_columns:
    - is_summer
    params:
      expression: (month <= 9) * (month >= 4)

This will make available a new feature called is_summer that is true during the months from April to September inclusive.

6.2. Importing external data

`season_nb` feature

For a numerical identifier for our seasons, let’s create a new "season" dataset with this file:

421B

Seasons.csv

Whose content looks like this:

start_date

end_date

season_nb

2018-10-01

2019-04-01

2019-10-01

2020-04-01

2020-10-01

…

Once the file is valid, let’s add it as a feature in our configuration, via a get_from_dataset calculator.

calculated_cols:
  [...]
  season_number:
    method: get_from_dataset
    input_columns:
    - sales_date
    output_columns:
    - season_nb
    params:
      file_name: Seasons
      join_options:
        sales_date:
          greater_than_or_equal: start_date
          lesser_than: end_date

Making the features available to the model

Let’s use our new is_summer and season_nb features in our model, via the features section:

features:
  categorical_columns:
  - pos_id
  - item_id
  - month
  - is_summer
  numerical_columns:
  - season_nb

7. Final predictions

7.1. Final prediction 2023

Running another pipeline run with our updated configuration, we get an MAE of 17 and an r2 of 0.98. We have managed to capture the seasonality as well as the growth of our training dataset. See a final visualisation here:

We are now quite satisfied with our model. As a reminder, this is our pipeline configuration now:

Final configuration

# Define the column that you want to predict
column_to_predict: qty_sold

# Define the temporal column
date_col: sales_date

# Columns which are presented in your train and test sets 
# (except the column_to_predict)
# If your dataset columns presented in prediction_resolution are the same as
# then input_prediction_columns is not required

input_prediction_columns:
- sales_date
- item_id
- pos_id

# Define the type of each column (np.float32, np.int64 and str are availables)
cols_type:
  pos_id: np.int64
  item_id: np.int64
  sales_date: str
  qty_sold: np.int64

# What is your prediction resolution
prediction_resolution:
- sales_date
- item_id
- pos_id

#   -------------------------------------------------------
#   ------------------------ ALGORITHM --------------------
#   -------------------------------------------------------

algo_name: lightgbm
algorithm_parameters:
  objective: regression
  linear_tree: true
  min_data_in_leaf: 2

#   -------------------------------------------------------
#   ------------------ CALCULATED COLUMNS -----------------
#   -------------------------------------------------------

calculated_cols:
  date_attributes:
    method: date_attributes
    input_columns:
    - sales_date
    output_columns:
    - month
    - ordinal

  is_summer_feat:
    method: mathematical_expression
    input_columns:
    - month
    output_columns:
    - is_summer
    params:
      expression: (month <= 9) * (month >= 4)

  season_number:
    method: get_from_dataset
    input_columns:
    - sales_date
    output_columns:
    - season_nb
    params:
      file_name: Seasons
      join_options:
        sales_date:
          greater_than_or_equal: start_date
          lesser_than: end_date
#   -------------------------------------------------------
#   ------------------------- FEATURES --------------------
#   -------------------------------------------------------
features:
  categorical_columns:
  - pos_id
  - item_id
  - month
  - is_summer
  numerical_columns:
  - ordinal
  - season_nb

7.2. 2024 predictions

Let’s retrain it on the full data at our disposal, with sales ranging from 2019 to 2023, and predict our true unknown data, for 2024. Our new train and predict datasets for this exercise are:

179KB

train_and_test_dataset.csv

36KB

predict_dataset.csv

After validating these datasets, we'll initiate a new pipeline run.

We'll compare our predictions with what a basic extrapolation would anticipate, observing that our predictions align reasonably well with expected outcomes.

However, it's important to note that our current dataset exhibits straightforward linear growth and simple step behavior.

Real-world data typically presents more complexity, especially when forecasting for numerous products across numerous points of sale. In such cases, visual inspection of predictions becomes challenging, and we must rely more on model metrics for evaluation.

8. Summary

Throughout this tutorial, we've acquired the skills:

to incorporate datasets
to set up a fundamental pipeline
to experiment with diverse models
to integrate additional features using calculators
to navigate the iterative process
to refine our approach

Leveraging the comprehensive metrics provided by Verteego, we tracked our progress and identified instances where performance enhancements were achieved, leading us to a model that met our satisfaction.

Ultimately, we successfully generated predictions for 2024 that align with our existing knowledge and expectations.

9. What’s next

Right now, we are using a lightgbm model, with few settings. We might want to adjust its parameters, for instance:

algorithm_parameters:
  objective: regression
  linear_tree: true
  max_depth: 3
  n_estimators: 125
  min_data_in_leaf: 2
  max_cat_threshold: 32

We could also configure different Objectives, or use Hyper-Parameter Tuning to find the best parameters for us.

We can also try different models, or a Meta-model: for more info and a full list of available models, see the Model page.

Happy modeling!

PreviousForecasting Pipelines NextConfiguration

Last updated 9 months ago

1. Overview

2. A simple example

3. First Pipeline

3.1. Creating train/test datasets to train a model

3.2. Adding datasets to the platform

3.3. Creating the first configuration

3.4. Launching our first pipeline run

3.5. Analyzing first model's results

3.5.1. Looking at Training details

3.5.2. Analyzing the scores

3.5.3. Getting our predictions

4. V2 of Pipeline - Adding calculated features

4.1. Adding date attributes

4.2. Analyzing metrics

5. V3 of Pipeline - Capturing growth

5.1. Switching to LightGBM model (growth capturing)

5.2. Analyzing metrics

In our example

6. V4 of Pipeline - Modeling high and low seasons

6.1. Taging high season and low season

is_summer feature

6.2. Importing external data

season_nb feature

7. Final predictions

7.1. Final prediction 2023

Final configuration

7.2. 2024 predictions

8. Summary

9. What’s next

`is_summer` feature

`season_nb` feature