Getting started
My first Forecast
1. Overview
Let’s walk through a simple example, using the Verteego platform to generate predictions.
You will need :
Training file: contains data points on which model can be trained
Model configuration: YAML file (created directly in Verteego) which allows to create a predictive model
Prediction file: contains the points for which you want predictions
Here, we will walk through an example: Let’s say we want to predict sales for a given item, point-of-sales, and date.
We will need to:
Add the relevant train and predict datasets
Set up our configuration file
Launch a forecast pipeline run
Analyse the results
Add calculated features
Add features from external sources
2. A simple example
Context : Let’s say we have two products (item_id
), 123456
and 987654
. They’re sold in two different points-of-sales, with ids (pos_id
) 50
and 100
.
Goal : For each product and shop, we have the total sold quantity per day, from 2019 to 2023, and we would like to predict for 2024.
For the sake of this introductory example, the data is relatively simple:
Quantity sold is the same every day for a given shop
With a low-season value and a high-season value (spring-summer or fall-winter).
There’s a bit of growth every year. This is a graph of the data we have at our disposal
Let’s use the Verteego platform to predict sales for 2024. In this instance, we can easily extrapolate the values for 2024, assuming all trends remain the same - it is represented below in dots. We’ll be able to contrast our predictions with this extrapolation.
3. First Pipeline
3.1. Creating train/test datasets to train a model
To start to build a model for our scenario, we will split our existing data into train and test datasets.
A model will be built on data from 2019 to 2022 inclusive, ie the “train dataset”, and will be tested on 2023, the “test dataset”.
Train and test datasets should have the same columns. The only tolerated difference is that your test dataset might not contain the column to predict. In that case, you will not get scores, as it cannot evaluate your predictions comparing to what is really happened.
We will be able to compare our predictions to the real sales, and therefore evaluate the quality of our model.
Here is a graph of the period of time we’ll use for training and testing. The vertical black line marks the split between our train and test data.
In real life, we usually cannot compare our predictions to anything, since it is in the future, but here we’ll be able to compare to our easy extrapolation and see/evaluate the performance of the model.
In real life, once a final model is decided, we would retrain the model on all of our data - 2023 included and use this new model to predict 2024.
3.2. Adding datasets to the platform
Verteego supports a several number of Connectors which you can use to create datasources and datasets.
You can also upload CSV files (less than 50mb).
For our example, we will use two CSV files:
How can you upload the data?
Data -> Datasets section and add a dataset (select the “Upload file” option).
Here is a sample of what they contain:
50
123456
2019-01-01
15
50
123456
2019-01-02
15
50
123456
2019-01-03
15
Verteego will validate your dataset after upload. This may take a few minutes. Once your dataset’s status has changed to "Valid", it is correctly formatted and ready to be used.
If you click on your dataset, you can see an overview (number of rows, etc), but also quite a bit of details for each variable in the Variables tab.
3.3. Creating the first configuration
To create your first model, you need to create a Pipeline to experiment in.
In the Pipelines section, create a new Forecast pipeline, and set its initial configuration to the YAML example below.
For our example, this configuration is very simple and the only features our XGBoost model has are the item_id
and the pos_id
.
That is, the only variables we will feed into our model are the IDs of the items and of the points-of-sales.
The configuration tells Verteego which features you want to use and which algorithms you want to train and score with.
You can find more details on the various sections in the Configuration page
3.4. Launching our first pipeline run
Once you’ve configured your pipeline, and your Train and Test datasets are valid, you are ready to launch a pipeline run.
In Pipelines -> Runs, click on “Run”. You can name your pipeline run
Specify which dataset to use for training (“Train Dataset”)
Specify which dataset to use for predicting and testing (“Test Dataset”).
Hit “Create”, the pipeline will automatically launch. It will run several steps, described in more details in the Concepts.
3.5. Analyzing first model's results
Once the pipeline has finished running, you can look at differents metrics calculated on training and on predictions (score)
3.5.1. Looking at Training details
You can look at a specific Training step, by clicking on Pipelines -> Trainings, then on the training of your choice.
Specific Traing page will summarize:
Parameters of this model
Metrics of the model (on the training set).
Feature importance, which you might find useful in figuring out whether a feature adds signal or noise. In our example, item_id is the most useful feature.
3.5.2. Analyzing the scores
You can head over to Pipelines -> Scores to see the scores of your model.
A set of standard metrics are calculated:
MAPE -> (Mean Absolute Pourcentage Error) measures accuracy of a forecast system as a percentage. It’s the average of the ratio between the error and the actual observation.
WARNING ⇒ The lower the actual observation is, the higher the MAPE could be!
MAE -> (Mean Absolute Error) measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation.
RMSE -> (Root Mean Squared Error) is a quadratic scoring rule that measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation.
R2 -> is a measure of the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An R2 of 1 indicates that the regression predictions perfectly fit the data.
MSE -> (Mean Squared Error), the average squared difference between the value observed in a statistical study and the values predicted from a model.
You can refer to Performance analysis and ML model improvement to see how to interpret different metrics
They’re not all displayed by default. You can change which columns you see by clicking on the icon on the top right of the table.
This will allow you to compare your pipeline runs to one another. You can choose to show only the Score metrics (calculated on your Test set), or only the Train metrics, or both.
In our example
In Pipeline -> Scores, we can see we’ve got two new results:
PIPELINENAME_score_on_prediction
PIPELINENAME_score_on_postprocessing
For now, we can ignore the score_on_postprocessing (usefull when you have postprocessed your raw results , example, round predicted quantities or replace negative value per 0)
Looking at the score_on_prediction results:
We see a Mean Absolute Error (MAE) of 94. It is an average error over our predictions, which is not iquite good considering the average qty_sold over the test data is 259.
Our coefficient of determination, R2, sits at 0.58. Giving us room to improve, seeing as the theoretical best R2 is 1.
For a more in-depth coverage of metrics and model evaluation in general, please see Performance analysis and ML model improvement
3.5.3. Getting our predictions
You can click on Pipeline -> Predictions, then on the prediction of your pipeline. In the top-right, you can then choose to download the prediction file (as a CSV), or export it to an existing DataSource.
Here is a plot of our first predictions, in dashed. We can see it essentially predicted an average value over our training dataset, for each item and pos.
4. V2 of Pipeline - Adding calculated features
One of the powerful features of Verteego consists in its ability to use its Calculators.
Calculators can be used to generate additional features with only a few lines of configuration.
For example:
Calculate averages quantities at different levels (aggregagate_val_group_be_key, hierarchical_aggregate)
Automatically extract seasonal patterns
Perform PCA
Generate clusters
Use TSFresh to generate hundreds of time-related features
Get weather information using gps coordinates
and a lot more
4.1. Adding date attributes
Let’s use a simple calculator on our example.
So far, we’re not using our sales_date
feature, and we are not capturing seasonality at all. We can add date attributes that could prove relevant, such as the month. Let’s add the date_attributes calculator to our calculated_cols
block, as such:
Let’s not forget to also make month
available to our model, as a categorical feature, by updating the features block of our configuration:
4.2. Analyzing metrics
If we run a new pipeline run with this updated configuration, we get an improved scores not only for Training but alos for score.
Metrics on Training:
MAE from 69 to 26
R2 from 0.69 to 0.96
Metrics on Prediction:
MAE from 93 to 66
R2 from 0.56 to 0.86.
Results are really improving. Let’s visualize the results. We can see that it is capturing the seasonal pattern, but it is failing to capture the growth, and essentially returns an average per month over our train set.
5. V3 of Pipeline - Capturing growth
Our model is capturing seasonality, thanks to our month
feature, but it is failing to capture growth. This is due to our use of XGBoost, which does not inherently extrapolate, though various techniques can be used to capture that growth.
5.1. Switching to LightGBM model (growth capturing)
What we’ll do, is switch our model to LightGBM, which can be used to capture linear relationships between numerical variables. Let’s update our model configuration as such:
Let’s add an ordinal
output to our date_attributes
calculator, which will output a number for each date that can then be used in a linear equation. This is our update to the date_attributes calculator:
And finally, let’s add that ordinal to the numerical features for our model:
5.2. Analyzing metrics
In our example
Running with this new configuration, we get improved results: an MAE of 21 and an r2 of 0.98!
We’ve managed to increase these predictions and capture some of the growth, but it is not quite linear. This is because the quantity sold is not linear in the sales date. It is linear per season. Relative to the sales date, it is constant for periods of 6 months and increases linearly every other season.
Here is a visualization of our predictions:
6. V4 of Pipeline - Modeling high and low seasons
In order to capture the true linear relationship in our data between high/low seasons and the sales, we need to identify the seasons.
Let’s use:
Calculator to differentiate between high season and low season
External dataset to assign a number to a season. That way, lightGBM should capture the linear relationships between season_number and qty_sold, whether in high season or low season.
6.1. Taging high season and low season
is_summer
feature
is_summer
featureLet’s update our configuration to add a true/false feature for whether we’re in summer season. We can add the following calculator to our calculators section:
This will make available a new feature called is_summer
that is true during the months from April to September inclusive.
6.2. Importing external data
season_nb
feature
season_nb
featureFor a numerical identifier for our seasons, let’s create a new "season" dataset with this file:
Whose content looks like this:
2018-10-01
2019-04-01
1
2019-04-01
2019-10-01
2
2019-10-01
2020-04-01
3
2020-04-01
2020-10-01
4
…
…
…
Once the file is valid, let’s add it as a feature in our configuration, via a get_from_dataset calculator.
Making the features available to the model
Let’s use our new is_summer
and season_nb
features in our model, via the features section:
7. Final predictions
7.1. Final prediction 2023
Running another pipeline run with our updated configuration, we get an MAE of 17 and an r2 of 0.98. We have managed to capture the seasonality as well as the growth of our training dataset. See a final visualisation here:
We are now quite satisfied with our model. As a reminder, this is our pipeline configuration now:
Final configuration
7.2. 2024 predictions
Let’s retrain it on the full data at our disposal, with sales ranging from 2019 to 2023, and predict our true unknown data, for 2024. Our new train and predict datasets for this exercise are:
After validating these datasets, we'll initiate a new pipeline run.
We'll compare our predictions with what a basic extrapolation would anticipate, observing that our predictions align reasonably well with expected outcomes.
However, it's important to note that our current dataset exhibits straightforward linear growth and simple step behavior.
Real-world data typically presents more complexity, especially when forecasting for numerous products across numerous points of sale. In such cases, visual inspection of predictions becomes challenging, and we must rely more on model metrics for evaluation.
8. Summary
Throughout this tutorial, we've acquired the skills:
to incorporate datasets
to set up a fundamental pipeline
to experiment with diverse models
to integrate additional features using calculators
to navigate the iterative process
to refine our approach
Leveraging the comprehensive metrics provided by Verteego, we tracked our progress and identified instances where performance enhancements were achieved, leading us to a model that met our satisfaction.
Ultimately, we successfully generated predictions for 2024 that align with our existing knowledge and expectations.
9. What’s next
Right now, we are using a lightgbm model, with few settings. We might want to adjust its parameters, for instance:
We could also configure different Objectives, or use Hyper-Parameter Tuning to find the best parameters for us.
We can also try different models, or a Meta-model: for more info and a full list of available models, see the Model page.
Happy modeling!
Last updated