Performance analysis and ML model improvement
1. Metrics
1.1. MAE
Formula
Definition
MAE (Mean Absolute Error) measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation.
Example:
1
2
1
100
101
1
100
110
10
1.2. RMSE
Formula
Definition
RMSE (Root Mean Squared Error) is a quadratic scoring rule that measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation.
Example:
1
2
1
100
101
1
100
110
10
1.3. MAPE
Formula
Definition
MAPE (Mean Absolute Pourcentage Error) measures accuracy of a forecast system as a percentage. It’s the average of the ratio between the error and the actual observation.
The lower the actual observation is, the higher the MAPE could be.
Example:
1
2
1
100
100
101
1
1
100
110
10
10
1.4. wMAPE
Formula
Definition
wMAPE (Weighted Mean Absolute Pourcentage Error) is a variant of MAPE in which errors are weighted by values of actual observations (e.g. in case of sales forecasting, errors are weighted by sales volume)
Example:
1
2
1
100
100
101
1
1
100
110
10
10
Due to small quantities, the first data point contributes largely to the high MAPE result. By weighting the percentage error using the real quantity, we reduce the impact of this data point and have a more realistic overview of the percentage error.
1.5. R²
Formula
Definition
R² is a measure of the goodness of fit of a model. In regression, the R² coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An R² of 1 indicates that the regression predictions perfectly fit the data.
Values of R² outside the range 0 to 1 occur when the model fits the data worse than the worst possible least-squares predictor (equivalent to a horizontal hyperplane at a height equal to the mean of the observed data).
Example:
1
2
1
100,5
100
101
1
100,5
100
110
10
100,5
⇒ here, 98% of variation of the real data point is explained by the model.
1.6. Global overprediction
Overprediction corresponds to the amount of forecasts that exceeds the actual observations. It can be expressed in units or in percentage.
Example :
1
2
1
100
99
1
100
110
10
1.7. Global underprediction
Underprediction corresponds to the amount of forecast that is inferior to the actual observations. It can be expressed in units or in percentage.
Exemple :
1
2
1
100
99
1
100
110
10
2. How to interpret metrics
When we want to appreciate the sales forecast model performance, we need to combine several metrics.
2.1. MAE versus RMSE
The first level of analysis is based on the units error, i.e. MAE and RMSE.
The MAE reflects how many error units in average the model makes for each forecast. By squaring the error, the RMSE ponderates the large error
If you are in the latter situation, we encourage you to:
Identify these large error forecast data points
Analyse their corresponding historical time series
Try to identify potential areas of improvement of your ML model.
Theses metrics can be used for all forecast data points.
2.2. MAE versus MAPE
As explained before, compare to the MAE, the MAPE corresponds to the error value expressed in pourcentage of real value. It is easy to understand that these two metrics are clearly complementary as having a low MAE does not tell you by itself if it a good result or not.
For instance, an MAE of 100 does not yield the same result whether the MAPE is 1% or 100%:
In the first case, MAE = 100 is a good performance as it represents only 1% of error when we compare the forecast to the real value.
In the second case, MAPE = 100% means that the real value corresponds to half of your prediction which means that MAE = 100 is a quit big error !
We encourage you to always use these two metrics together as it is not possible to appreciate properly the model accuracy using only one of them.
It is the minimal combination of metrics to use to correctly interpret forecast results.
You cannot compute this metrics when real value is equal to 0. In this case, only MAE and RMSE could be used !
2.3. MAPE versus wMAPE
These two metrics enable you to assess the percentage error. However, in comparison to the MAPE, the wMAPE weighs the percentage error by the ratio of the real quantity to the total real quantity.
The primary advantage of the wMAPE is its ability to mitigate large percentage errors made on small real quantities, providing a more realistic overview of the percentage error.
2.4. MAE versus R²
The R² is a classic machine learning metric that allows us to assess how much variability is explained by the ML model.
It achieves this by comparing the model's ability to account for variability with a scenario where the average of real values is used as the forecast (resulting in no variability, as all forecast values would be the same).
For example, if our ML model yields MAE = 2 but R² = 0, it indicates that the variability present in real values is not explained by the features used in the model, despite the seemingly low MAE.
Once again, similar to MAPE, R² is one of the key metrics to combine with MAE.
3. Model scoring at different levels
Using multiple metrics is important but not sufficient for obtaining a realistic view of a model's performance.
Additionally, it is crucial to combine various levels of analysis to understand the strengths and weaknesses of an ML model.
Model performance is often analyzed from different perspectives:
either by computing metrics on forecasts aggregated according to specific features, such as product category or higher-level time features
or by assessing metrics on different subsets, such as segregating actual observations based on their values (e.g., zero or positive).
3.1. Performance analysis on aggregated forecasts
The initial approach to analyzing forecast accuracy involves calculating various metrics (as discussed in the previous section) directly at the forecast resolution. While it's valuable to compute metrics at the finest level, we can also assess model performance at a higher resolution using different categorical features.
For example, if your forecasts are generated at the "product" level, as illustrated below, it would be beneficial to evaluate model performance at the product category level.
At product level (model resolution):
1
A
2
1
-1
1
2
A
100
101
1
1
3
B
100
110
10
10
At the product category level :
A
101
103
0
B
100
110
10
This type of analysis enables you to determine whether global variability has been modeled accurately or not.
3.2. Performance analysis on different scopes
3.2.1. Real values
In sales forecasts datasets, two types of sales events can be found :
real sales event (i.e. positive real value)
no sale event (i.e. real value equal to zero)
Thus, we can analyze sales forecasts by categorizing real values and calculating metrics for sales events on one hand, and non-sale events on the other.
The latter are not initially present in the raw sales dataset but are added during the creation/formatting of training and test/prediction datasets (see section 'Cleaning and formatting sales data').
3.2.2. categorical features
Another interesting thing is to analyse model accuracy for values of a specific column. For instance, you could be interested in evaluating model performances for each product category :
1
A
2
1
-1
1
2
A
100
101
1
1
3
B
100
110
10
10
Here, it's evident that our performance is stronger in category A compared to category B. Consequently, we should focus our efforts on enhancing the model's performance in category B.
This could involve:
analyzing the composition of category B products
studying time series data
identifying potential outliers
implementing appropriate adjustments.
4. Model generation and improvement
Develop an initial model using minimal configuration parameters (refer to 'Minimal configuration').
Evaluate the performance of the initial model by creating plots comparing forecasts to actual observations and computing accuracy metrics as outlined previously (including various metrics, scoring across different scopes, and aggregation).
Identify areas of lowest performance and conduct a detailed analysis of historical data related to these areas to pinpoint potential outliers or irrelevant data that require cleaning.
Determine whether the model's performance is balanced. If not, assess whether the model tends to overpredict or underpredict.
a. In case of overprediction:
Examine for negative trends in historical data.
Assess if there's an underrepresentation of non-sale events in the training dataset compared to the test dataset.
Check if the training dataset is balanced concerning categories where the model exhibits overperformance.
b. In case of underprediction:
Investigate for growth trends in historical data.
Determine if there's an overrepresentation of non-sale events in the training dataset compared to the test dataset.
Verify if the training dataset is balanced regarding categories where the model demonstrates underperformance.
After cleaning the dataset, enhance the model by exploring:
Different algorithms sequentially or using a meta-model. For instance, if trends are discernible in historical data, the
lightgbm
algorithm may yield better results.Adjusting model resolution if certain features exhibit significant variability, leading to substantial overprediction or underprediction. Creating models for each value of such features could help mitigate this issue.
Incorporating external data (refer to 'External Data') to improve model performance further.
Last updated