Performance analysis and ML model improvement

1. Metrics

1.1. MAE

Formula

MAE = \frac{1}{n} \times \sum_{i=1}^n|y_{ref_i} - y_{pred_i}|

Definition

MAE (Mean Absolute Error) measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation.

Example:

Actual observation

forecast value

Absolute error (units)

100

101

100

110

MAE = (1+1+10)/3 = 4

1.2. RMSE

Formula

RMSE = \sqrt{\frac{1}{n} \times \sum_{i=1}^n(y_{ref_i} - y_{pred_i})^2}

Definition

RMSE (Root Mean Squared Error) is a quadratic scoring rule that measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation.

Example:

actual observation

forecast value

Absolute error (units)

100

101

100

110

RMSE =\sqrt{(1²+1²+10²)/3} = 5.83

1.3. MAPE

Formula

MAPE = \frac{1}{n} \times \sum_{i=1}^n|\frac{y_{ref_i} - y_{pred_i}}{y_{ref_i}}|

Definition

MAPE (Mean Absolute Pourcentage Error) measures accuracy of a forecast system as a percentage. It’s the average of the ratio between the error and the actual observation.

The lower the actual observation is, the higher the MAPE could be.

Example:

actual observation

forecast value

Absolute error (units)

Absolute error (%)

100

101

100

110

MAPE = (100+1+10)/3 = 37

1.4. wMAPE

Formula

wMAPE = \frac{1}{n}\times\frac{\sum_{i=1}^n|y_{ref_i}-y_{pred_i}|}{\sum_{i=1}^n|y_{ref_i}|}

Definition

wMAPE (Weighted Mean Absolute Pourcentage Error) is a variant of MAPE in which errors are weighted by values of actual observations (e.g. in case of sales forecasting, errors are weighted by sales volume)

Example:

actual observation

forecast value

Absolute error (units)

Absolute error (%)

100

101

100

110

wMAPE = (1*100+100*1+100*10)/201 = 5.97

Due to small quantities, the first data point contributes largely to the high MAPE result. By weighting the percentage error using the real quantity, we reduce the impact of this data point and have a more realistic overview of the percentage error.

1.5. R²

Formula

R^2 = 1-\frac{\sum_{i=1}^n(y_{ref_i}-y_{pred_i})^2}{\sum_{i=1}^n(y_{ref_i}- \overline{y}_{ref})^2}

Definition

R² is a measure of the goodness of fit of a model. In regression, the R² coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An R² of 1 indicates that the regression predictions perfectly fit the data.

Values of R² outside the range 0 to 1 occur when the model fits the data worse than the worst possible least-squares predictor (equivalent to a horizontal hyperplane at a height equal to the mean of the observed data).

Example:

actual observation

forecast value

Absolute error (units)

Moyenne des valeurs réelles

100,5

100

101

100,5

100

110

100,5

R² = 1-((1+1+10)²/(99.5+0.5+0.5)²) = 0.98

⇒ here, 98% of variation of the real data point is explained by the model.

1.6. Global overprediction

Overprediction corresponds to the amount of forecasts that exceeds the actual observations. It can be expressed in units or in percentage.

Example :

actual observation

forecast value

Absolute error (units)

100

110

overprediction(units) = 10+1=11

1.7. Global underprediction

Underprediction corresponds to the amount of forecast that is inferior to the actual observations. It can be expressed in units or in percentage.

Exemple :

actual observation

forecast value

Absolute error (units)

100

110

underprediction(units)= 1

2. How to interpret metrics

When we want to appreciate the sales forecast model performance, we need to combine several metrics.

2.1. MAE versus RMSE

The first level of analysis is based on the units error, i.e. MAE and RMSE.

The MAE reflects how many error units in average the model makes for each forecast. By squaring the error, the RMSE ponderates the large error

The more the RMSE tends to the MAE, the more balanced the error is.

if the RMSE is much higher than the MAE, it means that some of the data points of your test set have very large errors

If you are in the latter situation, we encourage you to:

Identify these large error forecast data points
Analyse their corresponding historical time series
Try to identify potential areas of improvement of your ML model.

Theses metrics can be used for all forecast data points.

2.2. MAE versus MAPE

As explained before, compare to the MAE, the MAPE corresponds to the error value expressed in pourcentage of real value. It is easy to understand that these two metrics are clearly complementary as having a low MAE does not tell you by itself if it a good result or not.

For instance, an MAE of 100 does not yield the same result whether the MAPE is 1% or 100%:

In the first case, MAE = 100 is a good performance as it represents only 1% of error when we compare the forecast to the real value.
In the second case, MAPE = 100% means that the real value corresponds to half of your prediction which means that MAE = 100 is a quit big error !

We encourage you to always use these two metrics together as it is not possible to appreciate properly the model accuracy using only one of them.

It is the minimal combination of metrics to use to correctly interpret forecast results.

You cannot compute this metrics when real value is equal to 0. In this case, only MAE and RMSE could be used !

2.3. MAPE versus wMAPE

These two metrics enable you to assess the percentage error. However, in comparison to the MAPE, the wMAPE weighs the percentage error by the ratio of the real quantity to the total real quantity.

The primary advantage of the wMAPE is its ability to mitigate large percentage errors made on small real quantities, providing a more realistic overview of the percentage error.

By comparing the MAPE and wMAPE, you can evaluate the model's error on small quantities.

The closer the MAPE is to the wMAPE, the better the model performs on small quantities.

2.4. MAE versus R²

The R² is a classic machine learning metric that allows us to assess how much variability is explained by the ML model.

It achieves this by comparing the model's ability to account for variability with a scenario where the average of real values is used as the forecast (resulting in no variability, as all forecast values would be the same).

For example, if our ML model yields MAE = 2 but R² = 0, it indicates that the variability present in real values is not explained by the features used in the model, despite the seemingly low MAE.

When dealing with an R² value close to 0, it's advisable to remove the features used in the model to avoid unnecessary computations and search for more relevant features.

Once again, similar to MAPE, R² is one of the key metrics to combine with MAE.

3. Model scoring at different levels

Using multiple metrics is important but not sufficient for obtaining a realistic view of a model's performance.

Additionally, it is crucial to combine various levels of analysis to understand the strengths and weaknesses of an ML model.

Model performance is often analyzed from different perspectives:

either by computing metrics on forecasts aggregated according to specific features, such as product category or higher-level time features
or by assessing metrics on different subsets, such as segregating actual observations based on their values (e.g., zero or positive).

3.1. Performance analysis on aggregated forecasts

The initial approach to analyzing forecast accuracy involves calculating various metrics (as discussed in the previous section) directly at the forecast resolution. While it's valuable to compute metrics at the finest level, we can also assess model performance at a higher resolution using different categorical features.

For example, if your forecasts are generated at the "product" level, as illustrated below, it would be beneficial to evaluate model performance at the product category level.

At product level (model resolution):

product_id

product_category

actual observation

forecast value

error (units)

absolute error (units)

-1

100

101

100

110

MAE(Product) = (1+1+10)/3 = 4

At the product category level :

product_category

actual observation

forecast value

error (units)

101

103

100

110

MAE(ProductCategory)= (0+10)/2=5

This type of analysis enables you to determine whether global variability has been modeled accurately or not.

3.2. Performance analysis on different scopes

3.2.1. Real values

In sales forecasts datasets, two types of sales events can be found :

real sales event (i.e. positive real value)
no sale event (i.e. real value equal to zero)

Thus, we can analyze sales forecasts by categorizing real values and calculating metrics for sales events on one hand, and non-sale events on the other.

The latter are not initially present in the raw sales dataset but are added during the creation/formatting of training and test/prediction datasets (see section 'Cleaning and formatting sales data').

Therefore, it is crucial to evaluate how the model performs specifically on these 'artificial' data points added to achieve a more balanced dataset and to prevent over-prediction.

3.2.2. categorical features

Another interesting thing is to analyse model accuracy for values of a specific column. For instance, you could be interested in evaluating model performances for each product category :

product_id

product_category

actual observation

forecast value

error (units)

absolute error (units)

-1

100

101

100

110

MAE(A)=(|-1|+|+1|)/2=1

MAE(B) = |10|/1 = 10

Here, it's evident that our performance is stronger in category A compared to category B. Consequently, we should focus our efforts on enhancing the model's performance in category B.

This could involve:

analyzing the composition of category B products
studying time series data
identifying potential outliers
implementing appropriate adjustments.

4. Model generation and improvement

Develop an initial model using minimal configuration parameters (refer to 'Minimal configuration').
Evaluate the performance of the initial model by creating plots comparing forecasts to actual observations and computing accuracy metrics as outlined previously (including various metrics, scoring across different scopes, and aggregation).
Identify areas of lowest performance and conduct a detailed analysis of historical data related to these areas to pinpoint potential outliers or irrelevant data that require cleaning.
Determine whether the model's performance is balanced. If not, assess whether the model tends to overpredict or underpredict.
a. In case of overprediction:
- Examine for negative trends in historical data.
- Assess if there's an underrepresentation of non-sale events in the training dataset compared to the test dataset.
- Check if the training dataset is balanced concerning categories where the model exhibits overperformance.
b. In case of underprediction:
- Investigate for growth trends in historical data.
- Determine if there's an overrepresentation of non-sale events in the training dataset compared to the test dataset.
- Verify if the training dataset is balanced regarding categories where the model demonstrates underperformance.
After cleaning the dataset, enhance the model by exploring:
- Different algorithms sequentially or using a meta-model. For instance, if trends are discernible in historical data, the lightgbm algorithm may yield better results.
- Adjusting model resolution if certain features exhibit significant variability, leading to substantial overprediction or underprediction. Creating models for each value of such features could help mitigate this issue.
- Incorporating external data (refer to 'External Data') to improve model performance further.

PreviousUser roles NextAPI

Last updated 1 year ago