# Performance analysis and ML model improvement

## 1. **Metrics**

### **1.1. MAE**

#### *<mark style="color:blue;">Formula</mark>*

$$
MAE = \frac{1}{n} \times \sum\_{i=1}^n|y\_{ref\_i} - y\_{pred\_i}|
$$

#### *<mark style="color:blue;">Definition</mark>*

<mark style="color:blue;">**MAE**</mark> (<mark style="color:blue;">**M**</mark>ean <mark style="color:blue;">**A**</mark>bsolute <mark style="color:blue;">**E**</mark>rror) measures the **average magnitude of the errors** in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation.

#### *<mark style="color:blue;">Example:</mark>*

<table><thead><tr><th width="187">Actual observation</th><th>forecast value</th><th>Absolute error (units)</th></tr></thead><tbody><tr><td>1</td><td>2</td><td>1</td></tr><tr><td>100</td><td>101</td><td>1</td></tr><tr><td>100</td><td>110</td><td>10</td></tr></tbody></table>

$$
MAE = (1+1+10)/3 = 4
$$

### **1.2. RMSE**

#### *<mark style="color:blue;">Formula</mark>*

$$
RMSE = \sqrt{\frac{1}{n} \times \sum\_{i=1}^n(y\_{ref\_i} - y\_{pred\_i})^2}
$$

#### *<mark style="color:blue;">Definition</mark>*

<mark style="color:blue;">**RMSE**</mark> (<mark style="color:blue;">**R**</mark>oot <mark style="color:blue;">**M**</mark>ean <mark style="color:blue;">**S**</mark>quared <mark style="color:blue;">**E**</mark>rror) is a **quadratic scoring** rule that **measures the average magnitude of the error**. It’s the square root of the average of squared differences between prediction and actual observation.

#### *<mark style="color:blue;">Example:</mark>*

| actual observation | forecast value | Absolute error (units) |
| ------------------ | -------------- | ---------------------- |
| 1                  | 2              | 1                      |
| 100                | 101            | 1                      |
| 100                | 110            | 10                     |

$$
RMSE =\sqrt{(1²+1²+10²)/3} = 5.83
$$

### **1.3. MAPE**

#### *<mark style="color:blue;">Formula</mark>*

$$
MAPE = \frac{1}{n} \times \sum\_{i=1}^n|\frac{y\_{ref\_i} - y\_{pred\_i}}{y\_{ref\_i}}|
$$

#### *<mark style="color:blue;">Definition</mark>*

<mark style="color:blue;">**MAPE**</mark> (<mark style="color:blue;">**M**</mark>ean <mark style="color:blue;">**A**</mark>bsolute <mark style="color:blue;">**P**</mark>ourcentage <mark style="color:blue;">**E**</mark>rror) measures **accuracy of a forecast system as a percentage**. It’s the average of the ratio between the error and the actual observation.

{% hint style="warning" %}
The lower the actual observation is, the higher the MAPE could be.
{% endhint %}

#### *<mark style="color:blue;">Example:</mark>*

| actual observation | forecast value | Absolute error (units) | Absolute error (%) |
| ------------------ | -------------- | ---------------------- | ------------------ |
| 1                  | 2              | 1                      | 100                |
| 100                | 101            | 1                      | 1                  |
| 100                | 110            | 10                     | 10                 |

$$
MAPE = (100+1+10)/3 = 37
$$

### **1.4. wMAPE**

#### *<mark style="color:blue;">Formula</mark>*

$$
wMAPE = \frac{1}{n}\times\frac{\sum\_{i=1}^n|y\_{ref\_i}-y\_{pred\_i}|}{\sum\_{i=1}^n|y\_{ref\_i}|}
$$

#### *<mark style="color:blue;">Definition</mark>*

<mark style="color:blue;">**wMAPE**</mark> (<mark style="color:blue;">**W**</mark>eighted <mark style="color:blue;">**M**</mark>ean <mark style="color:blue;">**A**</mark>bsolute <mark style="color:blue;">**P**</mark>ourcentage <mark style="color:blue;">**E**</mark>rror) is a v**ariant of MAPE** in which errors are **weighted by values of actual observations** *(e.g. in case of sales forecasting, errors are weighted by sales volume)*

#### *<mark style="color:blue;">Example:</mark>*

| actual observation | forecast value | Absolute error (units) | Absolute error (%) |
| ------------------ | -------------- | ---------------------- | ------------------ |
| 1                  | 2              | 1                      | 100                |
| 100                | 101            | 1                      | 1                  |
| 100                | 110            | 10                     | 10                 |

$$
wMAPE = (1*100+100*1+100\*10)/201 = 5.97
$$

Due to small quantities, **the first data point contributes largely to the high MAPE result.**\
By **weighting** the percentage error using the real quantity, we **reduce the impact of this data point and have a more realistic overview** of the percentage error.

### **1.5. R²**

#### *<mark style="color:blue;">Formula</mark>*

$$
R^2 = 1-\frac{\sum\_{i=1}^n(y\_{ref\_i}-y\_{pred\_i})^2}{\sum\_{i=1}^n(y\_{ref\_i}- \overline{y}\_{ref})^2}
$$

![r2\_formula.PNG](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/2867f382-85ff-4676-864c-71d404cd8ded/r2_formula.png)

#### *<mark style="color:blue;">Definition</mark>*

**R² is a measure of the goodness of fit of a model.** In regression, the R² coefficient of determination is a statistical measure of **how well the regression predictions approximate the real data points.** An R² of 1 indicates that the regression predictions perfectly fit the data.

Values of R² **outside the range 0 to 1** occur when the model **fits the data worse than the worst possible** least-squares predictor *(equivalent to a horizontal hyperplane at a height equal to the mean of the observed data).*

#### *<mark style="color:blue;">Example:</mark>*

| actual observation | forecast value | Absolute error (units) | Moyenne des valeurs réelles |
| ------------------ | -------------- | ---------------------- | --------------------------- |
| 1                  | 2              | 1                      | 100,5                       |
| 100                | 101            | 1                      | 100,5                       |
| 100                | 110            | 10                     | 100,5                       |

$$
R² = 1-((1+1+10)²/(99.5+0.5+0.5)²) = 0.98
$$

⇒ here, 98% of variation of the real data point is explained by the model.

### **1.6. Global overprediction**

**Overprediction corresponds to the amount of forecasts that exceeds the actual observations.** It can be expressed in units or in percentage.

Example :

| actual observation | forecast value | Absolute error (units) |
| ------------------ | -------------- | ---------------------- |
| 1                  | 2              | 1                      |
| 100                | 99             | 1                      |
| 100                | 110            | 10                     |

$$
overprediction(units) = 10+1=11
$$

### **1.7. Global underprediction**

**Underprediction corresponds to the amount of forecast that is inferior to the actual observations.** It can be expressed in units or in percentage.

Exemple :

| actual observation | forecast value | Absolute error (units) |
| ------------------ | -------------- | ---------------------- |
| 1                  | 2              | 1                      |
| 100                | 99             | 1                      |
| 100                | 110            | 10                     |

$$
underprediction(units)= 1
$$

## 2. How to interpret metrics

When we want to appreciate the sales forecast model performance, we need to **combine several metrics.**

### **2.1. MAE versus RMSE**

The **first level of analysis** is based on the **units error**, i.e. **MAE** and **RMSE**.&#x20;

The **MAE reflects how many error units in average the model makes for each forecast**. \
By squaring the error, the **RMSE ponderates the large error**

{% hint style="info" %}
The more the **RMSE tends to the MAE**, the more **balanced the error is.**

\
**if the RMSE is much higher than the MAE,** it means that **some of the data points** of your test set **have very large errors**
{% endhint %}

If you are in the latter situation, we encourage you to:&#x20;

* **Identify these large error forecast data points**
* **Analyse** their corresponding **historical time series**
* Try to **identify** potential **areas of improvement** of your ML model.&#x20;

Theses metrics can be used for all forecast data points.

### **2.2. MAE versus MAPE**

As explained before, compare to the MAE, the **MAPE corresponds to the error value expressed in pourcentage of real value**. It is easy to understand that these **two metrics** are clearly **complementary** as having a low MAE does not tell you by itself if it a good result or not.

For instance, an MAE of 100 does not yield the same result whether the MAPE is 1% or 100%:

* In the first case, **MAE = 100 is a good performance as it represents only 1% of error** when we compare the forecast to the real value.&#x20;
* In the second case, **MAPE = 100%** means that the real value corresponds to half of your prediction which means that **MAE = 100 is a quit big error** !

We encourage you to always **use these two metrics together** as it is not possible to appreciate properly the model accuracy using only one of them.&#x20;

It is the minimal combination of metrics to use to correctly interpret forecast results.

{% hint style="warning" %}
You cannot compute this metrics when real value is equal to 0. In this case, only MAE and RMSE could be used !
{% endhint %}

### **2.3. MAPE versus wMAPE**

These two metrics enable you to assess the percentage error. However, in comparison to the MAPE, the **wMAPE weighs the percentage error** **by the ratio of the real quantity to the total real quantity**.&#x20;

The primary advantage of the wMAPE is its **ability to mitigate large percentage errors made on small real quantities**, providing a more realistic overview of the percentage error.&#x20;

{% hint style="info" %}
By comparing the MAPE and wMAPE, you can evaluate the model's error on small quantities.&#x20;

**The closer the MAPE is to the wMAPE, the better the model performs on small quantities.**
{% endhint %}

### **2.4. MAE versus R²**

The **R²** is a classic machine learning metric that allows us to assess **how much variability is explained by the ML model**.&#x20;

It achieves this by comparing the model's ability to account for variability with a scenario where the average of real values is used as the forecast (resulting in no variability, as all forecast values would be the same).

For example, if our ML model yields **MAE = 2 but R² = 0**, it indicates that the **variability** present in real values **is not explained by the features** used in the model, despite the seemingly low MAE.

{% hint style="info" %}
When dealing with an R² value close to 0, it's advisable to remove the features used in the model to avoid unnecessary computations and search for more relevant features.
{% endhint %}

Once again, similar to MAPE, R² is one of the key metrics to combine with MAE.

## 3. Model scoring at different levels

Using multiple metrics is important but not sufficient for obtaining a realistic view of a model's performance.&#x20;

Additionally, it is crucial **to combine various levels of analysis to understand the strengths and weaknesses of an ML model.**&#x20;

*Model performance is often analyzed from different perspectives:*&#x20;

* either by **computing metrics** on forecasts **aggregated** according **to specific features**, such as product category or higher-level time features
* or by **assessing metrics on different subsets**, such as segregating actual observations based on their values (e.g., zero or positive).

### **3.1. Performance analysis on aggregated forecasts**

The initial approach to analyzing forecast accuracy involves calculating various metrics (as discussed in the previous section) directly at the forecast resolution. While it's valuable to compute metrics at the finest level, we can also **assess model performance at a higher resolution using different categorical features.**

For example, if your **forecasts are generated at the "product" level**, as illustrated below, it would be beneficial to **evaluate model performance at the product category level.**

* &#x20;**At product level (model resolution):**

| product\_id | product\_category | actual observation | forecast value | error (units) | absolute error (units) |
| ----------- | ----------------- | ------------------ | -------------- | ------------- | ---------------------- |
| 1           | A                 | 2                  | 1              | -1            | 1                      |
| 2           | A                 | 100                | 101            | 1             | 1                      |
| 3           | B                 | 100                | 110            | 10            | 10                     |

$$
MAE(Product) = (1+1+10)/3 = 4
$$

* **At the product category level :**

| product\_category | actual observation | forecast value | error (units) |
| ----------------- | ------------------ | -------------- | ------------- |
| A                 | 101                | 103            | 0             |
| B                 | 100                | 110            | 10            |

$$
MAE(ProductCategory)= (0+10)/2=5
$$

This type of analysis enables you to determine whether global variability has been modeled accurately or not.

### **3.2. Performance analysis on different scopes**

#### 3.2.1. Real values

In sales forecasts datasets, two types of sales events can be found :

* real sales event (i.e. positive real value)
* no sale event (i.e. real value equal to zero)

Thus, we can **analyze sales forecasts** by categorizing **real values** and calculating metrics for sales events on one hand, and **non-sale events** on the other.&#x20;

The latter are not initially present in the raw sales dataset but are **added during the creation/formatting of training and test/prediction datasets** *(see section 'Cleaning and formatting sales data')*.&#x20;

{% hint style="info" %}
Therefore, it is crucial to evaluate how the model performs specifically on these 'artificial' data points added to achieve a more balanced dataset and to prevent over-prediction.
{% endhint %}

#### 3.2.2. categorical features

Another interesting thing is to analyse model accuracy for values of a specific column. For instance, you could be interested in evaluating model performances for each product category :

| product\_id | product\_category | actual observation | forecast value | error (units) | absolute error (units) |
| ----------- | ----------------- | ------------------ | -------------- | ------------- | ---------------------- |
| 1           | A                 | 2                  | 1              | -1            | 1                      |
| 2           | A                 | 100                | 101            | 1             | 1                      |
| 3           | B                 | 100                | 110            | 10            | 10                     |

$$
MAE(A)=(|-1|+|+1|)/2=1
$$

$$
MAE(B) = |10|/1 = 10
$$

Here, it's evident that our **performance is stronger in category A compared to category B.** Consequently, we should focus our efforts on enhancing the model's performance in category B.&#x20;

*This could involve:*&#x20;

* analyzing the composition of category B products
* studying time series data
* identifying potential outliers
* implementing appropriate adjustments.

## 4. Model generation and improvement

1. **Develop an initial model** using minimal configuration parameters (refer to 'Minimal configuration').
2. **Evaluate the performance** of the initial model by **creating plots** comparing forecasts to actual observations and **computing accuracy metrics** as outlined previously *(including various metrics, scoring across different scopes, and aggregation).*
3. **Identify areas of lowest performance** and conduct a detailed analysis of historical data related to these areas to pinpoint potential outliers or irrelevant data that require cleaning.
4. **Determine whether the model's performance is balanced.** \
   If not, assess whether the model **tends to overpredict or underpredict.**<br>

   **a. In case of overprediction:**

   * **Examine for negative trends** in historical data.
   * Assess if there's an **underrepresentation of non-sale events** in the training dataset compared to the test dataset.
   * Check if the **training dataset is balanced** concerning categories where the model exhibits overperformance.<br>

   **b. In case of underprediction:**

   * Investigate for **growth trends** in historical data.
   * Determine if there's an **overrepresentation of non-sale events** in the training dataset compared to the test dataset.
   * Verify if the **training dataset is balanced** regarding categories where the model demonstrates underperformance.<br>
5. After cleaning the dataset, **enhance the model** by exploring:
   * **Different algorithms** sequentially or using a meta-model. For instance, if trends are discernible in historical data, the `lightgbm` algorithm may yield better results.
   * **Adjusting model resolution** if certain features exhibit significant variability, leading to substantial overprediction or underprediction. Creating models for each value of such features could help mitigate this issue.
   * **Incorporating external data** (refer to 'External Data') to improve model performance further.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://doc.verteego.com/verteego-doc/best-practices/performance-analysis-and-ml-model-improvement.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
