By Sean Deaton in machine learning — May 15, 2023

On Imbalanced Datasets

What's the difference between macro and weighted averages? Do they matter? On imbalanced datasets.

I recently came across some interesting results using scikit-learn’s classification_report.

							precision    recall  f1-score   support
					
					 0       0.86      0.52      0.65        83
					 1       0.88      0.98      0.92       287
					
    accuracy                           0.87       370
   macro avg       0.87      0.75      0.78       370
weighted avg       0.87      0.87      0.86       370

Confusion Matrix:
[[ 43  40]
 [  7 280]]

The results show a peculiar difference between the macro and weighted average results for recall (and as a result, f1); the weighted average is slightly higher (0.86 vs 0.78). This stems from the imbalanced nature of the dataset. The supportcolumn shows the number of samples in each class, 0 and 1, respectively. The samples belonging to class 1 make up about 77% of the samples. Given this relative imbalance between the two classes, the model could do well by simply guessing that all samples in class 1 belong to class 0.

We can see this more clearly with a more extreme example, courtesy of Stack Overflow.

Classification Report :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     28432
           1       0.02      0.02      0.02        49

    accuracy                           1.00     28481
   macro avg       0.51      0.51      0.51     28481
weighted avg       1.00      1.00      1.00     28481

Here, class 0 dominates about 99.8% of all of the samples. With such an imbalanced dataset, the model should always predict any event belonging to the majority class.

Macro Average

That’s exactly what the discrepancy between the macro and weighted averages is telling us. The macro average evenly multiplies each class’s score by 1/n where n is the number of predicted classes and adds them together. Since the model performs binary classification, this is just 1/2. If we consider the F1 scores for my dataset above, at 0.65 and 0.92 for classes 0 and 1:

$$ avg_{macro} = \dfrac{1}{2}\times score_0 + \dfrac{1}{2} \times score_1 $$

$$ avg_{macro} = \dfrac{1}{2} \times 0.65 + \dfrac{1}{2} \times 0.92 $$

$$ avg_{macro} = 0.785 $$

Weighted Average

Not too bad. However, the weighted average instead multiplies each class’s score by a weighted percentage each class takes in the sample. So instead of /n we’ll multiply by samples in class n/total samples where n is again the number of predicted classes.

Consider again my dataset’s F1 results alongside the number of samples in each class, 83 and 287.

$$ avg_{weighted} = \dfrac{samples\space in\space class\space 0}{total\space number\space of\space samples} \times score_0 + \dfrac{samples\space in\space class\space 1}{total\space number\space of\space samples} \times score_1 $$

$$ avg_{weighted} = \dfrac{83}{83+287}\times 0.65 + \dfrac{287}{83+287} \times 0.92 $$

$$ avg_{weighted} = 0.859 $$

So that’s how we arrive at our macro and weighted averages.

How Did We Get Here?

Imbalanced datasets occur all of the time in practice. Especially given that we often want to predict rare or extreme outliers. Consider detecting skin cancer from pictures of patients. Assuming no bias in who is submitting their photos, it’s likely that most pictures will show no signs of skin cancer. The true positive case, actually having skin cancer, may occur in one out of every five cases. Other applications may have much rarer events with 1000:1 odds or higher. Should we sample such that each class is given equal weight? Or should we train the model on a proportion it’s likely to see in the real world?

How Should We Approach It?

No, we should not split the classes 1:1. The question that we should instead be asking is, “is the minority class properly represented?” Specifically for my dataset, in the real world, do the members of class 1 typically account for 77% of the population? If so, great! We’re testing and training on realistic populations. Re-balancing the training data would negatively influence our model, as most classifiers assume that predictions follow “…the same distribution as the training data”.

Furthermore, we should examine if we have enough data to represent that minority class properly. Obviously, having 1 member of the minority class is not sufficient. Is having 10? Likely not. 100? Maybe. What about 10,000? Now we’re talking! It might be the case that we need more data. It’s reported that data-hungry models, like Random Forest Classifiers, should have sample sizes that are 200 times larger than the number of features. My dataset has 1,536 features, so my 370 events are far from the target amount of 307,200 (for both classes).

Finally, we should consider adjusting the threshold of the model’s prediction. Most models assume a threshold of 50%. As in, if the model is more than 50% sure, it’ll place an event in a particular class. But maybe we can adjust this value to be relevant to our context. A great example is with weather patterns. No meteorologist says it’s going to pour or not pour. Instead, they give likelihoods that an event will happen. Maybe that’ll help in this case.

Another likely point of optimization lies in feature engineering. Perhaps those features aren’t very important to begin with. I’m not quite sure yet, but I’ll keep experimenting and maybe come back to update this post with what works well. I know definitively that I need more samples, so that’s the first order of business.

In machine learning, you have the data you have. Re-balancing involves removing events from the majority class or replicating events from the minority class. This either removes information or provides no new information. In cases where we’re considering replicating data, we should instead consider creating and labeling additional events to increase the population of the minority class.

Additional Reading

Both articles have been linked above, but to be explicit, I would highly recommend:

Gabe Verzino’s post “Why Balancing Classes is Over-Hyped” and his citations. It’s a great overview of the problem and how under and over-sampling aren’t likely the issues.
Foster Provost’s “Machine Learning from Imbalanced Data Sets 101” lays out a succinct argument against up or down sampling and instead focuses on adjusting thresholds, adding new data, and selecting representative samples.

Thanks for Reading!

This article, and others, are available for free at seandeaton.com and on Medium. Questions or comments? Let’s chat on LinkedIn!