Discover Top Posts Tagged with #imbalanced dataset

Class Imbalance in ML

Imbalanced classification refers to the classification predictive modelling problem where the number of examples in the training dataset for each class label is not balanced.

If there is a dataset consisting of 10000 genuine and 10 fraudulent transactions, the classifier will tend to classify fraudulent transactions as genuine transactions. The reason can be easily explained by the numbers. Suppose the machine learning algorithm has two possibly outputs as follows:

Model 1 classified 7 out of 10 fraudulent transactions as genuine transactions and 10 out of 10000 genuine transactions as fraudulent transactions.

Model 2 classified 2 out of 10 fraudulent transactions as genuine transactions and 100 out of 10000 genuine transactions as fraudulent transactions.

If we take the number of mistakes made as to the performance of the model, Model 1 has only 17 errors but Model 2 has 102 errors. However, if we want to minimize the fraudulent transactions we should use Model 2. But any machine learning algorithm will generally pick Model 1 resulting in passing a lot of fraudulent transactions unrestricted.

Better Metrics

We can better metrics than just counting the errors, such as:

True Positive (TP) – An example that is positive and is classified correctly as positive

True Negative (TN) – An example that is negative and is classified correctly as negative

False Positive (FP) – An example that is negative but is classified wrongly as positive

False Negative (FN) – An example that is positive but is classified wrongly as negative

Now let's find the performance of our models with respect to our new metrics.

In our case, our primary focus is to reduce the number of fraudulent transactions as much as possible, i.e lesser number of false negatives. So, calculating the False Negative rate for both our Models,

Model 1:

FNR_M1 = 7/ (7+3)

FNR_M1 = 0.7

Model 2:

FNR_M2 = 2/ (2+8)

FNR_M2 = 0.2

Now we see that the False Negative rate of Model 1 is at 70% while the False Negative rate of Model 2 is just 20% which makes it a better classifier.

#machine learning #classification #class imbalance #imbalanced dataset #metrics