Fixing Imbalanced Classification Problem
Imbalanced datasets pose a challenging problem where the classes are represented unequally. For an imbalanced dataset consisting of two classes, their training examples ratio may be 1:100 and for various scenarios such as fraud detection in claims, click-through rate for an ad serving company and predicting airplane crash/ failure the ratio might be even higher, say 1:1000 or 1:5000.
So how do we fix this?
1. Resampling the dataset
It's one of the straightforward methods of dealing with highly imbalanced datasets by levelling up the classes.
1.1 Under Sampling:
Undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset.
Pros:
It can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge.
Cons:
It can discard potentially useful information which could be important for building rule classifiers.
The sample chosen by random under-sampling may be a biased sample. And it will not be an accurate representation of the population. Thereby, resulting in inaccurate results with the actual test data set.
Note: Under Sampling should only be done when we have a huge number of records.
1.2 Over Sampling:
Over Sampling involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset.
Pros:
Unlike under-sampling, this method leads to no information loss.
Outperforms under sampling
Cons:
It increases the likelihood of overfitting since it replicates the minority class events.
Note: Oversampling can be considered when we have fewer records
1.2.1 SMOTE
Synthetic Minority Oversampling TEchnique, or SMOTE for short, is an oversampling method. It works by creating synthetic samples from the minor class instead of creating copies.
SMOTE first selects a minority class at random (X1) and finds its k(here, k=4) nearest minority class neighbours(X11, X12, X13, X14). The synthetic instance is created by choosing one of the k nearest neighbours X11 at random and connecting X1 and X11 to form a line segment in the feature space.
Now lets consider our dataset where there are 9900 instances of class 0 and 100 instances of class 1.
After over sampling the minority class using SMOTE, the transformed dataset can be visualized as below:
Here we have 9900 instances for both class 0 and class 1.
Pros:
Mitigates the problem of overfitting caused by random oversampling as synthetic examples are generated rather than a replication of instances
Cons:
While generating synthetic examples SMOTE does not take into consideration neighbouring examples from other classes. This can result in an increase in the overlapping of classes and can introduce additional noise
SMOTE is not very effective for high dimensional data
1.3 Hybrid Approach ( Under Sampling + Over Sampling)
SMOTE: Synthetic Minority Over-sampling Technique,2011 suggested a hybrid approach of combining SMOTE with random under-sampling of the majority class.
Here, we first oversample the minority class to have 10 percent the number of examples of the majority class (e.g. about 1,000), then use random under sampling to reduce the number of examples in the majority class to have 50 percent more than the minority class (e.g. about 2,000).The final class distribution after this sequence of transforms matches our expectations with a 1:2 ratio or 1980(approx. 2000) examples in the majority class and about 990(approx. 1000) examples in the minority class.
2. Cost-Sensitive Learning
In cost-sensitive learning instead of each instance being either correctly or incorrectly classified, each class (or instance) is given a misclassification cost. Thus, instead of trying to optimize the accuracy, the problem is then to minimize the total misclassification cost. Here the penalty is associated with an incorrect prediction.
Sklearn ml models provide the class_weights parameter where we can specify a higher weight for the minority class using a dictionary.
For the logistic regression, we calculate the loss per instance using binary cross-entropy.
Loss= −y log(p) − (1−y)log(1−p).
However, according to the above code snippet, we set the class weights as {0:1,1:10}
NewLoss = −10*y log(p) − 1*(1−y)log(1−p).
So what happens here is that if our model gives a probability of 0.3 and we misclassify a positive example, the NewLoss acquires a value of -10log(0.3) = 5.2287 and if our model gives a probability of 0.7 and we misclassify a negative example, the NewLoss acquires a value of -log(0.3) = 0.52.
That means we penalize our model around ten times more when it misclassifies a positive minority example in this case.
There is no method to pick the apt class weights, so it's a hyperparameter to be tuned. However, if we want to get class_weights using the distribution of the y variable, we can use the following compute_class_weight from sklearn.
Cost-sensitive algorithms include Logistic Regression, Decision Trees, Support Vector Machines, Artificial Neural Networks, Bagged decision trees, Random Forest, Stochastic Gradient Boosting.
3. Ensemble Models
3.1 Bagging:
Bagging is an abbreviation of Bootstrap Aggregating. The conventional bagging algorithm involves generating ‘n’ different bootstrap training samples with replacement. And training the algorithm on each bootstrapped algorithm separately and then aggregating the predictions at the end.
Bagging is used for reducing Overfitting in order to create strong learners for generating accurate predictions. Unlike boosting, bagging allows replacement in the bootstrapped sample.
Pros:
In noisy data environments, bagging outperforms boosting
Improved misclassification rate of the bagged classifier
Reduces overfitting
Cons:
Bagging works only if the base classifiers are not bad to begin with. Bagging bad classifiers can further degrade performance
3.2 Boosting( AdaBoost):
Boosting is an ensemble technique to combine weak learners to create a strong learner that can make accurate predictions. Boosting starts out with a base classifier / weak classifier that is prepared on the training data.
For example in a data set containing 1000 observations out of which 20 are labelled fraudulent. Equal weights W1 are assigned to all observations and the base classifier accurately classifies 400 observations.
The weight of each of the 600 misclassified observations is increased to w2 and the weight of each of the correctly classified observations is reduced to w3.
In each iteration, these updated weighted observations are fed to the weak classifier to improve its performance. This process continues till the misclassification rate significantly decreases thereby resulting in a strong classifier.
Pros:
Good generalization- suited for any kind of classification problem
Very simple to implement
Cons:
Sensitive to noisy data and outliers
3.3 Gradient Boosting
Adaboost either requires the users to specify a set of weak learners or randomly generates the weak learners before the actual learning process. The weight of each learner is adjusted at every step depending on whether it predicts a sample correctly.
Whereas Gradient Boosting builds the first learner on the training dataset to predict the samples, calculates the loss (Difference between real value and output of the first learner). And use this loss to build an improved learner in the second stage.
At every step, the residual of the loss function is calculated using the Gradient Descent Method and the new residual becomes a target variable for the subsequent iteration.
Cons:
Gradient Boosted trees are harder to fit than Random forests
Might lead to overfitting if parameters are not tuned properly
3.3.1 Extreme Gradient Boosting(XGBoost)
Pros:
It is 10 times faster than the normal Gradient Boosting as it implements parallel processing. It is highly flexible as users can define custom optimization objectives and evaluation criteria, has an inbuilt mechanism to handle missing values.
Unlike gradient boosting which stops splitting a node as soon as it encounters a negative loss, XG Boost splits up to the maximum depth specified and prunes the tree backwards and removes splits beyond which there is only negative loss.
In most cases, synthetic techniques such as SMOTE and MSMOTE will outperform the conventional oversampling and undersampling techniques. For better performance, we can use SMOTE or MSMOTE along with advanced boosting methods such as Gradient Boosting or XGBoost.











