Unmasking Multicollinearity
Multicollinearity is a common problem when estimating linear or generalized linear models. It occurs when there are high correlations among predictor variables (apart from the target variable), leading to unreliable and unstable estimates of regression coefficients.
Multicollinearity violates the fundamental assumptions of regression analysis which assumes that there exists a degree of independence among the predictor variables. It can become a roadblock when we want to distinguish the individual effects of the predictor variables on the target variable.
Consider the following equation,
W1 is the increase in y for one unit increase in X1 by keeping the X2 constant. But when X1 and X2 are highly correlated the changes in one would affect the other and we might not be able to see their individual effects on y.
How do we detect multicollinearity?
The severity of the problem can be assessed with a statistic called Variance Inflation Factor (VIF). The VIF can be calculated for each predictor by doing a linear regression of that predictor on all the other predictors and then obtaining R2from that regression.
VIF can be captured by,
VIF estimates how much the variance of a coefficient is inflated because of linear dependence with other predictors.
It has a lower bound of 1 but no upper bound. However, a rule of thumb is that VIF greater than 5 indicates the presence of high multicollinearity. Usually, a VIF of up to 10 can be tolerated only when the dataset has more categorical features or many features in general.
Let's take a look at the blood pressure data obtained from 20 Individuals with high BP
blood pressure (y = BP, in mm Hg)
age (x1 = Age, in years)
weight (x2 = Weight, in kg)
body surface area (x3 = BSA, in sq m)
duration of hypertension (x4 = Dur, in years)
basal pulse (x5 = Pulse, in beats per minute)
stress index (x6 = Stress)
Now let's calculate the VIF through the variance_inflation_factor provided by the statsmodel package. Removing the target variable (BP) and index , the VIF is,
VIF arrived here is way too high because we have not scaled the data. Let’s fix this by transforming the dataset using StandardScaler() and then calculate VIF.
Now, this looks reasonable. The VIF of variables Weight and BSA is higher than 5, which makes sense because as the body surface area(BSA) increases or decreases, so does the weight.
The correlation matrix also allows us to investigate the dependence between multiple variables at the same time. The result is a table containing the Pearson correlation coefficients between each variable and the others.
Our correlation plot confirms the same, as our variable Weight and BSA have a high correlation of 0.88. We can fix this by eliminating a variable(either Weight or BSA) from the feature list. However selecting a feature to retain purely depends on the problem we are trying to solve. As of now we can go ahead and eliminate the BSA variable from the dataset and calculate the VIF of the remaining features.
Impressive! Just removing the BSA variable has reduced the VIF of the remaining predictors to a great extent.
So how do we avoid multicollinearity?
Make sure to avoid creating new variables which might be directly dependent on the variable already present. For example, Incase of creating an ‘Age’ variable from ‘DOB’, either one should be dropped.
Check for identical variables with different measuring units. For example, the Height variable can be present in both inches and cms.
When creating dummy variables for a feature having more than 2 values, set drop_first=True or simply encode them instead. For example, for the Result variable containing Pass and Fail as values we can just encode them as 1 and 0 instead of creating dummy variables or if not we can set the drop_first parameter to True.
When is it safe to be multicorrelated?
Multicollinearity inflates the standard error of the coefficients resulting in their instability. This instability creates a risk in coefficient interpretation and can mislead the hypothesis testing of the coefficients.
However, Multicollinearity does not reduce the predictive power or reliability of the model as a whole. It affects only calculations regarding individual predictors.
Most of the categorical variables with three or more categories have high VIF and this is completely fine because the proportion of the categories is very small.



















