Dealing with Missing values - Introduction
Real world data is very dirty. To perform any analysis we need to first clean the data from missing values, outliers, etc. Causes for missing data can be many like:
Why estimate missing values?
Since many algorithms cannot deal with missing values like Distance measure-dependent algorithms(e.g., clustering, similarity searches), we need to either remove these records or try to figure out what could be best possible value.
Suppose, you are modeling weight (Y) as a function of gender(X). Some respondents wouldn't disclose their weight, so you are missing some values for Y. There are three possible mechanisms for the nondisclosure:
Missing Completely At Random (MCAR)
There may be no particular reason why some respondents told you their weights and others didn't. That is, the probability that Y is missing may has no relationship to X or Y. In this case our data is missing completely at random (MCAR).
Other Eg: The dog ate the response sheets!, missing observations because a page of the questionnaire was missing; missing data because of a data processing error; missing data because of a change in data collection procedure, etc.
One gender may be less likely to disclose its weight. That is, the probability that Y is missing depends only on the value of X. Such data are missing at random (MAR)
Missing Not At Random (MNAR)
Heavy (or light) people or rich females may be less likely to disclose their weight. That is, the probability that Y is missing depends on the unobserved value of Y itself. Such data are not missing at random or missing not at random (MNAR)
How to find the type – MAR/MCAR/MNAR ?
To know the type of missing data we should have some idea of data creation
If possible we should also know why data is missing
Use methods which hold in MAR
Don’t use methods which hold only in MCAR
How to deal with missing data?
Exclude subjects with missing values : Case deletion
Make a guess, replace with the guessed values
Fill in with simple guess, e.g. sample mean
Fill in with better guessed values
Black points are missing values
Use complete.cases function of R to check rows with complete data
Valid analysis under MCAR
OK if proportion of missing cases is small <5%
Can be inefficient, may discard a very high proportion of cases
May introduce substantial bias, if missing data are not MCAR (complete cases may be un-representative of the population)
Fill in all missing values with zeroes is easy but Distorts the data disproportionately (changes statistical properties), it may also introduce bias. And Why zero?
If your data is struggling with MAR missing data mechanism then you can try other single imputation methods like imputing with Conditional/Unconditional Mean, Conditional/Unconditional Distribution. (which will be discussed in next post)
Try KNN(nearest neighbor) imputation technique which may give better results as compared to other in most of the situation.
Single imputation is relatively difficult than case deletion but yields better handling for missing data in most of the cases for your analysis.
Fill in random values - Iteratively predict values for each variable until some convergence is reached
More difficult to implement
Requires (initially) more computations
More work involved in interpreting results
There is no need to worry for R users as a lot of packages are already build by awesome R community for dealing with missing values.
MICE: Multiple Imputation with Chained Equations