The Missing Data Problem
🌼 Introduction
Missing data is one of the most common—and dangerous—issues in machine learning. It affects accuracy, biases results, and can completely mislead your model. In this episode, we explore practical strategies to detect, understand, and handle missing values while preserving your dataset’s true patterns.
🔍 1. Why Missing Data Happens
Human errors
Sensor failures
Data extraction issues
Optional fields in forms
Privacy restrictions
Understanding the source helps decide the right strategy.
🗑️ 2. Deletion Methods
a) Listwise Deletion (Complete Case Analysis)
Remove all rows with any missing value.
Pros: Simple, fast, keeps distributions clean.
Cons: Can lose a lot of data → smaller training set.
Use when: Missingness is rare and random.
b) Pairwise Deletion
Uses all available data for each calculation (e.g., correlations).
Pros: Keeps more data.
Cons: Harder to manage; inconsistent row counts.
Use when: Performing statistical analysis, not model training.
🔧 3. Imputation Techniques
a) Mean / Median / Mode Imputation
Mean: For normal distributions
Median: For skewed data
Mode: For categorical features
Pros: Simple and fast Cons: Reduces variability, can bias model
b) Forward Fill / Backward Fill
Use previous or next valid value.
Best for time-series data
c) KNN Imputation
Find the k nearest samples and average their values.
Pros: Captures relationships across features
Cons: Slow for large datasets
d) Iterative Imputation (MICE)
Build a model for each feature with missing values.
Pros: Most accurate for complex patterns
Cons: Expensive; risky leakage if not done inside CV folds
📈 4. Impact on Model Performance
Missing-data handling changes the dataset, so always compare: MethodProsConsImpactListwiseClean dataData lossGood if missing <5%Mean ImputationStableBias riskWorks for simple modelsKNNCaptures structureExpensiveOften improves ML accuracyIterativeMost accurateSlowBest for critical datasets
🧪 5. Real Example Comparison (Python Snippet)
from sklearn.impute import SimpleImputer, KNNImputer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression import pandas as pd df = pd.read_csv("data.csv") X = df.drop("target", axis=1) y = df["target"] methods = { "mean": SimpleImputer(strategy="mean"), "median": SimpleImputer(strategy="median"), "knn": KNNImputer(n_neighbors=5) } for name, imputer in methods.items(): X_imp = pd.DataFrame(imputer.fit_transform(X), columns=X.columns) X_train, X_test, y_train, y_test = train_test_split(X_imp, y, test_size=0.2) model = LogisticRegression().fit(X_train, y_train) preds = model.predict(X_test) print(name, accuracy_score(y_test, preds))
This prints the performance for each imputation strategy.
🎯 6. Best Practices
Never impute using the full dataset → do it inside cross-validation to avoid leakage
Choose deletion only if missing data is minimal
Compare multiple methods
Use advanced imputation for complex datasets
For time-series → prefer forward/backward fill
⭐ Conclusion
Handling missing data properly can dramatically improve model accuracy, reduce bias, and stabilise performance. The right technique depends on your data distribution, missingness pattern, and model type.












