Schooling, Machine Learning and Bias and Variance
Remember our Good old School days !! What if we send Machines to school ? Will it make them better ? Interesting ? read on … It's just an analogy...
Remember how we studied the whole year... to learn many subjects... class tests and then finally year-end-exam. In the test, the less error we made, more marks we got and more percentage thereof. Whole point was to prove that we learnt properly in the whole of the year and we did that by answering the questions appropriately. The job of the teacher was to evaluate the answers and provide some amount of corrective measures so that the student can minimize error in the final exam.. Those corrective measures were mainly applied during the mid-term test.
Machine Learning is also quite similar. Machine learns from the huge dataset (called training set) and forms a hypothesis (denoted as h(theta)). It then does various model selection as a corrective measure through Cross-Validation dataset (This is much like our mid-term exam) and then test on the Test Set (X(test)), more like our final exam and we evaluate this to be a good hypothesis based on the lesser errors it make.
Rule of thumb for splitting the dataset is to first randomly shuffle your dataset and then take
60% for Training Set - used for learning.
20% for Cross Validation - Used for model selection, like, degree of polynomial of features, regularization parameter (have not discussed this in detail; this would be a separate discussion).
20% for Test Set - Final Exam.
Now, forming a hypothesis boils down to identifying the parameter vector, theta, such that difference between predicted behavior and actual behavior is minimized. A good way to identify if your hypothesis is learning is by plotting the cost function J(theta)/error vs the training set size. If you start with few training data and then move on to large training set, error will increase as you move from low to high data set. Now, do the same thing with your Cross Validation set (cv)... With CV, your error would decrease over more data.
It may so happen that with more data-set,
(Case 1) The error difference between training data and cross validation data is reduced.
(Case 2) There is significant gap in the error difference between training data and cross validation data.
Case 1 signifies that our hypothesis is under-fitting the data and this is known as high Bias. This means that getting more data is NOT going to help improve the hypothesis. Instead,
Get additional features, or more polynomial features. Or/and
Decrease regularization parameter (lambda)
Case 2 signifies that our hypothesis is over-fitting the data and this is known as high Variance. This means that getting more data is likely to help improve the hypothesis. So to fix high variance
Get more training examples. Or/and
Try smaller set of features, or reduce some features. Or/and
Increase regularization parameter (lambda)
Using Principal Component Analysis, PCA to reduce the dimensionality of the given data and use it to improve variance is an extremely BAD IDEA. This is a very common misuse of PCA. Increase regularization parameter instead.
Recommended approach in Machine Learning:
Start with simple, quick and dirty algorithm. Do not spend too much time on algorithm.
Plot learning curves and see if more data, more features etc are likely to help.
Manually examine the examples with cross validation set about where the algorithm made error.













