Stacking,Blending
The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.
Let’s say you want to do 2-fold stacking (Generalize to k-fold ):
Split the train set in 2 parts: train_a and train_b
Fit a first-stage model on train_a and create predictions for train_b
Fit the same model on train_b and create predictions for train_a
Finally fit the model on the entire train set and create predictions for the test set. That is to say , we find the best model via 2-fold CV.
Now train a second-stage stacker model on the probabilities from the first-stage model(s).
A stacker model gets more information on the problem space by using the first-stage predictions as features, than if it was trained in isolation.
With blending, instead of creating out-of-fold predictions for the train set, you create a small holdout set of say 10% of the train set. The stacker model then trains on this holdout set only.
Blending has a few benefits:
It is simpler than stacking.
It wards against an information leak: The generalizers and stackers use different data.
You do not need to share a seed for stratified folds with your teammates. Anyone can throw models in the ‘blender’ and the blender decides if it wants to keep that model or not.
The cons are:
You use less data overall
The final model may overfit to the holdout set.
Your CV is more solid with stacking (calculated over more folds) than using a single small holdout set.
"""Kaggle competition: Predicting a Biological Response.Blending {RandomForests, ExtraTrees, GradientBoosting} + stretching to[0,1]. The blending scheme is related to the idea Jose H. Solorzanopresented here:http://www.kaggle.com/c/bioresponse/forums/t/1889/question-about-the-process-of-ensemble-learning/10950#post10950'''You can try this: In one of the 5 folds, train the models, then usethe results of the models as 'variables' in logistic regression overthe validation data of that fold'''. Or at least this is theimplementation of my understanding of that idea :-)The predictions are saved in test.csv. The code below created my bestsubmission to the competition:- public score (25%): 0.43464- private score (75%): 0.37751- final rank on the private leaderboard: 17th over 711 teams :-)Note: if you increase the number of estimators of the classifiers,e.g. n_estimators=1000, you get a better score/rank on the privatetest set.Copyright 2012, Emanuele Olivetti.BSD license, 3 clauses.""" from __future__ import divisionimport numpy as npimport load_datafrom sklearn.cross_validation import StratifiedKFoldfrom sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifierfrom sklearn.linear_model import LogisticRegression def logloss(attempt, actual, epsilon=1.0e-15): """Logloss, i.e. the score of the bioresponse competition. """ attempt = np.clip(attempt, epsilon, 1.0-epsilon) return - np.mean(actual * np.log(attempt) + (1.0 - actual) * np.log(1.0 - attempt)) if __name__ == '__main__': np.random.seed(0) # seed to shuffle the train set n_folds = 10 verbose = True shuffle = False X, y, X_submission = load_data.load() if shuffle: idx = np.random.permutation(y.size) X = X[idx] y = y[idx] skf = list(StratifiedKFold(y, n_folds)) clfs = [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'), RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'), ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'), ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'), GradientBoostingClassifier(learn_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)] print "Creating train and test sets for blending." dataset_blend_train = np.zeros((X.shape[0], len(clfs))) dataset_blend_test = np.zeros((X_submission.shape[0], len(clfs))) for j, clf in enumerate(clfs): print j, clf dataset_blend_test_j = np.zeros((X_submission.shape[0], len(skf))) for i, (train, test) in enumerate(skf): print "Fold", i X_train = X[train] y_train = y[train] X_test = X[test] y_test = y[test] clf.fit(X_train, y_train) y_submission = clf.predict_proba(X_test)[:,1] dataset_blend_train[test, j] = y_submission dataset_blend_test_j[:, i] = clf.predict_proba(X_submission)[:,1] dataset_blend_test[:,j] = dataset_blend_test_j.mean(1) print print "Blending." clf = LogisticRegression() clf.fit(dataset_blend_train, y) y_submission = clf.predict_proba(dataset_blend_test)[:,1] print "Linear stretch of predictions to [0,1]" y_submission = (y_submission - y_submission.min()) / (y_submission.max() - y_submission.min()) print "Saving Results." np.savetxt(fname='test.csv', X=y_submission, fmt='%0.9f')








