DataDriven @datascistuff-blog - Tumblr Blog

Stacking,Blending

The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.

Let’s say you want to do 2-fold stacking (Generalize to k-fold ):

Split the train set in 2 parts: train_a and train_b

Fit a first-stage model on train_a and create predictions for train_b

Fit the same model on train_b and create predictions for train_a

Finally fit the model on the entire train set and create predictions for the test set. That is to say , we find the best model via 2-fold CV.

Now train a second-stage stacker model on the probabilities from the first-stage model(s).

A stacker model gets more information on the problem space by using the first-stage predictions as features, than if it was trained in isolation.

With blending, instead of creating out-of-fold predictions for the train set, you create a small holdout set of say 10% of the train set. The stacker model then trains on this holdout set only.

Blending has a few benefits:

It is simpler than stacking.

It wards against an information leak: The generalizers and stackers use different data.

You do not need to share a seed for stratified folds with your teammates. Anyone can throw models in the ‘blender’ and the blender decides if it wants to keep that model or not.

The cons are:

You use less data overall

The final model may overfit to the holdout set.

Your CV is more solid with stacking (calculated over more folds) than using a single small holdout set.

"""Kaggle competition: Predicting a Biological Response.Blending {RandomForests, ExtraTrees, GradientBoosting} + stretching to[0,1]. The blending scheme is related to the idea Jose H. Solorzanopresented here:http://www.kaggle.com/c/bioresponse/forums/t/1889/question-about-the-process-of-ensemble-learning/10950#post10950'''You can try this: In one of the 5 folds, train the models, then usethe results of the models as 'variables' in logistic regression overthe validation data of that fold'''. Or at least this is theimplementation of my understanding of that idea :-)The predictions are saved in test.csv. The code below created my bestsubmission to the competition:- public score (25%): 0.43464- private score (75%): 0.37751- final rank on the private leaderboard: 17th over 711 teams :-)Note: if you increase the number of estimators of the classifiers,e.g. n_estimators=1000, you get a better score/rank on the privatetest set.Copyright 2012, Emanuele Olivetti.BSD license, 3 clauses.""" from __future__ import divisionimport numpy as npimport load_datafrom sklearn.cross_validation import StratifiedKFoldfrom sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifierfrom sklearn.linear_model import LogisticRegression def logloss(attempt, actual, epsilon=1.0e-15): """Logloss, i.e. the score of the bioresponse competition. """ attempt = np.clip(attempt, epsilon, 1.0-epsilon) return - np.mean(actual * np.log(attempt) + (1.0 - actual) * np.log(1.0 - attempt)) if __name__ == '__main__': np.random.seed(0) # seed to shuffle the train set n_folds = 10 verbose = True shuffle = False X, y, X_submission = load_data.load() if shuffle: idx = np.random.permutation(y.size) X = X[idx] y = y[idx] skf = list(StratifiedKFold(y, n_folds)) clfs = [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'), RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'), ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'), ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'), GradientBoostingClassifier(learn_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)] print "Creating train and test sets for blending." dataset_blend_train = np.zeros((X.shape[0], len(clfs))) dataset_blend_test = np.zeros((X_submission.shape[0], len(clfs))) for j, clf in enumerate(clfs): print j, clf dataset_blend_test_j = np.zeros((X_submission.shape[0], len(skf))) for i, (train, test) in enumerate(skf): print "Fold", i X_train = X[train] y_train = y[train] X_test = X[test] y_test = y[test] clf.fit(X_train, y_train) y_submission = clf.predict_proba(X_test)[:,1] dataset_blend_train[test, j] = y_submission dataset_blend_test_j[:, i] = clf.predict_proba(X_submission)[:,1] dataset_blend_test[:,j] = dataset_blend_test_j.mean(1) print print "Blending." clf = LogisticRegression() clf.fit(dataset_blend_train, y) y_submission = clf.predict_proba(dataset_blend_test)[:,1] print "Linear stretch of predictions to [0,1]" y_submission = (y_submission - y_submission.min()) / (y_submission.max() - y_submission.min()) print "Saving Results." np.savetxt(fname='test.csv', X=y_submission, fmt='%0.9f')

CrossValidation

CrossValidation : One should build the model completely based on Training data. The test data should probably not be used during model building and tuning phases.

One can use CrossValidation to tune parameters, choose the algorithm that needs to be used, etc.

K-fold CV

------------

If K is large -- Majority of data in various training sets will be similar. The test data size for CV will be smaller. This will imply LOW BIAS and HIGH VARIANCE.

If K is small --- Training and Test data can be significantly different in different CV situations. This will lead to HIGH BIAS and LOW VARIANCE

High Var ----OVER FIT

One can also use Bootstrap /Sampling without Replacement

#CrossValidation

Using Regression : R NOTES

Steps 1:

1. Identify the kind of features the data has (numeric or categorical).

2. Regression needs Numerical features. So, we need to bucket or convert categorical features into numeric.

3. Distributions:

Check the dist of the Dependent Variable. One needs Normal Dist in general. So, some changes mite be required.

Compute Joint Distribution of Dependent Variable with Independent Variables.

Numeric Var -- Correlation, Density Categorical Var --- Use TABLE

4. Distributions of Independent Variables:

a. Correlation

b. Graph using pairs, pairs.panels (library psych)

5. Use lm function to build REGRESSION MODEL.

6. Look at summary of the model using summary()

a. Look for dist of Residual Errors

b. R-sq value - how well is the dependent variable models. How much of variance is described.

c. Look at Signif. Values to identify how much predictive power each feature has.

The stars (for example, ***) indicate the predictive power of each feature in the model. The significance level (as listed by the Signif. codes in the footer) provides a measure of how likely the true coefficient is zero given the value of the estimate. The presence of three stars indicates a significance level of 0 , which means that the feature is extremely unlikely to be unrelated to the dependent variable. A common practice is to use a significance level of 0.05 to denote a statistically significant variable. If the model had few features that were statistically significant, it may be cause for concern, since it would indicate that our features are not very predictive of the outcome. Here, our model has several significant variables, and they seem to be related to the outcome in logical ways

7. In Regression, feature selection and model specs are the analyst's job.

a. non-linear terms x^2, xy , etc

b. Transformation – converting a numeric variable to a binary indicator

c. Domain Knowledge Helps

#linear regression

Exploratory Data Science Recipes 1

Given a dataset in a tabular form, we want to study the relationships between predictor variables and dependent variable.

Recipes:

1. Given a data.frame, how do split it into training and test and calibration data.

a. Create a vector of size equal to the number of rows in the data.frame. The vector is populated with random number uniformly generated between 0 and 1. (runif(nrow)).

dTrainAll <- subset(data.f, rgroup<=0.9) #Approx 90%

dTest <- subset(data.f,rgroup>0.9)

###Split dTrainAll into Calibration Data and Train Data

useForCal <- rbinom(n=dim(dTrainAll)[[1]],size=1,prob=0.1)>0

dCal <-subset(dTrainAll,useForCal)

dTrain<-subset(dTrainAll,!useForCal)

Find out which variables are Categorical and which are Numeric

vars <-colnames(data.f)

catVars <- vars[sapply(dTrainAll[,vars],class) %in% c('factor','character')]

numericVars<-vars[sapply(dTrainAll[,vars],class) %in% c('numeric','integer')]

Generative Learning Algorithms: NAIVE BAYES

1. Input Data x is discrete. For example in text classification, x is a vector of 0 and 1 (bag of words). The vector size is the size of the vocabulary V.

2. In generate algorithms, we generate a model for the input data. For example, we can assume that the input data is Multivariate Gaussian Dist.

Q. What is the data input model in NB ?

Its a MULTINOMIAL DIST. Since for a vector of size V, we have 2^V possible values of x drawn from a multinomial distribution. Since, we have 2^V possible values, defining or modeling x over this input space of 2^V-1 dim space is tough.

Q. How do we solve this dimensionality problem?

By making the INDEPENDENCE assumption which is clearly not correct but works.

What is independence ?? If x and y are independent, then

p(x) = p(x|y)

in NB, this becomes

p(x1,x2,…,xV|y) = p(x1|y) p(x2|y,x1)p(x3|y,x1,x2)…..

=p(x1|y)p(x2|y)…p(xV|y)

Words in a document are indepenent of each other.

#machine learnign #Naive Bayes

Generative Learning Models: Gaussian Discriminant Ananlysis Model

GDA is a classification based machine learning algorithm. The basic premise is as follows. Suppose we have two classes for which the designated labels are y=1 and y=0. The assumptions which GDA makes (which if wrong is a bad assumption to make) is that the data from the two classes comes from Multivariate Guassian Distribution with different means and cov.

Generative Models tend to assume/model the behavior/distribution of the data for each of the classes. The data is generated via the Guassian Dist. What GDA does is find parameters which sepearate the two classes such that P(y=1|x) is greater than 0.5 on one side and <0.5 on the other side.

GDA Parameters

----------------------

Gauss Dist of 2 classes mean1 mean2 and sigma matrix

Also prior dist of classes c1 and c2 which is bernoulli for example with param PHI.

GDA and Logistic Regression

Numerical Optimization

------------------------------

We need to estimate the parameters to maximize such that the data is indeed similar to what we have.

There are two concepts here:

Log-Likelihood

Likelihood (MLE)

GDA vs LR (Duality)

-------------------------------------

GDA and LR have a duality. If the data for classes is indeed from Guassian Dist, GDA makes sense bcus this is the assumption it works on.

P(y=1|x) can be expressed as a LR equation for some Theta and same x.

If p(x|y) is Gaussian, GDA is better than LGR.

However, LR does not make the assumption that data is Gaussian. and hence is more robust to incorrect modelling assumption. If the data is Poisson for example, LR still works well but GDA does not.

Thats why LR is more popular, since it does not make assumptions about the input data distribution.

#machine learning #gaussian #logistic regression

R Text Mining libary(help=tm)

Information on package ‘tm’ Description: Package: tm Title: Text Mining Package Version: 0.5-9 Date: 2013-06-17 Authors@R: c(person("Ingo", "Feinerer", role = c("aut", "cre"), email = "[email protected]"), person("Kurt", "Hornik", role = "aut")) Depends: R (>= 2.14.0), methods Imports: parallel, slam (>= 0.1-22) Suggests: filehash, proxy, Rgraphviz, SnowballC, XML SystemRequirements: Antiword (http://www.winfield.demon.nl/) for reading MS Word files, pdftotext from Poppler (http://poppler.freedesktop.org/) for reading PDF Description: A framework for text mining applications within R. License: GPL (>= 2) URL: http://tm.r-forge.r-project.org/ Packaged: 2013-06-18 09:35:09 UTC; hornik Author: Ingo Feinerer [aut, cre], Kurt Hornik [aut] Maintainer: Ingo Feinerer <[email protected]> NeedsCompilation: yes Repository: CRAN Date/Publication: 2013-06-18 12:14:27 Built: R 3.0.0; x86_64-pc-linux-gnu; 2013-07-01 08:39:04 UTC; unix Index: DataframeSource Data Frame Source Dictionary Dictionary DirSource Directory Source FunctionGenerator Function Generator GmaneSource Gmane Source PCorpus Permanent Corpus Constructor PlainTextDocument Plain Text Document RCV1Document RCV1 Text Document Reuters21578Document Reuters-21578 Text Document ReutersSource Reuters-21578 XML Source Source Access Sources TermDocumentMatrix Term-Document Matrix TextDocument Access and Modify Text Documents TextRepository Text Repository URISource Uniform Resource Identifier Source VCorpus Volatile Corpus VectorSource Vector Source WeightFunction Weighting Function XMLSource XML Source Zipf_plot Explore Corpus Term Frequency Characteristics acq 50 Exemplary News Articles from the Reuters-21578 XML Data Set of Topic acq as.PlainTextDocument Create Objects of Class PlainTextDocument c.Corpus Combine Corpora, Documents, Term-Document Matrices, and Term Frequency Vectors crude 20 Exemplary News Articles from the Reuters-21578 XML Data Set of Topic crude dissimilarity Dissimilarity findAssocs Find Associations in a Term-Document Matrix findFreqTerms Find Frequent Terms getFilters List Available Filters getReaders List Available Readers getSources List Available Sources getTokenizers List Available Tokenizers getTransformations List Available Transformations inspect Inspect Objects makeChunks Split a Corpus into Chunks materialize Materialize Lazy Mappings meta Meta Data Management ncol.TermDocumentMatrix The Number of Rows/Columns/Dimensions/Documents/Terms of a Term-Document Matrix plot.TermDocumentMatrix Visualize a Term-Document Matrix preprocessReut21578XML Preprocess the Reuters-21578 XML archive. prescindMeta Prescind Document Meta Data readDOC Read In a MS Word Document readGmane Read In a Gmane RSS Feed readPDF Read In a PDF Document readPlain Read In a Text Document readRCV1 Read In a Reuters Corpus Volume 1 Document readReut21578XML Read In a Reuters-21578 XML Document readTabular Read In a Text Document readXML Read In an XML Document read_dtm_Blei_et_al Read Document-Term Matrices removeNumbers Remove Numbers from a Text Document removePunctuation Remove Punctuation Marks from a Text Document removeSparseTerms Remove Sparse Terms from a Term-Document Matrix removeWords Remove Words from a Text Document rownames.TermDocumentMatrix Row, Column, Dim Names, Document IDs, and Terms sFilter Statement Filter scan_tokenizer Tokenizers searchFullText Full Text Search stemCompletion Complete Stems stemDocument Stem Words stopwords Stopwords stripWhitespace Strip Whitespace from a Text Document termFreq Term Frequency Vector tm_filter Filter and Index Functions on Corpora tm_intersect Intersection between Documents and Words tm_map Transformations on Corpora tm_reduce Combine Transformations tm_tag_score Compute a Tag Score weightBin Weight Binary weightSMART SMART Weightings weightTf Weight by Term Frequency weightTfIdf Weight by Term Frequency - Inverse Document Frequency writeCorpus Write a Corpus to Disk Further information is available in the following vignettes in directory ‘/home/sverma/R/x86_64-pc-linux-gnu-library/3.0/tm/doc’: extensions: Extensions (source, pdf) tm: Introduction to the tm Package (source, pdf)

#text mining #R

LDA: Topic Modeling using topicmodels package in R

Packages Required : topicmodels, tm (textmining), SnowballC (R interface to the C libstemmer library that implements Porter's word stemming algorithm for collapsing words)

library(topicmodels) library(tm) library("XML") library("SnowballC")

set.seed(1102)

install.packages("corpus.JSS.papers",repos ="http://datacube.wu.ac.at/")

data("JSS_papers", package = "corpus.JSS.papers")

attributes(JSS_papers) #dim 556, 15 Matrix

remove_HTML_markup <- function(s) { doc <- htmlTreeParse(s, asText = TRUE, trim = FALSE) iconv(xmlValue(xmlRoot(doc)), "", "UTF-8") } #Prepare Corpuse and DocumentToTerm Matrix corpus <- Corpus(VectorSource(sapply(JSS_papers[, "description"], remove_HTML_markup))) dtm <- DocumentTermMatrix(corpus,control = list(stemming = TRUE, stopwords = TRUE, minWordLength = 3,removeNumbers = TRUE)) dtm <- removeSparseTerms(dtm, 0.99) dim(dtm)

#LDA jss_LDA <- LDA(dtm[1:450,], control = list(alpha = 0.1), k = 10) post <- posterior(jss_LDA, newdata = dtm[-c(1:450),]) get_terms(jss_LDA, 5)

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 [1,] "model" "use" "model" "packag" "test" "use" "packag" "statist" "packag" [2,] "use" "packag" "estim" "use" "model" "packag" "provid" "use" "data" [3,] "estim" "method" "packag" "data" "use" "cluster" "data" "comput" "model" [4,] "can" "time" "function" "user" "statist" "data" "analysi" "can" "use" [5,] "packag" "provid" "use" "graphic" "method" "function" "statist" "calcul" "function" Topic 10 [1,] "use" [2,] "data" [3,] "program" [4,] "can" [5,] "factor"

#topic modeling #LDA #R

Installing topicmodels R package

Original Link http://theoryno3.blogspot.sg/2010/12/installing-topicmodels-r-package.html

It has been quite annoying trying to install the "topicmodels" package for R. But here's a run down, in case others out there encounter the following error message when installing the package into a non-standard location. ctm.c:29:25: error: gsl/gsl_rng.h: No such file or directory ctm.c:30:28: error: gsl/gsl_vector.h: No such file or directory ctm.c:31:28: error: gsl/gsl_matrix.h: No such file or directory First, you will need to install the GNU GSL library. Pick it up from here: ftp://ftp.gnu.org/gnu/gsl/. The typical yum or manual compilation should work just fine. If you're installing this library into a non-standard location, take note of the installation path because you will need that below. Second, even when passing the "--configure-vars" option to point to the location of the GSL include and shared library folder, R CMD INSTALL will fail. The solution? Here: 1) Download the topicmodel source from here: http://cran.r-project.org/web/packages/topicmodels/index.html 2) Unpack into a working folder. 3) Modify the src/Makevars file to read as follows: LIB_GSL=/path/to/gsl/installation/ PKG_LIBS=-lgsl -lgslcblas -L${LIB_GSL}/lib PKG_CPPFLAGS=-I$(LIB_GSL)/include Of course, modify the value for LIB_GSL according to your installation path. Now re-compress the folder structure like so: "tar zcvf topicmodels.tar.gz /path/to/working/folder" Now run "R CMD INSTALL topicmodels.tar.gz ". Note that you will also encounter an issue with the vignette generation, so you'll need to install the OAIHarvester package in an R session: install.packages("OAIHarvester")

#topic modeling #LDA #R

LDA (Latent Dirichlet Allocation) Topic Modeling

Given a document corpus, we are interested in understanding the key topics into which these documents can be classified.

One such algorithm for Topic Modeling is LDA. LDA computes the hidden structure that most likely generated the document corpus.

1. We have a bunch of documents. Each document exhibits a few topics. The document corpus has a few topics (say 20,50, 100). A given document may have, say, 5 topics. Each document has a topic distribution.

2. A topic is a hidden variable. It can be viewed as a distribution over a fixed vocabulary set.

3. Observed and Hidden Variables: Documents are observed. The topic structure - the topics, per-document topic distribution and the per-document per-word topic assignment are LATENT.

4. LDA is a probabilistic algorithm. In generative prob modeling, data is assumed to arise from a generative process with hidden variables. The generative process defines a Joint Prob Dist over multiple variables, some of which are known and some are latent. This joint distribution is modeled as a conditional distribution over hidden variables given the know variables. This conditional distribution is called the Posterior Distribution.

#text classification #lda #topic modeling

Naive Bayes

1. Bayes theorem P(A/B) = P(B/A)P(A)/P(B) P(A) - Prior , P(B/A) - Likeliehood , P(B) - Marginal Likelihood Goal is to computethe Posterior Prob by Computing the remaining there probabilities.

P(B/A)P(A) -- P(A intersection B) Joint Probability

P(A int B) = P(A)P(B) Independent

2. NB uses Bayes Th for Classification. It computes probability.

Wrong Assumption:

All features are equally important and INDEPENDENT. Without the Independent assumption, the computation may be infeasible. Assuming independence among features makes the joint prob easy to compute. This is particularly true if the number of features is large.

Not good for datasets with large number of numeric features. Some discretization may be needed. BINNING technique can be useful

Probabilities which are computed are less reliable, but the prediction in terms of classes generally is OKAY.

3.

The Laplace estimator - If probability of a given feature is 0/n , convert it to 1/n.

#naive bayes #machine learning #bayes theorem

Learning from Data

HW1:

Bins and Marbles 3. We have 2 opaque bags, each containing 2 balls. One bag has 2 black balls and the other has a black ball and a white ball. You pick a bag at random and then pick one of the balls in that bag at random. When you look at the ball, it is black. You now pick the second ball from that same bag. What is the probability that this ball is also black?

A. Baye's Th

Assume: Bag1 has 2 black balls, Bag2 has 1 white/1black ball

We want to find out the prob that we took the bag1 given the first ball is black.

P(bag=1|firstball=black) =

P(firstball=black|bag=1)P(bag=1)|P(firstball=black)=

1*.5 /(P(firstball=black|bag=1)P(bag=1) +P(firstball=black|bag=2)P(bag=2))=

0.5 /(1*0.5+0.5*0.5)= 0.5/.75 = 2/3

________________________________________________________

Consider a sample of 10 marbles drawn from a bin that has red and green marbles. The probability that any marble we draw is red is μ = 0.55 (independently, with replacement). We address the probability of getting no red marbles (ν = 0) in the following cases:

4. We draw only one such sample. Compute the probability that ν = 0. The closest answer is (closest is the answer that makes the expression |your answer− given option| closest to 0):

.0003405 (Binomial with nCr , n=r=10 , answer is (1-μ)^10

5. We draw 1,000 independent samples. Compute the probability that (at least) one of the samples has ν = 0. The closest answer is:

Prob(that at least one sample has v=0) = 1-no sample has v=0

= 1 - (1-.0003405)^1000=0.289

#Learning from Data machine_learning

R : SVM ROC/AUC and Splitting Data Into Training and Testing Data

#Function begins # split the data set in test and training set split.data <- function(data, p = 0.7, s = 666){ set.seed(s) index <- sample(1:dim(data)[1]) train <- data[index[1:floor(dim(data)[1] * p)], ] test <- data[index[((ceiling(dim(data)[1] * p)) + 1):dim(data)[1]], ] return(list(train = train, test = test)) } #Function ends

The function takes a matrix and splits it into two parts , train & test.

#dati dati = split.data(magic04, p = 0.7) train<-dati$train test<-dati$test

#dati dati = split.data(magic04, p = 0.7) train<-dati$train test<-dati$test #str(train) #str(test) #SVM TRAINING library(e1071) model <- svm(train[,1:10],train[,11], probability = T) # prediction on the test set pred <- predict(model, test[,1:(dim(test)[[2]]-1)], probability = T) # Check the predictions table(pred,test[,dim(test)[2]]) pred.prob <- attr(pred, "probabilities") pred.to.roc <- pred.prob[, 1] # performance assessment library(ROCR) pred.rocr <- prediction(pred.to.roc, as.factor(test[,(dim(test)[[2]])])) perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff") cat("AUC =",deparse(as.numeric([email protected])),"\n") perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr") plot(perf.tpr.rocr, colorize=T)

AUC = 0.914772127332079