Supervised Learning Algorithms Done Blind
Have you ever had to analyze a data set, without knowing what the features were called or what the data set consisted of? This was a first for me. Luckily this data challenge had guidelines. It was a fun exercise, and I was rewarded when the description of the data set was revealed, and I was able to infer what my results really meant!
The Problem
Given an X and Y training data set (first hint: Supervised Learning) of over 4000 instances and 600+ features and an `Xtest` data set of more than 2600 instances, I was to provide accuracy estimates for a classification model of 22 classes, and apply the model to the test data set.
Steps of the Analysis
I first needed to determine which classification algorithm to use. Since I decided to perform analyses in Python, I referred to the Scikit-learn flowchart (shown below) to determine which algorithm to dive into first.
With plenty of observations and the need to predict categories of a labelled data set, the chart suggested that I apply a linear Support Vector Classification to the data set. As a requirement for this analysis, all features should be on the same scale; inpspection of minimums and maximums all features indicated that they were all on the same scale.
import pandas as pd xTrain.describe()
Next, I randomly split the training data set into x and y training and validation data sets for the purposes of cross-validation.
from sklearn.cross_validation import train_test_split x_train, x_val, y_train, y_val = train_test_split(xTrain, yTrain, test_size=0.33, random_state=42) y_train = np.ravel(y_train) y_val = np.ravel(y_test)
I then trained the model by applying Python’s scikit-learn linear SVM algorithm to the randomly sampled x and y training data set. And I modeled the prediction of the x validation data set.
preds = clf.predict(x_val) score = clf.score(x_val, y_val) #mean accuracy # manual accuracy check correct = 0 for x in range(len(y_val)): if y_val[x]== preds[x]: correct += 1 acc= correct/float(len(y_val)) * 100.0
I then calculated the predicted model accuracy, recall, f-score and support by comparing the predicted response to the y validation response variable. A quick description of these descriptors are shown below in terms of True Positives (TP), False Positives (FP) and False Negatives (FN):
precision (or accuracy) = TP/(TP +FP)
recall = TP/(TP + FN)
F-score: weighted harmonic mean of precision and recall. Values range from 0 to 1, where 1 indicates equal importance between recall and precision.
support: the number of occurrences in each class
from sklearn.metrics import classification report classification_report(y_test, preds)
As shown in the table below, class accuracies ranged from 0.72 to 1.0. A manual check of mean accuracy (correct predictions/total number of predictions) gave exact agreement (94%). Another point to note is the discrepancy between precision and recall for the 13th and 14th classes.
Results were also visualized in a confusion matrix, as shown below. As you can see more clearly in this figure than in the chart above, the 13th and 14th classes (12th and 13th if you count from zero!) were not classified as well as the others. I wonder why that is?
Out of curiosity, I compared the prediction accuracy of other algorithms (CART, K-nearest neighbors, and polynomial and radial basis function (RBG) SVC kernels) to linear SVC. Linear SVC surpassed accuracy of the decision tree (90%) K-nearest neighbors (87%) and polynomial (90%) SVC. RBF SVC did equally well to the linear SVC. However I would argue that the linear model is the most parsimonious.
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=50, oob_score=True) model.fit(x_train, y_train) model = DecisionTreeClassifier() model.fit(x_train, y_train) predicted = model.predict(x_val) cartreport = metrics.classification_report(y_test, predicted)) from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier() knn.fit(x_train, y_train) knnpreds= knn.predict(x_val) knnscore = knn.score(x_test, y_test) knnreport = classification_report(y_test, knnpreds) poly3svc = svm.SVC(kernel='poly', degree=3) poly3svc.fit(x_train, y_train) poly3svcPreds = poly3svc.predict(x_val) poly3svcScore = poly3svc.score(x_val, y_val) rbfsvc = svm.SVC(kernel='rbf') rbfsvc.fit(x_train, y_train) rbfsvcPreds = rbfsvc.predict(x_val) rbfsvcScore = rbfsvc.score(x_val, y_val)
Finally, the model was applied to the X test data set and the predicted response variable (row numbers and predicted responses) were submitted to the judge.
predTest = clf.predict(xTest)
Results and Inferences
The accuracy of my test set was the same as that calculated for the cross-validation (94% - yes!!). And the meaning of the results became clear when I was told what the data represented: it was a digital recording of people reciting the first 22 letters of the alphabet! Can you guess why variables 13 and 14 were poorly categorized? Well, it turns out they represent the letters m and n which can be mistaken for one another easily when spoken/recorded.
This was a fun exercise, to go into a data set blindly, categorize and test the accuracy of the data set, and then be rewarded with it’s meaning post-hoc!
The code snippets above were extracted from the full script which can be found on my GitHub repo.












