Week 2: Peer-graded Assignment: Running a Random Forest
This assignment is intended for Coursera course "Machine Learning for Data Analysis by Wesleyan University”.
It is for "Week 2: Peer-graded Assignment: Running a Random Forest".
I am working on Random Forest in Python.
1) Syntax used to run Random Forest
My binary response or target variable is Personal Income (0=Low Income (<=$23000), 1=High Income (<=$100000)) and my Explanatory or predictor variables are Major depression (0=NO,1=YES), Gender (0=Female, 1=Male) and Dysthymia (0=NO, 1=YES).
The training sample has 306 observations or rows, 60% of our original sample, and 3 explanatory variables. The test sample has 205 observations or rows, 40% of the original sample. And again 3 explanatory variables or columns.
2) Code used to run Random Forest
from pandas import Series, DataFrame
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier
# bug fix for display formats to avoid run time errors
pd.set_option('display.float_format', lambda x:'%.2f'%x)
data = pd.read_csv("NESARC_Data_Set.csv", low_memory=False)
# convert values to numeric
data['MAJORDEPLIFE'] = pd.to_numeric(data['MAJORDEPLIFE'], errors='coerce')
data['S1Q10A'] =pd.to_numeric(data['S1Q10A'], errors='coerce')
data['GENDER']=pd.to_numeric(data['SEX'],errors='coerce')
data['DYSLIFE']=pd.to_numeric(data['DYSLIFE'],errors='coerce')
# subset data to age 18-35
sub1 = data[(data['AGE'] >= 18) & (data['AGE'] <= 35) & (data['S1Q10A'] >= 0) & (data['S1Q10A'] <= 100000)]
elif row['S1Q10A']<=100000:
B1['INCOME'] = B1.apply (lambda row: INCOME (row),axis=1)
# recode explanatory variables to include 0
B1['GENDER'] = B1['SEX'].map(recode2)
# convert INCOME to numerical
B1['INCOME'] =pd.to_numeric(B1['INCOME'], errors='coerce')
B1['GENDER'] =pd.to_numeric(data['GENDER'], errors='coerce')
B1['DYSLIFE'] =pd.to_numeric(data['DYSLIFE'], errors='coerce')
#Split into training and testing sets
predictors = data_clean[['MAJORDEPLIFE','GENDER','DYSLIFE']]
targets = data_clean.INCOME
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
#Build model on training data
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=25)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
print(sklearn.metrics.confusion_matrix(tar_test,predictions))
print(sklearn.metrics.accuracy_score(tar_test, predictions))
# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)
Running a different number of trees and see the effect
of that on the accuracy of the prediction
for idx in range(len(trees)):
classifier=RandomForestClassifier(n_estimators=idx + 1)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
plt.plot(trees, accuracy)
print(statistics.mean(accuracy))
Accuracy score vs Number of trees
As we can see, all the trees has the same accuracy of about 89%.
The variables are listed in the order they've been named earlier in the code. Starting with Major depression, Gender, and ending with Dysthymia. As we can see the variables with the highest important score at 0.499 is Gender and the variable with the lowest important score is Asian Dysthymia at 0.08.
Confusion Matrix: The diagonal, 184 and 0, represent the number of true negative for personal smoking, and the number of true positives, respectively. The 21, on the bottom left, represents the number of false negatives. Classifying high income as low income. And the 0 on the top right, the number of false positives, classifying low income as a high income which is none in our case.
In my confusion matrix, the training data statistical model incorrectly classified a total of 21 of the 205 observations in the test sample, meaning that the statistical model misclassified 10% of the observations in the test data set.
21 + 0 = 21 incorrectly classified
Test error rate = % misclassified = 21/205 = 10%
Accuracy Score: It is approximately 0.8976, which suggests that the decision tree model has classified 89.76% of the sample correctly as either regular or not regular smokers.
Given that we don't interpret individual trees in a random forest, the most helpful information to be gotten from a forest is arguably the measured importance for each explanatory variable. Also called the features. Based on how many votes or splits each has produced in the 25 tree ensemble.