Decision Tree
Decision trees are predictive models that explore nonlinear relationships and interactions among explanatory variables. When the response variable is categorical, the model is a called a classification tree. Decision trees create segmentations by applying a series of rules repeatedly to choose variable sets that best predict the response variable.
My data set does not have categorical response or explanatory variables, so I created some for this exercise. High CO2 emissions are defined as 30E9 or more metric ton.
Generated decision tree can be found below:
Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. All possible separations (categorical) or cut points (quantitative) are tested.
This decision tree uses these variables to predict output variable (TREG1) – whether person is a smoker, or not:
BIO_SEX – categorical – gender
GPA1 – numeric – current GPA
ALCEVR1 – binary – alcohol use
WHITE – binary – whether participant is white
BLACK – binary – whether participant is black
To train a decision tree I’ve split given dataset into train and test datasets in proportion 70/30.
From decision tree we can observe:
Participants who used alcohol were more likely to be smokers.(up to 5 times more smokers who used alcohol)
Most smokers are white
People with lower GPA are more usual to be regular smokers
Source code
import pandas as pd import sklearn.metrics from numpy.lib.format import magic from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn import tree from io import StringIO from IPython.display import Image import pydotplus
RND_STATE = 55324
AH_data = pd.read_csv(“data/tree_addhealth.csv”) data_clean = AH_data.dropna() data_clean.dtypes data_clean.describe()
predictors = data_clean[[‘BIO_SEX’,’GPA1′, ‘ALCEVR1’, ‘WHITE’, ‘BLACK’]]
targets = data_clean.TREG1
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=0.3)
classifier=DecisionTreeClassifier(random_state=RND_STATE) classifier=classifier.fit(pred_train, tar_train) predictions=classifier.predict(pred_test)
print(“Confusion matrix:\n”, sklearn.metrics.confusion_matrix(tar_test,predictions)) print(“Accuracy: “,sklearn.metrics.accuracy_score(tar_test, predictions))
out = StringIO() tree.export_graphviz(classifier, out_file=out, feature_names=[“sex”, “gpa”, “alcohol”, “white”, “black”],proportion=True, filled=True, max_depth=4) graph=pydotplus.graph_from_dot_data(out.getvalue()) img = Image(data=graph.create_png()) img
with open(“output” + “.png”, “wb”) as f: f.write(img.data)
















