Decision Tree
This is the first time post after a very long long time vacuum to post for this tumblr. Ok This post it write because, i enroll Machine Learning for Data Analysis Class in Coursera. The Assignment for this class should be post in tumblr, ok here we are my assignment (sorry for my bad English, it is the first time I post in English)
The assignment for this week is to running a decision tree. There are two software that we already studied in this week that is SAS and Python. I already installed my SAS (It is free yeeah, because I use University Edition) but it make me very confused. I haven’t much time to use is it because the deadline it is very short (because I also enroll 3 class that have same deadline, so it is very hard) So I choose Python to running the decision tree. I prefer to use enthought for running python (I already familiar with this one) so here we are the code that I was running. Actually it is same code that we get in this course just a little bit change (it is from the discussion for this course)
# -*- coding: utf-8 -*-
"""
Created on Sun Dec 13 21:12:54 2015
@author: ldierker
"""
# -*- coding: utf-8 -*-
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
"""
Data Engineering and Analysis
"""
#Load the dataset
AH_data = pd.read_csv("tree_addhealth.csv")
data_clean = AH_data.dropna()
data_clean.dtypes
data_clean.describe()
"""
Modeling and Prediction
"""
#Split into training and testing sets
predictors = data_clean[['BIO_SEX','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN','age','ALCEVR1','ALCPROBS1','marever1','cocever1','inhever1','cigavail','DEP1',
'ESTEEM1','VIOL1','PASSIST','DEVIANT1','SCHCONN1','GPA1','EXPEL1','FAMCONCT','PARACTV','PARPRES']]
targets = data_clean.TREG1
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape
#Build model on training data
classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)
#Displaying the decision tree
from sklearn import tree
#from StringIO import StringIO
from io import BytesIO as StringIO
#from StringIO import StringIO
from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, out_file=out)
import pydotplus
graph=pydotplus.graph_from_dot_data(out.getvalue())
with open('picture_out1.png', 'wb') as f:
f.write(graph.create_png())
Actually there are some trouble running this code in my Mac such that
1. KeyError: "['BIO_SEX' 'HISPANIC' 'WHITE' 'BLACK' 'NAMERICAN' 'ASIAN' 'age' 'ALCEVR1'\n 'ALCPROBS1' 'marever1' 'cocever1' 'inhever1' 'cigavail' 'DEP1' 'ESTEEM1'\n 'VIOL1' 'PASSIST' 'DEVIANT1' 'SCHCONN1' 'GPA1' 'EXPEL1' 'FAMCONCT'\n 'PARACTV' 'PARPRES'] not in index"
This error occurred because the data from “tree_addhealth.csv” was changed when I download, there is a additional row in first row, so just delete the first row and this error will disappear
2. TypeError: unicode argument expected, got 'str'
It just change the code from io import StringIO into from io import BytesIO as StringIO
3. InvocationException: GraphViz's executables not found
It is because the GraphViz’s has not been installed in my mac, you can follow this step
1 log in & download Xcode an Xcode Command Line Tools from https://developer.apple.com/downloads/
2 install Xcode and the Xcode Command Line Tools
3 agree to Xcode license in terminal:
sudo xcodebuild -license
4 get MacPorts pkg installer for your version of osx from https://www.macports.org/install.php#installing
5 install MacPorts for your version of osx
6 in terminal:
export PATH=/opt/local/bin:/opt/local/sbin:$PATH
sudo port -v selfupdate
7 install graphviz via MacPorts. in terminal:
sudo port install graphviz-gui
After that several error the program is running. Yeaaah, and the result for this program is
First i will tell you about this picture,
Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. All possible separations (categorical) or cut points (quantitative) are tested.
The following explanatory variables were included as possible contributors to a classification tree model evaluating smoking experimentation (my response variable), age, gender, (race/ethnicity) Hispanic, White, Black, Native American and Asian. Alcohol use, marijuana use, cocaine use, inhalant use, availability of cigarettes in the home, whether or not either parent was on public assistance, any experience with being expelled from school. alcohol problems, deviance, violence, depression, self-esteem, parental presence, parental activities, family connectedness, school connectedness and grade point average that named by x[1]-x[25].
This tree is very very confusing because has many leaves it is because there are no pruning in Python unlike SAS. The tree that has built in python is very large, the depth of the tree is more than 10. Ok lets zoom in the picture to get more interpretation about our decision tree
in this picture the decision to pick the side it is x[9] if x[9]<=25 it will go to left side, and if wrong it will go to the right side. It will run until x[25], and if the condition its true it will make the child in left side, anyelse it will make child in right side









