Discover Top Posts Tagged with #assignment-1

Decision Tree

Decision trees are predictive models that explore nonlinear relationships and interactions among explanatory variables. When the response variable is categorical, the model is a called a classification tree. Decision trees create segmentations by applying a series of rules repeatedly to choose variable sets that best predict the response variable.

My data set does not have categorical response or explanatory variables, so I created some for this exercise. High CO2 emissions are defined as 30E9 or more metric ton.

Generated decision tree can be found below:

Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. All possible separations (categorical) or cut points (quantitative) are tested.

This decision tree uses these variables to predict output variable (TREG1) – whether person is a smoker, or not:

BIO_SEX – categorical – gender

GPA1 – numeric – current GPA

ALCEVR1 – binary – alcohol use

WHITE – binary – whether participant is white

BLACK – binary – whether participant is black

To train a decision tree I’ve split given dataset into train and test datasets in proportion 70/30.

From decision tree we can observe:

Participants who used alcohol were more likely to be smokers.(up to 5 times more smokers who used alcohol)

Most smokers are white

People with lower GPA are more usual to be regular smokers

Source code

import pandas as pd import sklearn.metrics from numpy.lib.format import magic from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn import tree from io import StringIO from IPython.display import Image import pydotplus

RND_STATE = 55324

AH_data = pd.read_csv(“data/tree_addhealth.csv”) data_clean = AH_data.dropna() data_clean.dtypes data_clean.describe()

predictors = data_clean[[‘BIO_SEX’,’GPA1′, ‘ALCEVR1’, ‘WHITE’, ‘BLACK’]]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=0.3)

classifier=DecisionTreeClassifier(random_state=RND_STATE) classifier=classifier.fit(pred_train, tar_train) predictions=classifier.predict(pred_test)

print(“Confusion matrix:\n”, sklearn.metrics.confusion_matrix(tar_test,predictions)) print(“Accuracy: “,sklearn.metrics.accuracy_score(tar_test, predictions))

out = StringIO() tree.export_graphviz(classifier, out_file=out, feature_names=[“sex”, “gpa”, “alcohol”, “white”, “black”],proportion=True, filled=True, max_depth=4) graph=pydotplus.graph_from_dot_data(out.getvalue()) img = Image(data=graph.create_png()) img

with open(“output” + “.png”, “wb”) as f: f.write(img.data)

#assignment-1

Design Statement

Assignment 1:

My original intent was to use the Last.fm api to explore the data I had been tracking for many years of my song listening habits. Unfortunately I found part of the way through the writing process that the API that they have does not track the date that songs are played, so I could not create the type of visualization I initially intended on making. Instead I spun my idea to focus on music exploration and investigating the relationship between how often a song is played and how recently it has been added. The end result came out rather visually interesting. Using svg gradients I manipulated the color of each individual bar of the graph based on a time vs play count algorithm that I wrote. Tracks that are listen to often appear as a hot pink color, while "cooler", older tracks appear in a blue color, and a wide range of tracks in between can be seen. I hoped to add additional functionality through sorting, and alternate data sets, but I ran out of time after struggling with D3 for a few weeks. In the end I'm satisfied with what I created, but hope in the future I can extend it further.

All data and images were self-created.

Project Repo: https://github.com/iamnbutler/music-stats

Live Example: http://nate.jp/music-stats/

#DATT3935 #assignment-1

Decision Tree

My data set does not have categorical response or explanatory variables, so I created some for this exercise. High CO2 emissions are defined as 30E9 or more metric ton.

Generated decision tree can be found below:

This decision tree uses these variables to predict output variable (TREG1) – whether person is a smoker, or not:

BIO_SEX – categorical – gender

GPA1 – numeric – current GPA

ALCEVR1 – binary – alcohol use

WHITE – binary – whether participant is white

BLACK – binary – whether participant is black

To train a decision tree I’ve split given dataset into train and test datasets in proportion 70/30.

From decision tree we can observe:

Participants who used alcohol were more likely to be smokers.(up to 5 times more smokers who used alcohol)

Most smokers are white

People with lower GPA are more usual to be regular smokers

Source code

RND_STATE = 55324

AH_data = pd.read_csv(“data/tree_addhealth.csv”) data_clean = AH_data.dropna() data_clean.dtypes data_clean.describe()

predictors = data_clean[[‘BIO_SEX’,’GPA1′, ‘ALCEVR1’, ‘WHITE’, ‘BLACK’]]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=0.3)

classifier=DecisionTreeClassifier(random_state=RND_STATE) classifier=classifier.fit(pred_train, tar_train) predictions=classifier.predict(pred_test)

print(“Confusion matrix:\n”, sklearn.metrics.confusion_matrix(tar_test,predictions)) print(“Accuracy: “,sklearn.metrics.accuracy_score(tar_test, predictions))

with open(“output” + “.png”, “wb”) as f: f.write(img.data)

#assignment-1

#assignment-1

Trending Tags

Recently Viewed Tags

#assignment-1