Run a Classification Tree
Building Decision Tree Model :
In This project assignment I will explore non linear relationships among a series of explanatory variables and binary, categorical response variables.
Code
import pandas as pd
import numpy as np
from sklearn.metrics import*
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from io import StringIO
from IPython.display import Image
import pydotplus
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt
%matplotlib inline
rnd_state = 23468
Load data
data = pd.read_csv('Data/breast_cancer.csv')
data.info()
In the output above there is an empty column 'Unnamed: 32', so next it should be dropped.
Plots
For visualization purposes, the number of dimensions was reduced to two by applying t-SNE method. The plot illustrates that our classes are not clearly divided into two parts, so the nonlinear methods (like Decision tree) may solve this problem.
Decision tree
Confusion matrix:
Actual 0 1 All
Predicted
0 96 8 104
1 5 62 67
All 101 70 171
Accuracy: 0.9239766081871345
Results
Generated decision tree can be found below:
Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable (breast cancer diagnosis: malignant or benign).
The dataset was splitted into train and test samples in ratio 70\30.
After fitting the classifier the key metrics were calculated - confusion matrix and accuracy = 0.924. This is a good result for a model trained on a small dataset.
From decision tree we can observe:
The malignant tumor is tend to have much more visible affected areas, texture and concave points, while the benign's characteristics are significantly lower.
The most important features are:
concave points_worst = 0.707688
area_worst = 0.114771
concave points_mean = 0.034234
fractal_dimension_se = 0.026301
texture_worst = 0.026300
area_se = 0.025201
concavity_se = 0.024540
texture_mean = 0.023671
perimeter_mean = 0.010415
concavity_mean = 0.006880
















