Data Analysis with Classification Tree
Here I am using the GapMiinder data set to analyze which explanatory variables does the best job of classifying life expectancy among countries.
I am running a decision tree analysis on the below variables:
Response Variable: Life Expectancy (lifeexpectancy), converted to a categorized variable le_cat with two levels
1: High – Life expectancy greater than 65 years
2: Low – Life expectancy less than or equal to 65 years
Explanatory Variables:
urbanrate
incomeperperson
employrate
alcconsumption
hivrate
suicideper100th
femaleemployrate
Here all my explanatory variables are quantitative.
A decision tree analysis was executed to test non-linear relationships among the above explanatory variables and the categorical response variable and all possible cut points are tested. I have used the entropy “goodness of split” criterion to grow the tree and a cost complexity algorithm for pruning the full tree into the final subtree.
The SAS code for my analysis is given below:
Output:
The final subtree (after pruning) generated by the snippet is:
Confusion Matrix:
Observations and Interpretations
1. Out of 213 countries available in the data set, only 142 were considered for the analysis as these records had data for all variables considered.
2. The initial tree after the “grow” step had 8 leaf nodes. After pruning, the adjusted subtree has 4 leaves.
3. The HIV rate was the first variable to separate the sample into two subgroups. Countries with HIV rate < 0.577 (82 countries) were put into one bucket where 91% have high life expectancy and only 9% have low life expectancy.
4. For the 60 Countries with high HIV rate (>= 0.577) 33.33% have high life expectancy and 66.67% have low.
5. This subgroup is then further divided by income per person at a cut point of 637.169.
6. Out of the countries with high HIV rate and low income per person (30 countries) none have high life expectancy.
7. On the other hand, 66.67% countries with high HIV rate and high income per person (30 countries) have high life expectancy
8. The group of countries with high HIV rate (>= 0.577) and high income per person (> 637.169) are further subdivided by using HIV Rate again into two subgroups:
a. HIV rate below 3.161. In this group, 91% countries have high life expectancy.
b. HIV rate above 3.161. In this group, no country has high life expectancy.
9. From the confusion matrix,
a. The classification tree, correctly classifies countries with high life expectancy 100% of the time.
b. It correctly classifies countries with low life expectancy 90% of the time (1 - 0.0952 = 0.9048 or 90.48%).














