K-means Cluster Analysis, 4.4 assignment
Dear colleagues students,
I performed a K-means cluster analysis was for gap minder data set to identify whether by analyzing 9 clustering variables it would be possible to predict income per person in different countries. Clustering variables:
femaleemployrate
employrate
Internetuserate
lifeexpectancy
polityscore
relectricperperson
urbanrate
oilperperson
co2emissions
All clustering variables were quantitative and standardized to have a mean of 0 and a standard deviation of 1.
Data were randomly split into a training set that included 70% of the observations (N=150) and a test set that included 30% of the observations (N=64).
Elbow curve of r-square values for the 9 cluster solutions suggests that the 3, 4, 6, 7 and 8-cluster solutions might be interpreted (most prominent bends).
The results below are for an interpretation of the 4-cluster solution.
Canonical discriminant analyses was used to reduce the 9 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster indicated that the observations in clusters 3 and 2 were more densely packed and more numerous and did not overlap very much with the other clusters. Cluster 4 was generally distinct, with only a few observations. Observations in cluster 1 were spread out more than the other clusters, showing higher cluster variance than in others. The results of this plot suggest that the best cluster solution may have fewer than 4 clusters. Also, data set is relatively small, so it would be important to also evaluate the cluster solutions with fewer than 4 clusters (spoiler alert - when data is not split into training and test sets, 3-cluster solution explains differences better, I checked this afterwards).
Plot of the first two canonical variables for the clustering variables by cluster.
The means on the clustering variables showed that, compared to the other clusters, countries in cluster 1 have highest female and overall employment rates compared to others, but lowest internet usage, urbanization, life expectancy, oil and electricity usage, democracy scores and co2 emission rates. We can safely predict this would be the cluster of countries with lowest income per person.
Cluster 4 countries have 2nd highest female and overall employment rates and life expectancy compared to others, and highest (with big difference from other clusters) internet usage, urbanization, oil and electricity usage, democracy scores and co2 emission rates. We can predict this would be the cluster of countries with highest income per person.
Clusters 3 and 2 have big distinction among them by all parameters, and cluster 2 countries clearly being more developed and of higher personal income than cluster 3 .
ANOVA procedure was performed to test for significant differences between the clusters on income per person. A tukey test was used for post hoc comparisons between the clusters. Results indicate significant differences between the clusters on income (F=43.09, p<.0001). The tukey post hoc comparisons showes significant differences between clusters on income, with the exception that clusters 1 and 3 were not significantly different from each other. Countries in cluster 4 had the highest income per person rate (mean=58210.61, sd=40741.34), and cluster 1 countries had the lowest income per person (mean=956.84, sd=1367.61). This model explains 51% of observations (R-square=0.5083).
Please review my work. Thank you!
Sincerely,
Edita












