Discover Top Posts Tagged with #k-mean

K-means Cluster Analysis, 4.4 assignment

Dear colleagues students,

I performed a K-means cluster analysis was for gap minder data set to identify whether by analyzing 9 clustering variables it would be possible to predict income per person in different countries. Clustering variables:

femaleemployrate

employrate

Internetuserate

lifeexpectancy

polityscore

relectricperperson

urbanrate

oilperperson

co2emissions

All clustering variables were quantitative and standardized to have a mean of 0 and a standard deviation of 1.

Data were randomly split into a training set that included 70% of the observations (N=150) and a test set that included 30% of the observations (N=64).

Elbow curve of r-square values for the 9 cluster solutions suggests that the 3, 4, 6, 7 and 8-cluster solutions might be interpreted (most prominent bends).

The results below are for an interpretation of the 4-cluster solution.

Canonical discriminant analyses was used to reduce the 9 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster indicated that the observations in clusters 3 and 2 were more densely packed and more numerous and did not overlap very much with the other clusters. Cluster 4 was generally distinct, with only a few observations. Observations in cluster 1 were spread out more than the other clusters, showing higher cluster variance than in others. The results of this plot suggest that the best cluster solution may have fewer than 4 clusters. Also, data set is relatively small, so it would be important to also evaluate the cluster solutions with fewer than 4 clusters (spoiler alert - when data is not split into training and test sets, 3-cluster solution explains differences better, I checked this afterwards).

Plot of the first two canonical variables for the clustering variables by cluster.

The means on the clustering variables showed that, compared to the other clusters, countries in cluster 1 have highest female and overall employment rates compared to others, but lowest internet usage, urbanization, life expectancy, oil and electricity usage, democracy scores and co2 emission rates. We can safely predict this would be the cluster of countries with lowest income per person.

Cluster 4 countries have 2nd highest female and overall employment rates and life expectancy compared to others, and highest (with big difference from other clusters) internet usage, urbanization, oil and electricity usage, democracy scores and co2 emission rates. We can predict this would be the cluster of countries with highest income per person.

Clusters 3 and 2 have big distinction among them by all parameters, and cluster 2 countries clearly being more developed and of higher personal income than cluster 3 .

ANOVA procedure was performed to test for significant differences between the clusters on income per person. A tukey test was used for post hoc comparisons between the clusters. Results indicate significant differences between the clusters on income (F=43.09, p<.0001). The tukey post hoc comparisons showes significant differences between clusters on income, with the exception that clusters 1 and 3 were not significantly different from each other. Countries in cluster 4 had the highest income per person rate (mean=58210.61, sd=40741.34), and cluster 1 countries had the lowest income per person (mean=956.84, sd=1367.61). This model explains 51% of observations (R-square=0.5083).

Please review my work. Thank you!

Sincerely,

Edita

#sas #machine learning #data analysis #k-mean

Back Propagated K-Mean Clustering for Prediction of Slow Learners

By Vimika | Prince Verma"Back Propagated K-Mean Clustering for Prediction of Slow Learners"

Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-1 | Issue-6 , October 2017,

URL: http://www.ijtsrd.com/papers/ijtsrd4695.pdf

http://www.ijtsrd.com/engineering/computer-engineering/4695/back-propagated-k-mean-clustering-for-prediction-of-slow-learners/vimika

peer reviewed international journal, submit paper online, commerce journal

#computer engineering #k-mean #svm #prediction

Algo(s) of the Day: kNN and K-Means (or "Which of these things is most like the other?")

k-Nearest Neighbor (kNN) and k-Means clustering are two of the most commonly used, and relatively easy to comprehend, methods to analyze data through clustering. However, these algorithms serve different, but potentially overlapping, functions that can be confused with one another.

k-Means allows you to take an unlabelled data set of instances and put each instance into one of k number of groups. The use of an unlabelled data set deems this method unsupervised.

For example, lets say you get a big pile of plant species that have never been named, but you know features (or measurements) of the plant species such as size, color, shape, weight, etc. You can use these features to start putting them into piles or groups of "similar" plants.

kNN allows you to use a labelled data set to classify an unlabelled instance due to its proximity to k number of neighbors. In this supervised method, the "training" of the model is carried out at the time of classifying an unlabelled instance. Depending on the data type, the hard work could be in creating the labelled data set (k-Means could potentially be used to create these labels) rather than training a model; the data set is the model. The expense comes when testing unknowns and, depending on the size of the data set, storing all that data.

For example, you obtain a pile of identified plant species, each described by a number of features (or measurements). But then a scientists discovers a new species and your job is to figure out which knowns species it is most similar to.

Basic Assumptions: For these analyses, k represents either the number of clusters created (k-Means) or the number of neighbors (kNN) used for classification. Because the analyst must determine the optimal number for k, these methods are sometimes characterized as semi-supervised.

Both algorithms calculate some sort of similarity (often a distance metric) between data points. For example, Euclidean distance is a common metric calculated in feature space.

Both methods are non-parametric and therefore lack a need to assume anything about the distribution of the data (which is one less assumption to worry about, yay!).

In upcoming posts I will go into more detail (and Python code) on how these algos work. I might even compare the explicitly written code with (for example) scipy or R functions in an effort to demonstrate how data science can be as much as an art form as it is a form of science; beware of situations that lead to inaccurate solutions, which can be avoided through visualizations of your data (if possible). For example, how would you categorize the yellow tomato below?

#k-mean #knn #algorithm #data science

특징값을 이용한 이미지 분류

DCT를 통해 얻은 특징값들을 K-Mean이 아니라 Mean shift로 분류할 수 있을까?

#k-mean #mean shift