New K Means Cluster Analysis
Please Got K Means Cluster Analysis Post
K Means Cluster Analysis
RMH

No title available
Jules of Nature

Kaledo Art
No title available
Peter Solarz
Claire Keane

@theartofmadeline
he wasn't even looking at me and he found me
NASA

PR's Tumblrdome
Cosimo Galluzzi

Janaina Medeiros

oozey mess
will byers stan first human second

roma★
d e v o n

tannertan36
I'd rather be in outer space 🛸

titsay
seen from United States
seen from Germany
seen from United States

seen from Netherlands

seen from United States

seen from Singapore
seen from Argentina
seen from Norway

seen from United States

seen from United States
seen from United States
seen from United States
seen from China
seen from United States

seen from Poland

seen from United States

seen from Jordan

seen from Jordan

seen from United States
seen from United States
@imsmaity96
New K Means Cluster Analysis
Please Got K Means Cluster Analysis Post
K Means Cluster Analysis
New Decision Tree
Please Go Decision Tree Post
Decision Tree
K Means Cluster Analysis
K-Means clustering is one of the popular unsupervised machine learning algorithm. In supervised learning, we provide set of features and label our model learn what should be the output when we give specific type of inputs and when we give new input it predict the output. In unsupervised learning there is no labels provided. The algorithm finds patter in the inputs and form clusters. If the clusters are far from each other and inner elements of clusters are close to each other we consider it as good cluster. K-Means clustering uses simple 4 steps to get the clusters. First we give the number of clusters (k). Then a random centroid is selected for each cluster. Then the each points is assigned to the cluster and based on distance and new clusters are formed with new centroid. This process keeps on iterating until centriods of new clusters don't change. K-mean clustering can be used in document classification, recommended system, image classification, segmentation of customers etc.
Data description
we can clearly see columns like Balance, Bonus_miles, Bonus_trans, Flight_miles_12mo and Days_since_enroll have different scale. We'll have to standardize it. It will help in reducing the traning time and get better output.
Lets check the inertia of cluster by plotting the elbow curve graph. Inertia is sum of distance of all point in a cluster from its centroid.
Lasso Regression Analysis
LASSO Regression is used to reduce the model overfitting. It increase the bias and reduce the variance in model.
Full form of LASSO is Least Absolute Shrinkage and Selection Operator. So the model itself is capable of feature selection. It shrinks the less important features and remove the features which are not important by making the value of features zero.
LASSO regression also know as L1 regularization. It takes the absolute value of variable and remove variables which don't much contribute to the model.
#Conclusion
We can clearly see the prediction accuracy is stable when we used both the dataset When we add more data the prediction error decreases. The R-square values of .74 and .70 indicate training and test model have variance of .74 and .70
Random Forest
Random forest is a supervised machine learning model. Like decision tree random forest can be use to predict continuous and categorical variable.
Random forest is an ensemble model it uses more than one model. It uses multiple decision tree to derive the output.
It uses bagging method, it divide the training sample data into subsets and majority output vote of different subset is considered so it's also called bootstrap aggregation because it decide the output based on output aggregation.
Since it derive its output by aggregating the majority vote from multiple models it has low variance and high bias. Ensemble models are generally preferred to build classification models
We can clearly see parameter tuning slightly helped in improving performance
Decision Tree
Decision tree can be easily interpreted and visualized and working of the model can be easily explained to stakeholders, unlike black box algorithms such as Neural Network and SVM
Decision Tree can be used as a regression model and also classification model. Regression model output will be continues variable while the classification model output will be categorical variable.
The top node of the decision tree is called root node or master node. Then the nodes can be divided into multiple parent node also called child node. They will have sub nodes. The last node will not have any sub node and it's called leaf node or Terminal node.
The decision tree split nodes based on homogeneity of elements and make decision. The last leaf node will have less homogeneous element and the master node will have the more homogeneous elements.
If the model have continues variable as output the decision tree use reduction in variance method. For classification model which have categorical variable as output the decision tree will use Gini Impurity, Information Gain, Chi-Square methods to split the nodes based on homogeneity of elements.
Note: Decision tree have low bias and high variance. They tend to overfit the training data and cannot generalize the data so the model will perform well in training environment but might not perform when when deployed. so we must carefully decide the right bias-variance trade-off. We can prune the tree to reduce the biasness.
Dataset Discription
Avlanche dataset is by microsoft and have 1095 rows and 7 columns:
The "avalanche" column is the target variable, zero in avalanche means avalanche not occurred, one in avalanche columns means avalanche occurred. Other columns are feature variables. "tracked_out" column is only categorical variable in the feature variables other columns are continues variables
We can clearly see there is no null value in dataset
EDA
Feature vs Target variable
no_visitors,fresh_thickness,tracked_out feature and target output have correlation and not contributing much to target variable
And surface_hoar, wind, weak_layers target output have variation and can be used to build the model
From the confusion matrix we see 104 out put is true positive and 141 is true negative. False positive and False negative is 42
Since its a huge data I have used export_text instead of plotting the decision tree. We can see how the decision tree first split <=20.50 and then second layer <=4.61 and so on.