Untitled @analysisbyzee - Tumblr Blog

HOW BODY FAT HAVE AN IMPACT ON HEIGHT AND WEIGHT OF A PERSON?

Primarily, all required libraries were imported. Next, the required dataset is loaded.

Here, I have uploaded the dataset available at Kaggle.com in the csv format using the read_csv() function. The dataset contains 15 attributes. These are:

Density determined from underwater weighing, Percent body fat from Siri's (1956) equation, Age (years), Weight (lbs), Height (inches), Neck circumference (cm), Chest circumference (cm), Abdomen 2 circumference (cm), Hip circumference (cm), Thigh circumference (cm), Knee circumference (cm), Ankle circumference (cm), Biceps (extended) circumference (cm), Forearm circumference (cm), Wrist circumference (cm)

The dataset contains estimates of the percentage of body fat determined by underwater weighing and various body circumference measurements for 252 men.

Research question: IS BODY FAT AND HIEGHT OF A PERSON ASSOCIATED?? HOW BODY FAT HAVE AN IMPACT ON HEIGHT AND WEIGHT OF A PERSON?

Since, both are response variable as well as explanatory variable are quantitative, we are going to apply Pearson Correlation coefficient for the analysis.

To get the Pearson Correlation, Pearson r function from the scipystats library is used. It generates both the correlation coefficient and the associated p-value. For the association between body fat and height of a person, the correlation coefficient is approximately -0.09 with a p-value appx 0.15. This indicates that p-value is greater than 0.05.

By squaring the r-value we get 0.0018. This means that if we know height of person, we can predict 0.18% percent of the variability we'll see in the amount of body fat.

Therefore from the above observations, we can say that there is no association between body fat and height of a person.

Now we check the association between weight and bodyfat of a person. Same procedure was used to check to get the correlation coefficient and p-value. For the association between body fat and height of a person, the correlation coefficient was found to be approximately 0.61 and the p-value was 2.4731162567910132e-27. This indicates that p-value is less than 0.05. This tells us the relationship is statistically significant. Since the r is more closer to 1 we can say that the association between bodyfat and weight is fairly strong and it's also positive.

After squaring the r-value, we get 0.37. This could be interpreted the following way. If we know the weight of person, we can predict 37 percent of the variability we will see in the amount of body fat.

#data analytics #body fat #height #weight

CHI SQUARE TEST OF INDEPENDENCE FOR CHEST PAIN AND THALL

Primarily, all required python libraries need to be loaded. These also include seaborn and scipy.stats library.

I have decided that I am going to work with heart.csv dataset available at Kaggle.com. The dataset contains 14 attributes. These are age, sex, chest pain type (4 values), resting blood pressure, serum cholesterol in mg/dl, fasting blood sugar > 120 mg/dl, resting electrocardiographic results (values 0,1,2), maximum heart rate achieved, exercise induced angina, old peak = ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels (0-3) colored by fluoroscopy, thal: 0 = normal; 1 = fixed defect; 2 = reversible defect and output: 0= less chance of heart attack 1= more chance of heart attack.

Firstly, pandas crosstabs function is used to generate contingency table of observed counts. Within parentheses, I included my my 4 level categorical response variable chest pain, followed by a comma, and then my categorical explanatory variable, THALL.

We used xi square test to analyze the variables since both response and explanatory variable are categorical.

These results include the table of counts of the response variable by the explanatory variable.

To generate the column percentages, I used the counts for my contingency table Ct1.

Now, looking at the chi square results, the chi square value is large, 41.69and the P value shown in scientific notation is quite small. Approximately 3.7352035167743426e-06which clearly tells us that chest pain and likeliness of heart attack are significantly associated.

By plotting the graph we infer that individuals who found reversible defect in their blood flow had experienced greater chest pain.

#data analytics

analysisbyzee

ANOVA test for THALL and maximum heart rate peaked

Primarily, all python libraries need to be loaded. In order to calculate the ANOVA F-statistic and its corresponding p-value, we also need to import the Statsmodels Formula API, which will allow us to fit statistical models.

When examining the association between max_heart_rate( maximum heart rate the patient has peaked quantitative response) and THALL( a radioactive element injected into the bloodstream of the patient(categorical explanatory), an Analysis of Variance (ANOVA) revealed the following :

Using the ols function we get the f statistic as 10.52 and p- value 1.34e-06. Since the p-value obtained is lower than cutoff point(0.05), we can safely reject the null hypothesis and say that there is an association between thallium present in the body and the maximum heart rate the patient peaked.

Using the group by function, we can eyeball each mean. According to the output, '1' has the lowest mean i.e 134.235294 and '2' has the highest mean i.e 155.771084. This indicates that lowest mean number of individuals/patients who had a heart attack had observed reversible defect in the bloodflow of the patient. Whereas, highest mean number of patients who had a heart attack had observed that their blood flow was normal.

Post hoc comparisons of mean number of patients by THALL categories revealed that those individuals having fixed defect in their blood flow reported significantly had peak greater heart rate compared to those having normal or reversible deft in their blood flows.

analysisbyzee

Here, ANOVA test is used because we have a quantitative response variable and a categorical explanatory variable with more than two levels.

ANOVA test for THALL and maximum heart rate peaked

#heart attack #datasets #anova #data analysis

The univariate graph of age group(categorical variable):

This graph is unimodal. It seems to be skewed to the left as there are higher frequencies in high categories than the lower categories.

The univariate graph of chest pain(categorical variable):

This graph is unimodal, with its highest peak at the median category of '0 chest pain'. It seems to be skewed to the right as there are higher frequencies in lower categories than the higher categories.

The univariate graph of max heart attack(quantitative variable):

This graph is unimodal, with its highest peak at the median category of 160-180. It seems to be skewed to the left as there are higher frequencies in higher categories than the lower cate

gories.

The univariate graph of output(categorical variable):

The above graph is a bivariate graph which shows the relationship between chest pain and output(or likeliness of heart attack). According to the above graph, individuals who experienced chest pain of type 1 have more likelihood of getting a heart attack. The graph is skewed left.

#data analysis #heart attack #analysis

CONVERTING AGE INTO CATEGORICAL VARIABLE

For this assignment, we created a new variable by comparing age groups categorically. compare age groups categorically.First, we created a variable with four age groups cut at the 25th, 50th, 75th and 100th percentiles by using pandas.qcut function.

Then, we created customized splits using the pandas cut function. Here, I have created three age groups, 17-30, 31-50 and 51-80.

By using the crosstab function, we can check the count of observations within each of the age group three categories.

We then created a frequency distribution for the new AGEGROUP3 variable. Following are the results:

Of the total number, 0% lies in age group category (17-30], 20.73% lies in age group category(30,51], 79. 26% lies in age group category (51-80]. From the above frequency distribution, we observe that majority of individuals who recieved heart attack were mostly old(above 50).

#heart attack #data analysis #assignment #conversion #variable

Frequency Distribution table for chest pain and THALL attribute

Primarily, pandas and numpy library were imported. Next, the required dataset is loaded. Here, I have uploaded the dataset available at Kaggle.com in the csv format using the read_csv() function. The dataset contains 14 attributes. These are age, sex, chest pain type (4 values), resting blood pressure, serum cholesterol in mg/dl, fasting blood sugar > 120 mg/dl, resting electrocardiographic results (values 0,1,2), maximum heart rate achieved, exercise induced angina, old peak = ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels (0-3) colored by fluoroscopy, thal: 0 = normal; 1 = fixed defect; 2 = reversible defect and output: 0= less chance of heart attack 1= more chance of heart attack.

Using the len function we can check the total no of variables and observations. The dataset has 14 variables and 302 observations. Out of the 13 variables, only 3 were used for frequency distribution table.

Using dtype function we can check the data type for the variables.

To convert it into numeric format we use to_numeric function available in the pandas library which was imported earlier.

#setting variables you will be working with to numeric data['age'] = pd.to_numeric(data['age']) data['sex'] = pd.to_numeric(data['sex']) data['chest pain'] = pd.to_numeric(data['chest pain']) data['resting blood pressure'] = pd.to_numeric(data['resting blood pressure']) data['cholestrol'] = pd.to_numeric(data['cholestrol']) data['fasting blood sugar'] = pd.to_numeric(data['fasting blood sugar']) data['resting ecg'] = pd.to_numeric(data['resting ecg']) data['max heart rate'] = pd.to_numeric(data['max heart rate']) data['excercise included'] = pd.to_numeric(data['excercise included']) data['old peak'] = pd.to_numeric(data['old peak']) data['slp'] = pd.to_numeric(data['slp']) data['caa'] = pd.to_numeric(data['caa']) data['THALL'] = pd.to_numeric(data['THALL']) data['output'] = pd.to_numeric(data['output'])

Next, we calculated the percentage of observations for each variable using the value_count function. To do so, nomralize parameter should be set as True in the function.

#counts and percentages (i.e. frequency distributions) for each variable print("Counts for sex: ") c1 = data['sex'].value_counts(sort=False) print (c1) print("Percentages for sex: ") p1 = data['sex'].value_counts(sort=False, normalize=True) print (p1) print("Counts for chest pain: ") c2 = data['chest pain'].value_counts(sort=False) print (c2) print("Percentages for chest pain: ") p2 = data['chest pain'].value_counts(sort=False, normalize=True) print (p2) print("Counts for THALL: ") c3 = data['THALL'].value_counts(sort=False) print (c3) print("Percentages for THALL: ") p3 = data['THALL'].value_counts(sort=False, normalize=True) print (p3)

For the first question " On the scale of 0-3(from least to high), what level of chest pain did u experience?" Of the total number, 47.35% chose level 0, 16.55% chose level 1, 28.8% chose level 2 and 7.28% chose level 3.

For the second question" On the scale of 0-3, what was the level of blood flow of the patient observed while doing exercise and resting?"

0 maps to null in the original dataset.

2 -This means that the blood flow was normal.

3 - This means that a reversible defect was found

Of the total number, 0.66% chose level 0, 5.62% chose level 1, 54.96% chose level 2 and 38.74% chose level 3. From the above frequency distribution, we observe that majority of individual's blood flow was normal while doing exercises and resting.

Next, we calculated the frequency distributions for individuals having age greater than 45. This can be achieved by splitting the dataset as follows.

sub1=data[(data['age']>=45)]

For the first question " On the scale of 0-3(from least to high), what level of chest pain did u experience?" Of the total number, 50.40% chose level 0, 15.04% chose level 1, 27.23% chose level 2 and 7.31% chose level 3.

For the second question" On the scale of 0-3, what was the level of blood flow of the patient observed while doing exercise and resting?"

0 maps to null in the original dataset.

2 -This means that the blood flow was normal.

3 - This means that a reversible defect was found

Of the total number, 0.81% chose level 0, 5.69% chose level 1, 51.62% chose level 2 and 41.86% chose level 3. From the above frequency distribution, we observe that majority of individual's blood flow was normal while doing exercises and resting.

We also observe that there is not much difference in the frequency distribution tables for both the cases.

THE WHOLE CODE:

import pandas as pd import numpy as np # any additional libraries would be imported here column_names = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','max heart rate','excercise included','old peak','slp','caa','THALL','output'] data= pd.read_csv("heart.csv",header=None,names=column_names) data = data.iloc[1: , :] # removes the first row of dataframe (In this case, ) print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns) # checking the format of your variables data['resting blood pressure'].dtype #setting variables you will be working with to numeric data['age'] = pd.to_numeric(data['age']) data['sex'] = pd.to_numeric(data['sex']) data['chest pain'] = pd.to_numeric(data['chest pain']) data['resting blood pressure'] = pd.to_numeric(data['resting blood pressure']) data['cholestrol'] = pd.to_numeric(data['cholestrol']) data['fasting blood sugar'] = pd.to_numeric(data['fasting blood sugar']) data['resting ecg'] = pd.to_numeric(data['resting ecg']) data['max heart rate'] = pd.to_numeric(data['max heart rate']) data['excercise included'] = pd.to_numeric(data['excercise included']) data['old peak'] = pd.to_numeric(data['old peak']) data['slp'] = pd.to_numeric(data['slp']) data['caa'] = pd.to_numeric(data['caa']) data['THALL'] = pd.to_numeric(data['THALL']) data['output'] = pd.to_numeric(data['output']) #counts and percentages (i.e. frequency distributions) for each variable print("Counts for sex: ") c1 = data['sex'].value_counts(sort=False) print (c1) print("Percentages for sex: ") p1 = data['sex'].value_counts(sort=False, normalize=True) print (p1) print("Counts for chest pain: ") c2 = data['chest pain'].value_counts(sort=False) print (c2) print("Percentages for chest pain: ") p2 = data['chest pain'].value_counts(sort=False, normalize=True) print (p2) print("Counts for THALL: ") c3 = data['THALL'].value_counts(sort=False) print (c3) print("Percentages for THALL: ") p3 = data['THALL'].value_counts(sort=False, normalize=True) print (p3) #subset data to adults greater than age 45 sub1=data[(data['age']>=45)] #make a copy of my new subsetted data sub2 = sub1.copy() #counts and percentages (i.e. frequency distributions) for each variable print("Counts for sex: ") c1 = sub2['sex'].value_counts(sort=False) print (c1) print("Percentages for sex: ") p1 = sub2['sex'].value_counts(sort=False, normalize=True) print (p1) print("Counts for chest pain: ") c2 = sub2['chest pain'].value_counts(sort=False) print (c2) print("Percentages for chest pain: ") p2 = sub2['chest pain'].value_counts(sort=False, normalize=True) print (p2) print("Counts for THALL: ") c3 = sub2['THALL'].value_counts(sort=False) print (c3) print("Percentages for THALL: ") p3 = sub2['THALL'].value_counts(sort=False, normalize=True) print (p3)

#frequency distribution #python #data analysis

Association between chest pain and heart attack- Literature Review

I have decided that I am particularly interested in output variable which shows the likelihood of getting a heart attack. I am not sure which variables I will use regarding output variable (e.g. fasting blood sugar or chest pain), so for now I will include all of the relevant variables in my personal codebook.

Since, chest pain is the basic symptom of getting a heart attack, I have decided to explore the the association between the type of chest pain and chances of getting a heart attack.

So, my research questions for the assignment will be that what are the chances of getting a chest pain in a heart attack. Is chest pain a common symptom for heart attack? Can a severe chest pain result into a heart attack?

Article[1] shows the research conducted on patients who had severe and mild chest pains to find out if severe chest pains leads to heart attack. Article [2] is another research article which describes testing of low-risk patients presenting to the emergency department with chest pain.

According to the researches mentioned above, it is not necessary to experience chest pain in a heart attack. While some heart attacks do feature classic symptoms like chest and arm pain, the idea that they all do is FALSE.

[1] Body, R., Lewis, P.S., Carley, S., Burrows, G., Haves, B. and Cook, G., 2016. Chest pain: if it hurts a lot, is heart attack more likely?. European Journal of Emergency Medicine, 23(2), pp.89-94.

[2] Amsterdam, E.A., Kirk, J.D., Bluemke, D.A., Diercks, D., Farkouh, M.E., Garvey, J.L., Kontos, M.C., McCord, J., Miller, T.D., Morise, A. and Newby, L.K., 2010. Testing of low-risk patients presenting to the emergency department with chest pain: a scientific statement from the American Heart Association. Circulation, 122(17), pp.1756-1776.

#heart attack #chest pain #literature review #datasets #coding #data analysis

K-means Cluster Analysis for Heart attack Analysis

A k-means cluster analysis was conducted to identify underlying subgroups of individuals based on their similarity of responses on 12 variables that represent characteristics that could have an impact on maximum heart rate achieved.

Primarily, all python libraries need to be loaded that are required in creation for a lasso regression model. These also include the k-Means function from the sklearn.cluster library. Following are the libraries that are necessary to import:

Next, the required dataset is loaded. Here, I have uploaded the dataset available at Kaggle.com in the csv format using the read_csv() function. The dataset contains 14 attributes. These are age, sex, chest pain type (4 values), resting blood pressure, serum cholesterol in mg/dl, fasting blood sugar > 120 mg/dl, resting electrocardiographic results (values 0,1,2), maximum heart rate achieved, exercise induced angina, old peak = ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels (0-3) colored by fluoroscopy, thal: 0 = normal; 1 = fixed defect; 2 = reversible defect and output: 0= less chance of heart attack 1= more chance of heart attack.

Out of the 13 variables, only 12 were used in cluster analysis. Variable for maximum heart rate achieved is used for validation.

Before clustering, we need to standardize the variables measured on different scales. This is done so that the solution is not driven by variables measured on larger scales. The describe function is used to see statistical details of pandas dataframe. From the above data, we can see that our clustering variables are not standardized.

Here, we standardized the clustering variables to have a mean of 0, and a standard deviation of 1.The as type float 64 code ensures that all predictors will have a numeric format.

clustervar=cluster.copy()

clustervar['sex']=preprocessing.scale(clustervar['sex'].astype('float64'))

clustervar['age']=preprocessing.scale(clustervar['age'].astype('float64'))

clustervar['resting blood pressure']=preprocessing.scale(clustervar['resting blood pressure'].astype('float64'))

clustervar['cholestrol']=preprocessing.scale(clustervar['cholestrol'].astype('float64'))

clustervar['old peak']=preprocessing.scale(clustervar['old peak'].astype('float64'))

clustervar['chest pain']=preprocessing.scale(clustervar['chest pain'].astype('float64'))

clustervar['fasting blood sugar']=preprocessing.scale(clustervar['fasting blood sugar'].astype('float64'))

clustervar['resting ecg']=preprocessing.scale(clustervar['resting ecg'].astype('float64'))

clustervar['excercise included']=preprocessing.scale(clustervar['excercise included'].astype('float64'))

clustervar['slp']=preprocessing.scale(clustervar['slp'].astype('float64'))

clustervar['caa']=preprocessing.scale(clustervar['caa'].astype('float64'))

clustervar['THALL']=preprocessing.scale(clustervar['THALL'].astype('float64'))

clustervar['output']=preprocessing.scale(clustervar['output'].astype('float64'))

Now, dataset is divided into a training set and a test set. This can be achieved by using train_test_split() function. The size ratio is set as 70% for the training sample and 30% for the test sample. The random_state option specifies a random number seat(here I have selected as 123) to ensure that the data are randomly split the same way if the code is run again.

# split data into train and test sets

clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)

cdist function from the scipy.spatial.distance library is used to calculate the average distance of the observations from the cluster centroids. We used 10 clusters for the analysis. Object meandist is used to store the average distance values that we will calculate for the 1 to 9 cluster solutions. The model is then initialized by calling the k-Means function from the sk learning cluster library. The function takes n_clusters which indicates the number of clusters as an argument. Here, we substituted n_clusters with k to tell Python to run the cluster analysis for 1 through 9 clusters. The model is then trained using the fit function which takes training features as argument. The code following meandist.append computes the average of the sum of the distances between each observation in the cluster centroids.

# k-means cluster analysis for 1-9 clusters

from scipy.spatial.distance import cdist

clusters=range(1,10)

meandist=[]

for k in clusters:

model=KMeans(n_clusters=k)

model.fit(clus_train)

clusassign=model.predict(clus_train)

meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))

/ clus_train.shape[0])

Next, we plot the elbow curve using the map plot lib plot function. The plot shows decrease in the average minimum distance of the observations from the cluster centroids for each of the cluster solutions.

From the above graph, we can see that the average distance decreases as the number of clusters increases. We can observe at two clusters, at three clusters, at seven clusters, and at eight clusters , there appear to be bends. These bends indicates that average distance value is leveling off such that adding more clusters doesn't decrease the average distance as much. Notice that these bends are not very much clear. This means that the elbow curve was inconclusive.

So we'll rerun the cluster analysis, this time asking for 3 clusters. To do so, simply initialize the model by calling the Kmeans function and set nclusters=3.

# Interpret 3 cluster solution

model3=KMeans(n_clusters=3)

model3.fit(clus_train)

clusassign=model3.predict(clus_train)

# plot clusters

Now, we used canonical discriminate analysis, which is a data reduction technique that creates a smaller number of variables that are linear combinations of the 3 clustering variables. To conduct the canonical discriminate analysis, we used the the PCA function and the sklearn decomposition library.

PCA(2) asks Python to return the two first canonical variables.Then we create a matrix called plot_columns that will include the two canonical variables estimated by the canonical discriminate analysis. PCA_2.fit asks Python to fit the canonical discriminate analysis that we specified with the PCA command, and the _transform applies the canonical discriminate analysis to the clus_train data set to calculate the canonical variables. We will plot the two canonical variables by the cluster assignment values from the 3 cluster solution in a scatter plot using the matplot libplot function.

From the above graph we can see that none of the cluster did not overlap very much with the other clusters. This indicates less correlation among the observations. The observations in the green and yellow clusters had greater spread. Also the observation in the green cluster were spread out more than the other clusters, showing high within cluster variance. The results of this plot suggest that the best cluster solution may have fewer than 3 clusters, so it will be especially important to also evaluate the cluster solutions with fewer than 3 clusters.

The means on the clustering variables showed that compared to the other clusters, individuals in cluster 0 had the highest likelihood of having a heart attack. They are less likely to get a blood disorder called thalassemia than cluster 1, highest slope of the peak exercise ST segment, lesser exercise induced angina,moderate old peak and greater chances of having a chest pain. On the other hand, indivuals in cluster 1 had the least chance of having a heart attack. Compared to individuals in the other clusters, they were lower chances for having a chest pain, greater resting blood pressure, lower resting ecg, highest exercise induced angina, highest old peak, lowest slope of the peak exercise ST segment, most number of major vessels colored by fluoroscopy, and highest chances for getting a blood disorder called thalassemia. Individuals in cluster 2 appeared to have moderate chances for having a heart attack as compared to the other two clusters. They had higher levels of resting ecg and had the least levels of old peak, least number of major vessels colored by fluoroscopy and had the lowest chances for getting thalassemia.

Finally, let's see how the clusters differ on maximum heart rate achieved. We'll use analysis of variance to test whether there are significant differences between clusters on the quantitative max_heart_rate variable. To do this, we have to import the statsmodels.formula.api and the statsmodels.stats.multicomp libraries. We use the ols function to test the analysis of variance.

# validate clusters in training data by examining cluster differences in GPA using ANOVA

# first have to merge GPA with clustering variables and cluster assignment data

mhr_data=data['max_heart_rate']

# split GPA data into train and test sets

mhr_train, mhr_test = train_test_split(mhr_data, test_size=.3, random_state=123)

mhr_train1=pd.DataFrame(mhr_train)

mhr_train1.reset_index(level=0, inplace=True)

merged_train_all=pd.merge(mhr_train1, merged_train, on='index')

sub1 = merged_train_all[['max_heart_rate', 'cluster']].dropna()

import statsmodels.formula.api as smf

import statsmodels.stats.multicomp as multi

mhrmod = smf.ols(formula='max_heart_rate ~ C(cluster)', data=sub1).fit()

print (mhrmod.summary())

The analysis of variance summary table indicates that the clusters differed significantly on maximum heart rate achieved.

When we examine the means we find that individuals in cluster 0 had previously achieved highest heart rate(mean= 158.276923 and sd = 17.619687) and the individuals in cluster 1 had achieved the lowest levels for maximum heart rate(mean= 138.205479 and sd= 24.218885).

The tukey post hoc comparisons showed significant differences between clusters on maximum heart rate achieved.

#python #project #KMEANS #cluster #analysis #data analysis #unsupervised #machine learning #heart attack

Primarily, all python libraries need to be loaded that are required in creation for a lasso regression model. These also include the train_test_split function from the sklearn.cross_validation library, and the LassoLarsCV function from the sklearn.linear_model library. Following are the libraries that are necessary to import:

column_names = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','max_heart_rate','excercise included','old peak','slp','caa','THALL','output']

data= pd.read_csv("heart.csv",header=None,names=column_names)

data = data.iloc[1: , :] # removes the first row of dataframe

Now, we divide the columns in the dataset as dependent or independent variables. The output variable is selected as max_heart_rate variable for heart disease prediction system. The dataset contains 13 feature variables and 1 quantitative target variable.

#split dataset in features and target variable feature_cols = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','excercise included','old peak','slp','caa','THALL','output'] pred = data[feature_cols] # Features tar = data.max_heart_rate # Target variable

To achieve a fair penalty term, we need to standardize all the predictors to have a mean equal to zero and a standard deviation equal to one, including the binary predictors. The as type float 64 code ensures that all predictors will have a numeric format.

Now, the dataset is split into a training data set consisting of 70% of the total observations in the data set. And a test data set consisting of the other 30% of the observations by using the train test split function from the sklearn cross validation library

# split data into train and test sets

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, tar,

test_size=.3, random_state=123)

Now, the Lasso regression model initialized by calling the LassoLarsCV function. Here, cv is set as 10 so that ten random folds from the training data set is chosen in the final statistical model and precompute is set as false so that Python does not apply a precomputed matrix. The model is then trained using the fit function which takes training features and training target variables as arguments.

# specify the lasso regression model

model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)

By using the following code, a dictionary is created containing the variable labels and the associated regression coefficients.

We can see that predictors having regression coefficient as 0 are caa(variable for number of major vessels (0-3) colored by fluoroscopy), fasting blood sugar, 'resting ecg' and sex. This means that coefficients for these variables have shrunk to zero after applying the LASSO regression penalty, and are removed from the model. Therefore, only 9 variables were selected in the final model.

Following are the most strongly associated predictors of maximum heart rate achieved:

'slp': 5.597880552188525

'age': -6.338731102729565

slp is positively associated with maximum heart rate achieved and age is negatively associated with maximum heart rate achieved.

Following is the plot of change in the regression coefficient by values of the penalty parameter at each step of the selection process.

# plot coefficient progression

m_log_alphas = -np.log10(model.alphas_)

ax = plt.gca()

plt.plot(m_log_alphas, model.coef_path_.T)

plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',

label='alpha CV')

plt.ylabel('Regression Coefficients')

plt.xlabel('-log(alpha)')

plt.title('Regression Coefficients Progression for Lasso Paths')

From the graph we can infer that slp, the light blue line, had the largest regression coefficient and was entered into the model first. At second two, age, the dark blue line and at step three, output, the light green line were entered into the model.

Next, we plot the graph for change in the mean square error for the change in the penalty parameter alpha at each step in the selection process. The alpha values through the model selection process for each cross validation fold is plotted on the horizontal axis, and the mean square error for each cross validation fold is plotted on the vertical axis.

# plot mean square error for each fold

m_log_alphascv = -np.log10(model.cv_alphas_)

plt.figure()

plt.plot(m_log_alphascv, model.mse_path_, ':')

plt.plot(m_log_alphascv, model.mse_path_.mean(axis=-1), 'k',

label='Average across the folds', linewidth=2)

plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',

label='alpha CV')

plt.legend()

plt.xlabel('-log(alpha)')

plt.ylabel('Mean squared error')

plt.title('Mean squared error on each fold')

From the above graph we can infer that there is variability across the individual cross-validation. For each fold, the pattern for change in the mean square error as variables are added to the model differs. By observing the black line depicting average across all folds, we can infer that initially it decreases rapidly and then levels off to a point at which adding more predictors doesn't lead to much reduction in the mean square error.

To compute the mean square error, we need to import the mean squared error function from the sklearn metrics library. We can see that the model is less accurate is predicting the maximum heart rate achieved in the training data.

The R square value for training data is 0.38 indicating that the model explained 38% of the variance in maximum heart rate achieved for the training set. And, the R square value for test data is 0.34 indicating that the model explained 34% of the variance in maximum heart rate achieved for the test set.

The Whole Code:

from pandas import Series, DataFrame import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LassoLarsCV column_names = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','max_heart_rate','excercise included','old peak','slp','caa','THALL','output'] data= pd.read_csv("heart.csv",header=None,names=column_names) data = data.iloc[1: , :] # removes the first row of dataframe (In this case, ) #split dataset in features and target variable feature_cols = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','excercise included','old peak','slp','caa','THALL','output'] pred = data[feature_cols] # Features tar = data.max_heart_rate # Target variable print(data.dtypes) # standardize predictors to have mean=0 and sd=1 predictors=pred.copy() from sklearn import preprocessing predictors['age']=preprocessing.scale(predictors['age'].astype('float64')) predictors['sex']=preprocessing.scale(predictors['sex'].astype('float64')) predictors['chest pain']=preprocessing.scale(predictors['chest pain'].astype('float64')) predictors['resting blood pressure']=preprocessing.scale(predictors['resting blood pressure'].astype('float64')) predictors['cholestrol']=preprocessing.scale(predictors['cholestrol'].astype('float64')) predictors['fasting blood sugar']=preprocessing.scale(predictors['fasting blood sugar'].astype('float64')) predictors['resting ecg']=preprocessing.scale(predictors['resting ecg'].astype('float64')) predictors['excercise included']=preprocessing.scale(predictors['excercise included'].astype('float64')) predictors['old peak']=preprocessing.scale(predictors['old peak'].astype('float64')) predictors['slp']=preprocessing.scale(predictors['slp'].astype('float64')) predictors['caa']=preprocessing.scale(predictors['caa'].astype('float64')) predictors['THALL']=preprocessing.scale(predictors['THALL'].astype('float64')) predictors['output']=preprocessing.scale(predictors['output'].astype('float64')) # split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, tar, test_size=.3, random_state=123) # specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train) # print variable names and regression coefficients dict(zip(predictors.columns, model.coef_)) # plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths') # plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.mse_path_, ':') plt.plot(m_log_alphascv, model.mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold') # MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('training data MSE') print(train_error) print ('test data MSE') print(test_error) # R-square from training and test data rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test,tar_test) print ('training data R-square') print(rsquared_train) print ('test data R-square') print(rsquared_test)

#python #machine learning #lasso #regression #heart attack #analysis #data analysis

Random Forest Classifier for Heart Attack Analysis

Primarily, all python libraries need to be loaded that are required in creation for a classification random forest. These also include features from sklearn library. Following are the libraries that are necessary to import:

column_names = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','max heart rate','excercise included','old peak','slp','caa','THALL','output']

data= pd.read_csv("heart.csv",header=None,names=column_names)

data = data.iloc[1: , :] # removes the first row of dataframe

Now, we divide the columns in the dataset as dependent or independent variables. The output variable is selected as target variable for heart disease prediction system. The dataset contains 13 feature variables and 1 target variable.

feature_cols = ['age','sex','chest pain','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','max heart rate','excercise included','old peak','slp','caa','THALL']

pred = data[feature_cols] # Features

tar = data.output # Target variable

Now, dataset is divided into a training set and a test set. This can be achieved by using train_test_split() function. The size ratio is set as 60% for the training sample and 40% for the test sample.

pred_train, pred_test, tar_train, tar_test = train_test_split(X, y, test_size=0.4, random_state=1)

Using the shape function, we observe that the training sample has 181 observations (nearly 60% of the original sample) and 13 explanatory variables whereas the test sample contains 122 observations(nearly 40 % of the original sample) and 13 explanatory variables.

Next, we need to import RandomForestClassifier from sklearn.ensemble. Here, the random forest classifier is initialized and number of trees is set as 25. The model is then trained using the fit function which takes training features and training target variables as arguments.

#Import Random Forest Model

from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier

claf=RandomForestClassifier(n_estimators=25)

claf=claf.fit(pred_train,tar_train)

res=claf.predict(pred_test)

To check the accuracy of the model, we use the accuracy_score function of metrics library. Our model has a classification rate of 80.99 %. Therefore, we can say that our model has good accuracy for finding out a person has a heart attack or 81% of the individuals were classified correctly in finding out their chances for having a heart attack.

To find out the correct and incorrect classification of decision tree, we use the confusion matrix function. Our model predicted 46 true negatives for having a heart disease and 52 true positives for having a heart attack. The model also predicted 13 false negatives and 10 false positives for having a heart attack.

To generate importance scores, we initialize the extra tree classifier, and then fit a model. Finally, we ask Python to print the feature importance scores calculated from the forest of trees that we've grown. The variables are listed in the order they've been named earlier in the code.

From this data we can infer that the variable with the highest importance score at 0.133 is resting electrocardiographic results and 0.13 (chest pain). The variable with the lowest important score at 0.01 is fasting blood sugar.

Now, we're going to use code that builds for us different numbers of trees, from one to 25, and provides the correct classification rate for each. This code will build for us random forest classifier from one to 25, and then finding the accuracy score for each of those trees from one to 25, and storing it in an array.

trees=range(25)

accuracy=np.zeros(25)

for idx in range(len(trees)):

claf=RandomForestClassifier(n_estimators=idx + 1)

claf=claf.fit(pred_train,tar_train)

res=claf.predict(pred_test)

accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, res)

From the graph we can infer that only one tree has 80% accuracy and the accuracy climbs to only 82% with successive trees that are grown . So we can say that it may be perfectly appropriate to interpret a single decision tree for this data.

The Whole Code:

from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics # Feature Importance from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier column_names = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','max heart rate','excercise included','old peak','slp','caa','THALL','output'] data= pd.read_csv("heart.csv",header=None,names=column_names) data = data.iloc[1: , :] # removes the first row of dataframe (In this case, ) data #split dataset in features and target variable feature_cols = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','max heart rate','excercise included','old peak','slp','caa','THALL'] pred = data[feature_cols] # Features tar = data.output # Target variable pred_train, pred_test, tar_train, tar_test = train_test_split(pred, tar, test_size=0.4, random_state=1) # 60% training and 40% test pred_train.shape pred_test.shape tar_train.shape tar_test.shape #Import Random Forest Model from sklearn.ensemble import RandomForestClassifier #Create a Gaussian Classifier claf=RandomForestClassifier(n_estimators=25) claf=claf.fit(pred_train,tar_train) res=claf.predict(pred_test) sklearn.metrics.confusion_matrix(tar_test,res) sklearn.metrics.accuracy_score(tar_test, res) # fit an Extra Trees model to the data ET_model = ExtraTreesClassifier() ET_model.fit(pred_train,tar_train) # display the relative importance of each attribute print(ET_model.feature_importances_) trees=range(25) accuracy=np.zeros(25) for idx in range(len(trees)): claf=RandomForestClassifier(n_estimators=idx + 1) claf=claf.fit(pred_train,tar_train) res=claf.predict(pred_test) accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, res) plt.cla() plt.plot(trees, accuracy)

#python #randomforest #classifier #data mining #machine learning #data analysis #heart attack

Classification Decision Tree for Heart Attack Analysis

Primarily, the required dataset is loaded. Here, I have uploaded the dataset available at Kaggle.com in the csv format.

All python libraries need to be loaded that are required in creation for a classification decision tree. Following are the libraries that are necessary to import:

The following code is used to load the dataset. read_csv() function is used to load the dataset.

column_names = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','max heart rate','excercise included','old peak','slp','caa','THALL','output']

data= pd.read_csv("heart.csv",header=None,names=column_names)

data = data.iloc[1: , :] # removes the first row of dataframe

feature_cols = ['age','sex','chest pain','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','max heart rate','excercise included','old peak','slp','caa','THALL']

pred = data[feature_cols] # Features

tar = data.output # Target variable

pred_train, pred_test, tar_train, tar_test = train_test_split(X, y, test_size=0.4, random_state=1)

Using the shape function, we observe that the training sample has 181 observations (nearly 60% of the original sample) and 10 explanatory variables whereas the test sample contains 122 observations(nearly 40 % of the original sample) and 10 explanatory variables.

Now, we need to create an object claf_mod to initialize the decision tree classifer. The model is then trained using the fit function which takes training features and training target variables as arguments.

# To create an object of Decision Tree classifer

claf_mod = DecisionTreeClassifier()

# Train the model

claf_mod = claf_mod.fit(pred_train,tar_train)

To check the accuracy of the model, we use the accuracy_score function of metrics library. Our model has a classification rate of 58.19 %. Therefore, we can say that our model has good accuracy for finding out a person has a heart attack.

To find out the correct and incorrect classification of decision tree, we use the confusion matrix function. Our model predicted 18 true negatives for having a heart disease and 53 true positives for having a heart attack. The model also predicted 31 false negatives and 20 false positives for having a heart attack.

To display the decision tree we use export_graphviz function. The resultant graph is unpruned.

dot_data = StringIO()

export_graphviz(claf_mod, out_file=dot_data,

filled=True, rounded=True,

special_characters=True,class_names=['0','1'])

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

graph.write_png('heart attack.png')

Image(graph.create_png())

To get a prune graph, we changed the criterion as entropy and initialized the object again. The maximum depth of the tree is set as 3 to avoid overfitting.

# Create Decision Tree classifer object

claf_mod = DecisionTreeClassifier(criterion="entropy", max_depth=3)

# Train Decision Tree Classifer

claf_mod = claf_mod.fit(pred_train,tar_train)

#Predict the response for test dataset

tar_pred = claf_mod.predict(pred_test)

By optimizing the performance, the classification rate of the model increased to 72.13%.

By passing the object again into export_graphviz function, we obtain the prune graph.

From the above graph, we can infer that :

1) individuals having cholesterol less than 338 mg/dl, age less than or equal to 70.5 years, and whose previous peak was less than or equal to 1.55: 84 of them are more likely to have a heart attack whereas 42 of them will less likely to have a heart attack.

2) individuals having cholesterol less than 338 mg/dl, age less than or equal to 70.5 years, and whose previous peak was more than 1.55: 6 of them will less likely to have a heart attack whereas 38 of them are more likely to have a heart attack.

3) individuals having cholesterol less than 338 mg/dl and age less than or equal to 76.5 years: are less likely to have a heart attack

4) individuals having cholesterol less than 338 mg/dl and age more than 76.5 years: are more likely to have a heart attack

5) individuals having cholesterol more than 338 mg/dl : are less likely to have a heart attack

The Whole Code:

from google.colab import files uploaded = files.upload()

import pandas as pd from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier from sklearn.model_selection import train_test_split # Import train_test_split function from sklearn.metrics import classification_report import sklearn.metrics #Import scikit-learn metrics module for accuracy calculation from sklearn.tree import export_graphviz from sklearn.externals.six import StringIO from IPython.display import Image import pydotplus

column_names = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','max heart rate','excercise included','old peak','slp','caa','THALL','output'] data= pd.read_csv("heart.csv",header=None,names=column_names) data = data.iloc[1: , :] # removes the first row of dataframe (In this case, ) #split dataset in features and target variable feature_cols = ['age','sex','chest pain','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','max heart rate','excercise included','old peak','slp','caa','THALL'] pred = data[feature_cols] # Features tar = data.output # Target variable pred_train, pred_test, tar_train, tar_test = train_test_split(X, y, test_size=0.4, random_state=1) # 60% training and 40% test pred_train.shape pred_test.shape tar_train.shape tar_test.shape

# To create an object of Decision Tree classifer claf_mod = DecisionTreeClassifier() # Train the model claf_mod = claf_mod.fit(pred_train,tar_train) #Predict the response for test dataset tar_pred = claf_mod.predict(pred_test) sklearn.metrics.confusion_matrix(tar_test,tar_pred) # Model Accuracy, how often is the classifier correct? print("Accuracy:",metrics.accuracy_score(tar_test, tar_pred)) dot_data = StringIO() export_graphviz(claf_mod, out_file=dot_data, filled=True, rounded=True, special_characters=True,class_names=['0','1']) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) graph.write_png('heart attack.png') Image(graph.create_png())

# Create Decision Tree classifer object claf_mod = DecisionTreeClassifier(criterion="entropy", max_depth=3) # Train Decision Tree Classifer claf_mod = claf_mod.fit(pred_train,tar_train) #Predict the response for test dataset tar_pred = claf_mod.predict(pred_test) # Model Accuracy, how often is the classifier correct? print("Accuracy:",metrics.accuracy_score(tar_test, tar_pred)) from sklearn.externals.six import StringIO from IPython.display import Image from sklearn.tree import export_graphviz import pydotplus dot_data = StringIO() export_graphviz(claf_mod, out_file=dot_data, filled=True, rounded=True, special_characters=True, class_names=['0','1']) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) graph.write_png('improved heart attack.png') Image(graph.create_png())

#python #decision tree #heart attack #analysis #project #machine learning #data analysis

Trending Blogs

Recently Viewed Blogs

Untitled