Primarily, all python libraries need to be loaded that are required in creation for a lasso regression model. These also include the train_test_split function from the sklearn.cross_validation library, and the LassoLarsCV function from the sklearn.linear_model library. Following are the libraries that are necessary to import:
Next, the required dataset is loaded. Here, I have uploaded the dataset available at Kaggle.com in the csv format using the read_csv() function. The dataset contains 14 attributes. These are age, sex, chest pain type (4 values), resting blood pressure, serum cholesterol in mg/dl, fasting blood sugar > 120 mg/dl, resting electrocardiographic results (values 0,1,2), maximum heart rate achieved, exercise induced angina, old peak = ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels (0-3) colored by fluoroscopy, thal: 0 = normal; 1 = fixed defect; 2 = reversible defect and output: 0= less chance of heart attack 1= more chance of heart attack
column_names = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','max_heart_rate','excercise included','old peak','slp','caa','THALL','output']
data= pd.read_csv("heart.csv",header=None,names=column_names)
data = data.iloc[1: , :] # removes the first row of dataframe
Now, we divide the columns in the dataset as dependent or independent variables. The output variable is selected as max_heart_rate variable for heart disease prediction system. The dataset contains 13 feature variables and 1 quantitative target variable.
#split dataset in features and target variable
feature_cols = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','excercise included','old peak','slp','caa','THALL','output']
pred = data[feature_cols] # Features
tar = data.max_heart_rate # Target variable
To achieve a fair penalty term, we need to standardize all the predictors to have a mean equal to zero and a standard deviation equal to one, including the binary predictors. The as type float 64 code ensures that all predictors will have a numeric format.
Now, the dataset is split into a training data set consisting of 70% of the total observations in the data set. And a test data set consisting of the other 30% of the observations by using the train test split function from the sklearn cross validation library
# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, tar,
test_size=.3, random_state=123)
Now, the Lasso regression model initialized by calling the LassoLarsCV function. Here, cv is set as 10 so that ten random folds from the training data set is chosen in the final statistical model and precompute is set as false so that Python does not apply a precomputed matrix. The model is then trained using the fit function which takes training features and training target variables as arguments.
# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
By using the following code, a dictionary is created containing the variable labels and the associated regression coefficients.
We can see that predictors having regression coefficient as 0 are caa(variable for number of major vessels (0-3) colored by fluoroscopy), fasting blood sugar, 'resting ecg' and sex. This means that coefficients for these variables have shrunk to zero after applying the LASSO regression penalty, and are removed from the model. Therefore, only 9 variables were selected in the final model.
Following are the most strongly associated predictors of maximum heart rate achieved:
'age': -6.338731102729565
slp is positively associated with maximum heart rate achieved and age is negatively associated with maximum heart rate achieved.
Following is the plot of change in the regression coefficient by values of the penalty parameter at each step of the selection process.
# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')
From the graph we can infer that slp, the light blue line, had the largest regression coefficient and was entered into the model first. At second two, age, the dark blue line and at step three, output, the light green line were entered into the model.
Next, we plot the graph for change in the mean square error for the change in the penalty parameter alpha at each step in the selection process. The alpha values through the model selection process for each cross validation fold is plotted on the horizontal axis, and the mean square error for each cross validation fold is plotted on the vertical axis.
# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.plot(m_log_alphascv, model.mse_path_, ':')
plt.plot(m_log_alphascv, model.mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
From the above graph we can infer that there is variability across the individual cross-validation. For each fold, the pattern for change in the mean square error as variables are added to the model differs. By observing the black line depicting average across all folds, we can infer that initially it decreases rapidly and then levels off to a point at which adding more predictors doesn't lead to much reduction in the mean square error.
To compute the mean square error, we need to import the mean squared error function from the sklearn metrics library. We can see that the model is less accurate is predicting the maximum heart rate achieved in the training data.
The R square value for training data is 0.38 indicating that the model explained 38% of the variance in maximum heart rate achieved for the training set. And, the R square value for test data is 0.34 indicating that the model explained 34% of the variance in maximum heart rate achieved for the test set.
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoLarsCV
column_names = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','max_heart_rate','excercise included','old peak','slp','caa','THALL','output']
data= pd.read_csv("heart.csv",header=None,names=column_names)
data = data.iloc[1: , :] # removes the first row of dataframe (In this case, )
#split dataset in features and target variable
feature_cols = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','excercise included','old peak','slp','caa','THALL','output']
pred = data[feature_cols] # Features
tar = data.max_heart_rate # Target variable
print(data.dtypes)
# standardize predictors to have mean=0 and sd=1
predictors=pred.copy()
from sklearn import preprocessing
predictors['age']=preprocessing.scale(predictors['age'].astype('float64'))
predictors['sex']=preprocessing.scale(predictors['sex'].astype('float64'))
predictors['chest pain']=preprocessing.scale(predictors['chest pain'].astype('float64'))
predictors['resting blood pressure']=preprocessing.scale(predictors['resting blood pressure'].astype('float64'))
predictors['cholestrol']=preprocessing.scale(predictors['cholestrol'].astype('float64'))
predictors['fasting blood sugar']=preprocessing.scale(predictors['fasting blood sugar'].astype('float64'))
predictors['resting ecg']=preprocessing.scale(predictors['resting ecg'].astype('float64'))
predictors['excercise included']=preprocessing.scale(predictors['excercise included'].astype('float64'))
predictors['old peak']=preprocessing.scale(predictors['old peak'].astype('float64'))
predictors['slp']=preprocessing.scale(predictors['slp'].astype('float64'))
predictors['caa']=preprocessing.scale(predictors['caa'].astype('float64'))
predictors['THALL']=preprocessing.scale(predictors['THALL'].astype('float64'))
predictors['output']=preprocessing.scale(predictors['output'].astype('float64'))
# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, tar,
test_size=.3, random_state=123)
# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
# print variable names and regression coefficients
dict(zip(predictors.columns, model.coef_))
# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')
# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.mse_path_, ':')
plt.plot(m_log_alphascv, model.mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)
# R-square from training and test data
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)