Assignment 3
In the third assignment I am going to perform multilinear regression on relationship between interest rate and annual income of client, term in which client is supposed to repay the loan, installment and home ownership of client.
I centered my variables and recoded them like this:
data import1;
set work.import1;
annual_inc_c = annual_inc - 75133.67;
installment_c = installment -447.9723112;
funded_amnt_c = funded_amnt -15438.95;
/*check vycentrovani*/
run;
DATA import1;
SET import1;
/*home_ownership_c = .; */
IF (home_ownership="MORTGAGE") THEN home_ownership_c = 0;
IF (home_ownership="OWN") THEN home_ownership_c = 1;
IF (home_ownership="RENT") THEN home_ownership_c = 2;
/*purpose_c = .;*/
IF (title = "Business") THEN purpose_c = 0;
IF (title = "Car financing") THEN purpose_c = 1;
IF (title = "Credit card refinancing") THEN purpose_c = 2;
IF (title = "Debt consolidation") THEN purpose_c = 3;
IF (title = "Green loan") THEN purpose_c = 4;
IF (title = "Home buying") THEN purpose_c = 5;
IF (title = "Home improvement") THEN purpose_c = 6;
IF (title = "Major purchase") THEN purpose_c = 7;
IF (title = "Medical expenses") THEN purpose_c = 8;
IF (title = "Moving and relocation") THEN purpose_c = 9;
IF (title ="Other") THEN purpose_c = 10;
IF (title ="Paying off higher inter") THEN purpose_c = 11;
IF (title = "Vacation") THEN purpose_c = 12;
/* grade_c = .;*/
IF (grade = "A") THEN grade_c = 0;
IF (grade = "B") THEN grade_c = 1;
IF (grade = "C") THEN grade_c = 2;
IF (grade = "D") THEN grade_c = 3;
IF (grade = "E") THEN grade_c = 4;
IF (grade = "F") THEN grade_c = 5;
IF (grade = "G") THEN grade_c = 6;
/*term_c = .;*/
IF (term = "36 months") THEN term_c = 0;
IF (term = "60 months") THEN term_c = 1;
RUN;
And ran multi linear regression:
PROC GLM;
model int_rate=term_c annual_inc_c purpose_c home_ownership_c installment_c/solution clparm;
run;
I found out that this model has high ability of explaining relationship = R2 is 28%.
The p-values of every explanatory variable is < 0.05 and hence they are statistically significant.
Before I drew the plots I calculated data for them like:
PROC GLM PLOTS(unpack)= all data=import; ods graphics on; model int_rate=term_c annual_inc_c purpose_c home_ownership_c installment_c/solution clparm; output residual=res student= stdres RSTUDENT = COOKD COOKD = cookd h= leverage out = results; run;
Thanks to this code I was able to drew:
Standardized residual plot
which was created like this:
PROC GPLOT data=results; label stdres = "Standardized Residual" id= "id"; plot stdres* id /vref = 0; run;
It shows that a lot of observations have standardized residual higher than 2, that means that model is unacceptable and observations should be deleted.
QQ plot for residuals
proc univariate data=results; var res term_c annual_inc_c purpose_c home_ownership_c installment_c; qqplot;
It shows that despite some outliers data are distributed normally.
in total, model has high R2 of 94% which is very correlated with one of variable, which was excluded, some observations have inappropriate residuals and hence must be removed.












