Data Analyst's Blog @dataanalystblog-blog - Tumblr Blog

Assignment 3

In the third assignment I am going to perform multilinear regression on relationship between interest rate and annual income of client, term in which client is supposed to repay the loan, installment and home ownership of client.

I centered my variables and recoded them like this:

data import1;

set work.import1;

annual_inc_c = annual_inc - 75133.67;

installment_c = installment -447.9723112;

funded_amnt_c = funded_amnt -15438.95;

/*check vycentrovani*/

run;

DATA import1;

SET import1;

/*home_ownership_c = .; */

IF (home_ownership="MORTGAGE") THEN home_ownership_c = 0;

IF (home_ownership="OWN") THEN home_ownership_c = 1;

IF (home_ownership="RENT") THEN home_ownership_c = 2;

/*purpose_c = .;*/

IF (title = "Business") THEN purpose_c = 0;

IF (title = "Car financing") THEN purpose_c = 1;

IF (title = "Credit card refinancing") THEN purpose_c = 2;

IF (title = "Debt consolidation") THEN purpose_c = 3;

IF (title = "Green loan") THEN purpose_c = 4;

IF (title = "Home buying") THEN purpose_c = 5;

IF (title = "Home improvement") THEN purpose_c = 6;

IF (title = "Major purchase") THEN purpose_c = 7;

IF (title = "Medical expenses") THEN purpose_c = 8;

IF (title = "Moving and relocation") THEN purpose_c = 9;

IF (title ="Other") THEN purpose_c = 10;

IF (title ="Paying off higher inter") THEN purpose_c = 11;

IF (title = "Vacation") THEN purpose_c = 12;

/* grade_c = .;*/

IF (grade = "A") THEN grade_c = 0;

IF (grade = "B") THEN grade_c = 1;

IF (grade = "C") THEN grade_c = 2;

IF (grade = "D") THEN grade_c = 3;

IF (grade = "E") THEN grade_c = 4;

IF (grade = "F") THEN grade_c = 5;

IF (grade = "G") THEN grade_c = 6;

/*term_c = .;*/

IF (term = "36 months") THEN term_c = 0;

IF (term = "60 months") THEN term_c = 1;

RUN;

And ran multi linear regression:

PROC GLM;

model int_rate=term_c annual_inc_c purpose_c home_ownership_c installment_c/solution clparm;

run;

I found out that this model has high ability of explaining relationship = R2 is 28%.

The p-values of every explanatory variable is < 0.05 and hence they are statistically significant.

Before I drew the plots I calculated data for them like:

PROC GLM PLOTS(unpack)= all data=import; ods graphics on; model int_rate=term_c annual_inc_c purpose_c home_ownership_c installment_c/solution clparm; output residual=res student= stdres RSTUDENT = COOKD COOKD = cookd h= leverage out = results; run;

Thanks to this code I was able to drew:

Standardized residual plot

which was created like this:

PROC GPLOT data=results; label stdres = "Standardized Residual" id= "id"; plot stdres* id /vref = 0; run;

It shows that a lot of observations have standardized residual higher than 2, that means that model is unacceptable and observations should be deleted.

QQ plot for residuals

proc univariate data=results; var res term_c annual_inc_c purpose_c home_ownership_c installment_c; qqplot;

It shows that despite some outliers data are distributed normally.

in total, model has high R2 of 94% which is very correlated with one of variable, which was excluded, some observations have inappropriate residuals and hence must be removed.

#coursera #course assignment

Regression Modelling in Practise week 2 assignment

In week 2 we are supposed to do following with our chosen data set:

If you have a categorical explanatory variable, make sure one of your categories is coded "0" and generate a frequency table for this variable to check your coding. If you have a quantitative explanatory variable, center it so that the mean = 0 (or really close to 0) by subtracting the mean, and then calculate the mean to check your centering.

Test a linear regression model and summarize the results in a couple of sentences. Make sure to include statistical results (regression coefficients and p-values) in your summary.

In exactly this way:

Post your program and output

Post a frequency table for your (recoded) categorical explanatory variable or report the mean for your centered explanatory variable.

Write a few sentences describing the results of your linear regression analysis.

So what I did is:

I wanted to see the relationship between funded amount and annual income.

First I centered the variable by this code because I have quantitative one:

1) I found the mean

proc means; var annual_inc; run; /*75133.67*/

2) I added centered variable into existing dataset

data import; set work.import; annual_inc_c = annual_inc - 75133.67;

3) I checked the new mean

proc means; var annual_inc_c; run;

The new variable is centered.

Second I ran linear regression like this:

PROC GLM; model funded_amnt=annual_inc_c /solution clparm; run;

The results of the linear regression model indicated that funded amount = total loan provided (Beta=0.059, p=.0001) was significantly and positively associated with clients’ annual income.

#coursera #assignment #regression #course

Regression Modelling in Practice week 1 assignment

The first week assignment of Coursera course by Wesleyan University is to find suitable data set for our own personal a research so that we can try different types of regressions and complete course with new knowledge.

I browsed many sites and tried to find my dream data set that meet these conditions:

has some quantitative response variable

has some non quantitative response variable (result is either 1 or 0)

has many variables

has enough data

I found it finally on web Kaggle.com where I downloaded about 10 data sets but only one eventually met these conditions which I downloaded from the original website.

Sample

The sample is from Lending Club. Lending Club is peer-to-peer (P2P) platform where investors and debtors meet directly. Investors can choose into which loans they will invest. When they fund some loan/debtor, they will receive monthly installments.

In my sample I took data from 2015 which I downloaded from Kaggle website. These data have in total N = 69,674 data rows. According to Lending Club: ”Data set contains complete loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter.”

Procedure

The data was collected throughout the examination process of each application of the loan. Each application was treated separately.

Measures

P2P platforms are mainly used for paying borrowers debts from other loans for lower interest rate.

Forbes states that “In fact, according to peer-to-peer platform Lending Club, its borrowers — on average — secure a 24% lower interest rate when using its peer-to-peer loans to consolidate debt.”

This means that P2P is sought out by people who have already had some/many loans and want to pay them which make them riskier since they were not able to repay them on time.

This data set gives me plenty opportunities to study. I would like to examine the relationship between installment, interest rate, funded amount, etc.

The biggest challenge can be to predict if the loan will or wont be paid fully.

I hope I will take as much of the data I can.

#coursera #assignment #datascience #regression

Regression Modeling in Practice

Hey,

I am Barbora from the Czech Republic. I have recently enrolled in a Coursera course called Regression Modeling in Practice taught by lectors from Wesleyan University.

I took several Coursera courses before so I knew what to expect about it. So far I have to say that this took my heart and mind into different dimension. The lectures are well structured, nicely explained, I am not loosing pace and feel really comfortable while watching videos and testing the shown material in SAS or Python.

Here are the 5 biggest advantages about this course are:

Very nice structure

Easy to follow pace

Learn how to write code in SAS and Python - so you ca learn how to use both in 1 Course

you have to start Tumblr and blogging = here I am

You get sme preparation for some bigger project

I have already seen 3 weeks out of 4, but due to pre-Christmas workload I had no time to find some good dataset or start thinking about its usefulness and research questions, etc.Today is the day to start with assignments :)

#datascience #data #coursera #courseraassignment #analyst #sas

Trending Blogs

Recently Viewed Blogs

Data Analyst's Blog