Multiple Regression using R
Introduction:
Multiple regression is a branch of linear regression which can be used to analyse more than two variables. In multiple regression there is one response and more than one predictor variables whereas in linear regression where one response variable and one predictor variable. The predicator variables are the dependent variables and the response variable are the independent variables. Considering the equation for multiple regression,
Y=mx1+mx2+mx3=b
Where Y is the response variable
m1, m2, m3 are predictor variables
Let us discuss two problems regarding multiple regression
Analysis using R:
Multiple regression using R is one of the widely and often used method which is easy to use and handle.
DATA SET USED:
· https://github.com/grantgasser/Complete-Multiple-Regression
Using this dataset, we study of the relation between degree of brand liking (Y) and moisture content (X1) and sweetness (X2) of the product, the following results were obtained from the experiment based on a completely randomized design.
Some of the steps which we has to be followed are
1. Load and view the dataset
2. Identifying the data linearity in R
3. Plotting the graph
4. Implementation of Multiple Regression
5. Prediction and Interpretation
Brand Preference:
In a small-scale experimental study of the relation between degree of the brand liking (Y) and moisture content (XI) and sweetness (X2) of the product, the following results were obtained from the experiment based on a completely randomized design (data are coded:)
Analyzation of the data:
Scatter plot:
The diagnostic aids show that firstly, there are no outliers and the distribution for each variable is normal. Additionally, looking at the correlation matrix, Y and X1 have significant positive correlation, Y and X2 are positively correlated, but less so than Y and X1 and there’s no correlation between X1 and X2.
Correlation Matrix:
The correlation matrix of the variables is plotted to check the correlation between the variables.
Summary:
The value of multiple R- squared is 0.9521 and the adjusted R- squared value is 0.9447. When the variable X2, is added to X1 we get a p-value of about 2.01e-05. F- statistic variable is larger than 1. Y= 37.65 + 4.425X1 + 4.375X2 is the result of the regression model. Holding the other variables constant, increasing one unit of X1 results in a 4.425 rise in brand liking degree, while increasing one unit of X2 results in a 4.375 increase in brand liking degree. Because the P values for each variable are less than 0.05, both X1 and X2 are significant.
QQ plot:
In this QQ- plot the points plotted all fall in the same line which clearly determines that the residuals follow normal distribution. There are no outliers and errors .
Shapiro test:
Shapiro Wilk test is a statistic normality test for a random data set. It can be used to analyse if the data set is normally distributed. By analysing the values, we get,
Model validation:
Regression vs Residual Plot:
The above given plot is a residual plot which indicates a pattern in the residuals and the fitted plot. Although the distribution appears to be pretty normal, there are outliers on both sides of the median, with more outliers to the right.
Breusch-Pagan test:
This is test can be used to determine whether the heteroscedasticity is present in a regression analysis
Prediction and Confidence level:
newX = data. frame (X1=newX1, X2 = newX2)
#Confidence interval (95%)
predict (fit, newX, interval="confidence")
#Prediction Interval (95%)
predict (fit, newX, interval="prediction"
Output:
Interpretation:
From the above analysis, we get that the R- square is about 95% is very good and the results are accurate and the overall relationship is significant.














