Data Analysis with Multiple Regression
Here I am using the GapMinder dataset and to examine my research question: whether breast cancer rate can be predicted by urbanization rate and alcohol consumption.
Here, I am considering the below variables:
Predictor Variable: urbanrate, alcconsumption Response Variable: breastcancerper100th
Since both my independent predictors are quantitative, I will need to center them before using them in the regression analysis.
Here is the SAS code to find out the means for my variables:
This snippet generates the means as follows:
Next, I would center the quantitative predictors with the below code snippet and plot regression lines individually to get an idea of the relation of each predictor with the response variable:
The snippet produces the below scatter plots with regression lines (the response variable against one predictor variable at a time):
Since both individual scatterplots indicate a positive linear trend, I will go ahead with fitting both the variables in my regression model one at a time.
Adding only urbanrate to begin with:
This produces:
It is statistically significant as can be seen from the p-value < 0.0001. However, R-square is only 0.32 meaning we can account for only 32% of the variability in breast cancer using this model.
Next, I’ll try to add another predictor, alcconsumption to see if that improves the model. This can be done with the below snippet:
This time it produces:
As can be clearly seen, not only are both statistically significant (p-value < 0.0001), together they increase R-square to 0.45 meaning with this modified model we can account for 45% of the variability in breast cancer rate.
Going further, I added a quadratic term for urbanrate to my model:
This time, I have also added code to request residual statistics. This gives me the below:
All three predictors are still statistically significant ((p-value < 0.05), the R-square has risen to 0.48.
As I requested residual stats as well, we get the below from the same code:
Q-Q Plot
Outlier and Leverage Diagnostic:
Studentized Residuals
Summary of Observations:
1. Each individual predictor (urbanrate and alcoholconsumption) is separately showing positive linear correlation with the response variable breast cancer rate.
2. Urbanrate, on its own, can be used as a predictor for breast cancer rate (p-value < 0.0001). This model allows us to account for about 32% of the variability in breast cancer rate. The regression equation we get from this model is:
breastcancerper100th = 37.90639200 + 0.57169795 * urbanrate_c
(where urbanrate_c is the centered urbanrate)
3. Adding alcohol consumption rate to the model (along with urbanrate) improves the model. Both predictors still remain statistically significant (p-value < 0.0001). This model gets R-square up to 0.446, in other words, we can account for around 45% of the variability in breast cancer using this model. The regression equation for this model is:
breastcancerper100th = 37.90973617 + 0.46996409 * urbanrate_c + 1.64345926 * alcconsumption_c
(where urbanrate_c is the centered urbanrate and alcconsumption_c is centered alcconsumption)
4. Adding a quadratic term for urbanrate to the model, still keeps all the predictors statistically significant (p-values < 0.05) but pushes the R-square up to 0.484. In other words, using this model, we can account for around 48% of the variability of breast cancer rate. The regression equation from this model is:
breastcancerper100th = 33.47240804 + 0.49935408 * urbanrate_c + 0.00856208 * urbanrate_c * urbanrate_c + 1.77696015 * alcconsumption_c
(where urbanrate_c is the centered urbanrate and alcconsumption_c is centered alcconsumption)
5. While adding the variables one at a time, I did not see any evidence of confounding.
6. From the studentized residual plot, I see there are only 8 residual points that fall outside the -2 to 2 range out of the possible 165 data points. This amounts to less than 5% which is in line with the expectations of standard normal distribution.
7. From Outlier and leverage plot, I see there are 6 outliers. None of the outliers fall in the high leverage category. There are 11 high leverage points but none of them are outliers.
8. From the Q-Q plot, I see that the quantiles plot roughly follows the straight line although it deviates a little at lower and, in particular, at higher ranges. So, the residuals are not exactly following normal distribution and even though this model does a decent job of predicting breast cancer rate there might still be room for improvement.
















