Test a Basic Linear Regression Model
The program in SAS [link]
Here is a scatter plot showing a linear association between the overall polity score for countries and alcohol consumption per adult from the gap minder data set. That is we can a draw a straight line to the scatter plot and this regression line does a relatively not bad job of catching the association.
You can see that it looks like alcohol consumption rate increases as democracy score increases. We can actually fit a line that curves, to better fit the association, by adding a polynomial term. For example, we could add a quadratic term, to draw the line of best fit that captures the curvature that we're seeing.
Now the scatterplot shows the original linear regression line in blue and the quadratic regression line in green. Notice how the quadratic line does a better job of capturing the association at lower and higher democracy score. The points at these levels are closer to the quadratic, or second-order polynomial curve. Meaning that the expected or predicted values are closer to the observed values. So based on just looking at the two curves, it looks like the green quadratic curve fits the data better than the blue straight line.
But we can be even more sure of this conclusion if we test to see whether adding a second order polynomial term to our aggression model gives us a significantly better fitting model. We do this by simply adding another variable that is the squared value of the explanatory x variable, x squared, to the regression model.
First, let's test the regression model for just the linear association between democracy score and alcohol consumption rate using the ols function from the statsmodels API formula library. Note that we have centered democracy score quantitative explanatory variable polityscore_c. Centering is especially important when testing a polynomial regression model. Because it makes it considerably easier to interpret the regression coefficients.
If we look at the results, we can see from the significant p-value and positive parameter estimate (=0.3541) that alcohol consumption rate is positively associated with countries' democracy score. So the linear association, the blue line in the scatter plot, is statistically significant. But the R-square is 18.6%, indicating that the linear association of democracy score is capturing only about 18.6% of the variability in alcohol consumption. But what happens if we allow that straight line to curve by adding a second order polynomial to that regression equation. The Python code to do this is here.
When we look at the table of results, we see that the value for the linear term for democracy score is positive (=0.5250), and the p-value is less than 0.05. In addition, the quadratic term is positive insignificant, indicating that the curvilinear pattern we observed in our scatter plot is statistically significant. In addition, you see that the R square increases to 23.3%. Which means that adding the quadratic term for democracy score, increase the amount of variability in alcohol consumption rate that can be explained by democracy score from 18.6% to 23.3%. Together, these results suggest that the best fitting line for this association is one that includes some curvature.
Let's add another centered explanatory variable, urbanrate, to our regression equation. Here's the regression equation for this model and the python code. This is the same gap minder model that we tested previously with the exception that we have added the centered urbanrate_c explanatory variable. And here are the results.
As you can see from the table above the coefficients for the linear and quadratic democracy score variables, remain significant after adjusting for the urban rate. The urban rate is also statistically significant. The positive regression coefficient indicates that countries with a high urban rate, tend to have a higher alcohol consumption rate.
In fact, urban rate and democracy score together, explain about 25.3% of the variability in alcohol consumption rate. So, there's clearly some error in estimating the response value with this model. In this regression model, the residual is the difference between the predicted alcohol consumption rate and the actual observed alcohol consumption rate for each country.
We can take a look at this residual variability, which not only helps us to see how large the residuals are but also allows us to see whether our regression assumptions are met. And whether there are any outlying observations, that might be unduly influencing the estimation of the regression coefficient.
The easiest way to evaluate residuals is to graph them. First, we can use a qq-plot to evaluate the assumption that the residuals from our aggression model are normally distributed. A qq-plot plots the quantiles of the residuals that we would theoretically see if the residuals followed a normal distribution, against the quantiles for the residuals estimated from our aggression model. The python code to generate a qq-plot is here.
The qqplot for our regression model shows that the residuals generally follow a straight line, but deviate at the lower and higher quantiles. This indicates that our residuals did not follow perfect normal distribution. This could mean that the curvilinear association that we observed in our scatter plot may not be fully estimated by the quadratic democracy score term. There might be other explanatory variables that we might consider including in our model, that could improve estimation of the observed curvilinearity.
To evaluate the overall fit of the predicted values of the response variable to the observed values and to look for outliers, we can examine a plot of the standardized residuals for each of the observations.
If we take a look at this plot, we see that most of the residuals fall within 2 standard deviations of the mean. So basically, they're either between -2 or 2, and all but a few countries (=7) have residuals that are more than 2 standard deviations above or below the mean of 0.
With the standard normal distribution, we would expect 95% of the values of the residuals to fall between two standard deviations of the mean. Residual values that are more than two standard deviations from the mean in either direction are a warning sign that we may have some outliers. There are 7 observations that have three or more standard deviations from the mean. So we have some extreme outliers.
1.9% of the residuals exceeded an absolute value of 2.5 and 4.4% were greater than an absolute value of 2.0. This suggests that the fit of the model is relatively poor and could be improved. The biggest contributor to poor model fit is leaving out important explanatory variables. In order to improve the fit of this model, we should include more explanatory variables to better explain the variability in our alcohol consumption rate response variable.
The following Python code can be used to generate a few more plots to help us determine how specific explanatory variables contribute to the fit of our model.
The primary plots of interest are the plots of the residuals for each observation of different of values of urban rates in the upper right-hand corner and partial regression plot which is in the lower left-hand corner. The plot in the upper right-hand corner shows the residuals for each observation at different values of Internet use rate. Finally, because we have multiple explanatory variables, we might want to take a look at the contribution of each individual explanatory variable to model fit, controlling for the other explanatory variables. One type of plot that does this, is the partial regression residual plot.
Another plot, in the lower left-hand corner, is a partial regression residual plot. It attempts to show the effect of adding urban rate as an additional explanatory variable to the model. Given that one or more explanatory variables are already in the model. For the urban rate variable, the values in the scatter plot are two sets of residuals. The residuals from a model predicting the alcohol consumption rate response from the other explanatory variables, excluding urban rate, are plotted on the vertical access, and the residuals from the model predicting urban rate from all the other explanatory variables are plotted on the horizontal access. What this means is that the partial regression plot shows the relationship between the response variable and specific explanatory variable, after controlling for the other explanatory variables.
When we take a look at the plot for urban rate in the lower left-hand corner, we see that, in contrast to the plot of the residuals at different values of urban rate without adjusting for the polity score variables, which is shown above, the partial regression plot for urban rate does not clearly indicate a nonlinear association. Rather, the residuals are spread out in a random pattern around the partial regression line. In addition, many of the residuals are pretty far from this line, indicating a great deal of alcohol consumption rate prediction error. This suggests that although urban rate shows a statistically significant association with alcohol consumption rate, this association is pretty weak after controlling for democracy score.
Let's take a look at a leverage plot to identify observations that have an unusually large influence on the estimation of the predicted value of the response variable, alcohol consumption rate, or that are outliers, or both. The leverage of an observation can be thought of in terms of how much the predicted scores for the other observations would differ if the observations in question were not included in the analysis. The leverage always takes on values between 0 and 1. A point with zero leverage has no effect on the regression model. And outliers are observations with residuals greater than 2 or less than -2. We use the following Python code to generate a leverage plot.
One of the first things we see in the leverage plot is that we have a few outliers, contents that have residuals greater than 2. We've already identified some of these outliers in some of the other plots we've looked at, but this plot also tells us that these outliers have small or close to zero leverage values, meaning that although they are outlying observations, they do not have an undue influence on the estimation of the regression model. On the other hand, we see that there are a few cases with higher than average leverage. But three, in particular, is more obvious in terms of having an influence on the estimation of the predicted value of alcohol consumption rate. This observation has a high leverage but is not an outlier. We don't have any observations that are both high leverage and outliers.