Week 3 - Regression Modeling in practice - Test a Multiple Regression Model
Hi there,
This week we are going to build a multiple regression model for our response variable in Gapminder dataset (EMPLOYRATE), and several (possible) explanatory variables (ie URBANRATE, INTERNETUSERATE, INCOMEPERPERSON).
Assignment:
Write a blog entry that summarize in a few sentences 1) what you found in your multiple regression analysis. Discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (Beta coefficients and p-values) in your summary. 2) Report whether your results supported your hypothesis for the association between your primary explanatory variable and the response variable. 3) Discuss whether there was evidence of confounding for the association between your primary explanatory and response variable (Hint: adding additional explanatory variables to your model one at a time will make it easier to identify which of the variables are confounding variables); and 4) generate the following regression diagnostic plots:
a) q-q plot
b) standardized residuals for all observations
c) leverage plot
d) Write a few sentences describing what these plots tell you about your regression model in terms of the distribution of the residuals, model fit, influential observations, and outlier
(SIDE NOTE 1: The research question is explained in
http://playingaroundwithdataanalysis.tumblr.com/post/139295250087/data-analysis-and-interpretation-coursera-week)
(SIDE NOTE 2 - Previous posts for Regression Modeling in practice:
Week 2: http://playingaroundwithdataanalysis.tumblr.com/post/140170375832/week-2-regression-modeling-in-practice-basics)
First, we start by centering all our possible explanatory variables that we deem may have an association with our response variable: URBANRATE; INTERNETUSERATE and INCOMEPERPERSON, That is, we substract their mean values so their mean is 0. This produces a set of derived variables, that we call URBANRATE_c, INCOMEPERPERSON_c and INTERNETUSERATE_c
By running .describe() on these variables, we can see that their mean now is 0, or very close to 0.
Then, we begin exploring the individual relationships between EMPLOYRATE and each of the other explanatory variables:
1) Linear regression model between EMPLOYRATE and URBANRATE:
# Employrate - vs Urbanization rate regression test reg1 = smf.ols('EMPLOYRATE ~ URBANRATE_c', data=sub1).fit() print (reg1.summary())
p-value for test is low: 2.61e-06
Coefficients for URBANRATE are Intercept=59.05 and slope=-0.16. Both with p-value =0.000. The slope value is close to 0, so in any case the relatiohsip is not very strong.
R squared at 12.8% means that the model is only capturing for 12% of the variability of the response variable.
So, there seems to be statistically significant negative association between EMPLOYRATE and URBANRATE:
2) Linear regression model between EMPLOYRATE and INTERNETUSERATE:
# Employrate - vs Internet rate regression test reg1 = smf.ols('EMPLOYRATE ~ INTERNETUSERATE_c', data=sub1).fit() print (reg1.summary())
There is a negative relationship betwen Employrate and Interentuse rate. Although statistically significant (p-value = 0.012), the slope coefficient is close to 0, which suggests a weak linear relationship. The R squared is low, meaning that the model is only capturing the 4% of variability of the response variable.
3) Linear regression model between EMPLOYRATE and INCOMEPERPERSON:
# Employrate - vs income grate regression test
reg1 = smf.ols('EMPLOYRATE ~ INCOMEPERPERSON_c', data=sub1).fit() print (reg1.summary())
Here p-value is high: p-value=0.7 .
There is no significant relationship between employ rate and income per person
Now, we are going to add the three variables to the regression model, to see if maybe it is urbanization rate rather than internet use rate that are associated significantly with Employment rate (one may be confounder variable).
With the multiple regression, we will account for partialiing out the part of the association that can be accounted for by the other.
In other words:
- is urbanization rate positively associated with the employment rate after controlling for internet use rate and income rate? .
- Similarly, is internet use rate positively associated with the employrate after controlling urbanization rate?
With the multiple regression we will be able to conclude if both urbanization rate and internet use rate are significantly associated with employment rate. We will be able to tell this by examining the p-values after a multiple regression test.
On the other hand, we have seen that, in the case of urban rate,a straight line is not doing a good job in estimating the association between two variables. We can use polynomial regression, to find a line that curves, to better fit the association, by adding a polynomial term.
For space reasons, we will not include all data for the regression test including URBANRATE and the square of URBANRATE, but the result is that there is a (very weak) although statistically significant relationship between EMPLOYRATE and both variables (URBANRATE and URBANRATE^2).
We will be building a multiple regression model including these two variables and adding the other two possible lurking variables (INTERNETUSERATE and INCOMEPERPERSON), to see if their p values are statistically significant and contribute to modeling the response variable EMPLOYRATE when each of them is being adjusted:
4) Multiple regression model:
reg3 = smf.ols('EMPLOYRATE ~ URBANRATE_c + I(URBANRATE_c**2) + INTERNETUSERATE_c + INCOMEPERPERSON_c', data=sub1).fit() print (reg3.summary())
TEST INTERPRETATION:
Intercept: it is the value of the response variable when all other explanatory variables are 0. Because we centered the explanatory variables, the intercept is the value of the employment rate at the mean of urbanrate, internet use rate and Income per person. So: urbanrate when both urban rate, internet use and income per user rates are at their means is 56.93 of every 100 people.
- The urban rate variables remain significant after adjusting the internet user rate (p-value=0.000, beta=-0.17)
- The quadratic term for urban rate remain significant after adjusting for other variables, although very weakly (p-value=0.001,beta=0.0047)
- Internet use rate is not statistically significant, (p-value= 0.132, but beta=0.0003))
- income per user is also statistically significant, although extremely weakly (p-value=0.012, beta=0.0003).
Each observation has an estimated response value, which is also referred to as the predicted or fitted response variable, based on the regression equation. We see in the R-squared value, that only about 22% of the variability in employment rate is explained by the explanatory variables. So, there is still clearly some error in estimating the response value with this model.
RESIDUALS EVALUATION:
Let's evaluate the residuals:
1) Q-Q plot:
We use a qq-plot to evaluate the assumption that the residuals from our regression model are normally distributed. A qq-plot plots the quantiles of the residuals that we would theoretically see if the residuals followed a normal distribution, against the quantiles for the residuals estimated from our regression model.
#Q-Q plot for normality fig4=sm.qqplot(reg3.resid, line='r')
The qqplot for our regression model shows that the residuals generally follow a straight line, but deviate at the lowere and higher quantiles. This indicates that our residuals do not follow a perfect normal distribution. This could mean that the curvilinear association that we observed in our scatter plot may not be fully estimated by the quadratic urban rate term.
We need to plot the standardized residuals distribution. The standardized residuals are simply the residual values that are transformed to have a mean of zero and a standard dev of 1:
# simple plot of residuals #First, convert the array of standardized residuals to a Pandas DataFrame: stdres=pandas.DataFrame(reg3.resid_pearson) #plot the scatter plot of residuals from this model, ls=None means do not draw line that connects residuals: plt.plot(stdres, 'o', ls='None') #draw horizontal line at y=0 l = plt.axhline(y=0, color='r') #draw horizontal green lines at y=1, y=-1 l = plt.axhline(y=1, color='g') l = plt.axhline(y=-1, color='g') #draw horiontal blue lines at y=2, y=-2 l = plt.axhline(y=2, color='b') l = plt.axhline(y=-2, color='b') #draw horiontal yellow lines at y=2.5, y=-2.5 l = plt.axhline(y=2.5, color='y') l = plt.axhline(y=-2.5, color='y')
plt.ylabel('Standardized Residual') plt.xlabel('Observation Number')
We see that most of residuals fall inside 1 std dev of the mean. Few residuals have values that are beyond 2 std dev of the mean. With the standard normal distribution, we woudl expect 95% of the points in this interval of 2 std dev. No point more than 3 std dev of the mean (no extreme outliers).
Here, none of the residuals exceed value of 2.5 (or maybe 1), but ther are around than 5% of points with an absoulte value of 2 (it looks like 8 points fall there).
We should include more explanatory variables to better explain the variability in our eployment rate
In order to determine how specific explanatory variables contribute to the fit of our model there are additional python plots.
2) Standardized residuals for all observations:
A) For URBANRATE:
# additional regression diagnostic plots #each unit is 12 points, so this will be a 960x640 figure: fig2 = plt.figure(figsize(12,8)) fig2 = sm.graphics.plot_regress_exog(reg3, "URBANRATE_c", fig=fig2)
For URBANRATE, the residuals are rather distributed evenly across all values of URBANRATE. In the partial regression plot, we can see a slightly better fit to the regression line, but still many values far from this line.
B) For INCOMEPERPERSON:
# additional regression diagnostic plots #each unit is 12 points, so this will be a 960x640 figure: fig2 = plt.figure(figsize(12,8)) fig2 = sm.graphics.plot_regress_exog(reg3, "INCOMEPERPERSON_c", fig=fig2)
For INCOMEPEPERSON, we see that the biggest absolute value of residuals are almost all sprad along the low values of INCOME per person rate. However, we can see also high absolute values for higher rates of income.
If we explore the partial regression for this variable, we can see again that the residuals are rather scattered along the regression line, most of them really far from the line. It looks again like the association is weak between income per person and employment after controlling for other explanatory variables.
C) For INTERNETUSERATE:
# additional regression diagnostic plots #each unit is 12 points, so this will be a 960x640 figure: fig2 = plt.figure(figsize(12,8)) fig2 = sm.graphics.plot_regress_exog(reg3, "INTERNETUSERATE_c", fig=fig2)
There is a funnel shaped pattern to the residuals, where we see that the absolute values of the residuals are large for small values of INTERNETUSE rate. They get smaller and closer to zero as the Internet use rate increases, but start to get larger when this internet rate increases again. This is consistent with the conclusion that the model does not predict employment rate as well for countries that have high or low levels of internet rate. The model is particularly poor in trying to predict the employment rate for countries with low internet use rate. This suggests that the relationship between employment rate and internet use rate may not be curvilinear, but polynomic, for example.
This suggests that, although INTERNETUSE rate shows a statistically significant relationship with EMPLOYRATE, this association is pretty weak after controlling for other variables, such as URBANIZATION rate or INCOME level rate.
3) Leverage plot:
Finally, we can examine a leverage plot to identify observations that have an inusual large influence on the estimation of the predicted value of the response variable, or that are outliers:
How much of the predicted scores for the other observations would differ if the observations in question would not be included in the analysis. The leverage always takes on values between 0 (no effect on the regression model) and 1:
# leverage plot fig3=sm.graphics.influence_plot(reg3, size=8) print(fig3)
INTERPRETATION:
There are a few outliers greater than an absolute value of 2 but with low leverage, meaning that they have small influence in the model. On the other hand, we see other outlisers that have a greater than average leverage. There is one punt 104, that seemsn to have a high influence. It is a high leverage but is not an outlier. We have some observations that are both high leverage and outliers: like 145.
FINAL CONCLUSIONS:
We are seeing some weaknesses in our model. Even if we include other polinomial terms in our regression model (like third degree polynomials for URBANRATE, or quadratic terms for INCOMEPERPERSON or INTERNETUSERATE), even if p-values are low enough, the beta coefficients are extremely low, around 0 in all cases except for URBANRATE:
reg4 = smf.ols('EMPLOYRATE ~ URBANRATE_c + I(URBANRATE_c**2) + I(URBANRATE_c**3)+ INTERNETUSERATE_c + I(INTERNETUSERATE_c**2)+ INCOMEPERPERSON_c + I(INCOMEPERPERSON_c**2) +I(INTERNETUSERATE_c**2)', data=sub1).fit() print (reg4.summary())
This means that INTERNETUSERATE or INCOMEPERPERSON do not seem to moderate too much the relationship between URBANRATE and EMPLOYRATE.
The model is misspecified, in the sense that there must be other confounders for which we do not have data. This is consistent with the fact that the EMPLOYRATE is associated with multiple and complex socioeconomic variables. Unfortunately, we do not have the data for controlling those variables.
Indeed, the level of aggregation (per country level, per year) is too high to be able to account for specific outliers effects, economic variables etc.
Some of these effects to account for may be:
- Economic crisis of 2008: Indeed, the gapmidner dataset contains data for some variables from 2007 and for others from 2008. This is indeed a big important fact, as the employment destruction in 2008 was huge in some countries.
- Political situations specific to the countries, which may lead to outliers.
- Not the rate of urbanization itself, but how this urbanization had been growing in the previous years. We would need more longitudinal data to account for this.
Python code:
%pylab inline %matplotlib inline
import numpy as numpyp import pandas as pandas import statsmodels.api as sm import statsmodels.formula.api as smf import seaborn as sbs import matplotlib.pyplot as plt
#LOAD THE DATASET AND CONVERT COLUMNS TO UPPER-CASE TO AVOID ERRORS data = pandas.read_csv('gapminder.csv', low_memory=False, error_bad_lines=False ) data.columns= map(str.upper, data.columns)
#upper-case all Dataframe column names data.columns= map(str.upper, data.columns)
#bug fix for display formats to avoid runtime errors pandas.set_option('display.float_format', lambda x: '%f' %x)
#Set PANDAS to show all columns in a Dataframe #(the default display of pandas has a limit) pandas.set_option('display.max_columns', None) pandas.set_option('display.max_rows', None)
#setting variables you will be working with to numeric data['INTERNETUSERATE'] = data['INTERNETUSERATE'].convert_objects(convert_numeric=True) data['URBANRATE'] = data['URBANRATE'].convert_objects(convert_numeric=True) data['INCOMEPERPERSON'] = data['INCOMEPERPERSON'].convert_objects(convert_numeric=True) data['EMPLOYRATE'] = data['EMPLOYRATE'].convert_objects(convert_numeric=True)
sub1=data.copy()
#center explanatory variables:
sub1['URBANRATE'] = sub1['URBANRATE'][~np.isnan(sub1['URBANRATE'])] sub1['URBANRATE_c']=sub1['URBANRATE'] - sub1['URBANRATE'].mean() sub1['URBANRATE_c'].describe()
sub1['INCOMEPERPERSON'] = sub1['INCOMEPERPERSON'][~np.isnan(sub1['INCOMEPERPERSON'])] sub1['INCOMEPERPERSON_c']=sub1['INCOMEPERPERSON'] - sub1['INCOMEPERPERSON'].mean() sub1['INCOMEPERPERSON_c'].describe()
sub1['INTERNETUSERATE'] = sub1['INTERNETUSERATE'][~np.isnan(sub1['INTERNETUSERATE'])] sub1['INTERNETUSERATE_c']=sub1['INTERNETUSERATE'] - sub1['INTERNETUSERATE'].mean() sub1['INTERNETUSERATE_c'].describe()


















