The role of the residential environment @mertsalov - Tumblr Blog

Results

Part 3. Preliminary Results

Descriptive Statistics

Table 1 shows descriptive statistics for life expectancy at birth and the quantitative predictors. The total average life expectancy was:

2012: 70.2 years (sd= 8.5) with a minimum life expectancy at birth of 48.8 years and a maximum of 83.1 years.

2013: 70.5 years (sd= 8.3) with a minimum life expectancy at birth of 48.9 years and a maximum of 83.3 years.

Table 1. Descriptive Statistics for Data Analytic Variables, 2012-2013 (N=195)

Bivariate Analyses

Scatter plots for the association between the total life expectancy response variable and quantitative predictors (Figure 1, Figure 2) revealed that life expectancy was longer for the countries with larger GDP or adj. net income per capita. If we take a logarithm for the predictors, we get a more linear relationship with a larger value of the Pearson's correlation coefficient (see table 2).

Table 2. Pearson’s correlation coefficients

Total life expectancy and also the values of this index for men and women are shown in Figure 3 (data is presented for 2012-2013).

Figure 3. Total life expectancy

ANOVA results (see Figure 4) indicated that total life expectancy differ significantly as a function of GDP group (see Table 3).

Figure 4. ANOVA results

Table 3. ANOVA results

Regression Analysis

Fig. 5-6 show regression models constructed from the original features and their logarithms. The best quality is shown by models constructed on the basis of logarithmic features. The quality of models built on the basis of predictor 1 (GDP or log GDP per capita) and predictor 2 (adjusted net national income or log adjusted net national income per capita) is approximately the same.

Figure 5. Regression Analysis (data 2012)

Figure 6. Regression Analysis (data 2013)

Log adjusted net national income per capita was most strongly associated with total life expectancy at birth (Table 4-5).

Table 4. Variable names and regression coefficients:

Table 5. R-square from training and test data

Mean square error (MSE) for each fold is shown on Figure 7.

Figure 7. MSE

Total life expectancy at birth was shorter for the countries that had less values of GDP or adjusted net national income per capita. The higher values of these indicators correspond to a longer total life expectancy for the respective countries. These predictors accounted for 63% of the variance in total life expectancy.

#Descriptive Statistics #Bivariate Analyses #ANOVA #Regression Analysis

Methods

Part 2. Sample, measures, and analyses.

Sample

In this study the World Bank data set was used.

This World Bank data set is a subset of data extracted from the primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates. This data set consists of over 80 variables on N=248 countries for the years 2012 and 2013. All variables have valid data observations for minimum of 190 countries.

After excluding missing data (my selection criteria), there are 195 entries left in the data set.

Measures

Definitions for the variables that were analyzed

Response variable: Life expectancy at birth, male/female/total (years).

Life expectancy at birth indicates the number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life.

Aggregation Method: Weighted average.

Development Relevance: Mortality rates for different age groups (infants, children, and adults) and overall mortality indicators (life expectancy at birth or survival to a given age) are important indicators of health status in a country. Because data on the incidence and prevalence of diseases are frequently unavailable, mortality rates are often used to identify vulnerable populations. And they are among the indicators most frequently used to compare socioeconomic development across countries.

Limitations and Exceptions: Annual data series from United Nations Population Division's World Population Prospects are interpolated data from 5-year period data. Therefore they may not reflect real events as much as observed data.

Periodicity: Annual

Statistical Concept and Methodology: Life expectancy at birth used here is the average number of years a newborn is expected to live if mortality patterns at the time of its birth remain constant in the future. It reflects the overall mortality level of a population, and summarizes the mortality pattern that prevails across all age groups in a given year. It is calculated in a period life table which provides a snapshot of a population's mortality pattern at a given time. It therefore does not reflect the mortality pattern that a person actually experiences during his/her life, which can be calculated in a cohort life table. High mortality in young age groups significantly lowers the life expectancy at birth. But if a person survives his/her childhood of high mortality, he/she may live much longer. For example, in a population with a life expectancy at birth of 50, there may be few people dying at age 50. The life expectancy at birth may be low due to the high childhood mortality so that once a person survives his/her childhood, he/she may live much longer than 50 years.

Predictors included the gross domestic product (GDP) per capita (current US$) and adjusted net national income per capita (current US$) by countries at 2012 and 2013.

Gross domestic product (GDP) is a monetary measure of the market value of all final goods and services produced in a period (quarterly or yearly) of time. Nominal GDP estimates are commonly used to determine the economic performance of a whole country or region, and to make international comparisons. Nominal GDP per capita does not, however, reflect differences in the cost of living and the inflation rates of the countries; therefore using a basis of GDP per capita at purchasing power parity (PPP) is arguably more useful when comparing differences in living standards between nations.

GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in current U.S. dollars.

Adjusted net national income is the gross national income (GNI) minus consumption of fixed capital and natural resources depletion. GNI is the total domestic and foreign output claimed by residents of a country, consisting of gross domestic product (GDP), plus factor incomes earned by foreign residents, minus income earned in the domestic economy by nonresidents. Comparing the GNI and GDP shows whether a nation's resources are put to capital creation or declining toward abroad.

Aggregation Method: Weighted average.

Periodicity: Annual.

Analyses

The distributions for the predictors and the life expectancy response variable were evaluated by calculating the mean, standard deviation and minimum and maximum values for the quantitative variables.

Scatter plots and box plots were also examined, and Pearson correlation and Analysis of variance (ANOVA) were used to test bivariate associations between individual predictors and the life expectancy response variable.

Lasso regression with the least angle regression selection algorithm was used to identify the subset of variables that best predicted life expectancy response variable. The lasso regression model was estimated on a training data set consisting of a random sample of 70% of the countries (N=136), and a test data set included the other 30% of the countries (N=59). All predictor variables were standardized to have a mean=0 and standard deviation=1 prior to conducting the lasso regression analysis. Cross validation was performed using k-fold cross validation specifying 5 folds. The change in the cross validation mean squared error rate at each step was used to identify the best subset of predictor variables. Predictive accuracy was assessed by determining the mean squared error rate of the training data prediction algorithm when applied to observations in the test data set.

#Sample #Measures #Analyses #GDP #GNI #Life Expectancy

The Association between Income and Life Expectancy

Part 1. Title and Introduction to the Research Question

The purpose of this study was to determine the relationship between a country's gross domestic product (GDP) per capita, adjusted net national income per capita and life expectancy. GDP and adjusted net national income are the most accurate characteristics that determine the level of the country’s economic development.

As an ordinary person, I want to know how is life expectancy different in countries with different levels of economic development. In addition, I'm interested to know whether life expectancy is related to such external factors as the level of economic development and whether there is a significant difference in life expectancy in countries with the same GDP and/or adjusted net national income.

Getting the answer to this question is also of practical importance. As people become more educated, live longer and maintain good health, older people, as never before, can and do make a more meaningful contribution to society. By promoting their active participation in the life of society and its development, we can ensure that their invaluable talents and experience are well used. Older people who can and want to work should be able to do this.

#Life Expectancy #Income

Running a k-means Cluster Analysis

Part 16

A k-means cluster analysis was conducted to identify underlying subgroups of the countries based on their similarity of responses on 5 variables that represent characteristics that could have an impact on alcohol consumption rate per adult (variable alcconsumption) in these countries.

Clustering variables included quantitative variables measuring

polityscore – overall polity score, calculated by subtracting an autocracy score from a democracy score and describes the summary measure of a country's democratic and free nature: -10 is the lowest value, 10 the highest,

urbanrate (urban population [% of tota]l),

incomeperperson (Gross Domestic Product per capita in constant 2000 US$),

internetuserate (Internet users [per 100 people]),

employrate (total employees age 15+ [% of population]).

All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

This data set has a relatively small number of observations, that is why there is no need to split into training and test data sets.

The program in SAS

The program in Python [link]

Results

A series of k-means cluster analyses were conducted on the data set specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

Elbow curve of r-square values for the nine cluster solutions

The elbow curve was inconclusive, suggesting that the 2, 3 and 7-cluster solutions might be interpreted. The results below are for an interpretation of the 3-cluster solution.

The first thing we want to try is to graph the cluster in a scatter plot to see whether or not they overlap with each other in terms of their location in the p-dimensional space (5 variables). We have 5 dimensions, which would be impossible to visualize. A scatterplot will work to visualize a few dimensions, but not 5 dimensions.

So what we're going to do is use canonical discriminate analysis which is a data reduction technique that creates a smaller number of variables that are linear combinations of the 5 clustering variables.

The new variables called canonical variables are ordered in terms of the proportion of variance in the clustering variables that is accounted for by each of the canonical variables. So the first canonical variable will account for the largest proportion of the variance. The second canonical variable will account for the next largest proportion of variance and so on. Usually, the majority of the variants in the clustering variable will be accounted for by the first couple of canonical variables and those are the variables we can plot.

The results show that the 5 variables are now reduced to two canonical variables that can be used to visualize the location of the clusters in a two-dimensional space.

Let's just plot the two canonical variables using the sgplot procedure. We will use the data set from the canonical discriminate analysis that includes the canonical variables which we called cluscan. Here's the scatter plot.

The observations in clusters 1, 2 and 3 are not densely packed and do not overlap with the each other’s, meaning they are pretty less correlated with each other, and within cluster variance is relatively low. The clusters are relatively distinct. What this suggests is that the best cluster solution may have three clusters.

Take a look at the cluster means table to examine the patterns of means on the clustering variables for each cluster.

The means on the clustering variables show that compared to the other clusters. Countries in cluster 2 had low levels on a clustering variable. They had a relatively low of polity score, urban rate, income per person and internet use rate, but moderate levels of employ rate. Cluster 1 had higher levels on the clustering variables compared to cluster 2, but were low compared to cluster 3. Countries in cluster 3 have the highest levels in almost all clustering variables. The only exception is the variable of polity score which is some less than the one's in cluster 1.

Finally let's see how the clusters differ in alcohol consumption rate.

In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on grade point average (alcohol consumption rate). The model statement specifies the model with alcconsumption as the response variable and cluster as the explanatory variable. Then, because our categorical cluster variable has three categories, we requested a Tukey's test to evaluate post hoc comparisons between the clusters. A Tukey's test was used for post hoc comparisons between the clusters. Here are the results.

The box plot shows the mean alcohol consumption rate by cluster.

Countries in cluster 2 had the lowest alcohol consumption rate. Countries in clusters 1 and 3 have roughly equal to each other and the highest rates of alcohol consumption.

Results indicated significant differences between the clusters on alcohol consumption (F(2, 53)=7.44, p=0.0014).

The Tukey's test shows that the clusters differed significantly in mean alcohol consumption per adult, with the exception of clusters 1 and 3, which did not differ significantly from each other. Countries in cluster 2 had the lowest alcohol consumption (mean=5.00, sd=6.02), and clusters 1 and 3 had the highest alcohol consumption (mean=10.99, sd=4.45) and (mean=10.48, sd=4.09) respectively.

#cluster analysis

Running a Lasso Regression Analysis

Part 15

A lasso regression analysis was conducted to identify a subset of variables from a pool of 7 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring alcohol consumption per adult. Categorical predictor included polityscore – overall polity score, calculated by subtracting an autocracy score from a democracy score and describes the summary measure of a country's democratic and free nature: -10 is the lowest value, 10 the highest. Quantitative predictor variables include urbanrate (urban population [% of tota]l), incomeperperson (Gross Domestic Product per capita in constant 2000 US$), internetuserate (Internet users [per 100 people]), employrate (total employees age 15+ [% of population]), oilperperson (oil consumption per capita [tonnes per year and person]) and relectricperperson (residential electricity consumption, per person [kWh]). All predictor variables were standardized to have a mean of zero and a standard deviation of one.

The program in SAS

The program in Python [link]

Results

The first thing we see is some information about the SURVEYSELECT procedure we used to split the observations in the total data set, into training and test data. Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations.

Next, we see the information about the Lasso regression. It shows the Dependent Variable, alcconsumption, alcohol consumption per adult, and the selection method that it was used. It was used as a criterion for choosing the best model, 10-fold cross validation, with random assignments of observations to the folds. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

We can also see the total number of observations in the data set and the number of observations used for training and testing the statistical models. The number of parameters to be estimated is 8 for the intercept plus the 7 predictors.

Next is the table with the LAR selection information. It shows the steps in the analysis and the variable that is entered at each step. The ASE and Test ASE are the averaged squared error, which is the same as the means square error for the training data and the test data. You can see that at the beginning, there are no predictors in the model. Just the intercept. Then variables are entered one at a time in order of the magnitude of the reduction in the mean, or average squared error. So they are ordered in terms of how important they are in predicting alcohol consumption. According to the lasso regression results, it appears that the most important predictor of alcohol consumption was Internet users rate. Followed by self oil consumption per capita and so on.

You can also see how the average square error declines as variables are added to the model, indicating that the prediction accuracy improves as each variable is added to the model. The CV PRESS shows the sum of the residual sum of squares in the test data set. There's an asterisk at step 2. This is the model selected as the best model by the procedure. You can see that this is the model with the lowest summed residual sum of squares and that adding other variables to this model, actually increases it.

Finally, you could also see that the training data ASE continues to decline as variables are added. This is to be expected as model complexity increases. This is an example of the bias variance tradeoff.

The first plot shows the change in the regression coefficients at each step, and the vertical line represents the selected model.

This plot shows the relative importance of the predictor selected at any step of the selection process, how the regression coefficients changed with the addition of a new predictor at each step. As well as the steps at which each variable entered the model.

In our case, as also indicated in the summary table above, internetuserate and oilperperson had the largest regression coefficient, followed by employrate. Oilperperson and employrate were negatively associated with alcoholconsumption and internetuserate was positively associated with alcoholconsumption.

The lower plot shows how the chosen selection criterion, in this case CVPRESS, which is the residual sum of squares summed across all the cross-validation folds in the training set, changes as variables are added to the model. initially it rapidly decreases, and then at a certain point it begins to increase (at a point in which adding more predictors leads to much production in the residual sum of squares).

The next plot shows at which step in the selection process different selection criteria would choose the best model.

The other criteria selected more complex models and the criterion based on cross validation possibly selecting an overfitted model.

The next plot shows changing in the validation mean square error at each step.

The final plot shows the change in the average or mean square error at each step in the process.

As expected, the selected model was less accurate in predicting alcohol consumption rate in the test data, but the test average squared error at each step was pretty close to the training average squared error overall. This suggests that prediction accuracy was pretty stable across the two data sets.

Finally, the output shows the R-Square (23.43%) and adjusted R-Square (19.3%) for the selected model and the mean square error for both the training and test data.

#lasso regression

Running a Random Forest

Part 14

Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating alcohol consumption rate (my response variable), urban rate, Gross Domestic Product per capita, Democracy score, Internet users (per 100 people).

The program in SAS

The program in Python [link]

Results

We can see in the model information section that variables to try is equal to 2, indicating that a random selection of two explanatory variables was selected to test each possible split for each node in each tree within the forest.

By default, SAS will grow 100 trees (Maximum Trees = 100) and select 60% of their sample when performing the bagging process (Inbag Fraction = 0.6). The prune fraction specifies the fraction of training observations that are available for pruning a split. The value can be any number from 0 to 1, although a number close to 1 would leave little to grow the tree. The value of Prune Faction is 0. In other words, the default value is not to prune.

Leaf size specifies the smallest number of training observations that a new branch can have. The value of Leaf Size Setting is 1.

The split criterion used in HPFOREST is the Gini index.

In terms of missing data, if the value of our target or response variable is missing, the observation is excluded from the model. If the value of an explanatory variable is missing, PROC HPFOREST uses the missing value as a legitimate value by default.

Notice, too, that the number of observations read from my data set was 213. Within the baseline fit statistics output, you can see that the misclassification rate of the random forest is displayed. Here we see that the forest misclassified 43.7% of the sample. Suggesting that the forest correctly classified 56.3% of the sample.

The first ten and last ten observations of the fit statistics table.

PROC HPFOREST computes fit statistics for a sequence of forests that have an increasing number of trees. Forest models provide an alternative estimate of average square error and misclassification rate, called the out of bag or OOB estimate. The OOB estimate is a convenient substitute for an estimate that is based on test data and is a less biased estimate of how the model will perform on future data. We end up with near perfect prediction in the training samples as the number of trees grown gets closer to 100. When those same models are tested on the out of bag sample, the misclassification rate is around 24%.

The final table in our output represents arguably the largest contribution of random forests. Specifically, the variable importance rankings.

The number of rules column shows the number of splitting rules that use a variable. Each measure is computed twice, once on training data and once on the out of bag data.

The rows are sorted by the out of bag Gini measure or OOB Gini measure. The variables are listed from highest importance to lowest importance in predicting alcohol consumption rate.

Here we see that some of the most important variables in predicting alcohol consumption rate include urban rate, Gross Domestic Product per capita, Democracy score, Internet users.

Summary

The explanatory variable with the highest relative importance scores were polity score. The accuracy of the random forest was 70%, with the subsequent growing of multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate.

#Random Forest

Running a Classification Tree

Part 13

Program in SAS

The program in Python [link]

Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. All possible separations (categorical) or cut points (quantitative) are tested. For the present analyses, the entropy “goodness of split” criterion was used to grow the tree and a cost complexity algorithm was used for pruning the full tree into a final subtree.

The following explanatory variables were included as possible contributors to a classification tree model evaluating alcohol consumption rate (my response variable), urbanrate, polityscore, incomeperperson, internetuserate.

Take a look at the output.

We can see in the model information table that the decision tree that SAS grew has 30 leaves before pruning and 3 leaves following pruning. Model event level let us confirm that the tree is predicting the value 1, that is high level (> 6 liters per adult) alcohol consumption (our target variable). That is why it is necessary to recode the low-level alcohol consumption to 2, keeping 1 equal to high-level alcohol consumption.

Notice too that the number of observations read from gapminder dataset was 152 and the number of observations used was 152 too. These two numbers is equal because there were some data preparations in the data management part of the program (left only the valid data for the target variable and each explanatory variables). Those observations with missing data on even one variable have been set aside.

Next SAS creates a plot of the cross validated average standard error (ASE) based on number of leaves each of the trees generated on the training sample.

A vertical reference line is drawn for the tree with the number of leaves that has lowest cross-validated ASE. In this case, the 3-leaf tree. The horizontal reference line represents the average standard error plus one standard error for this complexity parameter. Often the 1-SE rule is applied when you are pruning via the cost-complexity method to potentially select a smaller tree that has only a slightly higher error rate than the minimum ASE selecting the smallest tree that has an ASE below the horizontal reference line is in effect implementing the 1-SE rule.

The final smaller tree

Model-based confusion matrix which shows how well the final classification tree performed.

The total model classified 81.6% of the sample correctly, 67% sensitivity and 96.1% specificity.

The total model correctly classified 1-0.3289=67.11% of those countries which have high-level alcohol consumption and 96.05% of those which have low-level alcohol consumption.

ROC-curve shows sensitivity, that is the true positive rate, and specificity, the true negative rate plotted against each other.

Variable impotence table

The importance score measures a variable's ability to mimic the chosen tree and to play the role as standing for variables appearing as primary splits. Here we see that variables internetuserate and polityscore.

#Classification Tree

Test a Logistic Regression Model

Part 12

Program in SAS

Program in Python [link]

Summary

Frequency tables

Model 1 (alcgroup ~ urbangroup)

Notice that our regression is significant at alpha level <0.0001.

Odds ratio is greater than 1 (OR = 5.252, 95% CI = 2.666-10.347, p-value <0.0001), it means that the probability of belonging to the alcohol consumption group 1 (alcconsumption> 5.795 liters per adult) increases among high urban rate countries (urbanrate> 56.75%) compared to those with low urban rate.

Those countries with high urban rate are 5.252 times more likely to have higher alcohol consumption (more than 5.795 liters per adult) than low urban rate countries.

Model 2 (alcgroup ~ urbangroup + politygroup)

What happens when we control for polity score?

As you can see, both urban rate and democracy score are independently associated with the likelihood of having high alcohol consumption per adult. Given that both urbangroup and politygroup are positively associated with the likelihood of being the countries with high alcohol consumption and our predictors are both binary.

The countries with high urban rate are 4.709 times (95% CI = 2.301-9.637) more likely to have high alcohol consumption per adult than the countries with low urban rate after controlling for democracy score. Also, the countries with positive democracy score are 4.796 times (95% CI = 2.148-10.707) more likely to have high alcohol consumption per adult than the countries with negative democracy score after controlling for the urban rate. Because the confidence intervals on our odds ratios overlap we cannot say that the polity score is more strongly associated with a level of alcohol consumption per adult than is the urban rate.

Model 3 (alcgroup ~ urbangroup + incomeperpersongroup)

If we consider the new explanatory variable Gross Domestic Product per capita in constant 2000 US$ and сconvert indicator to two level metric (level 0: incomeperson <= 2191 and level 1: incomeperson > 2191) and then we add this new metric to the first model we'll get the following result.

Urban rate is no longer significantly associated with alcohol consumption rate (p-value = 0.1246). Here we have an example of confounding. The level of the income per person confounds the relationship between urban rate and alcohol consumption rate because the p-value for urban rate is no longer significant when income per person level is included in the model.

Further, because urban rate is no longer associated with the alcohol consumption rate we wouldn’t interpret the corresponding odds ratio but would interpret the significant odds ratio between income per person level alcohol consumption rate. That is that the countries with high level of the income per person are 4.562 times more likely to have the high rate of the alcohol consumption per adult than the countries with low level of the income per person after controlling for urban rate.

#logistic regression

Test a Basic Linear Regression Model

Part 11

The program in SAS [link]

Reading data

Polynomial Regression

Here is a scatter plot showing a linear association between the overall polity score for countries and alcohol consumption per adult from the gap minder data set. That is we can a draw a straight line to the scatter plot and this regression line does a relatively not bad job of catching the association.

You can see that it looks like alcohol consumption rate increases as democracy score increases. We can actually fit a line that curves, to better fit the association, by adding a polynomial term. For example, we could add a quadratic term, to draw the line of best fit that captures the curvature that we're seeing.

Now the scatterplot shows the original linear regression line in blue and the quadratic regression line in green. Notice how the quadratic line does a better job of capturing the association at lower and higher democracy score. The points at these levels are closer to the quadratic, or second-order polynomial curve. Meaning that the expected or predicted values are closer to the observed values. So based on just looking at the two curves, it looks like the green quadratic curve fits the data better than the blue straight line.

But we can be even more sure of this conclusion if we test to see whether adding a second order polynomial term to our aggression model gives us a significantly better fitting model. We do this by simply adding another variable that is the squared value of the explanatory x variable, x squared, to the regression model.

First, let's test the regression model for just the linear association between democracy score and alcohol consumption rate using the ols function from the statsmodels API formula library. Note that we have centered democracy score quantitative explanatory variable polityscore_c. Centering is especially important when testing a polynomial regression model. Because it makes it considerably easier to interpret the regression coefficients.

If we look at the results, we can see from the significant p-value and positive parameter estimate (=0.3541) that alcohol consumption rate is positively associated with countries' democracy score. So the linear association, the blue line in the scatter plot, is statistically significant. But the R-square is 18.6%, indicating that the linear association of democracy score is capturing only about 18.6% of the variability in alcohol consumption. But what happens if we allow that straight line to curve by adding a second order polynomial to that regression equation. The Python code to do this is here.

When we look at the table of results, we see that the value for the linear term for democracy score is positive (=0.5250), and the p-value is less than 0.05. In addition, the quadratic term is positive insignificant, indicating that the curvilinear pattern we observed in our scatter plot is statistically significant. In addition, you see that the R square increases to 23.3%. Which means that adding the quadratic term for democracy score, increase the amount of variability in alcohol consumption rate that can be explained by democracy score from 18.6% to 23.3%. Together, these results suggest that the best fitting line for this association is one that includes some curvature.

Evaluating Model Fit

Let's add another centered explanatory variable, urbanrate, to our regression equation. Here's the regression equation for this model and the python code. This is the same gap minder model that we tested previously with the exception that we have added the centered urbanrate_c explanatory variable. And here are the results.

As you can see from the table above the coefficients for the linear and quadratic democracy score variables, remain significant after adjusting for the urban rate. The urban rate is also statistically significant. The positive regression coefficient indicates that countries with a high urban rate, tend to have a higher alcohol consumption rate.

In fact, urban rate and democracy score together, explain about 25.3% of the variability in alcohol consumption rate. So, there's clearly some error in estimating the response value with this model. In this regression model, the residual is the difference between the predicted alcohol consumption rate and the actual observed alcohol consumption rate for each country.

We can take a look at this residual variability, which not only helps us to see how large the residuals are but also allows us to see whether our regression assumptions are met. And whether there are any outlying observations, that might be unduly influencing the estimation of the regression coefficient.

The easiest way to evaluate residuals is to graph them. First, we can use a qq-plot to evaluate the assumption that the residuals from our aggression model are normally distributed. A qq-plot plots the quantiles of the residuals that we would theoretically see if the residuals followed a normal distribution, against the quantiles for the residuals estimated from our aggression model. The python code to generate a qq-plot is here.

The qqplot for our regression model shows that the residuals generally follow a straight line, but deviate at the lower and higher quantiles. This indicates that our residuals did not follow perfect normal distribution. This could mean that the curvilinear association that we observed in our scatter plot may not be fully estimated by the quadratic democracy score term. There might be other explanatory variables that we might consider including in our model, that could improve estimation of the observed curvilinearity.

To evaluate the overall fit of the predicted values of the response variable to the observed values and to look for outliers, we can examine a plot of the standardized residuals for each of the observations.

If we take a look at this plot, we see that most of the residuals fall within 2 standard deviations of the mean. So basically, they're either between -2 or 2, and all but a few countries (=7) have residuals that are more than 2 standard deviations above or below the mean of 0.

With the standard normal distribution, we would expect 95% of the values of the residuals to fall between two standard deviations of the mean. Residual values that are more than two standard deviations from the mean in either direction are a warning sign that we may have some outliers. There are 7 observations that have three or more standard deviations from the mean. So we have some extreme outliers.

1.9% of the residuals exceeded an absolute value of 2.5 and 4.4% were greater than an absolute value of 2.0. This suggests that the fit of the model is relatively poor and could be improved. The biggest contributor to poor model fit is leaving out important explanatory variables. In order to improve the fit of this model, we should include more explanatory variables to better explain the variability in our alcohol consumption rate response variable.

The following Python code can be used to generate a few more plots to help us determine how specific explanatory variables contribute to the fit of our model.

Summary

The primary plots of interest are the plots of the residuals for each observation of different of values of urban rates in the upper right-hand corner and partial regression plot which is in the lower left-hand corner. The plot in the upper right-hand corner shows the residuals for each observation at different values of Internet use rate. Finally, because we have multiple explanatory variables, we might want to take a look at the contribution of each individual explanatory variable to model fit, controlling for the other explanatory variables. One type of plot that does this, is the partial regression residual plot.

Another plot, in the lower left-hand corner, is a partial regression residual plot. It attempts to show the effect of adding urban rate as an additional explanatory variable to the model. Given that one or more explanatory variables are already in the model. For the urban rate variable, the values in the scatter plot are two sets of residuals. The residuals from a model predicting the alcohol consumption rate response from the other explanatory variables, excluding urban rate, are plotted on the vertical access, and the residuals from the model predicting urban rate from all the other explanatory variables are plotted on the horizontal access. What this means is that the partial regression plot shows the relationship between the response variable and specific explanatory variable, after controlling for the other explanatory variables.

When we take a look at the plot for urban rate in the lower left-hand corner, we see that, in contrast to the plot of the residuals at different values of urban rate without adjusting for the polity score variables, which is shown above, the partial regression plot for urban rate does not clearly indicate a nonlinear association. Rather, the residuals are spread out in a random pattern around the partial regression line. In addition, many of the residuals are pretty far from this line, indicating a great deal of alcohol consumption rate prediction error. This suggests that although urban rate shows a statistically significant association with alcohol consumption rate, this association is pretty weak after controlling for democracy score.

Let's take a look at a leverage plot to identify observations that have an unusually large influence on the estimation of the predicted value of the response variable, alcohol consumption rate, or that are outliers, or both. The leverage of an observation can be thought of in terms of how much the predicted scores for the other observations would differ if the observations in question were not included in the analysis. The leverage always takes on values between 0 and 1. A point with zero leverage has no effect on the regression model. And outliers are observations with residuals greater than 2 or less than -2. We use the following Python code to generate a leverage plot.

One of the first things we see in the leverage plot is that we have a few outliers, contents that have residuals greater than 2. We've already identified some of these outliers in some of the other plots we've looked at, but this plot also tells us that these outliers have small or close to zero leverage values, meaning that although they are outlying observations, they do not have an undue influence on the estimation of the regression model. On the other hand, we see that there are a few cases with higher than average leverage. But three, in particular, is more obvious in terms of having an influence on the estimation of the predicted value of alcohol consumption rate. This observation has a high leverage but is not an outlier. We don't have any observations that are both high leverage and outliers.

SAS results [link]

#Linear Regression

Test a Basic Linear Regression Model

Part 10

The program in SAS

The program in Python [links]

Results

Quantitative explanatory variable

Categorical explanatory variable

Summary

Quantitative explanatory variable

The F-statistic is 14.73 and the p-value is very small (p = 0.0002). Considerably less than our alpha level of 0.05 which tells us that we can reject the null hypothesis and conclude the urban rate is significantly associated with the alcohol consumption per adult.

The coefficients of linear regression (model: alcconsumption ~ zurbanrate) are respectively 6.85 and 1.41

Categorical explanatory variable

The F-statistic is 17.84 and the p-value is very small (p < 0.0001). Considerably less than our alpha level of 0.05 which tells us that we can reject the null hypothesis and conclude the polity group is significantly associated with the alcohol consumption per adult.

The results of the linear regression model indicated that polity group (beta=2.93, p<0.0001) was significantly and positively associated with the alcohol consumption per adult.

#linear regression

Information About the Data

Part 9

Sample

Gapminder contains data for all 192 UN members (as of April 2008), aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating a total of 213 areas (n = 213).

Gapminder data include over 200 indicators for countries, including gross domestic product, total employment rate, and estimated HIV prevalence, democracy score (polity), alcohol consumption per adult (age 15+) and urban population (% of total). The last three indicators are included in the data analysis for this Coursera course.

Procedures

GapMinder collects data from a handful of sources, including the Institute for Health Metrics and Evaluation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank.

Indicators using in this course:

urbanrate – urban population was calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects. Source: World Bank Staff estimates based on United Nations, World Urbanization Prospects. Source: World Bank (http://data.worldbank.org/indicator). For the current analysis, it was binned into five categories: 1st – urbanrate <= 5%, 2nd – 5% < urbanrate <=10%, 3rd – 10% < urbanrate <= 15%, 4th – 15% < urbanrate <= 20%, 5th – urbanrate > 20%.

alcconsumtion – recorded and estimated average alcohol consumption, adult (15+) per capita consumption in litres pure. Source: WHO (http://www.who.int/en/).

polityscore – democracy score (based on Polity IV). Source: Polity IV Project (http://www.systemicpeace.org/polity/polity4.htm)

Measures

Population is available in Gapminder World in the indicator “population, total”, which contains observations 2008 (urban population, % of total). Urban population refers to people living in urban areas as defined by national statistical offices.

Estimating total average alcohol consumption is very complicated. Much of the alcohol consumed is unregistered (home brewed, illegal etc). In general, in countries with high GDP per Capita, the registered alcohol sales make up a larger part of total consumption than in countries with low GDP per Capita. It is thus especially difficult to estimate total consumption for low-income countries. The data here is based on estimations made by a team of experts at WHO. The data come from two different studies from different years. Note also that the average consumption does not inform about how many in the population have alcohol problems - a country with a low average can have a large part of the population who abstain from alcohol completely, while there are other groups with high-risk consumption; and a country with high average can have a large part of the population who drink regularly but not at risk levels.

Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest. My sample also uses the converted range value (pltscore variable) from 0 (the lowest value) to 100 (the highest).

Testing a Potential Moderator

Part 8

1. ANOVA

Is our explanatory variable associated with our response variable for each population subgroup or each level of our third variable?

That is, are urban rate groups and alcohol consumption rates associated for those countries belonging to the different democracy score groups? To accomplish this, we’re gonna run two separate ANOVAs, one for each level of the third variable, that is, for each democracy score groups.

The program in SAS

The program in Python [link]

Results

Summary

The ANOVA table examining the relationship between urban rate groups and alcohol consumption rates for those in the democracy group 1 shows a large F-value (22.59) and a significantly associated p-value<0.0001. When examining the means table, we see that for those involved in the democracy group 1, urbgroup=1 is associated with greater alcohol consumption rate 9.78 liters per adult on average, than urbgroup=0, which is 5.5 liter per adult on average.

The association between urban rate groups and alcohol consumption rates for those involved in the democracy group 0 is not significant. It has a large p-value of 0.6. Here, these results are shown graphically.

The relationship between urban rate groups and alcohol consumption rates depends on the democracy group (polityscore<=0 or polityscore>0) in which the country is located. In democracy group 0 (polityscore<=0), alcohol consumption per adult does not depend on the level of the urban population. The difference of the alcohol consumption per adult for this group is not statistically significant. In democracy group 1 (polityscore>0). Alcohol consumption per adult for the country in urban rate group 1 (urbanrate>57%) is statistically greater than for the country in urban rate group 0.

Thus, we can say there’s a significant statistical interaction between the variable urban rate group and alcohol consumption per adult. And the democracy group, our third variable, moderates the association between urban rate group and alcohol consumption. That is, the third variable that effects the strength of the relation between the explanatory variable (urbgroup – urban rate group) and the response variable (alcconsumption – alcohol consumption per adult).

2. Chi-square

Now let’s evaluate the third variable as a potential moderator in the context of Chi-square test of independence.

Asking the question, is urban rate associated with alcohol consumption? We’re going to create another variable for alcohol consumption for this purpose, reflecting a low alcohol consumption (<=5.92 liters per adult) and high alcohol consumption (>5.92 liters per adult).

Request a chi-square test of independence, examining the association between this new alcohol consumption rates (alcgroup) and urban population rates (urbgroup).

The program in SAS

The program in Python [link]

As we can see from the large chi-square value and significant p-value urban population rates and alcohol consumption rates are significantly associated.

In examining the column percents of each high alcohol consumption group with alcconsumption>5.92 liters per adults, we see the higher rate of urban population.

Would a third variable moderate this relationship? Might there be a statistical interaction between a third variable in urban rate group and predicting our response variable, alcohol consumption group?

We’re going to evaluate democracy score level as a third variable. Our question will be, does democracy score level affect either the strength or the direction of the relationship between urban population rates and alcohol consumption rates? Is urban population related to alcohol consumption rate for each level of this third variable?

Results

Summary

For democracy group 0 (democracy score <= 0) the chi-square value is small and p-value is quite large. Consequently, the difference in alcohol consumption between these two groups is not significant. For democracy group 1 (democracy score > 0) we find a large chi-square value and small p-value which is statistically significant. The relationship between urban population rates and alcohol consumption rates is statistically significant for those with high democracy score. The democracy group moderates the relationship between urban rate group and alcohol consumption.

3. Correlation

Let’s test for moderation within the context of our inferential test. Does the correlation between urban rate and alcohol consumption rate differ based on countries with different democracy levels?

The program in SAS

The program in Python [link]

Results

When we examine the correlation coefficients between urban rate and alcohol consumption rate for each of the polity score groups, we find the following.

Summary

For the low democracy group (democracy score <= 0), the correlation between urban rate and alcohol consumption is -0.15 and the p-value is not significant. For the high-level democracy countries, the association between urban rate and alcohol consumption is 0.44 with a significant p-value < 0.0001.

Generating a Correlation Coefficient

Part 7

Scatter plots for gap minder variables

We use the following scatter plots for visualizing the association between two quantitative variables. The first scatter plot shows the rate of alcohol consumption by the rate of the country's population living and urban settings. The second shows the rate of alcohol consumption by democracy score. From looking at the scatter plots, we can guess the associations are positive, that is, a higher alcohol consumption rate is associated with both, higher urban rates and greater democracy score.

The program in SAS

The program in Python

Results

SUMMARY

For the association between urbanrate and alcconsumption, the correlation coefficient is approximately 0.27434 with a p value of 0.0002. This tell us that the relationship is statistically significant.

For the association between polityscore and alcconsumption, the correlation coefficient is approximately 0.43095 and also has a significant p value of 0.0001. This tell us that the relationship is statistically significant.

Now we can actually interpret the scatter plots and the coefficients together. The association between polityscore and alcconsumption is fairly strong and it's also positive, as the scatter plot had already shown us. The association between urbanrate and alcconsumption is also positive but slightly more modest at 0.27. Both are statistically significant. That is, for both associations, it's highly unlikely that a relationship of this magnitude would be due to chance alone.

Running a Chi-Square Test of Independence

Part 6

How are urban rate groups related to alcohol consumption rate among the countries? Or in hypothesis testing terms, is urban rate and alcohol consumption per adult independent or dependent? That is, are the rates alcohol consumption equal or not equal among countries from my different urban rate categories?

The program in SAS

The program in Python (link)

Results

Our p-value of 0.0001 clearly tells us that urban population and rates alcohol consumption are associated.

Hypothesis Ha accepted: not all alcohol consumption’s rates are equal across urban rate categories. May be there are only two of the urban population rates are not equal to one another by alcohol consumption.

To determine which groups are different from the others, we will need to perform a post hoc test. If we reject the null hypothesis, we need to perform comparisons for each pair of alcohol consumption’s rates across the five urban population categories. In the case of 5 groups, we actually need to perform 10 pairwise comparisons.

The family-wise error rate for 10 different comparisons is 0.40. This means that if we do not protect against type 1 error, we will be wrongly rejecting the null hypothesis and accept the alternate hypothesis that is an association over 40% the time. That is why we will use the post hoc Bonferroni Adjustment approach. The goal of using the Bonferroni Adjustment is to control a family-wise error rate, also known as the maximum overall type 1 error rate. So, that we can evaluate which pairs of alcohol consumption rate are different from one another. We will adjust the p value to make it more difficult to reject the null hypothesis. The adjusted p value is calculated by dividing p=0.05 by 10.

For the actual post hoc testing, we need to run Chi-Square test for each of the 10 paired comparisons.

The program in SAS

The program in Python (link)

Results

Here are p values that are less than 0.005.

As we can see, urban population group 2 (that is 20%<urban rate<=40%) is significantly different alcohol consumption rates, groups 4 and 5.

Creating graphs for the data

Part 4

The program in SAS

The program in Python

Results

The univariate bar graph for categorical variable the summary measure of a country's democratic and free nature by group

The univariate bar graph of urban population

This graph is unimodal, with its highest peak at the category of 60 to 80% urban rate. It seems to be skewed to the left however the skewness is not pronounced.

Univariate histogram for quantitative variable: alcohol consumption per adult

Bivariate bar graph C->Q

In this bar chart, we can see differences in mean alcohol consumption per adult based on countries' urban population groups. And the relationship seems not to be linear. Though, as you can also see from the Y axis, differences between mean alcohol consumption for each urban population group are small. Also, what linear relationship we do see seems to be positive.

Scatterplot for the Association Between Urban Population and Alcohol Consumption

Scatterplot for the Association Between Democracy score and Alcohol Consumption

The graph above plots the alcohol consumption per adult (liters) of a country to the country’s corresponding urban population and democracy score. We can see that the scatter graphs show some relationship/trend between the two variables.

An increasing slope, as we can see here, between urban rate/democracy score and alcohol consumption per adult, indicates the relationship is positive. That is, higher values on one of the variables seem to be associated with higher values on the other, and lower values on one are associated with lower values on the other.

Running an analysis of variance

Part 5

The program in SAS

The program in Python

Results

Frequency distributions

Plots

ANOVA procedures

Conclusions

Model Interpretation for ANOVA:

When examining the association between alcohol consumption per adult (quantitative response) and democracy groups (categorical explanatory), an Analysis of Variance (ANOVA) revealed that alcohol consumption in polity group 0 (-10 ≤ polity score ≤ 0, mean = 3.94, s.d. = ±4.22) significantly less compared to those for polity group 1 (0 < polity score ≤ 10, mean = 7.94, s.d. = ±5.05), F(1, 156)=24.03, p = 2.36e-06.Note that the degrees of freedom can be found in the OLS table as the DF model and DF residuals. In this Note that the degrees of freedom can be found in the OLS table as the DF model and DF residuals. In this example, 24.03 is the actual F value from the OLS table.

Model Interpretation for post hoc ANOVA results:

ANOVA revealed that urban population (collapsed into 5 ordered categories, which is the categorical explanatory variable) and alcohol consumption per adult (quantitative response variable) were significantly associated, F(4, 178)=5.605, p=0.000285. Post hoc comparisons of mean alcohol consumption per adult by pairs of urban rate categories revealed that those urban population group 2 (20-40% % of total) reported significantly less alcohol consumption compared to those countries in urban population groups 4 and 5 (4: 60-80%, 5: 80-100% of total). All other comparisons were statistically similar.

#Alcohol Consumption #urban population #democracy score

Making Data Management Decisions

Part 3

The program in SAS

The program in Python

Frequency distributions

Conclusions

As can be seen from the above tables most countries (81 of 187) are in alcohol consumption group #1, for which the condition is: alcohol consumption is less or equal than 5 liters per person. For 26 countries there is no alcohol consumption data.

The values of the democracy score index were converted to a range from 0 to 100. For this indicator, the number of missed values is 52.

The total number of countries is 213.

Full results obtained with SAS and Python are available on the following links: Results in SAS Results in Python

Trending Blogs

Recently Viewed Blogs

The role of the residential environment