Playing around with data analysis @playingaroundwithdataanalys-blog - Tumblr Blog

Week 3 - Regression Modeling in practice - Test a Multiple Regression Model

Hi there,

This week we are going to build a multiple regression model for our response variable in Gapminder dataset (EMPLOYRATE), and several (possible) explanatory variables (ie URBANRATE, INTERNETUSERATE, INCOMEPERPERSON).

Assignment:

Write a blog entry that summarize in a few sentences 1) what you found in your multiple regression analysis. Discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (Beta coefficients and p-values) in your summary. 2) Report whether your results supported your hypothesis for the association between your primary explanatory variable and the response variable. 3) Discuss whether there was evidence of confounding for the association between your primary explanatory and response variable (Hint: adding additional explanatory variables to your model one at a time will make it easier to identify which of the variables are confounding variables); and 4) generate the following regression diagnostic plots:

a) q-q plot

b) standardized residuals for all observations

c) leverage plot

d) Write a few sentences describing what these plots tell you about your regression model in terms of the distribution of the residuals, model fit, influential observations, and outlier

(SIDE NOTE 1: The research question is explained in

http://playingaroundwithdataanalysis.tumblr.com/post/139295250087/data-analysis-and-interpretation-coursera-week)

(SIDE NOTE 2 - Previous posts for Regression Modeling in practice:

Week 2: http://playingaroundwithdataanalysis.tumblr.com/post/140170375832/week-2-regression-modeling-in-practice-basics)

First, we start by centering all our possible explanatory variables that we deem may have an association with our response variable: URBANRATE; INTERNETUSERATE and INCOMEPERPERSON, That is, we substract their mean values so their mean is 0. This produces a set of derived variables, that we call URBANRATE_c, INCOMEPERPERSON_c and INTERNETUSERATE_c

By running .describe() on these variables, we can see that their mean now is 0, or very close to 0.

Then, we begin exploring the individual relationships between EMPLOYRATE and each of the other explanatory variables:

1) Linear regression model between EMPLOYRATE and URBANRATE:

# Employrate - vs Urbanization rate regression test reg1 = smf.ols('EMPLOYRATE ~ URBANRATE_c', data=sub1).fit() print (reg1.summary())

p-value for test is low: 2.61e-06

Coefficients for URBANRATE are Intercept=59.05 and slope=-0.16. Both with p-value =0.000. The slope value is close to 0, so in any case the relatiohsip is not very strong.

R squared at 12.8% means that the model is only capturing for 12% of the variability of the response variable.

So, there seems to be statistically significant negative association between EMPLOYRATE and URBANRATE:

2) Linear regression model between EMPLOYRATE and INTERNETUSERATE:

# Employrate - vs Internet rate regression test reg1 = smf.ols('EMPLOYRATE ~ INTERNETUSERATE_c', data=sub1).fit() print (reg1.summary())

There is a negative relationship betwen Employrate and Interentuse rate. Although statistically significant (p-value = 0.012), the slope coefficient is close to 0, which suggests a weak linear relationship. The R squared is low, meaning that the model is only capturing the 4% of variability of the response variable.

3) Linear regression model between EMPLOYRATE and INCOMEPERPERSON:

# Employrate - vs income grate regression test

reg1 = smf.ols('EMPLOYRATE ~ INCOMEPERPERSON_c', data=sub1).fit() print (reg1.summary())

Here p-value is high: p-value=0.7 .

There is no significant relationship between employ rate and income per person

Now, we are going to add the three variables to the regression model, to see if maybe it is urbanization rate rather than internet use rate that are associated significantly with Employment rate (one may be confounder variable).

With the multiple regression, we will account for partialiing out the part of the association that can be accounted for by the other.

In other words:

- is urbanization rate positively associated with the employment rate after controlling for internet use rate and income rate? .

- Similarly, is internet use rate positively associated with the employrate after controlling urbanization rate?

With the multiple regression we will be able to conclude if both urbanization rate and internet use rate are significantly associated with employment rate. We will be able to tell this by examining the p-values after a multiple regression test.

On the other hand, we have seen that, in the case of urban rate,a straight line is not doing a good job in estimating the association between two variables. We can use polynomial regression, to find a line that curves, to better fit the association, by adding a polynomial term.

For space reasons, we will not include all data for the regression test including URBANRATE and the square of URBANRATE, but the result is that there is a (very weak) although statistically significant relationship between EMPLOYRATE and both variables (URBANRATE and URBANRATE^2).

We will be building a multiple regression model including these two variables and adding the other two possible lurking variables (INTERNETUSERATE and INCOMEPERPERSON), to see if their p values are statistically significant and contribute to modeling the response variable EMPLOYRATE when each of them is being adjusted:

4) Multiple regression model:

reg3 = smf.ols('EMPLOYRATE ~ URBANRATE_c + I(URBANRATE_c**2) + INTERNETUSERATE_c + INCOMEPERPERSON_c', data=sub1).fit() print (reg3.summary())

TEST INTERPRETATION:

Intercept: it is the value of the response variable when all other explanatory variables are 0. Because we centered the explanatory variables, the intercept is the value of the employment rate at the mean of urbanrate, internet use rate and Income per person. So: urbanrate when both urban rate, internet use and income per user rates are at their means is 56.93 of every 100 people.

- The urban rate variables remain significant after adjusting the internet user rate (p-value=0.000, beta=-0.17)

- The quadratic term for urban rate remain significant after adjusting for other variables, although very weakly (p-value=0.001,beta=0.0047)

- Internet use rate is not statistically significant, (p-value= 0.132, but beta=0.0003))

- income per user is also statistically significant, although extremely weakly (p-value=0.012, beta=0.0003).

Each observation has an estimated response value, which is also referred to as the predicted or fitted response variable, based on the regression equation. We see in the R-squared value, that only about 22% of the variability in employment rate is explained by the explanatory variables. So, there is still clearly some error in estimating the response value with this model.

RESIDUALS EVALUATION:

Let's evaluate the residuals:

1) Q-Q plot:

We use a qq-plot to evaluate the assumption that the residuals from our regression model are normally distributed. A qq-plot plots the quantiles of the residuals that we would theoretically see if the residuals followed a normal distribution, against the quantiles for the residuals estimated from our regression model.

#Q-Q plot for normality fig4=sm.qqplot(reg3.resid, line='r')

The qqplot for our regression model shows that the residuals generally follow a straight line, but deviate at the lowere and higher quantiles. This indicates that our residuals do not follow a perfect normal distribution. This could mean that the curvilinear association that we observed in our scatter plot may not be fully estimated by the quadratic urban rate term.

We need to plot the standardized residuals distribution. The standardized residuals are simply the residual values that are transformed to have a mean of zero and a standard dev of 1:

# simple plot of residuals #First, convert the array of standardized residuals to a Pandas DataFrame: stdres=pandas.DataFrame(reg3.resid_pearson) #plot the scatter plot of residuals from this model, ls=None means do not draw line that connects residuals: plt.plot(stdres, 'o', ls='None') #draw horizontal line at y=0 l = plt.axhline(y=0, color='r') #draw horizontal green lines at y=1, y=-1 l = plt.axhline(y=1, color='g') l = plt.axhline(y=-1, color='g') #draw horiontal blue lines at y=2, y=-2 l = plt.axhline(y=2, color='b') l = plt.axhline(y=-2, color='b') #draw horiontal yellow lines at y=2.5, y=-2.5 l = plt.axhline(y=2.5, color='y') l = plt.axhline(y=-2.5, color='y')

plt.ylabel('Standardized Residual') plt.xlabel('Observation Number')

We see that most of residuals fall inside 1 std dev of the mean. Few residuals have values that are beyond 2 std dev of the mean. With the standard normal distribution, we woudl expect 95% of the points in this interval of 2 std dev. No point more than 3 std dev of the mean (no extreme outliers).

Here, none of the residuals exceed value of 2.5 (or maybe 1), but ther are around than 5% of points with an absoulte value of 2 (it looks like 8 points fall there).

We should include more explanatory variables to better explain the variability in our eployment rate

In order to determine how specific explanatory variables contribute to the fit of our model there are additional python plots.

2) Standardized residuals for all observations:

A) For URBANRATE:

# additional regression diagnostic plots #each unit is 12 points, so this will be a 960x640 figure: fig2 = plt.figure(figsize(12,8)) fig2 = sm.graphics.plot_regress_exog(reg3, "URBANRATE_c", fig=fig2)

For URBANRATE, the residuals are rather distributed evenly across all values of URBANRATE. In the partial regression plot, we can see a slightly better fit to the regression line, but still many values far from this line.

B) For INCOMEPERPERSON:

For INCOMEPEPERSON, we see that the biggest absolute value of residuals are almost all sprad along the low values of INCOME per person rate. However, we can see also high absolute values for higher rates of income.

If we explore the partial regression for this variable, we can see again that the residuals are rather scattered along the regression line, most of them really far from the line. It looks again like the association is weak between income per person and employment after controlling for other explanatory variables.

C) For INTERNETUSERATE:

There is a funnel shaped pattern to the residuals, where we see that the absolute values of the residuals are large for small values of INTERNETUSE rate. They get smaller and closer to zero as the Internet use rate increases, but start to get larger when this internet rate increases again. This is consistent with the conclusion that the model does not predict employment rate as well for countries that have high or low levels of internet rate. The model is particularly poor in trying to predict the employment rate for countries with low internet use rate. This suggests that the relationship between employment rate and internet use rate may not be curvilinear, but polynomic, for example.

This suggests that, although INTERNETUSE rate shows a statistically significant relationship with EMPLOYRATE, this association is pretty weak after controlling for other variables, such as URBANIZATION rate or INCOME level rate.

3) Leverage plot:

Finally, we can examine a leverage plot to identify observations that have an inusual large influence on the estimation of the predicted value of the response variable, or that are outliers:

How much of the predicted scores for the other observations would differ if the observations in question would not be included in the analysis. The leverage always takes on values between 0 (no effect on the regression model) and 1:

# leverage plot fig3=sm.graphics.influence_plot(reg3, size=8) print(fig3)

INTERPRETATION:

There are a few outliers greater than an absolute value of 2 but with low leverage, meaning that they have small influence in the model. On the other hand, we see other outlisers that have a greater than average leverage. There is one punt 104, that seemsn to have a high influence. It is a high leverage but is not an outlier. We have some observations that are both high leverage and outliers: like 145.

FINAL CONCLUSIONS:

We are seeing some weaknesses in our model. Even if we include other polinomial terms in our regression model (like third degree polynomials for URBANRATE, or quadratic terms for INCOMEPERPERSON or INTERNETUSERATE), even if p-values are low enough, the beta coefficients are extremely low, around 0 in all cases except for URBANRATE:

reg4 = smf.ols('EMPLOYRATE ~ URBANRATE_c + I(URBANRATE_c**2) + I(URBANRATE_c**3)+ INTERNETUSERATE_c + I(INTERNETUSERATE_c**2)+ INCOMEPERPERSON_c + I(INCOMEPERPERSON_c**2) +I(INTERNETUSERATE_c**2)', data=sub1).fit() print (reg4.summary())

This means that INTERNETUSERATE or INCOMEPERPERSON do not seem to moderate too much the relationship between URBANRATE and EMPLOYRATE.

The model is misspecified, in the sense that there must be other confounders for which we do not have data. This is consistent with the fact that the EMPLOYRATE is associated with multiple and complex socioeconomic variables. Unfortunately, we do not have the data for controlling those variables.

Indeed, the level of aggregation (per country level, per year) is too high to be able to account for specific outliers effects, economic variables etc.

Some of these effects to account for may be:

- Economic crisis of 2008: Indeed, the gapmidner dataset contains data for some variables from 2007 and for others from 2008. This is indeed a big important fact, as the employment destruction in 2008 was huge in some countries.

- Political situations specific to the countries, which may lead to outliers.

- Not the rate of urbanization itself, but how this urbanization had been growing in the previous years. We would need more longitudinal data to account for this.

Python code:

%pylab inline %matplotlib inline

import numpy as numpyp import pandas as pandas import statsmodels.api as sm import statsmodels.formula.api as smf import seaborn as sbs import matplotlib.pyplot as plt

#LOAD THE DATASET AND CONVERT COLUMNS TO UPPER-CASE TO AVOID ERRORS data = pandas.read_csv('gapminder.csv', low_memory=False, error_bad_lines=False ) data.columns= map(str.upper, data.columns)

#upper-case all Dataframe column names data.columns= map(str.upper, data.columns)

#bug fix for display formats to avoid runtime errors pandas.set_option('display.float_format', lambda x: '%f' %x)

#Set PANDAS to show all columns in a Dataframe #(the default display of pandas has a limit) pandas.set_option('display.max_columns', None) pandas.set_option('display.max_rows', None)

#setting variables you will be working with to numeric data['INTERNETUSERATE'] = data['INTERNETUSERATE'].convert_objects(convert_numeric=True) data['URBANRATE'] = data['URBANRATE'].convert_objects(convert_numeric=True) data['INCOMEPERPERSON'] = data['INCOMEPERPERSON'].convert_objects(convert_numeric=True) data['EMPLOYRATE'] = data['EMPLOYRATE'].convert_objects(convert_numeric=True)

sub1=data.copy()

#center explanatory variables:

sub1['URBANRATE'] = sub1['URBANRATE'][~np.isnan(sub1['URBANRATE'])] sub1['URBANRATE_c']=sub1['URBANRATE'] - sub1['URBANRATE'].mean() sub1['URBANRATE_c'].describe()

sub1['INCOMEPERPERSON'] = sub1['INCOMEPERPERSON'][~np.isnan(sub1['INCOMEPERPERSON'])] sub1['INCOMEPERPERSON_c']=sub1['INCOMEPERPERSON'] - sub1['INCOMEPERPERSON'].mean() sub1['INCOMEPERPERSON_c'].describe()

sub1['INTERNETUSERATE'] = sub1['INTERNETUSERATE'][~np.isnan(sub1['INTERNETUSERATE'])] sub1['INTERNETUSERATE_c']=sub1['INTERNETUSERATE'] - sub1['INTERNETUSERATE'].mean() sub1['INTERNETUSERATE_c'].describe()

Week 2 - Regression Modeling in Practice - Basics of Linear Regression

Hi there,

This week we are going to build a basic linear regression model between two quantitative varlables in our Gapminder dataset, which will be our response variable and our explanatory variable.

Assignment:

If you have a quantitative explanatory variable, center it so that the mean = 0 (or really close to 0) by subtracting the mean, and then calculate the mean to check your centering.

2) Test a linear regression model and summarize the results in a couple of sentences. Make sure to include statistical results (regression coefficients and p-values) in your summary.

The research question is explained in

http://playingaroundwithdataanalysis.tumblr.com/post/139295250087/data-analysis-and-interpretation-coursera-week

As we have explained in week 1, we are going to test the relationship between two quantitative variables, EMPLOYRATE (response variable) and URBANRATE (explanatory variable).

We proceed with the data management part. This includes centering the explanatory variable, by substracting the mean value. This will create a new variable called URBANRATE_c:

sub1['URBANRATE_c']=sub1['URBANRATE'] - sub1['URBANRATE'].mean()

If we calculate the stats on this new variable, we check indeed that the mean now is 0:

sub1['URBANRATE_c'].describe()

count 193.000000 mean 0.000000 std 23.707742

Let’s see visually the linear relationship between both variables via a scatter plot:

scat1 = sbs.regplot(x="URBANRATE_c", y="EMPLOYRATE", fit_reg=True, data=sub1) plt.xlabel('Urbanization Rate') plt.ylabel('Employment Rate') plt.title ('Scatterplot for the Association Between Urban Rate and Employment Rate'

We can appreciate a (rather scattered) negative relationship between URBANRATE and EMPLOYRATE.

Now, let’s calculate the linear regression model:

print ("OLS regression model for the association between urban rate and employment rate") reg1 = smf.ols('EMPLOYRATE ~ URBANRATE_c', data=sub1).fit() print (reg1.summary())

TEST INTERPRETATION:

“Dep. Variable” shows the name of the response variable: EMPLOYRATE. Number of observations is 164, which are the observations that had valid data on both the explanatory and the response variables.

F-Statistic is 24.74, and the p-value is very small: 2.61e-06, so we can reject the null hypothesis and conclude that employment rate is significantly associated with the urban rate.

Parameter estimates (coef column): the intercept is 59.11, and the slope of the line is -0.16. This suggests a negative relationship between employment rate and urbanization.

This mean that we could model the employment rate relationship with urbanization in the following way:

employment rate=-0.1607 * urban rate + 59.11

The p>|t| are both 0. These represent the p values of our explanatory variables association with the response variable. This means coefficients are statistically significant.

R-squared value is 0.128, so it is the proportion on the variance in the response variable that can be explained by the explanatory variable. We now know that this model accounts for 12,8% of the variability we observe in our response variable, EMPLOYRATE.

We will need to check for other possible confounders or lurking variables, that may be moderating the relationship between employment rate and urbanization rate (next week). Indeed, this negative association is rather surprising.

DISCUSSION:

So, as previous exploratory inspections of the data had already pointed to, it looks like increasing rates of urbanization are associated with higher rates of unemployment.

In previous weeks we hypothesized about the possible reasons behind that. On one hand, there must be some lurking variables or confounders, as every economic aggregated factor hides many factors behind.

On another hand, it is possible that countries where the urbanization rate has grown beyond economic sustainability are rendering people jobless.

Another possibility is that there are certain assumptions for the test that have not been met in the research question and the data itself. For example, in this post we took some assumptions about the underlying data:

- The Pearson correlation coefficient, also called the product moment correlation coefficient, is the coefficient that measures a linear association between two numerical variables. It assumes both variables have normal distributions, so that the two variables are bivariate normally distributed.

- This implies that the scatterplot of the data should show an approximate ellipsoidal shape, and that each of the variables separately follow a normal distribution. In our case, we are not very far away from this ellipsoidal shape but one could argue that this assumption is not really met.

- In general, the Pearson correlation coefficient is sensitive to outliers, and skewedness of the distribution in one or both variables. We looked during Week 2 of the initial course on Data Analysis and Interpretation, that some variable distributions (though plotting their histograms), were somehow skewed. So, we are pretty sure that these assumptions are not met therefore we would need to resort to alternative correlation coefficients, such as Spearman correlation. Some of the conditions to apply this are: both of the variables are ordinal. They are not linearly related, they contain one or more outliers, they don't follow a bivariate normal distribution, or you cannot check this distribution,due to lack of data. To be done as further optional work....

Python code:

%pylab inline %matplotlib inline

import numpy as numpyp import pandas as pandas import statsmodels.api import statsmodels.formula.api as smf import seaborn as sbs import matplotlib.pyplot as plt

#LOAD THE DATASET AND CONVERT COLUMNS TO UPPER-CASE TO AVOID ERRORS data = pandas.read_csv('gapminder.csv', low_memory=False, error_bad_lines=False ) data.columns= map(str.upper, data.columns)

#upper-case all Dataframe column names data.columns= map(str.upper, data.columns)

#bug fix for display formats to avoid runtime errors pandas.set_option('display.float_format', lambda x: '%f' %x)

#Set PANDAS to show all columns in a Dataframe #(the default display of pandas has a limit) pandas.set_option('display.max_columns', None) pandas.set_option('display.max_rows', None)

sub1=data.copy()

scat1 = sbs.regplot(x="URBANRATE_c", y="EMPLOYRATE", fit_reg=True, data=sub1)

plt.xlabel('Urbanization Rate')

plt.ylabel('Employment Rate')

plt.title ('Scatterplot for the Association Between Urban Rate and Employment Rate'

print ("OLS regression model for the association between urban rate and employment rate") reg1 = smf.ols('EMPLOYRATE ~ URBANRATE_c', data=sub1).fit() print (reg1.summary())

Week 4 - Regression Modeling in Practice - Logistic Regression

Hi there,

As this week we want to test the logistic regression test and concepts and during the last weeks I have been exploring the relationship between employment rate and urbanization and other variables in the Gapminder dataset with some results, I have decided that I will change my research question this time and move to the NESARC dataset.

The reason is that there are interesting binary variables there on which to explore relationships following chi-square tests and logistic regression.

I have decided to work on the following research question:

- Is the suffering of generalized anxiety significantly associated with the following factors? traumatic events during childhood: parent divorce or parent death, age and sex

NOTE: this post will be long, as I need to explore the variables to use, explain data management, univariate and bivariate analysis, and perform some chi-square tests before going to logistic regression. Logistic regression tests are at the end of this post.

Variables in dataset:

First, thing, I need to inspect for the appropriate variables in the NESARC dataset:

1. S1Q2D DID BIOLOGICAL OR ADOPTIVE PARENTS GET DIVORCED OR PERMANENTLY STOP LIVING TOGETHER BEFORE RESPONDENT WAS 18

6914 1. Yes

30261 2. No

65 9. Unknown

5853 BL. NA, did not live or unknown if lived with biological or adoptive mother and father before age 18

2. S1Q2K DID BIOLOGICAL OR ADOPTIVE PARENT DIE BEFORE RESPONDENT WAS 18

4515 1. Yes

37358 2. No

184 9. Unknown

1036 BL. NA, did not live or unknown if lived with biological or adoptive parent(s) before age 18

SECTION 14: DSM-IV DIAGNOSES

3. 3630-3630 GENAXLIFE GENERALIZED ANXIETY DISORDER - LIFETIME (NON-HIERARCHICAL)

41155 0. No

1938 1. Yes

OTHER VARIABLES:

79-79 SEX SEX

18518 1. Male 24575 2. Female

AGE

DATA MANAGEMENT

Let’s start with some data management:

1. Convert variables to numeric

2. Recode variables S1Q2D and S1Q2K to have each two binary values, 0 and 1, as well as give them more explanatory names.

UNIVARIATE ANALYSIS

Then, follow with some univariate analysis of the main variables:

B. Univariate analysis for GENAXLIFE (prevalence of generalized anxiety in the population):

sns.countplot(x='GENAXLIFE', data=data) plt.xlabel('Incidence generalized anxiety') plt.title('Incidence of generalized anxiety')

data['GENAXLIFE'] = data['GENAXLIFE'].astype('category') c_genanx= data.groupby('GENAXLIFE').size() print(c_genanx)

GENAXLIFE 0 41155 1 1938 dtype: int64

p_genanx=data.groupby('GENAXLIFE').size() *100 / len(data) print(p_genanx)

GENAXLIFE 0 95.50 1 4.50 dtype: float64

2. univariate analysis for DEATHFAMILY (prevalence of event of death in the family before age 18 in the population):

In percentage:

DEATHFAMILY 0 89.52 1 10.48 dtype: float64

sns.countplot(x='DEATHFAMILY', data=sub1) plt.xlabel('Incidence of death in family before age 18') plt.title('Incidence of death in family before age 18')

3. univariate analysis for DIVORCEFAMILY (prevalence of event of divorce in the family before age 18 in the population):

In percentages:

DIVORCEFAMILY 0 83.96 1 16.04 dtype: float64

BIVARIATE ANALYSIS:

And some bivariate analysis between generalized anxiety and our explanatory binary variables DEATHFAMILY , DIVORCEFAMILY and SEX:

sns.factorplot(x='DEATHFAMILY', y='GENAXLIFE', data=sub1, kind='bar', ci=None) plt.xlabel('Death in the family before 18') plt.ylabel('Proportion Generalized Anxiety among death in the family before 18')

sns.factorplot(x='DIVORCEFAMILY', y='GENAXLIFE', data=sub1, kind='bar', ci=None) plt.xlabel('DIVORCE in the family before 18') plt.ylabel('Proportion Generalized Anxiety among divorce in the family before 18')

sns.factorplot(x='SEX', y='GENAXLIFE', data=sub1, kind='bar', ci=None) plt.xlabel('SEX') plt.ylabel('PREVALENCE OF Generalized Anxiety among SEX')

Chi-Square Statistic (C-> C)

Now, perform chi-square statistics between our response variable GENAXLIFE and each of our explanatory variables, DIVORCEFAMILY and DEATHFAMILY:

When the Explanatory Variable is Categorical and the response variable is also categorical, the test to perform to see statistical significance of a relationship between both variables is the Chi Square Statistical analysis.

A. DIVORCEFAMILY AND GENERALIZED ANXIETY

# contingency table of observed counts ct1=pandas.crosstab(sub1['GENAXLIFE'], sub1['DIVORCEFAMILY']) print (ct1)

DIVORCEFAMILY 0 1 GENAXLIFE 0 34620 6535 1 1559 379

We can see that among those with generalized anxiety, 1559 had no divorce in the family, while 379 had a divorce in the family. It looks like there might be a positive relationship, due to the small fraction of prevalence of divorce in the general population ( see univariate analysis): while in the general population the event divorce in the family was around 16%, among those suffering gen.anxiety, it is almost 25%.

If we compute the chi-square test:

# chi-square print ('chi-square value, p value, expected counts FOR DIVORCE IN THE FAMILY') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)

The result is:

chi-square value, p value, expected counts FOR DIVORCE IN THE FAMILY (18.307910232889945, 1.8792522804654385e-05, 1, array([[ 34551.93987423, 6603.06012577], [ 1627.06012577, 310.93987423]]))

We obtain from this test a small p-vaue =1.879e-05, and a chi-squre value of 18.3.

It looks like there is a strong association between generalized anxiety and the event of divorce in the family before age of 18.

B. DEATHFAMILY AND GENERALIZED ANXIETY

# contingency table of observed counts ct2=pandas.crosstab(sub1['GENAXLIFE'], sub1['DEATHFAMILY']) print (ct2)

DEATHFAMILY 0 1 GENAXLIFE 0 36866 4289 1 1712 226

Here, it looks as well like there might be a positive relationship betwen death in the family event and generalized anxiety.

# chi-square print ('chi-square value, p value, expected counts FOR DEATH IN THE FAMILY') cs2= scipy.stats.chi2_contingency(ct2) print (cs2)

chi-square value, p value, expected counts FOR DEATH IN THE FAMILY (2.9029957506164368, 0.088415095558617249, 1, array([[ 36843.05084352, 4311.94915648], [ 1734.94915648, 203.05084352]]))

We see in this second test, that the p-value is not significant, P=0.888. So we cannot reject here the null hypothesis that the event of a death in the family before age 18 is significantly associated with the prevalence of generalized anxiety.

C. SEX

# contingency table of observed counts ct2=pandas.crosstab(sub1['GENAXLIFE'], sub1['SEX']) print (ct2)

SEX 1 2 GENAXLIFE 0 17952 23203 1 566 1372

# column percentages colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct)

SEX 1 2 GENAXLIFE 0 0.97 0.94 1 0.03 0.06

In the percentages table, it strikes our attention that the prevalence of generalized anxiety seems to be much higher in women than in men (percentage duplicates).

At first sight, it appears that there is significant more prevalence of generalized anxiety among women than among men.

Chi-squared test can help us determine if this different is significant and if other factors moderate this relationship.

When we perform this test, we obtain a very low p-value (7.09e-36) and a very high Chi-square value (156.35). This indicates there is a very strong association between sex and generalized anxiety, with more prevalence in women.

LOGISTIC REGRESSION MODELS AND ODD RATIOS

1. We first calculate a logit model with DIVORECEFAMILY as unique explanatory variable:

lreg2 = smf.logit(formula = 'GENAXLIFE ~ DIVORCEFAMILY', data = sub1).fit() print (lreg2.summary()) # odds ratios print ("Odds Ratios") print (numpy.exp(lreg2.params))

We can see here that the p-value of the test is low (2.66e-05), and the p.value for the DIVORCEFAMILY variable is 0.00.

Odds ratio is 1.29. This confirms that there is a strong association between divorce and generalized anxiety.

The 95% confidence intervals for odds ratio for DIVORCEFAMILY are (1.15-1-45).

2. We then calculate another logit model including AGE and SEX explanatory variables:

lreg2 = smf.logit(formula = 'GENAXLIFE ~ DIVORCEFAMILY + AGE + SEX', data = sub1).fit() print (lreg2.summary()) # odds ratios print ("Odds Ratios") print (numpy.exp(lreg2.params))

Indeed, the above logistic regression test confirms that there is statistically significant (LLR p-value=6.127e-39 ) association between generalized anxiety, reflected in variable GENAXLIFE and the following explanatory variables:

- DIVORCEFAMILY (p=0.000), Odds ratio= 1.29. This odds ratio suggests a positive relationship, so, the presence of divorce in childhood increases the odds of suffering generalized anxiety at adult age.

- SEX (p=0.000=, odds ratio= 1.87. This odds ratio suggests a more positive relationship between sex (female) and Generalized anxiety.

As for the variable AGE, there is no significant relationship with generalized anxiety that suggests that there is more prevalence of generalized anxiety in certain specific ranges of age (P is very high).

If we calculate the 95% confidence intervals for odds ratio:

params = lreg2.params conf = lreg2.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

we obtain the following:

Lower CI Upper CI OR Intercept 0.03 0.03 0.03 DIVORCEFAMILY[T.1] 1.15 1.45 1.29 SEX[T.2] 1.70 2.07 1.87 AGE 1.00 1.00 1.00

For the statistically significant explanatory variables DIVORCEFAMILY and AGE, the intervals are (1.15- 1.45) and (1.70 - 2.07) respectively.

Python code:

%pylab inline %matplotlib inline

import numpy import pandas import statsmodels.api as sm import seaborn as sns import statsmodels.formula.api as smf import matplotlib.pyplot as plt import scipy.stats

# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%.2f'%x)

data = pandas.read_csv('nesarc_pds.csv', low_memory=False)

############################################################################## # DATA MANAGEMENT ##############################################################################

#setting variables you will be working with to numeric

data["IDNUM"]=data["IDNUM"].convert_objects(convert_numeric=True) data["S1Q2D"]=data["S1Q2D"].convert_objects(convert_numeric=True) data["S1Q2K"]=data["S1Q2K"].convert_objects(convert_numeric=True) data["AGE"]=data["AGE"].convert_objects(convert_numeric=True) data["SEX"]=data["SEX"].convert_objects(convert_numeric=True) data["GENAXLIFE"]=data["GENAXLIFE"].convert_objects(convert_numeric=True)

sub1=data.copy()

#Recode the two variables corresponding to death or divorce in the family before 18:

def DEATHFAMILY (x):

if x['S1Q2K']==1: return 1 else: return 0 sub1['DEATHFAMILY'] = sub1.apply (lambda x: DEATHFAMILY (x), axis=1) print (pandas.crosstab(sub1['S1Q2K'], sub1['DEATHFAMILY']))

def DIVORCEFAMILY (x): if x['S1Q2D']==1: return 1 else: return 0 sub1['DIVORCEFAMILY'] = sub1.apply (lambda x: DIVORCEFAMILY (x), axis=1) print (pandas.crosstab(sub1['S1Q2D'], sub1['DIVORCEFAMILY']))

#univariate analysis

sub1['DEATHFAMILY'] = sub1['DEATHFAMILY'].astype('category') sub1['DIVORCEFAMILY'] = sub1['DIVORCEFAMILY'].astype('category') sub1['SEX'] = sub1['SEX'].astype('category') sub1['GENAXLIFE'] = sub1['GENAXLIFE'].convert_objects(convert_numeric=True)

sns.countplot(x='GENAXLIFE', data=data) plt.xlabel('Incidence generalized anxiety') plt.title('Incidence of generalized anxiety')

data['GENAXLIFE'] = data['GENAXLIFE'].astype('category') c_genanx= data.groupby('GENAXLIFE').size() print(c_genanx)

p_genanx=data.groupby('GENAXLIFE').size() *100 / len(data) print(p_genanx)

sns.countplot(x='DEATHFAMILY', data=sub1) plt.xlabel('Incidence of death in family before age 18') plt.title('Incidence of death in family before age 18')

sns.countplot(x='DIVORCEFAMILY', data=sub1) plt.xlabel('Incidence of divorce in family before age 18') plt.title('Incidence of divorce in family before age 18')

#bivariate analysis

#chi-square tests:

# contingency table of observed counts ct1=pandas.crosstab(sub1['GENAXLIFE'], sub1['DIVORCEFAMILY']) print (ct1)

# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct) #colpct is a two dimensional array, where the columns representa an array (axis=0), and the rows represent another array (axis=1)

# chi-square print ('chi-square value, p value, expected counts FOR DIVORCE IN THE FAMILY') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)

# contingency table of observed counts ct2=pandas.crosstab(sub1['GENAXLIFE'], sub1['DEATHFAMILY']) print (ct2)

# column percentages colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct) #colpct is a two dimensional array, where the columns representa an array (axis=0), and the rows represent another array (axis=1)

# chi-square print ('chi-square value, p value, expected counts FOR DEATH IN THE FAMILY') cs2= scipy.stats.chi2_contingency(ct2) print (cs2)

#LOGISTIC REGRESSION

# logistic regression with DIVORCEFAMILY lreg2 = smf.logit(formula = 'GENAXLIFE ~ DIVORCEFAMILY', data = sub1).fit() print (lreg2.summary()) # odds ratios print ("Odds Ratios") print (numpy.exp(lreg2.params))

Data Analysis Tools - Week 4 - Exploring Statistical Interactions

Hi there,

Last week we explored the relationship between URBANRATE and EMPLOYRATE as quantitative variables, and drew some interesting conclusions.

We want to explore this week the effect of moderator variables.

Moderator variables are those third variables that affect either the strength or the direction of the relationship between the two variables that we are exploring. is the response variable related to the explanatory variable for each level of the third variable or moderator?

In the context of the relationship we are resarching, we will consider INCOMEGDP as moderator variable.

Our moderator to test will be incomeGDP. As this is a quantitative variable, we need to convert it to categorical.

In order to do this, we create a new quantitative variable basded on three levels of the variable INCOMEPERPERSON. We will create three levels, with cuts on 744 and 9425 dollars per person per year (cut fugures inspired by https://en.wikipedia.org/wiki/Developing_country):

def incomegdp (row): if row['INCOMEPERPERSON'] <= 744.239: return 1 elif row['INCOMEPERPERSON'] <= 9425.326 : return 2 elif row['INCOMEPERPERSON'] > 9425.326: return 3

data_clean['INCOMEGDP'] = data_clean.apply (lambda row: incomegdp (row),axis=1)

chk1 = data_clean['INCOMEGDP'].value_counts(sort=False, dropna=False) print(chk1)

1 43 2 76 3 37 dtype: int64

Next, we create the 3 subsets of data, that contain only data on each of the income groups. we will associate three dummy codes 1, 2 and 3 for these levels:

sub1=data_clean[(data_clean['INCOMEGDP']== 1)] sub2=data_clean[(data_clean['INCOMEGDP']== 2)] sub3=data_clean[(data_clean['INCOMEGDP']== 3)]

Now, we draw the scatterplots and calculate the r-coefficients for each of these subsets of data:

1) LOW income countries

scat1 = sns.regplot(x="URBANRATE", y="EMPLOYRATE", data=sub1) plt.xlabel('Urban Rate') plt.ylabel('Employ Rate') plt.title('Scatterplot for the Association Between Urban Rate and Employ Use Rate for LOW income countries') print (scat1)

print(scipy.stats.pearsonr(sub1['URBANRATE'], sub1['EMPLOYRATE']))

(-0.52276426265535059, 0.00032274862552258819)

R-value is negative , p-value is sufficiently low to conclude that we can reject null hypothesis here: for LOW income countries, it does look that there is a statistically significant negative correlation between URB rate and EMPLOY rate. The more urbanized, the less employment rate. This could at first hand be explained by the hypothesis that for LOW income countries, urbanization has been performed at a fast-pace, without guarantees for citizens, thereby concentrating big unemployed collectives.

2) MEDIUM income countries

scat2 = sns.regplot(x="URBANRATE", y="EMPLOYRATE", data=sub2) plt.xlabel('Urban Rate') plt.ylabel('Employ Rate') plt.title('Scatterplot for the Association Between Urban Rate and Employ Use Rate for MEDIUM income countries') print (scat2)

print(scipy.stats.pearsonr(sub2['URBANRATE'], sub2['EMPLOYRATE']))

(-0.19332386782897013, 0.094277702317991791)

R-value is negative but low, p-value is too high, so we cannot reject null hypothesis here: for MEDIUM income countries, it does not look that there is a statistically significant correlation between URB rate and EMPLOY rate for MEDIUM income countries.

3) HIGH income countries

scat3 = sns.regplot(x="URBANRATE", y="EMPLOYRATE", data=sub3) plt.xlabel('Urban Rate') plt.ylabel('Employ Rate') plt.title('Scatterplot for the Association Between Urban Rate and Employ Use Rate for HIGH income countries') print (scat3)

print(scipy.stats.pearsonr(sub3['URBANRATE'], sub3['EMPLOYRATE']))

(0.056821549585708134, 0.73835006605811548)

R-value is positive but extremely low, and the p-value is very high. We cannot either reject null hypothesis here, we cannot conclude that there is correlation between URB rate and EMPLOY rate in HIGH income countries.

Data Analysis Tools - Week 3 - Pearson Correlation

Hi there,

After performing in the previous two weeks different tests for categorical variables, this week, we go back to our Q-> Q analysis, and we will work with explanatory and response variables that are both quantitative.

(SIDE NOTE: you can see previous two weeks here:

week 1 (ANOVA test):

http://playingaroundwithdataanalysis.tumblr.com/post/139746508217/data-analysis-tools-week-1-running-and-analysis

week 2 (chi-square test): http://playingaroundwithdataanalysis.tumblr.com/post/140019560812/data-analysis-tools-week-2-chi-square-test-of)

To explore the relationship between two quantitative variables, we need to perform a Pearson correlation.

Pearson coefficient r represents how strong the relationship between both variables is, and can take from -1 to 1 values. On the other hand, r squared represents the fraction of the variability of one variable that can be predicted by the other (for example, if r squared=0.37, we can predict 37% of the variability we will see in the rate of internet use. It also means that 63% of the variability is unaccounted for)

We have been previously testing association between the urban rate (URBANRATE) and employment rate variables (EMPLOYRATE).

When creating a regression linear model with EMPLOYRATE as a dependent variable and URBANRATE as an independent variable we get a negative correlation, as we can visually inspect in the scatter plot:

scat1 = sns.regplot(x="URBANRATE", y="EMPLOYRATE", fit_reg=True, data=data) plt.xlabel('Urban Rate') plt.ylabel('Employment Rate') plt.title('Scatterplot for the Association Between Urban Rate and Employment Rate')

We can see a negative correlation between both variables EMPLOYRATE and URBANRATE. So, as previous exploratory inspections of the data had already pointed to, it looks like increasing rates of urbanization are associated with higher rates of unemployment. In previous weeks we hypothesized about the possible reasons behind that. On one hand, there must be some lurking variables or confounders, as every economic aggregated factor hides many factors behind. On another hand, it is possible that countries where the urbanization rate has grown beyond economic sustainability are rendering people jobless.

In order to ensure significance of this resulting association, now, let's test Pearson correlation to check the strength and statistical significance of this correlation:

data_clean=data.dropna()

print ('association between urbanrate and employmentrate') print (scipy.stats.pearsonr(data_clean['URBANRATE'], data_clean['EMPLOYRATE']))

association between urbanrate and employmentrate (-0.36552454654455263, 2.7043735860123006e-06)

RESULT: The association between urban rate and employment rate is very significant: **p-value is 2.07 e-06 <<< .05 ** Both variables have a negative correlation **(r=-0.365)** with increasing urbanization associated with decreasing employment rates.

Although initially we could argue that a higher employment rate is associated with a higher urbanization level in a country (more opportunities = more jobs and employment, in our analysis we have seen a negative correlation between the two.In fact in our previous analysis with ANOVA and chi-square, we have seen that the countries which have lower urbanization rate have a better employment rate and the employment rate median differences between the moderately urban or highly urban countries is not very significant.

To explore confounding factors to this association, now, let's explore the relationship between another two quantitative variables: EMPLOYRATE and INTERNETUSERATE:

scat1 = sns.regplot(x="INTERNETUSERATE", y="EMPLOYRATE", fit_reg=True, data=data) plt.xlabel('Internet Use Rate') plt.ylabel('Employment Rate') plt.title('Scatterplot for the Association Between Internet Use Rate and Employment Rate')

print ('association between Internet Use rate and Employrate') print (scipy.stats.pearsonr(data_clean['INTERNETUSERATE'], data_clean['EMPLOYRATE']))

association between Internet Use rate and Employrate (-0.20688588905526098, 0.0095608612514560005)

Here, we see that the association between Internet Use rate and Employrate is also negative, but weaker (-0.206) , with a p-value of 0.00956 , so statistically significant. According to this relationship, the more Internet penetration, the less employed it is !!! These are indeed surprising conclusions. Indeed, we must be in a case where there are lurking variables around. What if we consider another variable, INCOMEPERSON?

Lastly, let's see the relationship between INCOMEPERPERSON and EMPLOYRATE:

scat1 = sns.regplot(x="INCOMEPERPERSON", y="EMPLOYRATE", fit_reg=True, data=data) plt.xlabel('Income Per person Rate') plt.ylabel('Employment Rate') plt.title('Scatterplot for the Association Between Income Per person and Employment Rate')

print ('association between Income Per person rate and Employrate') print (scipy.stats.pearsonr(data_clean['INCOMEPERPERSON'], data_clean['EMPLOYRATE']))

association between Income Per person rate and Employrate (-0.033451718257051913, 0.67845844748680695)

Here, we cannot conclude there is any statistically significant relationship between INCOMEPERPERSON and EMPLOYRATE. So, we cannot conclude that the more income per person the higher is going to be the employrate. This is actually surprising as well.

Next week, we will explore moderator variables, so, those third variables that affect either the strength or the direction of the relationship between the two variables that we are exploring.

Is the response variable related to the explanatory variable for each level of the third variable or moderator?

Our moderator to test will be incomeGDP, in the context of the relationship between URBANRATE and EMPLOYRATE.

Python code:

%pylab inline %matplotlib inline

import numpy as np import pandas import seaborn as sns import scipy from scipy import stats, integrate import matplotlib.pyplot as plt

#LOAD THE DATASET AND CONVERT COLUMNS TO UPPER-CASE TO AVOID ERRORS data = pandas.read_csv('gapminder.csv', low_memory=False, error_bad_lines=False ) data.columns= map(str.upper, data.columns)

#upper-case all Dataframe column names data.columns= map(str.upper, data.columns)

#bug fix for display formats to avoid runtime errors pandas.set_option('display.float_format', lambda x: '%f' %x)

#Set PANDAS to show all columns in a Dataframe #(the default display of pandas has a limit) pandas.set_option('display.max_columns', None) pandas.set_option('display.max_rows', None)

#In order to plot the scatter plot, we need to remove all NaN values. #Also, we have noticed that incomeperperson variable has empty values. We #need to convert them to NaN first:

data['INCOMEPERPERSON']=data['INCOMEPERPERSON'].replace(' ', numpy.nan)

data_clean=data.dropna()

print ('association between urbanrate and employmentrate') print (scipy.stats.pearsonr(data_clean['URBANRATE'], data_clean['EMPLOYRATE']))

Data Analysis Tools- Week 2 - Chi Square Test of Independence

Hi there,

This week, after examining some variance analysis with ANOVA regarding the relationship different urbanization levels and the employment rate, we are going to play with another concept, which is the Chi Square Test of Independence.

A summary of the conclusion we drew there:

“Indeed, we can see that UrbLevel1 mean seems to be different to UrbLevel2 and Urblevel3 respectively with statistical significance (we can reject the null hypothesis there). However, UrbLevel2 and UrbLevel3 do not seem to be different and we would not reject null hypothesis there.”

*Chi-Square Statistic (C-> C)*

In order to perform this test, I need to chose two categorical variables:

- UrbanizationGroup, a derived variable that I created for the ANOVA test, to categorize the levels of urbanization of a country in three levels (I am referring to this derived variable in previous post for Week 1, and a more detailed analysis of this variable in Week 3 for previous course):

http://playingaroundwithdataanalysis.tumblr.com/post/139742078627/data-management-and-visualization-week-3

-UrbLevel1: Countries having 30 or less percent of population living in urban areas - UrbLevel2: Countries having between 25 and 75 percent of population living in urban areas. - UrbLevel3: Countries having more than 75 percent of population living in urban areas.

- Another categorical variable:I will use another derived variable that I created from binning the variable IncomePerUser into two groups or categories: Income1 countries (underdeveloped) and Income2 countries (developed). This variable is called INCOMEGDPGROUP, and it will be the explanatory variable.

data_valid_incomeperperson = sub1['INCOMEPERPERSON'][~np.isnan(sub1['INCOMEPERPERSON'])]

data_valid_incomeperperson.describe()

count 181.000000 mean 8637.722641 std 14286.776127 min 115.305996 25% 744.239413 50% 2557.433638 75% 9243.587053 max 105147.437697 Name: INCOMEPERPERSON, dtype: float64

These are the two levels for my new created variable INCOMEGDPGROUP:

-- Income1: Low income countries had icnome per capita of US$1,026 or less. AND Lower middle income countries had income per capita between US$1,026 and US$4,036.

-- Income 2: Upper middle income countries had income per capita between US$4,036 and US$12,476. AND High income countries had income per capita above US$12,476.

sub1['INCOMEGDPGROUP']= pandas.cut(sub1['INCOMEPERPERSON'], [0, 12475, 150000], labels=["Income1", "Income2"])

sub1['INCOMEGDPGROUP'] = sub1['INCOMEGDPGROUP'].astype('category') c_gdpgroup= sub1.groupby('INCOMEGDPGROUP').size() print(c_gdpgroup)

INCOMEGDPGROUP Income1 143 Income2 38 dtype: int64

We can check with crosstab function that the assignment is ok with regards to INCOMEPERPERSON variable:

print(pandas.crosstab(sub1['INCOMEPERPERSON'], sub1['INCOMEGDPGROUP']))

If we plot the variable via bar chart:

Now, we get the consistency table of the observed counts:

We set the INCOMEGDPGROUP as response two-level variable and the URBANIZATIONGROUP as explanatory variable with three levels. In case there is statistical significant relationship between these two variables after running Chi-square test, I will need to run post-hoc 3 chi-square tests between each of the urbanization levels.

# contingency table of observed counts ct2=pandas.crosstab(sub1['INCOMEGDPGROUP'], sub1['URBANIZATIONGROUP']) print (ct2)

URBANIZATIONGROUP UrbLevel1 UrbLevel2 UrbLevel3 INCOMEGDPGROUP Income1 29 96 17 Income2 1 10 27

In percentages:

colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct)

URBANIZATIONGROUP UrbLevel1 UrbLevel2 UrbLevel3 INCOMEGDPGROUP Income1 0.966667 0.90566 0.386364 Income2 0.033333 0.09434 0.613636

In this case, the output of the crosstab function is set with the explanatory variable URBANIZATIONGROUP three levels across the top, and the response 2-level variable INCOMEGDPGROUP down the side. Each column is the percentage we want to interpret. In other words, we are interested in whether the rate of INCOME differs according to which explanatory group the observations belong to (UrbLevel1, UrbLevel2, UrbLevel3).

If I want to graph the percentage of countries with INCOMEGDPGROUP, for example, within each UrbLebel category, I would do the following: set our explanatory variable to categorical, and the response variable to numeric:, then plot a bivariate bar chart (we cannot do this because our INCOMEGDPGROUP is not 0 and 1, but two different labels ): sub1['URBANIZATIONGROUP'] = sub1['URBANIZATIONGROUP'].astype('category') sub1['INCOMEGDPGROUP'] = sub1['INCOMEGDPGROUP'].convert_objects(convert_numeric=True).

Let’s do the Chi-square test:

print ('chi-square value, p value, expected counts') cs2= scipy.stats.chi2_contingency(ct2) print (cs2)

The results are:

chi-square value, p value, expected counts (57.178489370819108, 3.8357367308973694e-13, 2, array([[ 23.66666667, 83.62222222, 34.71111111], [ 6.33333333, 22.37777778, 9.28888889]]))

The chi-square F value is high, and the p-value is very small, so there is statistically significant relationship between both categorical variables (reject null hypothesis).

Post-hoc tests:

Now, to see why the null hypothesis is rejected, so why the different income levels are equal or different across the urbanization rate categories ie maybe there are only two urbaniation rates that are not equal to one another, or all three are different among them.

To determine which groups are different from each other, we need to run post-hoc tests. By conducting post-hoc comparisons between pairs of rates in a way to avoids excessive type 1 error (ie avoids rejecting null hypothesis when the null hypothesis is true).

In order to perform the post-hoc tests, called Bonferroni adjustment of p-value, we need to run a chi-square test for each of the 3 paired comparisons (UrbLevel1-UrbLevel2, UrbLevel1-UrbLevel3, UrbLevel2-UrbLevel3). I need to recode variables to this end:

1) UrbLevel 1 vs UrbLevel2

recode = {'UrbLevel1': 'UrbLevel1', 'UrbLevel2': 'UrbLevel2'} sub1['COMP1v2']= sub1['URBANIZATIONGROUP'].map(recode)

ct3=pandas.crosstab(sub1['INCOMEGDPGROUP'], sub1['COMP1v2']) print (ct3)

# column percentages colsum=ct3.sum(axis=0) colpct=ct3/colsum print(colpct)

COMP1v2 UrbLevel1 UrbLevel2 INCOMEGDPGROUP Income1 29 96 Income2 1 10 COMP1v2 UrbLevel1 UrbLevel2 INCOMEGDPGROUP Income1 0.966667 0.90566 Income2 0.033333 0.09434

print ('chi-square value, p value, expected counts') cs3= scipy.stats.chi2_contingency(ct3) print (cs3)

chi-square value, p value, expected counts (0.49379897084047991, 0.48223754701148502, 1, array([[ 27.57352941, 97.42647059], [ 2.42647059, 8.57352941]]))

We can see here that the p-value is too high, therefore we need to reject the null hypothesis here: Income behavior seems to be equal or similar among UrbLevel1 and UrbLevel2. Now, we run the second comparison:

2) UrbLevel1 vs UrbLevel3

recode2 = {'UrbLevel1': 'UrbLevel1', 'UrbLevel3': 'UrbLevel3'} sub1['COMP1v3']= sub1['URBANIZATIONGROUP'].map(recode2)

ct4=pandas.crosstab(sub1['INCOMEGDPGROUP'], sub1['COMP1v3']) print (ct4)

# column percentages colsum=ct4.sum(axis=0) colpct=ct4/colsum print(colpct)

COMP1v3 UrbLevel1 UrbLevel3 INCOMEGDPGROUP Income1 29 17 Income2 1 27 COMP1v3 UrbLevel1 UrbLevel3 INCOMEGDPGROUP Income1 0.966667 0.386364 Income2 0.033333 0.613636

print ('chi-square value, p value, expected counts') cs4= scipy.stats.chi2_contingency(ct4) print (cs4)

chi-square value, p value, expected counts (23.131137069452286, 1.5132035021203873e-06, 1, array([[ 18.64864865, 27.35135135], [ 11.35135135, 16.64864865]]))

Here, it looks there are differences in behavior for groups 1 and 3, we can safely reject null hypothesis here.

Now, we run the third comparison:

3) UrbLevel2 vs UrbLevel3

recode3 = {'UrbLevel2': 'UrbLevel2', 'UrbLevel3': 'UrbLevel3'} sub1['COMP2v3']= sub1['URBANIZATIONGROUP'].map(recode3)

ct5=pandas.crosstab(sub1['INCOMEGDPGROUP'], sub1['COMP2v3']) print (ct5)

# column percentages colsum=ct5.sum(axis=0) colpct=ct5/colsum print(colpct)

COMP2v3 UrbLevel2 UrbLevel3 INCOMEGDPGROUP Income1 96 17 Income2 10 27 COMP2v3 UrbLevel2 UrbLevel3 INCOMEGDPGROUP Income1 0.90566 0.386364 Income2 0.09434 0.613636

print ('chi-square value, p value, expected counts') cs5= scipy.stats.chi2_contingency(ct5) print (cs5)

chi-square value, p value, expected counts (42.371977105446803, 7.5463572238555074e-11, 1, array([[ 79.85333333, 33.14666667], [ 26.14666667, 10.85333333]]))

Similarly, we can reject the null hypothesis here and conclude that urbanization groups 2 and 3 behave differently with regards to the income rate.

Regression Modeling in Practice - Week 1 - Writing about your data

Assignment summary:

“This week’s assignment is to submit a blog entry in which you describe 1) your sample, 2) the data collection procedure, and 3) a measures section describing your variables and how you managed them to address your own research question. “

Gapminder http://www.gapminder.org defines itself as a “fact-based world view”.

Gapminder Foundation is a non-profit venture registered in Stockholm, Sweden, that promotes sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels.

Since its conception in 2005, Gapminder has grown to include over 200 indicators, including gross domestic product, total employment rate, and estimated HIV prevalence. Gapminder contains data for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating a total of 215 areas. GapMinder collects data from a handful of sources, including the Institute for Health Metrics and Evaulation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank.

Gapminder contains multiple visualizations and explanations about the data gathered.

IMPORTANT: See previous posts (course Data Analysis and Interpretation) about:

- Research question: explore the relationship between employment rate (response variable) vs other explanatory variables (mainly Urbanization rate, but also Internet use rate).

Week1:

http://playingaroundwithdataanalysis.tumblr.com/post/139295250087/data-analysis-and-interpretation-coursera-week

- Initial analysis of variables used (frequency counts, univariate analysis):

Week 2:

http://playingaroundwithdataanalysis.tumblr.com/post/139306452207/data-analysis-and-interpretation-week-2-writing

Samples:

Gapminder dataset includes several economic and health indicators at a country level of aggregation of a big sample of countries around the world. That is, 1 single observation per country. For some variables some observations are missing.

The number of countries and territories to include is arbitrary, but they have decided to include the following entities:

192 UN members (as of April 2008)

51 other entities listed in the “List of countries” in Wikipedia (2008-05-13). These include the Vatican, dependent territories, special entities and disputed territories. We have excluded the two “sub-dependencies” Ascension Island and Tristan da Cunha, although they are listed by Wikipedia.

4 French overseas territories (Guadeloupe, Martinique, Reunion and French Guyana), although they are considered an integral part of France

10 former states

2 ad-hoc areas: “Serbia excluding Kosovo” and “the Channel Islands”. The latter is the collective name of the two dependent territories Guernsey and Jersey.

Below is an explanation about the methodology and sources used to extract each of the variables that are of interest in our research question:

The collected data is observational at country aggregate level.

- urbanrate 2008 urban population (% of total) Urban population refers to people living in urban areas as defined by national statistical offices (calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects).

Urban population refers to people living in urban areas as defined by national statistical offices. It is calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects. Source: United Nations, World Urbanization Prospects. http://data.worldbank.org/indicator

Below is an example of a visualization for Urban population indicator in Gapminder website:

As per the univariate analysis for this data:

In the case of EMPLOYRATE, there are 33 missing values. In the case of URBANRATE there are 9 missing values. In the case of INTERNETUSERATE there are 19 missing values.

These are the stats for this variable in our dataset:

data['URBANRATE'].describe()

count 193.000000 mean 56.483938 std 23.707742 min 10.400000 max 100.000000

- employrate 2007 total employees age 15+ (% of population) Percentage of total population, age above 15, that has been employed during the given year. Source: International Labour Organization http://www.ilo.org/global/lang--en/index.htm

Employees age 15+ (% of population) Definition of indicator: Percentage of total population, age above 15, that has been employed during the given year.

Below is an example of a visualization for % employ rate indicator in Gapminder website:

data['EMPLOYRATE'].describe()

These are the stats for this variable in our dataset:

count 169.000000 mean 58.746746 std 10.490075 min 32.000000 max 83.199997

- Internetuserate 2010 Internet users (per 100 people) Internet users are people with access to the worldwide network. Source: World Bank

Internet users (per 100 people) Definition of indicator: Internet users are people with access to the worldwide network.

Below is an example of one of the visualizations in gapminder web page for Internet Use rate:

These are the stats of this variable in our dataset:

data['INTERNETUSERATE'].describe()

Count 183.000000 mean 35.540207 std 27.758392 min 0.210066 Max 95.638113

Procedures:

The procedures used to collect the data depend on the source being used, but mostly rely on surveys and statistical population studies.

The first point of data collection is at each country level., but aggregation of those data procedures depend on each source’s organization (ie World Bank, etc).

The periods of collection have been listed above when explaining each of the variable data.

Measures:

The measures for each of the variables have been included above when explaining each of the countries observations.

All the data used to assess the research question are quantitative data.

Data about INTERNETUSERATE and EMPLOYRATE, and URBANRATE are relative, given in % quantitative measures (from 0 to 100) inside each country.

INCOMEPERPERSON is given as an absolute quantitative value of US$ /person/year.

Data management actions

In order to proceed with data analysis, generally the only data management necessary is:

- to convert measures to numeric according to Pandas

- Remove NA values in order to perform some plots or tests.

There are no other data management actions necessary whenever quantitative explanatory or response variables are needed.

In case some categorical variable is needed for specific sections of the research, I have decided to bin some of these variables in specific groups.

Ie. divide countries according to their URBANRATE in three levels: UrbLevel1, UrbLevel2, UrbLevel3, given 2 thresholds.

See more info about data management decisions here:

http://playingaroundwithdataanalysis.tumblr.com/post/139742078627/data-management-and-visualization-week-3

Data Analysis Tools- Week 1 - Running and Analysis of Variance

Hi,

This is the assignment description: Run an analysis of variance. You will need to analyze and interpret post hoc paired comparisons in instances where your original statistical test was significant, and you were examining more than two groups (i.e. more than two levels of a categorical, explanatory variable).

The details about my research question are described in a previous blog post:

http://playingaroundwithdataanalysis.tumblr.com/post/139295250087/data-analysis-and-interpretation-coursera-week

Note:

- Previous posts for Data Analysis and Interpretation course:

Week1:

http://playingaroundwithdataanalysis.tumblr.com/post/139295250087/data-analysis-and-interpretation-coursera-week

Week 2:

http://playingaroundwithdataanalysis.tumblr.com/post/139306452207/data-analysis-and-interpretation-week-2-writing

Week 3:

http://playingaroundwithdataanalysis.tumblr.com/post/139742078627/data-management-and-visualization-week-3

So, I will be using the Gapminder dataset to explore the relationship between Urbanization rate and Employment rate:

I am now going to explore a C-> Q association within my main research question.

I want to check for the association between employment rate (Quantitative variable) and different levels of urbanization (categorical variable as explanatory).

With this I want to answer: *"Will the employment rate be strongly associated or related to a less urbanized country than a highly urbanized countries? " "Does the behavior of employment rate differ when we are speaking about different groups of countries according to their urbanization rate?" *

In order to use a Categorical variable, I will take the derived variable from previous week, called URNABIZATIONGROUP, that has three levels according to the urbanization rate:

http://playingaroundwithdataanalysis.tumblr.com/post/139742078627/data-management-and-visualization-week-3

I will take the variable "URBANIZATIONGROUP" as categorical explanatory and Gapminder’s EMPLOYRATE as quantitative response in an ANOVA F-Test.

In order to get an idea of the behavior of the means of employment rate and their variabilities across different categroies of urbanization, I will plot boxplots to show the association between the categorical variable urbanisation code and employment rate showing the association between the two.

data_valid_urbanrate = sub1['URBANRATE'][~np.isnan(sub1['URBANRATE'])]

sub1['URBANIZATIONGROUP']= pandas.cut(data_valid_urbanrate, [0, 29, 74, 100 ], labels=["UrbLevel1", "UrbLevel2", "UrbLevel3"])

c5 = sub1['URBANIZATIONGROUP'].value_counts(sort=False, dropna=True) print(c5)

p5 = sub1['URBANIZATIONGROUP'].value_counts(sort=False, dropna=True, normalize=True) print(p5)

Outputv of freq counts for URBANIZATIONGROUP (3 levels):

UrbLevel1 31 UrbLevel2 114 UrbLevel3 48 dtype: int64 UrbLevel1 0.153465 UrbLevel2 0.564356 UrbLevel3 0.237624 dtype: float64

We are going to run an ANOVA F-test for “Urbanization Group” as a categorical variable and the quantitative “EMPLOYRATE” as response variable. The NULL hypothesis we are going to test is:

“there is no significant relation between the level of urbanization of a country to its employment rate”.

# using ols function for calculating the F-statistic and associated p value model1 = smf.ols(formula='EMPLOYRATE ~ C(URBANIZATIONGROUP)', data=sub1) results1 = model1.fit() print (results1.summary())

This test is to test if the mean of Employment rate is the same for the three groups or Urbanization level. The p-value of 0.000503. Because it is so small, we can safely reject the null hypothesis, or the hypothesis that the mean of employment is the same across the three groups of urbanization level. This means that there is significant association between employment rate and urbanization level of a country.

sub2 = sub1[['EMPLOYRATE', 'URBANIZATIONGROUP']].dropna()

sub2['EMPLOYRATE']= sub2['EMPLOYRATE'].convert_objects(convert_numeric=True)

ct1 = sub2.groupby('EMPLOYRATE').size() print (ct1)

Let's try to draw the boxplot for the three urbanization groups, to see how their means for employment rate vary across the three urbanization groups. We will use boxplot function inside seaborn to draw the boxes, and swarmplot to draw the data points.

https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.boxplot.html

ax = sns.boxplot(x="URBANIZATIONGROUP", y="EMPLOYRATE", data=sub2) ax = sns.swarmplot(x="URBANIZATIONGROUP", y="EMPLOYRATE", data=sub2, color=".25")

We can see in the above box plot:

Even though the median differences between Moderately Urban (UrbLevel2) and Most urban countries (UrbLevel3) does not seem to be significant, visually there seems to a significant difference in the median employment rates between the LeastUrban (UrbLevel1) and the other two categories. Also the data spread within the least urban countries seem to be wider showing more variance within its own category.)

we will need to run a post-hoc test to assess the true relationship among the three variables (which means are not the same with statistical significance).

print ('means for EMPLOYRATE by URBANIZATION GROUP') m1= sub2.groupby('URBANIZATIONGROUP').mean() print (m1)

means for EMPLOYRATE by URBANIZATION GROUP EMPLOYRATE URBANIZATIONGROUP UrbLevel1 68.042308 UrbLevel2 57.213131 UrbLevel3 58.000000

print ('STD DEV for EMPLOYRATE by URBANIZATION GROUP') m2= sub2.groupby('URBANIZATIONGROUP').std() print (m2)

STD DEV for EMPLOYRATE by URBANIZATION GROUP EMPLOYRATE URBANIZATIONGROUP UrbLevel1 9.769920 UrbLevel2 10.067848 UrbLevel3 8.624841

mc1 = multi.MultiComparison(sub2['EMPLOYRATE'], sub2['URBANIZATIONGROUP']) res1 = mc1.tukeyhsd() print(res1.summary())

Multiple Comparison of Means - Tukey HSD,FWER=0.05 ==================================================== group1 group2 meandiff lower upper reject ---------------------------------------------------- UrbLevel1 UrbLevel2 -10.8292 -15.8859 -5.7725 True UrbLevel1 UrbLevel3 -10.0423 -15.852 -4.2326 True UrbLevel2 UrbLevel3 0.7869 -3.5513 5.125 False ----------------------------------------------------

Indeed, we can see that UrbLevel1 mean seems to be different to UrbLevel2 and Urblevel3 respectively with statistical significance (we can reject the null hypothesis there). However, UrbLevel2 and UrbLevel3 do not seem to be different and we would not reject null hypothesis there.

Data Management and Interpretation - Week 4 - Creating graphs for your data

Hi there,

- Previous posts for Data Analysis and Interpretation course:

Week1:

http://playingaroundwithdataanalysis.tumblr.com/post/139295250087/data-analysis-and-interpretation-coursera-week

Week 2:

http://playingaroundwithdataanalysis.tumblr.com/post/139306452207/data-analysis-and-interpretation-week-2-writing

Week 3:

http://playingaroundwithdataanalysis.tumblr.com/post/139742078627/data-management-and-visualization-week-3

This week, we are going to do two different types of tasks:

STEP 1: Create graphs of your variables one at a time (univariate graphs).

STEP 2: Create a graph showing the association between your explanatory and response variables (bivariate graph).

STEP 1- UNIVARIATE GRAPHS

Plotting univariate distributions with seaborn library See: http://stanford.edu/~mwaskom/software/seaborn/tutorial/distributions.html

The most convenient way to take a quick look at a univariate distribution in seaborn is the distplot() function. By default, this will draw a histogram and fit a kernel density estimate (KDE).

http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.distplot.html#seaborn.distplot

seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None, fit_kws=None, color=None, vertical=False, norm_hist=False, axlabel=None, label=None, ax=None)

This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions. It can also fit scipy.stats distributions and plot the estimated PDF over the data.

So, now, for our dataset and our variables, let's start plotting some distplot or distributions. Watch out for na values (we have included them in our frequency distribution tables). We need to remove them in order to plot the distplot. If there are missing values in the distplot, it will throw the error "cannot convert nan to integer values". That is why we need ot use the np.isnan() function from numpy package to dump those valid values to new variables that we will use as arguments in the distplot functions:

data_valid_employrate = data['EMPLOYRATE'][~np.isnan(data['EMPLOYRATE'])] data_valid_urbanrate = data['URBANRATE'][~np.isnan(data['URBANRATE'])] data_valid_internetuserate = data['INTERNETUSERATE'][~np.isnan(data['INTERNETUSERATE'])] data_valid_incomeperperson = data['INCOMEPERPERSON'][~np.isnan(data['INCOMEPERPERSON'])]

When drawing histograms, the main choice you have is the number of bins to use and where to place them. distplot() uses a simple rule to make a good guess for what the right number is by default, but trying more or fewer bins might reveal other features in the data.

Let’s see the distibution of EMPLOYRATE variable:

sns.distplot(data_valid_employrate, bins=20, hist=True, kde=True, rug=True, color='blue', axlabel='%employrate'); plt.xlabel('Employment rate') plt.title('Employment rate across countries in Gapminder dataset')

Number of Countries per percentage of Employment during year - This histogram is fairly normal, with a mean of xx% and a median of YY%, The values maximum and minimum for the % employment rate are:

The distribution is a unimodal distribution, symmetric around 55%.

Now, let’s see the distibution of URBANRATE variable:

sns.distplot(data_valid_urbanrate, bins=20, hist=True, kde=True, rug=True, color='red', axlabel='%urbanrate'); plt.xlabel('Urbanization rate') plt.title('Urbanization rate across countries in Gapminder dataset')

The histogram for urban looks fairly uniform.

The relative “normal" data distribution of the employment rate shows fewer countries with low and high rates employment while the relatively “uniform” data distribution of urban rate shows fairly uniform number of countries among the low and high level of urbanization. There are three main "modes" of this variable: around 25%, around 70-75% and a smaller on around 90%.

Finally, let’s see the plot for INTERNETUSERATE variable:

sns.distplot(data_valid_internetuserate, bins=20, hist=True, kde=True, rug=True, color='g',axlabel='%internetuserate'); plt.xlabel('Internet use rate') plt.title('Internet use rate across countries in Gapminder dataset')

At first sight, we see a skewed -right distribution. We see a few observations much larger than the rest, and a big tail of small values. This is in accordance to the general acceptance that the internet penetration rate across countries remains very low.

STEP 2 - BIVARIATE GRAPHS

If the response variable is quantitative, the graph we need to use is a scatterplot, which is the relationship between two quantitative variables (EMPLOYRATE and URBANRATE).

We set the following Hypothesis to test:

- H0: there is no relationship or association between urban rate and employment rate

- H1: there is a significant relationship or association between urban rate and employment rate.

We set the EMPLOYRATE as the quantitative response variable, and the URBANRATE as the quantitative independent or explanatory variable.

scat1 = sns.regplot(x="URBANRATE", y='EMPLOYRATE', fit_reg=True, data=data) plt.xlabel('Urban RATE') plt.ylabel('Employ Rate') plt.title('Scatterplot for the Association between URBAN Rate and EMPLOY Rate')

It looks like there is a negative correlation: more urbanized a country, the less employment rate.

Several open questions here:

- Is urbanization rendering people jobless? - Does it mean that people that find jobs is within rural areas of the country? - Does it mean that urbanization planning is defficient across countries? - Is increasing employment helping people to leave cities and go back to their native towns? - are these jobs created in rural areas by people who return from the city? - Is rapid urbanisation in developing countries leading to increasing relocatioj of people from rural to urban areas with an increasing pressure on limited resources available in the cities? - What about if we explore the correlation between employment and urbanization inside different levels or ranges of urbanization levels? Would these correlations become stronger or more scattered?

As next steps in our work, we could do the following:

Use a derived value according to the level of income per capita (INCOMEGDPGROUP), as we presume that there may be significative differences in the level of association between urbanization and employment across countries that belong to different segments of development...

#data_analysis #univariate #bivariate #histogram

Data Management and Visualization - Week 3 - Making Data Management Decisions

Hi there,

This week, after our exploratory data analysis from last week, I am going to start taking some decisions about our data, including generating secondary variables that we may further need to explore for my research question.

- Previous posts for Data Analysis and Interpretation course:

Week1:

http://playingaroundwithdataanalysis.tumblr.com/post/139295250087/data-analysis-and-interpretation-coursera-week

Week 2:

http://playingaroundwithdataanalysis.tumblr.com/post/139306452207/data-analysis-and-interpretation-week-2-writing

There are several possible data management operations that we could perform now:

1) Setting aside missing data (removing for example the typical “Don’t Know/Don’t Answer” codes from the responses of a poll)

2) Coding valid data and recoding values (for example, skipping the questions related to previous questions in a poll that do not apply for the case). Another example would be to rename certain values for variables that do not have a self-explanatory form (like levels of certain quantity, etc)

3) Creating secondary variables, so: creating some variables from other variables. One example would be a secondary variable built from operations between other two variables (ie the number of cigarettes smoked per month as the multiplication of days * num cigarettes per day).

4) Grouping variables within individual variables. For example, create a new variable that categorizes another quantitative variable into categories or levels.

For my research case, I can see that neither 1) , 2) or 3) cases apply, because I do not want to restrict any observation (restrict a priori any country).

This way, I have decided to focus on 4), so to group variables within individual quantitative variables. These are in the Derived variables section below:

1) Derived variables: URBANIZATIONGROUP

We can see that there are no values we want to restrict in our set that may not have been restricted yet (the missing countries or with NA values), as in the class videos example. We could only try to create a secondary variable that tries to subdivide the urban rate into different tiers or levels of urbanization. Now, we will proceed to create a secondary derived variable called “URBANIZATIONGROUP” for categorizing the level of urbanization of a country. We will assign 3 groups, corresponding to the following intervals of the quantitative variable URBANRATE:

These are the counts for each of these three levels in our dataset:

UrbLevel1 31

UrbLevel2 114

UrbLevel3 48

or, in percentages:

UrbLevel1 0.160622

UrbLevel2 0.590674

UrbLevel3 0.248705

To see if these levels have been assigned correctly, we execute the crosstab function with these two variables.

An extract of the output:

let’s plot the distribution of the three levels in the dataset with a bar chart:

(code:

sns.countplot(x='URBANIZATIONGROUP', data=data) plt.xlabel('Urbanization groups') plt.title('urb countries in Gapminder dataset')

2) Derived variables: INCOMEGDPGROUP

Now, let's create another derived value according to the level of income per capita, as we presume that there may be significative differences in the level of association between urbanization and employment across countries that belong to differeent segments of development. This variable, which we call INCOMEGDPGROUP, will be derived from INCOMEPERPERSON variable in Gapminder dataset.

We will use two divisions:

- Income1: Low income countries had GNI per capita of US$1,026 or less. AND Lower middle income countries had GNI per capita between US$1,026 and US$4,036. - Income 2: Upper middle income countries had GNI per capita between US$4,036 and US$12,476. AND High income countries had GNI per capita above US$12,476.

Let’s see the distribution of the two levels of this variable in the dataset by a bar chart:

(code:

sns.countplot(x='INCOMEGDPGROUP', data=data) plt.xlabel('Income GDP groups') plt.title('gdp countries in Gapminder dataset')

3) Subsetting input data according to INCOMEGDPGROUP.

Finally, we will create two different subsets of data that includes data for each of the contries belonging to INCOMEGDPGROUP 1 or 2 respectively:

- One data frame containing data only for Income1 countries:

data_income1=data[(data['INCOMEGDPGROUP']=="Income1")] #make a copy of my new subsetted data data_income1 = data_income1.copy()

We explore some stats for these subsets and the variables we have been exploring (EMPLOYRATE; URBANRATE; INTERNETUSERATE) using describe() function:

- Another one containing data only for Income2 countries, and stats below:

data_income2=data[(data['INCOMEGDPGROUP']=="Income2")] #make a copy of my new subsetted data data_income2 = data_income2.copy()

Indeed, we can see that the averages for INCOMEPERPERSON in both groups differ a lot: while the mean for this variable is 38 for Income1 group, it is 143 for Income2 group. There are other interesting differences for other variables too:

- INTERNETUSERATE mean is much higher in the second segment (more developed countries) as in the first segment, No surprises here.

- EMPLOYRATE mean is also much higher in first segment than in second . So does this happen with URBANRATE. If we look at the min value, however, for EMPLOYRATE, it looks like it is higher in Income1 countries than in Income2! tis must be an outlier..

Python code:

#Create secondary variable URBANIZATIONGROUP:

data['URBANIZATIONGROUP']= pandas.cut(data_valid_urbanrate, [0, 29, 74, 100 ], labels=["UrbLevel1", "UrbLevel2", "UrbLevel3"])

#Value counts for URBANIZATIONGROUP:

data['URBANIZATIONGROUP'] = data['URBANIZATIONGROUP'].astype('category') c_urbgroup= data.groupby('URBANIZATIONGROUP').size() print(c_urbgroup) p_urbgroup=data.groupby('URBANIZATIONGROUP').size() *100 / len(data) print(p_urbgroup)

#Now, to see if the categorization worked correctly, and determined which rates were included in each level of the new urbanizationgroup variable, #we use crosstab function:

print(pandas.crosstab(data['URBANIZATIONGROUP'], data['URBANRATE']))

#Plotting bars for the three categories of URBANIZATIONGROUP:

sns.countplot(x='URBANIZATIONGROUP', data=data) plt.xlabel('Urbanization groups') plt.title('urb countries in Gapminder dataset')

#explore variable INCOMEPERPERSON

data["INCOMEPERPERSON"]=data["INCOMEPERPERSON"].convert_objects(convert_numeric=True) freq_table_incomeperperson = pandas.crosstab(index=data["INCOMEPERPERSON"], # Make a crosstab columns="count") freq_table_incomeperperson

data_valid_incomeperperson = data['INCOMEPERPERSON'][~np.isnan(data['INCOMEPERPERSON'])]

data_valid_incomeperperson.describe()

#create derived variable INCOMEGDPGROUP:

data['INCOMEGDPGROUP']= pandas.cut(data['INCOMEPERPERSON'], [0, 12475, 150000], labels=["Income1", "Income2"])

c6 = data['INCOMEGDPGROUP'].value_counts(sort=False, dropna=True) print(c6)

p6 = data['INCOMEGDPGROUP'].value_counts(sort=False, dropna=True, normalize=True) print(p6)

print(pandas.crosstab(data['INCOMEPERPERSON'], data['INCOMEGDPGROUP']))

data['INCOMEGDPGROUP'] = data['INCOMEGDPGROUP'].astype('category') c_gdpgroup= data.groupby('INCOMEGDPGROUP').size() print(c_gdpgroup)

p_gdpgroup=data.groupby('INCOMEGDPGROUP').size() *100 / len(data) print(p_gdpgroup)

#subset data per income levels:

data_income1=data[(data['INCOMEGDPGROUP']=="Income1")] #make a copy of my new subsetted data data_income1 = data_income1.copy()

data_income2=data[(data['INCOMEGDPGROUP']=="Income2")] #make a copy of my new subsetted data data_income2 = data_income2.copy()

Data analysis and interpretation-Week 2 - Writing your first program

I have chosen Python to start with my analysis.

Initially, I seemed to run this known issue with Spyder: https://github.com/spyder-ide/spyder/issues/2984. As I would have problems trying to render any image, I have decided to use iPhython notebooks instead. iPython notebooks are very convenient, because they allow me to add text between code snippets and figures so that I can better go back to them in the future.

The first thing I noticed when running Python for loading the dataset, is that the datafile seems to contain some errors, that is why I need to ignore them via forcing “error_bad_lines=False”:

data = pandas.read_csv('gapminder.csv', low_memory=False, error_bad_lines=False )

After trying to force the delimiter as “comma” with no luck, and, as this was a non-starter situation, I have decided to ignore these errors via forcing with the parameter “error_bad_lines=False”, in the read_csv function.

This produces the following warning for 9 countries, an consideration I will need to remember:

'Skipping line 43: expected 16 fields, saw 17\n Skipping line 44: expected 16 fields, saw 17\n Skipping line 85: expected 16 fields, saw 17\n Skipping line 101: expected 16 fields, saw 17\n Skipping line 102: expected 16 fields, saw 17\n Skipping line 114: expected 16 fields, saw 17\n Skipping line 115: expected 16 fields, saw 17\n Skipping line 127: expected 16 fields, saw 17\n Skipping line 212: expected 16 fields, saw 17\n'

This countries correspond to the following missing countries for the analysis:

Congo Rep., Costa Rica, Iceland, Kuwait, Kyrgyzstan, Madagascar, Malawi, Monaco and Zimbabwe.

After printing some general data about our dataset, these are the main stats:

- 202 number of observations (countries)

- 16 number of columns, whose headers are:

Index(['country', 'incomeperperson', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'internetuserate', 'lifeexpectancy', 'oilperperson', 'polityscore', 'relectricperperson', 'suicideper100th', 'employrate', 'urbanrate'], dtype='object') 202

STARTING THE UNIVARIATE ANALYSIS

Next, we start with the UNIVARIATE analysis, that is, examining each variable separately.

First, we take a look at the distribution of those variables EMPLOYRATE, URBANRATE AND INTERNETUSERATE, so: which values they take and how often they take those values:

These frequency tables speak about the three observed variables: they are continuous variables that take values from 0 to 100 in a contuinuum of values. For some cases, there are some repeated values across countries, but normally they take singular values. There are some NaN or missing values that count for different variables.

In the case of EMPLOYRATE, there are 33 missing values. In the case of URBANRATE there are 9 missing values. In the case of INTERNETUSERATE there are 19 missing values. We will need to account for this during our analysis. If you pass a variable with many unique values to table(), such a numeric variable, it will still produce a table of counts for each unique value, but the counts may not be particularly meaningful:

1) Variable EMPLOYRATE:

c1 = data["EMPLOYRATE"].value_counts(sort=False) print(c1)

employment rate nan 33 50.500000 1 61.500000 3 64.500000 1 63.500000 1 56.500000 1 53.500000 3 81.500000 1 60.500000 1 42.500000 1 54.500000 2 49.500000 1 71.300003 1 52.500000 1 58.500000 1 57.500000 2 73.199997 1 63.799999 2 60.700001 1 51.400002 1 41.099998 1 75.199997 1 56.400002 1 70.400002 2 51.299999 1 50.700001 1 78.199997 2 46.200001 1 42.000000 1 71.800003 1 43.099998 1 46.000000 2 32.000000 1 51.000000 2 65.699997 1 56.900002 1 56.000000 2 46.799999 1 59.000000 1 61.000000 2 62.299999 1 65.000000 3 47.099998 1 64.300003 1 56.299999 2 71.000000 1 72.000000 1 66.900002 1 40.099998 1 76.000000 1 77.000000 1 68.300003 1 78.900002 1 53.099998 1 83.000000 1 73.099998 1 55.900002 3 54.599998 1 66.000000 2 48.700001 2 56.799999 2 68.099998 1 38.900002 1 74.699997 1 47.799999 1 68.000000 1 62.700001 1 46.900002 1 44.700001 1 42.799999 1 59.799999 1 41.200001 1 57.599998 1 46.400002 1 55.700001 1 71.599998 1 45.700001 1 49.599998 1 58.400002 2 66.800003 1 80.699997 1 60.900002 1 63.900002 1 58.200001 2 55.599998 1 65.599998 1 57.900002 1 62.400002 1 59.099998 2 57.200001 1 53.400002 2 50.900002 2 57.099998 1 59.299999 1 61.799999 1 72.800003 1 66.599998 1 63.200001 1 51.200001 2 55.099998 1 42.400002 2 57.299999 1 63.099998 1 55.400002 1 61.700001 1 44.200001 1 73.599998 1 83.199997 2 64.900002 1 81.300003 1 52.099998 1 47.299999 3 59.700001 1 59.900002 3 54.400002 1 58.799999 2 75.699997 1 65.900002 1 44.799999 1 58.900002 2 67.300003 1 71.699997 1 52.700001 1 63.700001 1 65.099998 1 44.299999 1 48.599998 2 60.400002 2 79.800003 1 68.900002 1 41.599998 1 61.299999 1 37.400002 1 dtype: int64

p1 = data["EMPLOYRATE"].value_counts(sort=False, normalize=True) print (p1)

percentage employment rate nan 0.163366 50.500000 0.004950 61.500000 0.014851 64.500000 0.004950 63.500000 0.004950 56.500000 0.004950 53.500000 0.014851 81.500000 0.004950 60.500000 0.004950 42.500000 0.004950 54.500000 0.009901 49.500000 0.004950 71.300003 0.004950 52.500000 0.004950 58.500000 0.004950 57.500000 0.009901 73.199997 0.004950 63.799999 0.009901 60.700001 0.004950 51.400002 0.004950 41.099998 0.004950 75.199997 0.004950 56.400002 0.004950 70.400002 0.009901 51.299999 0.004950 50.700001 0.004950 78.199997 0.009901 46.200001 0.004950 42.000000 0.004950 71.800003 0.004950 43.099998 0.004950 46.000000 0.009901 32.000000 0.004950 51.000000 0.009901 65.699997 0.004950 56.900002 0.004950 56.000000 0.009901 46.799999 0.004950 59.000000 0.004950 61.000000 0.009901 62.299999 0.004950 65.000000 0.014851 47.099998 0.004950 64.300003 0.004950 56.299999 0.009901 71.000000 0.004950 72.000000 0.004950 66.900002 0.004950 40.099998 0.004950 76.000000 0.004950 77.000000 0.004950 68.300003 0.004950 78.900002 0.004950 53.099998 0.004950 83.000000 0.004950 73.099998 0.004950 55.900002 0.014851 54.599998 0.004950

...

2) Variable URBANRATE

urbanization rate

urbanization rate nan 9 74.500000 1 73.500000 1 67.500000 1 26.460000 1 66.500000 1 87.300000 1 52.040000 1 71.100000 1 85.580000 1 73.480000 1 41.000000 1 17.000000 1 60.180000 1 92.680000 1 51.640000 1 32.580000 1 23.000000 1 52.740000 1 61.000000 1 86.960000 1 12.980000 1 73.920000 1 29.540000 1 82.440000 1 33.320000 1 15.100000 1 64.920000 1 57.180000 1 39.380000 1 14.320000 1 93.160000 1 42.000000 1 73.200000 1 77.540000 1 77.480000 1 48.360000 1 18.800000 1 54.340000 1 37.760000 1 69.460000 1 29.520000 1 70.360000 1 81.820000 1 43.840000 1 48.780000 1 56.420000 1 59.620000 1 95.640000 1 69.900000 1 88.920000 1 92.000000 1 77.360000 1 37.860000 1 68.460000 1 35.420000 1 12.540000 1 85.040000 1 24.760000 1 47.440000 1 38.580000 1 24.940000 1 21.600000 1 39.840000 1 71.620000 1 93.320000 1 29.840000 1 34.440000 1 47.880000 1 46.720000 1 50.020000 1 56.700000 1 34.480000 1 27.300000 1 60.560000 1 26.680000 1 27.840000 2 30.880000 1 16.540000 1 77.120000 1 51.460000 1 25.460000 1 77.200000 1 74.920000 1 19.560000 1 53.300000 1 10.400000 1 36.820000 1 24.780000 1 68.080000 1 36.280000 1 73.460000 1 77.880000 1 46.840000 1 73.640000 1 65.220000 1 84.540000 1 41.760000 1 74.820000 1 72.840000 1 42.380000 1 18.340000 1 86.560000 1 69.020000 1 36.160000 1 100.000000 4 80.460000 1 36.520000 1 92.300000 1 61.340000 1 88.520000 1 42.720000 1 51.920000 1 25.520000 1 65.580000 2 81.700000 1 21.560000 1 88.740000 1 48.600000 1 98.320000 1 71.900000 1 17.240000 1 13.220000 1 68.680000 1 32.180000 1 91.660000 1 32.320000 1 83.700000 1 27.140000 1 47.040000 1 36.840000 2 52.360000 1 59.460000 1 28.380000 1 71.400000 1 56.740000 1 94.260000 1 97.360000 1 42.480000 1 75.660000 1 24.040000 1 41.200000 1 46.780000 1 37.340000 1 30.460000 1 51.700000 1 83.520000 1 48.580000 1 61.320000 1 60.740000 1 63.860000 1 30.840000 1 68.120000 1 20.720000 1 88.440000 1 48.620000 1 86.680000 1 64.780000 1 82.420000 1 80.400000 1 28.080000 1 67.980000 1 57.940000 1 59.580000 1 60.300000 1 60.700000 1 17.960000 1 56.560000 1 56.020000 1 66.480000 1 43.440000 1 67.160000 1 66.600000 1 57.280000 1 66.960000 1 54.220000 1 43.100000 1 92.260000 1 71.080000 1 98.360000 1 63.300000 1 60.140000 1 94.220000 1 41.420000 1 89.940000 1 78.420000 1 54.240000 1 56.760000 1 dtype: int64

percentage urbanization rate nan 0.044554 74.500000 0.004950 73.500000 0.004950 67.500000 0.004950 26.460000 0.004950 66.500000 0.004950 87.300000 0.004950 52.040000 0.004950 71.100000 0.004950 85.580000 0.004950 73.480000 0.004950

3) Variable INTERNETUSERATE

internet use rate nan 19 1.400061 1 39.820178 1 74.163040 1 2.199998 1 56.300034 1 9.196775 1 8.370207 1 90.016190 1 69.339971 1 2.699966 1 77.638535 1 13.000111 1 43.055067 1 51.280478 1 9.549931 1 40.772851 1 62.471230 1 2.450362 1

...

percentage internet use rate nan 0.094059 1.400061 0.004950 39.820178 0.004950 74.163040 0.004950 2.199998 0.004950 56.300034 0.004950 9.196775 0.004950 8.370207 0.004950 90.016190 0.004950 69.339971 0.004950 2.699966 0.004950 77.638535 0.004950 13.000111 0.004950 43.055067 0.004950 51.280478 0.004950 9.549931 0.004950 40.772851 0.004950 62.471230 0.004950

Main stats for these three variables:

data_valid_employrate.describe()

count 169.000000 mean 58.746746 std 10.490075 min 32.000000 25% 51.200001 50% 58.400002 75% 65.000000 max 83.199997

data_valid_urbanrate.describe()

count 193.000000 mean 56.483938 std 23.707742 min 10.400000 25% 36.840000 50% 57.180000 75% 73.920000 max 100.000000

data_valid_interentuserate.describe()

count 183.000000 mean 35.540207 std 27.758392 min 0.210066 25% 9.999254 50% 31.568098 75% 55.646421 max 95.638113

Derived variables (this is an advance of week 3):

We can see that there are no values we want to restrict in our set that may not have been restricted yet (the missing countries or with NA values), as in the class videos example.

We could only try to create a secondary variable that tries to subdivide the urban rate into different tiers or levels of urbanization.

Now, we will proceed to create a secondary derived variable called "URBANIZATIONGROUP" for categorizing the level of urbanization of a country. We will assign 4 groups, corresponding to the 4 quartiles of the variable URBANRATE:

UrbLevel1: Countries having 30 or less percent of population living in urban areas

UrbLevel2: Countries having between 25 and 75 percent of population living in urban areas.

UrbLevel3: Countries having more than 75 percent of population living in urban areas.

These are the counts for each of these three levels in our dataset:

UrbLevel1 31 UrbLevel2 114 UrbLevel3 48

or, in percentages:

UrbLevel1 0.160622 UrbLevel2 0.590674 UrbLevel3 0.248705 To see if these levels have been assigned correctly, we execute the crosstab function with these two variables:

print(pandas.crosstab(x2_valid_data_urban['URBANIZATIONGROUP'], x2_valid_data_urban['URBANRATE']))

THIS IS THE PYTHON CODE (I am using Python 2.7):

#convert variables to numeric

data["EMPLOYRATE"]=data["EMPLOYRATE"]

.convert_objects(convert_numeric=True)

data["URBANRATE"]=data["URBANRATE"]

.convert_objects(convert_numeric=True)

data["INTERNETUSERATE"]=data["INTERNETUSERATE"]

.convert_objects(convert_numeric=True)

#get frequency counts (tables):

print("employment rate") c1 = data["EMPLOYRATE"].value_counts(sort=False, dropna=False) print(c1)

#in order to ask for percentages of each value based on those counts: print("percentage employment rate") p1 = data["EMPLOYRATE"].value_counts(sort=False, normalize=True, dropna= False) print (p1)

print("urbanization rate") c2 = data["URBANRATE"].value_counts(sort=False, dropna=False) print(c2) #in order to ask for percentages of each value based on those counts: print("percentage urbanization rate") p2 = data["URBANRATE"].value_counts(sort=False, normalize=True, dropna=False) print (p2)

print("internet use rate") c3 = data["INTERNETUSERATE"].value_counts(sort=False, dropna=False) print(c3) #in order to ask for percentages of each value based on those counts: print("percentage internet use rate") p3 = data["INTERNETUSERATE"].value_counts(sort=False, normalize=True, dropna=False) print (p3)

#get frequency tables using crosstab functions:

freq_table_employrate = pandas.crosstab(index=data["EMPLOYRATE"], columns="count") freq_table_employrate

freq_table_urbanrate = pandas.crosstab(index=data["URBANRATE"], columns="count") freq_table_urbanrate

freq_table_internetuserate = pandas.crosstab(index=data["INTERNETUSERATE"], columns="count") freq_table_internetuserate

freq_table_employrate/freq_table_employrate.sum()

freq_table_urbanrate/freq_table_urbanrate.sum()

freq_table_internetuserate/freq_table_internetuserate.sum()

_ _

#Next, we create new variable URBANIZATIONGROUP to categorize countries based on their percentage of urban population print('URBANIZATIONGROUP - 3 categories - custom groups based on thresholds')

x_valid_data_urban.loc[:,('URBANIZATIONGROUP')]= pandas.cut(x_valid_data_urban['URBANRATE'], [0, 29, 74, 100 ], labels=["UrbLevel1", "UrbLevel2", "UrbLevel3"])

x2_valid_data_urban=x_valid_data_urban.copy()

c5 = x2_valid_data_urban['URBANIZATIONGROUP'].value_counts(sort=False, dropna=True) print(c5)

p5 = x2_valid_data_urban['URBANIZATIONGROUP'].value_counts(sort=False, dropna=True, normalize=True) print(p5)

x2_valid_data_urban.describe()

#data_analysis #exploratory data analysis #frequency distribution #stats

Data analysis and interpretation- Coursera - Week 1 assignment: choosing the research question(s)

Title: Explore the relationship between the urbanization rate and the employment rate among the Gapminder countries dataset. Measure the influence of other variables, such as income per capita and internet usage rate.

1. Finding the Research question to investigate.

After carefully reviewing the different datasets provided in the Coursera outline, I have decided myself for the Gapminder dataset.

Gapminder defines itself as a “fact-based world view”.

I wanted to explore the different levels of relationship between employment rates and the levels of urbanization across different countries around the world.

As per UNESCO’s report by 2030 it is estimated that all developing regions will have more people living in urban areas than rural areas, with virtually all the world’s population growth concentrated in urban areas over the next 30 years.

By 2030 it is estimated that all developing regions will have more people living in urban areas than rural areas, with virtually all the world’s population growth concentrated in urban areas over the next 30 years. Although it is true that better urbanization and education around cities are the hub of a big portion of qualified jobs, it is in cities where some important problems around poverty divides and unemployment arise. Most urban youth, particularly youth migrants, live in unplanned settlement areas, often in squalid conditions and are vulnerable to high levels of unemployment. In [4] it is argued that rural-urban migration was counter-productive because migrants moved for the wrong reasons. Does demand for jobs surpass the job offer in many industrialised countries with high or rapid urban deployment?

Second reserach question:

As as secondary question in my research, is the fact that lack of education is an important barrier to youth employment. UNESCO’s report [1] mentions qualifications and education as a key factor to foster new enterpreneurs. And internet plays an important role here, so my question is if internet network deployment in highly urbanized countries can play a role in new jobs creation, so if internet usage can help improve the employment rate. On the other hands [3] some voices argue that technology will lead to de-urbanization.

2. Choosing the dataset.

Gapminder dataset.

3. Variables to use and codebook:

Sample: 289 countries in the world, evaluated according to different criteria, were drawn from the Gapminder dataset

- employrate 2007 total employees age 15+ (% of population) Percentage of total population, age above 15, that has been employed during the given year. Source: International Labour Organization

- Internetuserate 2010 Internet users (per 100 people) Internet users are people with access to the worldwide network. Source: World Bank

4. Literature Review

UNESCO’s study [1] provides an overview of patterns of urbanization across developing countries in relation to urban economic growth and implications for urban poverty, particularly among urban youth. This study highlights several case studies that provide illustration of many of the dymanics in contexts of rapid urbanisation, (e.g. Brazil, Ghana), boom economies (e.g. Vietnam) and high urban youth unemployment (e.g. Egypt).

In [2] it is argued that if urbanization is significant for both internet usage and dot-com concentration, this may suggest that the Internet may not be diminishing the advantages for cities.

Bibliography and references:

[1] Urbanization and the Employment Opportunities of Youth in Developing Countries - Ursula Grant, 2012 http://unesdoc.unesco.org/images/0021/002178/217879E.pdf

[2] Regional Development and Conditions for Innovation in the Network Society , M. S. van Geenhuizen,David V. Gibson,Manuel V. Heitor, 2005

[3] http://bigthink.com/experts-corner/technology-will-lead-to-de-urbanization-2

[4] “Urbanization and growth: setting the context” Patricia Clarke Annez and Robert M. Buckley. World Bank publication http://siteresources.worldbank.org/EXTPREMNET/Resources/489960-1338997241035/Growth_Commission_Vol1_Urbanization_Growth_Ch1_Urbanization_Growth_Setting_Context.pdf

5. Hypothesis to test

Different levels of Urbanization rate are positively correlated with the employment rate across different countries. High internet usage in urban areas can foster employment opportunities.

Trending Blogs

Recently Viewed Blogs

Playing around with data analysis