Data Analysis @rsaf-blog - Tumblr Blog

Regression model - Week 4

Data preparation :

Our research question : Are the "racial background" and the "feeling about weight" associated with the fact that “parents allow teens to have their own decision about what they wear”?

Binary response variable : H1WP3 > WEAR_DECISION : Do your parents let you make your own decisions about what you wear? (0: No, 1: yes, 6,7,8,9 NA Explanatory variables :

primary explanatory variable > RACIAL_BACKGROUND (1:White 2:Black or African American 3:American Indian or Native American 4:Asian or Pacific Islander 5:Other )

secondary explanatory variable > WEIGHT_FEELING : How do you think of yourself in terms of weight? (1: very underweight, 2: slightly underweight, 3: about the right weight, 4: slightly overweight, 5: very overweight, 6,8 : NA)

For a checking, here is a description :

1## Results

a)CATEGORICAL EXPLANATORY VARIABLES WITH 3+ CATEGORIES

Let's consider the RACIAL_BACK as quantitative (with more than 5 levels).

> We found here a non significant association with a p-value = 0.732 > 0.005

If we add the secondary variable WEIGHT_FEELING which is categorical with more than 2 variables

This OLS regression gives us also non significant p-values (-1.05, -0.86, -0.88, -0.55, 0.27) > 0.005

If we add a reference in WEIGHT_FEELING, we should guess the result (non significant p-values), but for a training purpose, here is the OLS regression :

b) Logistic regression

b1) logistic regression with WEIGHT_FEELING

#Odd ration :

# odd ratios with 95% confidence intervals

The Odds Ratios is very closer to “1″, that means there no dependency between the groups in WEAR_DECISION if the factor is WEIGHT_FEELING.

b2) logistic regression with WEIGHT_FEELING and RACIAL_BACK

# odd ratios with 95% confidence intervals

The Odds Ratios is very closer to “1″, that means there no dependency between the groups in WEAR_DECISION if the factors are WEIGHT_FEELING and RACIAL_BACK.

2## Hypothesis vs results

We have non significant p-values for each explanatory variables (RACIAL_BACK and TAUGHT_IN_SCHOOL). So we can say that the Ho hypothesis cannot be rejected. It means, our 02 explanatory variables are independent with the response variable.

3## Evidence of confounding variables

Unfortunately, we didn’t have any association between our explanatory variables and the response variable. So we cannot get any confusion about “confounding evidence” in our case.

Multiple regression - part 2

Assignment 4 : EVALUATING MODEL FIT

a) QQPlot

b1) simple plot of residuals

b2) additional regression diagnostic plots

c) leverage plot

Multiple regression - part 1

Data preparation : Adhealth dataset

For this assessment, let say our research question is as follow : Does the fact that teens was taught in school about “where to go for help with a health problem?”, and the racial background, are associated with the way teens define their “general health”?

Explanatory variables :

was teens taught in school about “where to go for help with a health problem?” (0: NO, 1: YES)

racial background (1:White, 2:Black or African American, 3:American Indian or Native American, 4:Asian or Pacific Islander, 5:Other )

Response variable : “general health” (1:excellent | 2:very good | 3:good | 4:fair | 5:poor)

We have a simple description here :

MULTIPLE REGRESSION & CONFIDENCE INTERVALS

Fig1 - first order (linear) scatterplot

Fig 2 : fit second order polynomial (Quadratic)

Center quantitative EVs for regression analysis

We don’t need to do this because we have categorical explanatory variables.

linear regression analysis

quadratic (polynomial) regression analysis

adding TAUGHT_IN_SCHOOL to explanatory variable

Assignment 1 : what I found in my multiple regression analysis ?

The last OLS regression result shows us :

Ev = RACIAL_BACK

c = combinaison of 2 variables among 5 = 5! / 2! (5 - 2) ! = 5x4/2 = 10 p/c = 0.05/10 = 0.005

a) the condition is not filled to say if RACIAL_BACK is associated with GENERAL_HEALTH. In fact, we have a non significant p-value = 0.023 > “p/c” = 0.005

b) even for quadratic (order = 2) analysis, we have p-value = 0.018 > “p/c” = 0.005 (the same conclusion)

Ev = TAUGHT_IN_SCHOOL

p/c = 0.05 for 2 level We have a p-value = 0.476 > 0.05, meaning a non signification p-value (the same conclusion as above)

Beta coefficients

The regression expression is : GENERAL_HEALTH = 2.643 - 0.447(RACIAL_BACK) + 0.082(RACIAL_BACK**2) + 0.098(TAUGHT_IN_SCHOOL)

Assignment 2 : Results vs hypothesis

Those results show us We have non significant p-values for each explanatory variables (RACIAL_BACK and TAUGHT_IN_SCHOOL). So we can say that the Ho hypothesis cannot be rejected. It means, our 02 explanatory variables are independent with the response variable.

Assignment 3 : Evidence of confounding

Unfortunately, we didn’t have any association between our explanatory variables and the response variable. So we cannot get any confusion about “confounding evidence” in our case.

Code & output : Testing simple linear regression

Code import pandas import numpy import seaborn import matplotlib.pyplot as plt import statsmodels.formula.api as smf

data = pandas.read_csv('../../../datasets & codebooks/addhealth_pds.csv', low_memory=False) print data.describe()

# Response variable : C # H1GH1 : In general, how is your health? Would you say… (excellent | very good | good | fair | poor) data['H1GH1'] = pandas.to_numeric(data['H1GH1'], errors='coerce')

sub1 = data[['TAUGHT_IN_SCHOOL', 'GENERAL_HEALTH']].dropna() print(sub1.describe())

print('Frequency table for our Explanatory variable') print 'count :' c1= sub1['TAUGHT_IN_SCHOOL'].value_counts(sort=False, dropna=True) print (c1)

print 'percentages :' p1 = sub1['TAUGHT_IN_SCHOOL'].value_counts(sort=False, normalize=True) print (p1)

############################################################################################ # BASIC LINEAR REGRESSION ############################################################################################ scat1 = seaborn.regplot(x="TAUGHT_IN_SCHOOL", y="GENERAL_HEALTH", scatter=True, data=sub1) plt.xlabel('Was Taught in School') plt.ylabel('General health') plt.title ('Scatterplot for the Association Between TAUGHT_IN_SCHOOL and GENERAL_HEALTH') print(scat1)

print 'Testing Linear regression model for C -> Q :' reg1 = smf.ols('GENERAL_HEALTH ~ TAUGHT_IN_SCHOOL', data=sub1).fit() #RESP_VAR ~ EXP_VAR print (reg1.summary())

Outputs

Testing simple linear regression

Data preparation (Codebook : Adhealth)

For this assessment, let say our research question is as follow Does the fact that teens was taught in school about “where to go for help with a health problem?” is associated with the way teens define their “general health”?

Explanatory variable : was teens taught in school about “where to go for help with a health problem?” (0: NO, 1: YES)

Response variable : “general health” (1:excellent | 2:very good | 3:good | 4:fair | 5:poor)

This is a quick description of our variables:

Frequency table for the Explanatory Variable

We got a simple frequency table as below

Testing linear regression model

As we have here a ‘Categorical EXP_VAR -> Quantitative RESP_VAR’ model, so we’ll use the formula ‘RESP_VAR ~ EXP_VAR' in the code. The output is as follow :

A little checking with a scatter plot gives this graph :

Summary

The regression test shows us the regression coef. is equal to 2.195, and the p-value = 0.000109 < 0.0001 (significant). We can say that in regression way : GENERAL_HEALTH = 2.195 - 0.1141xTAUGHT_IN_SCHOOL

With a binary variable TAUGHT_IN_SCHOOL (0 or 1), the regression expression above shows us there is no significant association in our research (the scatter plot gives an idea). In fact, - 0.1141 is very small to be multiplied with values 0 and 1.

#linear regression

Write about data : Measures

Our question research was : Does the “Racial background”has an impact in the way teens define their general health if they are taught in school about “where to go for help with a health problem”

Response variable which is Categorical : In general, how is your health? Would you say… (excellent | very good | good | fair | poor)

Explanatory variable which is categorical : racial background (1:White 2:Black or African American 3:American Indian or Native American 4:Asian or Pacific Islander 5:Other )

Our Moderator : taught in school about “where to go for help with a health problem" (yes | no )

Because we have C -> C relation here, we used the “Chi Square test of independence”. We did 02 separated analysis :

without the moderator

with the moderator

Write about data : Procedure

Data were collected using survey focuses on multiple factors. The objective is to know what factors influence adolescents’ health and risk behaviors, including personal traits, families, friendships, romantic relationships, peer groups, schools, neighborhoods, and communities. (There are 5 waves for the whole project). This current wave encompasses all data collection between 1994 and 1995, and followed a Systematic sampling methods and implicit stratification ensure that the 80 high schools selected are representative of US schools with respect to region of country, urbanicity, size, type, and ethnicity. Eligible high schools included an 11th grade and enrolled more than 30 students. More than 70 percent of the originally sampled high schools participated. Each school that declined to participate was replaced by a school within the stratum.

#procedure

Write about Data : Sample

The sample is from the 1st wave of The National Longitudinal Study of Adolescent Health (AddHealth) which is a representative school-based survey of adolescents in grades 7-12 in the United States [N=6504 Participants]. Our sample data is a subset with the following variables, because my question research was : Does the “Racial background” has an impact in the way teens define their “general health” if they are taught in school about “where to go for help with a health problem”.

A simple description of our sample is as follow :

#sample

Exploring statistical interactions

Intro

For this assessment, I chose this relationship : Does the “Racial background” [H1G18] has an impact in the way teens define their general health [H1GH1] if they are taught in school about “where to go for help with a health problem” [H1TS15].

(A copy of the python program is at the end)

So we have :

1) Response variable which is Categorical

H1GH1 > In general, how is your health? Would you say... (excellent | very good | good | fair | poor)

2) Explanatory variable which is categorical

H1G18 > racial background (1:White 2:Black or African American 3:American Indian or Native American 4:Asian or Pacific Islander 5:Other )

Our Moderator is : H1TS15 >> taught in school about “where to go for help with a health problem" (yes | no )

Because we have C -> C relation here, we will use the "Chi Square test of independence”.

A] test without moderator variable We have here the cross tab result :

B] With moderator variable

B1] WAS NOT TAUGHT ...

The cross tab above shows us a non significant p-value = 0.04 > p/c = 0.005 (p/c : Bonferroni Adjustment ). This info tells us we cannot accept that their is significant relationship between the 02 variables.

B2] WAS TAUGHT ...

That cross tab shows us also that the 02 variables are not significantly associated. (p-value = 0.85 > p/c = 0.005 )

Conclusion

Even, we separated the group of teenagers : - those who WAS NOT taught "where to go for help in case of health problem" - those who WAS taught "where to go for help in case of health problem" we cannot have a statistical info to say that our moderator variable here has a significant impact on relationship of our explanatory & response variables. We can say also, there is no signification statistic interaction here.

Python Program

import pandas

import numpy import scipy.stats import seaborn import matplotlib.pyplot as plt import statsmodels.stats.proportion as sm

data = pandas.read_csv('../../datasets & codebooks/addhealth_pds.csv', low_memory=False)

#setting variables you will be working with to numeric # Explanatory variables : C # 2 levels # H1TS15 : Taught in School : where to go for help with a health problem data['H1TS15'] = pandas.to_numeric(data['H1TS15'], errors='coerce') # more than 2 levels # H1GI8 : Which one category best describes your racial background? data['H1GI8'] = pandas.to_numeric(data['H1GI8'], errors='coerce')

# Response variable : C # H1GH1 : In general, how is your health? Would you say… (excellent | very good | good | fair | poor) data['H1GH1'] = pandas.to_numeric(data['H1GH1'], errors='coerce')

#SETTING MISSING DATA & RENAME VARIABLE data['H1TS15']=data['H1TS15'].replace(6, numpy.nan) data['TAUGHT_IN_SCHOOL']=data['H1TS15'].replace(8, numpy.nan) data['H1GI8']=data['H1GI8'].replace(6, numpy.nan) data['H1GI8']=data['H1GI8'].replace(7, numpy.nan) data['H1GI8']=data['H1GI8'].replace(8, numpy.nan) data['RACIAL_BACK']=data['H1GI8'].replace(9, numpy.nan) data['H1GH1']=data['H1GH1'].replace(6, numpy.nan) data['GENERAL_HEALTH']=data['H1GH1'].replace(8, numpy.nan)

c1= data['TAUGHT_IN_SCHOOL'].value_counts(sort=False, dropna=True) print(c1)

c2 = data['RACIAL_BACK'].value_counts(sort=False, dropna=True) print(c1)

sub1 = data[['TAUGHT_IN_SCHOOL', 'RACIAL_BACK', 'GENERAL_HEALTH']].dropna()

###### Without moderator print ('1) Without moderator') # contingency table of observed counts ct1=pandas.crosstab(sub1['GENERAL_HEALTH'], sub1['RACIAL_BACK']) print (ct1)

# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)

# chi-square print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)

# set variable types sub1["RACIAL_BACK"] = sub1["RACIAL_BACK"].astype('category') sub1['GENERAL_HEALTH'] = sub1['GENERAL_HEALTH'].convert_objects(convert_numeric=True)

# bivariate bar graph seaborn.factorplot(x="RACIAL_BACK", y="GENERAL_HEALTH", data=sub1, kind="bar", ci=None) plt.xlabel('Racial background') plt.ylabel('General health')

###### With moderator print ('2) With moderator')

### Subset 1 : wasn't taught in school about "where to go for help with a health problem" sub_no = sub1[(sub1['TAUGHT_IN_SCHOOL'] == 0)]

print ('association between Racial background and General health for those WAS NOT taught in school about "where to go for help with a health problem"') # contingency table of observed counts ct2=pandas.crosstab(sub_no['GENERAL_HEALTH'], sub_no['RACIAL_BACK']) print (ct2)

# column percentages colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct)

# chi-square print ('chi-square value, p value, expected counts') cs2= scipy.stats.chi2_contingency(ct2) print (cs2)

seaborn.factorplot(x="RACIAL_BACK", y="GENERAL_HEALTH", data=sub_no, kind="point", ci=None) plt.xlabel('Racial background (was not taught in school)') plt.ylabel('General health') plt.title('association between Racial background and General health for those WAS NOT taught in school about "where to go for help with a health problem"')

### Subset 2 : wasn taught in school about "where to go for help with a health problem" sub_yes = sub1[(sub1['TAUGHT_IN_SCHOOL'] == 1)]

print ('association between Racial background and General health for those WAS taught in school about "where to go for help with a health problem"') # contingency table of observed counts ct3=pandas.crosstab(sub_yes['RACIAL_BACK'], sub_yes['GENERAL_HEALTH']) print (ct3)

# column percentages colsum=ct3.sum(axis=0) colpct=ct3/colsum print(colpct)

# chi-square print ('chi-square value, p value, expected counts') cs3= scipy.stats.chi2_contingency(ct3) print (cs3)

seaborn.factorplot(x="RACIAL_BACK", y="GENERAL_HEALTH", data=sub_yes, kind="point", ci=None) plt.xlabel('Racial background (was taught in school)') plt.ylabel('General health') plt.title('association between Racial background and General health for those WAS taught in school about "where to go for help with a health problem"')

Pearson Correlation in AddHealth

Intro : My new association

For this assessment, a Q -> Q relation is needed, so I took a new association to be examined : the association between “How many cigarettes did you smoke each dat” [h1to7 = NB_CIG_DAY] and “the earned money in non-summer & summer week” [h1ee5 + h1ee7 = EARNED_MONEY]

Quantitative response variable[EARNED_MONEY] : How much money do you earn in a typical non-summer/summer week from all your jobs combined ?

Quantitative explanatory variables[NB_CIG_DAY]>> During the past 30 days, on the days you smoked, how many cigarettes did you smoke each day?

Note : A copy of the python program is at the end of the blog.

Model of Pearson Correlation

After running the corresponding Scatterplot & Pearson Correlation for the explanatory variable NB_CIG_DAY and the response variable EARNED_MONEY, we have this following output :

We have here a Pearson Correlation coefficient r = 0.15 ~ 0 > 0, that show us the relation between those 02 variables is a “weaker positive relationship”.

The square of r = 0.022, which tells us a weak predictive variability with 2.2%

Conclusion

The “# of smoked cigarettes per month” doesn’t significantly affect the “earned money”, nor predict a big variability for "“earned money”.

Python Program

import pandas import numpy import scipy.stats import seaborn import matplotlib.pyplot as plt

data = pandas.read_csv('../../datasets & codebooks/addhealth_pds.csv', low_memory=False)

#setting variables you will be working with to numeric # Explanatory variables : Q data['H1TO7'] = pandas.to_numeric(data['H1TO7'], errors='coerce')

# Response variable : Q data['H1EE5'] = pandas.to_numeric(data['H1EE5'], errors='coerce') data['H1EE7'] = pandas.to_numeric(data['H1EE7'], errors='coerce')

#SETTING MISSING DATA & RENAME VARIABLE data['H1TO7']=data['H1TO7'].replace(96, numpy.nan) data['H1TO7']=data['H1TO7'].replace(97, numpy.nan) data['NB_CIG_DAY']=data['H1TO7'].replace(8, numpy.nan)

data['H1EE5']=data['H1EE5'].replace(996, numpy.nan) data['H1EE5']=data['H1EE5'].replace(997, numpy.nan) data['H1EE5']=data['H1EE5'].replace(998, numpy.nan) data['H1EE5']=data['H1EE5'].replace(999, numpy.nan) data['H1EE5']=data['H1EE5'].replace(' ', numpy.nan)

data['H1EE7']=data['H1EE7'].replace(996, numpy.nan) data['H1EE7']=data['H1EE7'].replace(997, numpy.nan) data['H1EE7']=data['H1EE7'].replace(998, numpy.nan) data['H1EE7']=data['H1EE7'].replace(999, numpy.nan) data['H1EE7']=data['H1EE7'].replace(' ', numpy.nan)

# Define new categorical variable : EARNED MOENY def EARNED_MOENY(row): return (row['H1EE5'] + row['H1EE7'])*1

data['EARNED_MOENY'] = data.apply (lambda row: EARNED_MOENY (row),axis=1)

# Scatterplots scat1 = seaborn.regplot(x="NB_CIG_DAY", y="EARNED_MOENY", fit_reg=True, data=data) plt.xlabel('# smoked cigarettes per day') plt.ylabel('Earned money') plt.title('Scatterplot for the Association Between "smoked cigarettes per day" and "Earned money"')

data_clean = data[['NB_CIG_DAY', 'EARNED_MOENY']].dropna()

print ('association between NB_CIG_DAY and EARNED_MOENY') print (scipy.stats.pearsonr(data_clean['NB_CIG_DAY'], data_clean['EARNED_MOENY']))

#pearson correlation #predictive variability

Chi square tests - Addhealth dataset

Intro

For a best practice of this current assessment, I chose to have new variables. So it will be more clearer to have C->C relation. (the condition for Chi Square test of independence) Note : the python program is at the end of this blog.

My new association

For this assessment, I was interested to examine the association between “the general health of teens” [H1GH1 = GENERAL_HEALTH] and the status that teens was taught at school about "where to go for help with a health problem?” [H1TS15 = TAUGHT_IN_SCHOOL], and “the racial background of teens” [H1GI8 = RACIAL_BACK].

Categorical response variable

[GENERAL_HEALTH] : In general, how is your health? Would you say… (excellent | very good | good | fair | poor)

Categorical explanatory variables

[TAUGHT_IN_SCHOOL] : Taught in School : "where to go for help with a health problem?” (for 2 levels) [1:yes 0: no] [RACIAL_BACK] : Which one category best describes your racial background? (for more than 2 levels) [1:White 2:Black or African American 3:American Indian or Native American 4:Asian or Pacific Islander 5:Other ]

1) Model of interpretation of X2 t-o-i (2 levels)

After running the Chi square test of independence for the explanatory variable TAUGHT_IN_SCHOOL (categorical with 2 levels) and the response variable GENERAL_HEALTH (categorical), we have this following result :

And the graph proportion gives this :

We have a p-value = 0.634 > 0.05 here, so we can say that the Ho hypothesis cannot be rejected. It means, the 2 variables are independent variables.

2) Model of interpretation Post hoc tests (more than 2 levels)

We change here the explanatory variable to a new one “RACIAL_BACK” (categorical level 5). the Bonferroni Adjustment tells us the value of “p/c" c = combinaison of 2 variables among 5 = 5! / 2! (5 - 2) ! = 5x4/2 = 10 p/c = 0.05/10 = 0.005 The general X2 test gives this output :

This result cannot provide us a good interpretation, because we need to do a post hoc tests for this case.

a) Compare group 1 to 2 :

p-value = 0.02 > p/c = 0.005 This means the 2 groups are identical

b) Compare group 1 to 3 :

p-value = 0.98 > p/c = 0.005. This means the 2 groups are also identical

c) Compare group 1 to 4 :

p-value = 0.54 > p/c = 0.005 This means the 2 groups are also identical

d) Compare group 1 to 5 :

p-value = 0.93 > p/c = 0.005 This means the 2 groups are also identical

e) Conclusion

We see here that all the 5 groups are identical or "without a significant dependence”. Unfortunately, we don't have any case with p-value < 0.005. So our test is terminated.

This conclusion is confirmed with this percentage graph here-below :

3) Python Program

import pandas

import numpy import scipy.stats import seaborn import matplotlib.pyplot as plt

data = pandas.read_csv('../../datasets & codebooks/addhealth_pds.csv', low_memory=False)

# Response variable : C # H1GH1 : In general, how is your health? Would you say… (excellent | very good | good | fair | poor) data['H1GH1'] = pandas.to_numeric(data['H1GH1'], errors='coerce')

#Subset 1 : my needed variables sub1 = data[['TAUGHT_IN_SCHOOL', 'RACIAL_BACK', 'GENERAL_HEALTH']].dropna()

print ('############# simple X2 t-o-i (2 levels) #############')

# contingency table of observed counts ct0=pandas.crosstab(sub1['GENERAL_HEALTH'], sub1['TAUGHT_IN_SCHOOL']) print (ct0)

# column percentages colsum0=ct0.sum(axis=0) colpct0=ct0/colsum0 print(colpct0)

# chi-square print ('chi-square value, p value, expected counts') cs0= scipy.stats.chi2_contingency(ct0) print (cs0)

sub1["TAUGHT_IN_SCHOOL"] = sub1["TAUGHT_IN_SCHOOL"].astype('category') #0 : no #1 : yes

# graph percent seaborn.factorplot(x="TAUGHT_IN_SCHOOL", y="GENERAL_HEALTH", data=sub1, kind="bar", ci=None) plt.xlabel('Taught in school') plt.ylabel('General health')

print ('############# post hoc (more than 2 levels) #############')

# contingency table of observed counts ct1=pandas.crosstab(sub1['GENERAL_HEALTH'], sub1['RACIAL_BACK']) print (ct1)

# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)

# chi-square print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)

sub1["RACIAL_BACK"] = sub1["RACIAL_BACK"].astype('category') #1:White #2:Black or African American #3:American Indian or Native American #4:Asian or Pacific Islander #5:Other

# graph percent seaborn.factorplot(x="RACIAL_BACK", y="GENERAL_HEALTH", data=sub1, kind="bar", ci=None) plt.xlabel('Racial background') plt.ylabel('General health')

recode2 = {1: 1, 2: 2} sub1['COMP1v2']= sub1['RACIAL_BACK'].map(recode2)

# contingency table of observed counts ct2=pandas.crosstab(sub1['GENERAL_HEALTH'], sub1['COMP1v2']) print (ct2)

# column percentages colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct)

print ('chi-square value, p value, expected counts') cs2= scipy.stats.chi2_contingency(ct2) print (cs2)

recode3 = {1: 1, 3: 3} sub1['COMP1v3']= sub1['RACIAL_BACK'].map(recode3)

# contingency table of observed counts ct3=pandas.crosstab(sub1['GENERAL_HEALTH'], sub1['COMP1v3']) print (ct3)

# column percentages colsum=ct3.sum(axis=0) colpct=ct3/colsum print(colpct)

print ('chi-square value, p value, expected counts') cs3= scipy.stats.chi2_contingency(ct3) print (cs3)

recode4 = {1: 1, 4: 4} sub1['COMP1v4']= sub1['RACIAL_BACK'].map(recode4)

# contingency table of observed counts ct4=pandas.crosstab(sub1['GENERAL_HEALTH'], sub1['COMP1v4']) print (ct4)

# column percentages colsum=ct4.sum(axis=0) colpct=ct4/colsum print(colpct)

print ('chi-square value, p value, expected counts') cs4= scipy.stats.chi2_contingency(ct4) print (cs4)

recode5 = {1: 1, 5: 5} sub1['COMP1v5']= sub1['RACIAL_BACK'].map(recode5)

# contingency table of observed counts ct5=pandas.crosstab(sub1['GENERAL_HEALTH'], sub1['COMP1v5']) print (ct5)

# column percentages colsum=ct5.sum(axis=0) colpct=ct5/colsum print(colpct)

print ('chi-square value, p value, expected counts') cs5= scipy.stats.chi2_contingency(ct5) print (cs5)

print ('ok')

#chi square

ANOVA in Addhealth

Intro

My old research question has C->C variables, whereas ANOVA test needs C->Q. So for this exercise, I picked another variables as the course indication has suggested. (Note : A copy of the python program is attached at the end of this blog to avoid saturation.)

My new association

I was interested to examine the association between “the number of smoking days for past month” [H1TO5] and “the status that teens was taught at school about smoking cigarettes” [H1TS3], and “the racial background of teens” [H1GI8].

Quantitative response variable [H1TO5] : During the past 30 days, on how many days did you smoke cigarettes?

Categorical explanatory variables

[H1TS3] : Was “smoking“ taught in School? (for 2 levels)

[H1GI8] : Which one category best describes your racial background? (for more than 2 levels)

Model Interpretation for ANOVA

After examining the association between “the number of smoking days for past month” (quantitative response) and “the status that teens was taught at school about smoking cigarettes” (categorical explanatory), the Analysis of Variance (ANOVA) revealed that those who was not taught about cigarettes smoking (Mean=9.00, s.d ± 10.58) has no significantly difference from whose who was taught about cigarettes smoking (Mean=10.06, s.d ± 12.43), F(1, 128) = 0.753, p = 0.10 > 0.05

Model Interpretation for post hoc ANOVA results

The ANOVA revealed that “the racial background” which is the categorical explanatory variable) and “the number of smoking days for past month” (quantitative response variable) were not significantly associated, F (4, 125) = 1.297, p=0.275 > 0.05.

Post hoc comparisons of mean "number of smoking days for past month" by "racial background" categories revealed that : "Black or American african” reported a not very significantly "number of smoking days for past month" compared to “white”, "American Indian or Native American”, "Asian or Pacific Islander” or “Other race”. (the difference is not very significant). All other comparisons were statistically similar.

Python Program

#setting variables you will be working with to numeric # Explanatory variables : C # 2 levels # H1TS3 : Taught in School : smoking

data['H1TS3'] = data['H1TS3'].convert_objects(convert_numeric=True)

# more than 2 levels # H1GI8 : Which one category best describes your racial background?

data['H1GI8'] = data['H1GI8'].convert_objects(convert_numeric=True)

# Response variable : Q # H1TO5 : During the past 30 days, on how many days did you smoke cigarettes?

data['H1TO5'] = data['H1TO5'].convert_objects(convert_numeric=True)

#SETTING MISSING DATA & RENAME VARIABLE

data['H1TS3']=data['H1TS3'].replace(6, numpy.nan) data['TAUGHT_IN_SCHOOL']=data['H1TS3'].replace(8, numpy.nan) data['H1GI8']=data['H1GI8'].replace(6, numpy.nan) data['H1GI8']=data['H1GI8'].replace(7, numpy.nan) data['H1GI8']=data['H1GI8'].replace(8, numpy.nan) data['RACIAL_BACK']=data['H1GI8'].replace(9, numpy.nan) data['H1TO5']=data['H1TO5'].replace(96, numpy.nan) data['H1TO5']=data['H1TO5'].replace(97, numpy.nan) data['NB_SMOKING_DAY']=data['H1TO5'].replace(98, numpy.nan)

#Subset 1 : my needed variables

sub1 = data[['TAUGHT_IN_SCHOOL', 'RACIAL_BACK', 'NB_SMOKING_DAY']].dropna()

ct1 = sub1.groupby('NB_SMOKING_DAY').size() print (ct1)

# using ols function for calculating the F-statistic and associated p value

model1 = smf.ols(formula='NB_SMOKING_DAY ~ C(TAUGHT_IN_SCHOOL)', data=sub1)

results1 = model1.fit() print (results1.summary())

#Subset 2 : 2 levels

sub2 = sub1[['TAUGHT_IN_SCHOOL', 'NB_SMOKING_DAY']].dropna() print ('means for NB_SMOKING_DAY by TAUGHT_IN_SCHOOL status') m1= sub2.groupby('TAUGHT_IN_SCHOOL').mean() print (m1)

print ('standard deviations for NB_SMOKING_DAY by TAUGHT_IN_SCHOOL status')

sd1 = sub2.groupby('TAUGHT_IN_SCHOOL').std() print (sd1)

#Subset 2 : more than 2 levels >> By racial background

sub3 = sub1[['RACIAL_BACK', 'NB_SMOKING_DAY']].dropna() model2 = smf.ols(formula='NB_SMOKING_DAY ~ C(RACIAL_BACK)', data=sub3).fit()

print (model2.summary()) print ('means for NB_SMOKING_DAY by RACIAL_BACK status')

m2= sub3.groupby('RACIAL_BACK').mean() print (m2)

print ('standard deviations for NB_SMOKING_DAY by RACIAL_BACK status') sd2 = sub3.groupby('RACIAL_BACK').std()

print (sd2) mc1 = multi.MultiComparison(sub3['NB_SMOKING_DAY'], sub3['RACIAL_BACK'])

res1 = mc1.tukeyhsd() print(res1.summary())

#anova #hypothesis test

Addhealth - Graphs

Intro

In the Addhealth dataset, I was interested in Parents relationship with their children, and wanted to determine if "born again" christian children have been impacted by this relation.

I recreated a new variable named "PARENTS_AFFECTION_LEVEL". This variable is defined be summing the value of those variables : - mother's closeness (category response : 0 to 4) - mother's care (category response : 0 to 4) - father's closeness (category response : 0 to 4) - father's care (category response : 0 to 4)

So, the PARENTS_AFFECTION_LEVEL category will be 0 to 16.

1) Univariate graph

a) for all data

After drawing the PARENTS_AFFECTION_LEVEL as an univariate variable, we have those outputs

Distribution & description :

We see here that :

- the center of the distribution is 14,

- the range is 17 points (16 - 0 + 1)

Graph :

b) for subsets (just to have a comparison)

The subset 1 is the group of "born again" teens

The subset 2 is the group of "non born again" teens

2) Bivariate graph

Let's note that our explanatory variable (X) is PARENTS_AFFECTION_LEVEL, and our response variable (Y) is the "born again" variable. The flowchart guide gives us that we will have a Categorical to Categorical graph (C -> C). In fact, PARENTS_AFFECTION_LEVEL is a 02 categorical variables. (no need to collapse it), and the BORN_AGAIN variable is also a categorical variable.

At the end, we'll have this graph :

To read it, let's take the common range (shared value) for "born again" (1) and "non born again" (0) people [ marked with the blue range below ].

We see here that there are a positive correlation between PARENTS_AFFECTION_LEVEL and BORN_AGAIN children.

#univariate #bivariate

Data management : Addhealth

Just to summarize the context : The idea is to compare "born again" & "Not born again" christian, in term of their relation with their parent.

So, here in this data management, the methodology is to take christian who has said yes in the variable H1RE5, and they who has said no in the same question. Then, compare the parent's closeness & care for both group (distribution).

Program :

If you want to know what the program looks like in python, feel free to put an eye here

Output :

The results are compared in 3 columns with those images. The excel file is here, but you can view the image as follow

Summary :

Among the Addhealth dataset, 27.4% affirmed they are "born again" ; and 29.5% has said "no" (the rest are missing data)

1) if we take "born again" group [group 1]

68.27% of them feel "very close to their Mother" 87.10% of them feel "their Mother cares very much of them" 43.10% of them feel "very close to their Father" 61.21% of them feel "their Father cares very much of them"

2) if we take "not born again" group [group 2]

65.22% of them feel "very close to their Mother" 84.35% of them feel "their Mother cares very much of them" 34.56% of them feel "very close to their Father" 53.85% of them feel "their Father cares very much of them"

My conclusion :

teens in group 1 (1784 teens = 27.42%) feels have a best relation with their parents, compared to those in group 2 (1918 teens = 29.48%). Just to note, we have 2802 teens (43.08%) as missing data for the variable "born again".

Analysis in Python

Program code

Here is a overview of a code to view distribution of the Addhealth dataset :

-----------begin--------- [[

# import library import pandas import numpy

# import .csv file data = pandas.read_csv('addhealth_pds.csv', low_memory=False)

#setting variables to numeric data['H1WP9'] = data['H1WP9'].convert_objects(convert_numeric=True) data['H1WP10'] = data['H1WP10'].convert_objects(convert_numeric=True) data['H1WP13'] = data['H1WP13'].convert_objects(convert_numeric=True) data['H1WP14'] = data['H1WP14'].convert_objects(convert_numeric=True) data['H1RE1'] = data['H1RE1'].convert_objects(convert_numeric=True) data['H1RE5'] = data['H1RE5'].convert_objects(convert_numeric=True)

##### counts & percentages (i.e. frequency distributions) for each variable

# Section 16 : Relations with Parents print ('###### H1WP9 : How close do you feel to your {MOTHER/ADOPTIVE MOTHER/ STEPMOTHER/ FOSTER MOTHER/etc.}? ') print 'count :' c1 = data['H1WP9'].value_counts(sort=False) print (c1)

print 'percentages :' p1 = data['H1WP9'].value_counts(sort=False, normalize=True) print (p1)

print ('###### H1WP13 : How close do you feel to your {FATHER/ADOPTIVE FATHER/STEPFATHER/FOSTER FATHER/etc.}? ') print 'count :' c3 = data['H1WP13'].value_counts(sort=False) print (c3)

print 'percentages :' p3 = data['H1WP13'].value_counts(sort=False, normalize=True) print (p3)

# Section 37 : Religion print ('###### H1RE1 : What is your religion? ') print 'count :' c5 = data['H1RE1'].value_counts(sort=False) print (c5)

print 'percentages :' p5 = data['H1RE1'].value_counts(sort=False, normalize=True) print (p5)

]]-----------end------------

Program output (for three variables)

And we got the following output :

-----------------begin------------[[

###### H1WP9 : How close do you feel to your {MOTHER/ADOPTIVE MOTHER/ STEPMOTHER/ FOSTER MOTHER/etc.}? count : 4 1229 8 3 1 25 5 4239 2 156 6 2 3 480 7 370 dtype: int64 percentages : 4 0.188961 8 0.000461 1 0.003844 5 0.651753 2 0.023985 6 0.000308 3 0.073801 7 0.056888 dtype: float64

###### H1WP13 : How close do you feel to your {FATHER/ADOPTIVE FATHER/STEPFATHER/FOSTER FATHER/etc.}? count : 4 1211 8 1 1 75 5 2467 2 184 6 4 3 610 7 1952 dtype: int64 percentages : 4 0.186193 8 0.000154 1 0.011531 5 0.379305 2 0.028290 6 0.000615 3 0.093788 7 0.300123 dtype: float64

###### H1RE1 : What is your religion? count : 0 751 4 1590 8 95 12 80 16 192 20 1 24 5 28 182 96 25 1 27 5 569 9 8 13 236 17 134 21 27 25 25 2 64 6 9 10 67 14 370 18 17 22 1448 26 54 98 111 3 59 7 25 11 80 15 2 19 216 23 22 27 10 99 3 dtype: int64 percentages : 0 0.115467 4 0.244465 8 0.014606 12 0.012300 16 0.029520 20 0.000154 24 0.000769 28 0.027983 96 0.003844 1 0.004151 5 0.087485 9 0.001230 13 0.036285 17 0.020603 21 0.004151 25 0.003844 2 0.009840 6 0.001384 10 0.010301 14 0.056888 18 0.002614 22 0.222632 26 0.008303 98 0.017066 3 0.009071 7 0.003844 11 0.012300 15 0.000308 19 0.033210 23 0.003383 27 0.001538 99 0.000461 dtype: float64

]]-------------end----------

Summary

In the AddHealth survey, 6504 adolescents were asked :

How close do you feel to your {MOTHER/ADOPTIVE MOTHER/ STEPMOTHER/ FOSTER MOTHER/etc.}? [H1WP9]

How close do you feel to your {FATHER/ADOPTIVE FATHER/STEPFATHER/FOSTER FATHER/etc.}? [H1WP13]

What is your religion? [H1RE1]

For the 1st question above, 65.17% among them said "very much" (category 5), and we remark that 5.6% fell into category 7 (legitimate skip [no MOM]).

For the 2nd question, it was quite different because 37.93% only said "very much", and 30.01% has no Dad (legitimate skip)

With the 3rd question, the top of the list are Baptist (24.44%, category 4) and Catholic (22.26%, category 22). But we also found that 11.54% belong to the category 0 (none [skip to the next section])

#python analysis #spider #Addhealth #distribution

Research on “Add health”

I was fascinated with people behavior, so AddHealth dataset will be a good choice to begin. When I thought back about my adolescence, I always in mind my relation with my parents. And after, follow some questions like : "Did my relation with them have an impact with what I am now?". So I think, I'll chose the "relation with parents" (section 16) as a topic of interest. But I'll refine my question with those specific variables in section 16 :

H1WP9 : How close do you feel to your {MOTHER/ADOPTIVE MOTHER/ STEPMOTHER/ FOSTER MOTHER/etc.}?

H1WP10 : How much do you think she cares about you?

H1WP13 : How close do you feel to your {FATHER/ADOPTIVE FATHER/STEPFATHER/FOSTER FATHER/etc.}?

H1WP14 : How much do you think he cares about you?

I'll chose "religion" as a second topic. In fact, globally, my question is : Is "relation with parents" (section 16) associated with religion (section 37) ? That brings me to have an eye in those variables in section 37 :

H1RE1 : What is your religion?

H1RE5 : (if christian) Do you think of yourself as a Born-Again Christian?

After searching lot of researches that have treated similar topics in the net, I think the following literature could give a global idea on what should be my hypothesis : 1) Relationship between Perceived Parenting Style, Perceived Parental Acceptance-Rejection (PAR) and Perception of God among Young Adults 2) The Relative Importance of Parents and Peers for Adolescent Religious Orientation: An Australian Study 3) Following the Leaders: Parents' Influence on Adolescent Religious Activity 4) Transmission of Religious Values: Relations Between Parents' and Daughters' Beliefs Conclusion : In every domain of our life, we don't have an assurance in our opinions, in our identity. I think, parents have big responsibilities if they want their children will have steadiness in expression. My hypothesis for this current topic is : The closeness (H1WP9, H1WP13) and the care (H1WP10, H1WP14) of parents has an influence to christian (H1RE1) children to know with assurance if they are "born-again" (H1RE5). I don't know if we can reverse the expression as follow, but we'll see at the end of the research : Christian children who can say with assurance they are "born-again", have a good relation (closeness + care) with their parents.

#Addhealth #data analysis #parenting #religion

Trending Blogs

Recently Viewed Blogs

Data Analysis